For any project-related inquiries, please write an e-mail to hello@euplex.org

When using the dataset, please cite

Hurka, S., Haag, M. and Kaplaner, C. (2022) Policy complexity in the European Union, 1993-today: introducing the EUPLEX dataset. Journal of European Public Policy. (see publications)

Technical documentation

This section describes our current approach used to maintain a database of EU law-making procedures and texts and generate datasets from it.

Please note: the numbers on this page relate to the original dataset version as used in the article (see the article's replication material under publications).

Procedure and document selection

In accordance with our approach to create a 'living' dataset or database, the information retrieved from EU sources is saved to a document database in order to avoid multiple downloads of the same information. Similarly, extracted information is stored in SQL database tables in order to allow for quick access and inference. Our procedure for data collection and generation is set up as follows:

  • Collect (new) procedure URLs from the EUR-Lex procedure search
  • For each procedure
    • Download XML notice
    • Parse notice and extract document identifiers contained in procedure events
    • Download documents based on identifiers
      • Extract text and generate complexity measures
  • Create procedure and document dataset
  • Merge documents to procedures by event-document identifiers

Please note: Currently, the EUPLEX dataset only contains procedures for which the Commission has initiated a proposal.

Events in procedures may contain multiple documents. In order to identify and select the main document, e.g. legislative proposal documents in our current use case, of an event, we first filter out those that do not have a CELEX identifier and secondly, check for the document type denoted in the CELEX identifier. This allows us to identify the document that is usually listed as 'main document' on EUR-Lex procedure pages.

Text (pre-)processing and analysis

We currently focus on documents available in HTML format only in order to streamline the pre-processing procedure. After obtaining the HTML copy of a document, we extract the text and perform some additional cleaning to remove unnecessary line-breaks and white-space as well as special characters that might hinder further parsing. Next, we try to split the text into its recitals, citations and enacting terms. Those individual parts are then further split up into individual citations, recitals, articles (for the enacting part) and (sub)paragraphs. The parsing steps are performed using a custom spaCy pipeline component. Within the component, we currently mostly rely on our own implementations for entity recognition and parsing tasks, unless noted otherwise. While the proposal texts available in EUR-Lex are usually complete regarding the main EU policy elements, annexes are sometimes missing or (partly) stored in separate documents. Therefore, we exclude annex texts from our analysis. We do however count references to annexes as part of our reference count indicators. The parts used in the generation of each indicator can also be found in table 1 of the main text.

In order to count citations, recitals, articles and calculate the average depth measures, we then perform counting and matching procedures on the individual extracted parts. For identifying and counting references we rely on iterative approach of matching, splitting and parsing reference phrases performed over the individual articles in the enacting terms of a text. In theory, this would also allow us to extract the content of a reference beyond a simple internal/external distinction.

For our readability indicators, we rely on a forked version of the spacy_readability pipeline component (Holtzscher 2018, fork available from https://github.com/ghxm/euplexCy_readability). For indicators that require the number of sentences, we use the default DependencyParser component implemented in spaCy version 3.0 (Honnibal et al. 2020) with some additional checks (removing obvious non-sentences, such as paragraph numbering) in order to segment the text into sentences. In order to calculate the word entropy, we use the lemmas as identified by the default spaCy Lemmatizer. The linguistic indicators are computed for the cleaned version of the document text containing only the previously identified citations, recitals and enacting terms.

Bad formatting

Apart from the general cleaning routine described above, some EU legal texts stored in EUR-Lex, particularly legislative proposals, exhibit some irregularities that mostly seem to stem from the export of others formats into HTML format by EUR-Lex or sloppy or incomplete adherence to the EU style guide. Within our sample of downloaded legal texts, we encounter problems with formatting where

  • the document text is missing line-breaks and newlines.
  • the document text contains multiple proposals,
  • the document text does not contain a formal ending phrase like 'Done at'
  • or a combination of the above.

Currently, we exclude documents with one of the identify formatting problems from the complexity analysis with the exception of missing line-breaks where we try to re-introduce the necessary breaks and only discard documents if the ratio of line-breaks/new lines to text lines is $<0.003$. Within our sample of downloaded proposal texts, we identify the following distribution of formatting problems:

Complexity indicators available

Formatting problems

No

Yes

Total

Missing line-breaks

265

1,761

2,026

Multiple proposals + missing line-break

284

0

284

Multiple proposals in document

1,573

0

1,573

No formal end

857

0

857

No formal end + missing line-breaks

211

0

211

None

489

4,393

4,882

Total

3,679

6,154

9,833

Some texts do not exhibit any detected formatting problems, but we were nonetheless unable to generate the complexity indicators. This may be due to entirely empty documents or documents not containing the expected texts.

Intercoder reliability

Intercoder reliability for hand coding vs. automated coding (Krippendorff's alpha)

Element

Metric

2.5%

Mean

97.5%

Citations

nominal

0.9353712

0.9741485

1.0000000

ordinal

0.9838406

0.9939187

1.0000000

interval

0.9554066

0.9840738

1.0000000

Recitals

nominal

0.8754734

0.9273595

0.9792456

ordinal

0.8363162

0.9441479

0.9993540

interval

0.8035378

0.9170202

0.9960159

Articles

nominal

0.9078703

0.9539352

0.9884838

ordinal

0.9865558

0.9954892

0.9999882

interval

0.9974955

0.9987478

0.9996869

Int. references

nominal

0.2428854

0.3734224

0.5039594

ordinal

0.7039010

0.7974818

0.8775587

interval

0.7062923

0.8612381

0.9768854

Ext. references

nominal

0.3447282

0.4504172

0.5561062

ordinal

0.8732229

0.9086786

0.9400003

interval

0.5869365

0.8199596

0.9709558

Int. + ext. ref.

nominal

0.2857143

0.3877551

0.4897959

ordinal

0.8454434

0.9039655

0.9489077

interval

0.9607444

0.9770680

0.9891796


A note on comparing new legislation and 'amending' laws

Within our sample, not all legislation is created equal. While some proposals contain entirely new policies, others simply serve to amend existing laws. Additionally, some acts establish new rules to a large extent and also amend some existing ones. Within our data, the amending and and_amending variables capture these the respective cases. When assessing and comparing the complexity of texts of different types it is important to control to for these variables: A simple amending law may, e.g., only have a one or two formal articles, but specify large chunks of provisions to be inserted into existing laws. This may, e.g., inflate measures that are standardized by the number of articles, such as our relative reference counts. Since we count references within all parts the enacting terms, including the amending text, and standardize these counts by the number of articles within a specific text, these measures may be skewed with increasing amounts of amending text vs non-amending text. We thus consider it necessary to include the 'amending' variables into any analysis that seeks to identify differences in complexity.