Technical documentation

This section describes our current approach used to maintain a database of EU law-making procedures and texts and generate datasets from it.

Please note: the numbers on this page relate to the original dataset version as used in the article (see the article's replication material under publications).

Procedure and document selection

In accordance with our approach to create a 'living' dataset or database, the information retrieved from EU sources is saved to a document database in order to avoid multiple downloads of the same information. Similarly, extracted information is stored in SQL database tables in order to allow for quick access and inference. Our procedure for data collection and generation is set up as follows:

Collect (new) procedure URLs from the EUR-Lex procedure search
For each procedure
- Download XML notice
- Parse notice and extract document identifiers contained in procedure events
- Download documents based on identifiers
  - Extract text and generate complexity measures
Create procedure and document dataset
Merge documents to procedures by event-document identifiers

Please note: Currently, the EUPLEX dataset only contains procedures for which the Commission has initiated a proposal.

Events in procedures may contain multiple documents. In order to identify and select the main document, e.g. legislative proposal documents in our current use case, of an event, we first filter out those that do not have a CELEX identifier and secondly, check for the document type denoted in the CELEX identifier. This allows us to identify the document that is usually listed as 'main document' on EUR-Lex procedure pages.

Text (pre-)processing and analysis

We currently focus on documents available in HTML format only in order to streamline the pre-processing procedure. After obtaining the HTML copy of a document, we extract the text and perform some additional cleaning to remove unnecessary line-breaks and white-space as well as special characters that might hinder further parsing. Next, we try to split the text into its recitals, citations and enacting terms. Those individual parts are then further split up into individual citations, recitals, articles (for the enacting part) and (sub)paragraphs. The parsing steps are performed using a custom spaCy pipeline component. Within the component, we currently mostly rely on our own implementations for entity recognition and parsing tasks, unless noted otherwise. While the proposal texts available in EUR-Lex are usually complete regarding the main EU policy elements, annexes are sometimes missing or (partly) stored in separate documents. Therefore, we exclude annex texts from our analysis. We do however count references to annexes as part of our reference count indicators. The parts used in the generation of each indicator can also be found in table 1 of the main text.

In order to count citations, recitals, articles and calculate the average depth measures, we then perform counting and matching procedures on the individual extracted parts. For identifying and counting references we rely on iterative approach of matching, splitting and parsing reference phrases performed over the individual articles in the enacting terms of a text. In theory, this would also allow us to extract the content of a reference beyond a simple internal/external distinction.

For our readability indicators, we rely on a forked version of the spacy_readability pipeline component (Holtzscher 2018, fork available from https://github.com/ghxm/euplexCy_readability). For indicators that require the number of sentences, we use the default DependencyParser component implemented in spaCy version 3.0 (Honnibal et al. 2020) with some additional checks (removing obvious non-sentences, such as paragraph numbering) in order to segment the text into sentences. In order to calculate the word entropy, we use the lemmas as identified by the default spaCy Lemmatizer. The linguistic indicators are computed for the cleaned version of the document text containing only the previously identified citations, recitals and enacting terms.

Bad formatting

Apart from the general cleaning routine described above, some EU legal texts stored in EUR-Lex, particularly legislative proposals, exhibit some irregularities that mostly seem to stem from the export of others formats into HTML format by EUR-Lex or sloppy or incomplete adherence to the EU style guide. Within our sample of downloaded legal texts, we encounter problems with formatting where

the document text is missing line-breaks and newlines.
the document text contains multiple proposals,
the document text does not contain a formal ending phrase like 'Done at'
or a combination of the above.

Currently, we exclude documents with one of the identify formatting problems from the complexity analysis with the exception of missing line-breaks where we try to re-introduce the necessary breaks and only discard documents if the ratio of line-breaks/new lines to text lines is $<0.003$. Within our sample of downloaded proposal texts, we identify the following distribution of formatting problems:

	Complexity indicators available
Formatting problems	No	Yes	Total
Missing line-breaks	265	1,761	2,026
Multiple proposals + missing line-break	284	0	284
Multiple proposals in document	1,573	0	1,573
No formal end	857	0	857
No formal end + missing line-breaks	211	0	211
None	489	4,393	4,882
Total	3,679	6,154	9,833

Some texts do not exhibit any detected formatting problems, but we were nonetheless unable to generate the complexity indicators. This may be due to entirely empty documents or documents not containing the expected texts.

Intercoder reliability

Intercoder reliability for hand coding vs. automated coding (Krippendorff's alpha)
Element	Metric	2.5%	Mean	97.5%
Citations	nominal	0.9353712	0.9741485	1.0000000
	ordinal	0.9838406	0.9939187	1.0000000
	interval	0.9554066	0.9840738	1.0000000
Recitals	nominal	0.8754734	0.9273595	0.9792456
	ordinal	0.8363162	0.9441479	0.9993540
	interval	0.8035378	0.9170202	0.9960159
Articles	nominal	0.9078703	0.9539352	0.9884838
	ordinal	0.9865558	0.9954892	0.9999882
	interval	0.9974955	0.9987478	0.9996869
Int. references	nominal	0.2428854	0.3734224	0.5039594
	ordinal	0.7039010	0.7974818	0.8775587
	interval	0.7062923	0.8612381	0.9768854
Ext. references	nominal	0.3447282	0.4504172	0.5561062
	ordinal	0.8732229	0.9086786	0.9400003
	interval	0.5869365	0.8199596	0.9709558
Int. + ext. ref.	nominal	0.2857143	0.3877551	0.4897959
	ordinal	0.8454434	0.9039655	0.9489077
	interval	0.9607444	0.9770680	0.9891796

A note on comparing new legislation and 'amending' laws

Within our sample, not all legislation is created equal. While some proposals contain entirely new policies, others simply serve to amend existing laws. Additionally, some acts establish new rules to a large extent and also amend some existing ones. Within our data, the amending and and_amending variables capture these the respective cases. When assessing and comparing the complexity of texts of different types it is important to control to for these variables: A simple amending law may, e.g., only have a one or two formal articles, but specify large chunks of provisions to be inserted into existing laws. This may, e.g., inflate measures that are standardized by the number of articles, such as our relative reference counts. Since we count references within all parts the enacting terms, including the amending text, and standardize these counts by the number of articles within a specific text, these measures may be skewed with increasing amounts of amending text vs non-amending text. We thus consider it necessary to include the 'amending' variables into any analysis that seeks to identify differences in complexity.

Questions, issues, bugs or suggestions?

The EUPLEX dataset is a living dataset. This means that it is regularly updated to include new procedures, text and improved measures on their policy complexity.

The data is collected in an automated process which enables us to keep the large scope of the dataset manageable. As a result, we cannot manually check and validate each and every case in the dataset. Thus, if you have any questions, issues or suggestions on how to improve and expand the dataset or come across any bugs in data, please do not hesitate to contact us.