For any project-related inquiries, please write an e-mail to hello@euplex.org
When using the dataset, please cite
Hurka, S., Haag, M. and Kaplaner, C. (2022) Policy complexity in the European Union, 1993-today: introducing the EUPLEX dataset. Journal of European Public Policy. (see publications)
This section describes our current approach used to maintain a database of EU law-making procedures and texts and generate datasets from it.
In accordance with our approach to create a 'living' dataset or database, the information retrieved from EU sources is saved to a document database in order to avoid multiple downloads of the same information. Similarly, extracted information is stored in SQL database tables in order to allow for quick access and inference. Our procedure for data collection and generation is set up as follows:
Please note: Currently, the EUPLEX dataset only contains procedures for which the Commission has initiated a proposal.
Events in procedures may contain multiple documents. In order to identify and select the main document, e.g. legislative proposal documents in our current use case, of an event, we first filter out those that do not have a CELEX identifier and secondly, check for the document type denoted in the CELEX identifier. This allows us to identify the document that is usually listed as 'main document' on EUR-Lex procedure pages.
We currently focus on documents available in HTML format only in order to streamline the pre-processing procedure. After obtaining the HTML copy of a document, we extract the text and perform some additional cleaning to remove unnecessary line-breaks and white-space as well as special characters that might hinder further parsing. Next, we try to split the text into its recitals, citations and enacting terms. Those individual parts are then further split up into individual citations, recitals, articles (for the enacting part) and (sub)paragraphs. The parsing steps are performed using a custom spaCy
pipeline component. Within the component, we currently mostly rely on our own implementations for entity recognition and parsing tasks, unless noted otherwise. While the proposal texts available in EUR-Lex are usually complete regarding the main EU policy elements, annexes are sometimes missing or (partly) stored in separate documents. Therefore, we exclude annex texts from our analysis. We do however count references to annexes as part of our reference count indicators. The parts used in the generation of each indicator can also be found in table 1 of the main text.
In order to count citations, recitals, articles and calculate the average depth measures, we then perform counting and matching procedures on the individual extracted parts. For identifying and counting references we rely on iterative approach of matching, splitting and parsing reference phrases performed over the individual articles in the enacting terms of a text. In theory, this would also allow us to extract the content of a reference beyond a simple internal/external distinction.
For our readability indicators, we rely on a forked version of the spacy_readability
pipeline component (Holtzscher 2018, fork available from https://github.com/ghxm/euplexCy_readability). For indicators that require the number of sentences, we use the default DependencyParser
component implemented in spaCy
version 3.0 (Honnibal et al. 2020) with some additional checks (removing obvious non-sentences, such as paragraph numbering) in order to segment the text into sentences. In order to calculate the word entropy, we use the lemmas as identified by the default spaCy Lemmatizer
. The linguistic indicators are computed for the cleaned version of the document text containing only the previously identified citations, recitals and enacting terms.
Apart from the general cleaning routine described above, some EU legal texts stored in EUR-Lex, particularly legislative proposals, exhibit some irregularities that mostly seem to stem from the export of others formats into HTML format by EUR-Lex or sloppy or incomplete adherence to the EU style guide. Within our sample of downloaded legal texts, we encounter problems with formatting where
Currently, we exclude documents with one of the identify formatting problems from the complexity analysis with the exception of missing line-breaks where we try to re-introduce the necessary breaks and only discard documents if the ratio of line-breaks/new lines to text lines is $<0.003$. Within our sample of downloaded proposal texts, we identify the following distribution of formatting problems:
Complexity indicators available |
|||
Formatting problems |
No |
Yes |
Total |
Missing line-breaks |
265 |
1,761 |
2,026 |
Multiple proposals + missing line-break |
284 |
0 |
284 |
Multiple proposals in document |
1,573 |
0 |
1,573 |
No formal end |
857 |
0 |
857 |
No formal end + missing line-breaks |
211 |
0 |
211 |
None |
489 |
4,393 |
4,882 |
Total |
3,679 |
6,154 |
9,833 |
Some texts do not exhibit any detected formatting problems, but we were nonetheless unable to generate the complexity indicators. This may be due to entirely empty documents or documents not containing the expected texts.
Intercoder reliability for hand coding vs. automated coding (Krippendorff's alpha) | ||||
Element |
Metric |
2.5% |
Mean |
97.5% |
Citations |
nominal |
0.9353712 |
0.9741485 |
1.0000000 |
ordinal |
0.9838406 |
0.9939187 |
1.0000000 |
|
interval |
0.9554066 |
0.9840738 |
1.0000000 |
|
Recitals |
nominal |
0.8754734 |
0.9273595 |
0.9792456 |
ordinal |
0.8363162 |
0.9441479 |
0.9993540 |
|
interval |
0.8035378 |
0.9170202 |
0.9960159 |
|
Articles |
nominal |
0.9078703 |
0.9539352 |
0.9884838 |
ordinal |
0.9865558 |
0.9954892 |
0.9999882 |
|
interval |
0.9974955 |
0.9987478 |
0.9996869 |
|
Int. references |
nominal |
0.2428854 |
0.3734224 |
0.5039594 |
ordinal |
0.7039010 |
0.7974818 |
0.8775587 |
|
interval |
0.7062923 |
0.8612381 |
0.9768854 |
|
Ext. references |
nominal |
0.3447282 |
0.4504172 |
0.5561062 |
ordinal |
0.8732229 |
0.9086786 |
0.9400003 |
|
interval |
0.5869365 |
0.8199596 |
0.9709558 |
|
Int. + ext. ref. |
nominal |
0.2857143 |
0.3877551 |
0.4897959 |
ordinal |
0.8454434 |
0.9039655 |
0.9489077 |
|
interval |
0.9607444 |
0.9770680 |
0.9891796 |
Within our sample, not all legislation is created equal. While some proposals contain entirely new policies, others simply serve to amend existing laws. Additionally, some acts establish new rules to a large extent and also amend some existing ones. Within our data, the amending
and and_amending
variables capture these the respective cases. When assessing and comparing the complexity of texts of different types it is important to control to for these variables: A simple amending law may, e.g., only have a one or two formal articles, but specify large chunks of provisions to be inserted into existing laws. This may, e.g., inflate measures that are standardized by the number of articles, such as our relative reference counts. Since we count references within all parts the enacting terms, including the amending text, and standardize these counts by the number of articles within a specific text, these measures may be skewed with increasing amounts of amending text vs non-amending text. We thus consider it necessary to include the 'amending' variables into any analysis that seeks to identify differences in complexity.