Dataset codebook

The EUPLEX dataset consists of EU legislative procedures. The variables therefore relate to the procedure. Procedures have a number of variables relating directly to them, such as their reference number or their name. Within a procedure, various events, such as the adoption of a proposal or a vote, can occur. These events may have documents attached to them, e.g. the 'adoption of proposal by Commission' event may contain the actual proposal document. Variable naming rules are used to differentiate between event- and document-related variables.

Data structure

Events

All event-related variables use the prefix e_. All events have a legal date assigned to them, that is stored in the e_legal_date variable.

Currently only the following events are included in the dataset:

proposal: adoption of proposal by submitting institution, usually the Commission (ADP_byCOM), start of the legislative procedure
final: adoption / publication date of final law (PUB_OJ, SIGN_byEP_CONSIL, ADP_FRM_byCONSIL in that order, depending on data availability)

Documents

Documents are always attached to an event. All document-related variables carry the doc_ prefix. To reduce data size, events of one type are matched with their corresponding document only. I.e., a row where event==proposal has a corresponding doc entry for proposal but none for final. The following documents are included in the dataset:

Variables

Variable	Name	Type	Description
Procedure ID	procedure_id	String	Procedure ID as used in EUR-Lex urls
Procedure reference	procedure_reference	String	Complete procedure reference
Procedure notice CELLAR uri	uri__cellar	String	CELLAR uri of the procedure notice
Legislative procedure type	procedure_type	String	Type of legisaltive procedure used
Proposal adopted	proposal_adopted	Logical	Was the proposal adopted? (information based on ‘DOSSIER_ADOPTED-PROPOSAL’ tag in procedure notice)
Proposal pending	proposal_pending	Logical	Is the proposal still pending? (information based on ‘DOSSIER_PENDING-PROPOSAL’ tag in procedure notice)
EUROVOC domain(s)	eurovoc_…	Logical	Set of logical indicators marking whether a procedure is tagged with a EUROVOC identifier belonging to the specific EUROVOC domain
Procedure title	title	String	Title of the procedure
Events
Event Name/ID	event	String	Event identifier using ‘nice’ names. ‘proposal’ corresponds to ‘ADP_byCOM’ events in procedure notice. ‘final’ corresponds to `PUB_OJ`, `SIGN_byEP_CONSIL`, `ADP_FRM_byCONSIL` (in that order, depending on data availability) events in procedure notice.
Legal Date	e_legal_date	String (Date) / Integer	The date of an event as registered in the `EVENT_LEGAL_DATE` tag of the procedure notice (YYYY-MM-DD) / STATA: Number of days since 1960-01-01
Event-document CELEX uris	e_doc_celexs	String	The CELEX uri(s) of the main document attached to an event (used to merging documents to events)
Multiple main documents	e_multi_main_docs	Logical	Does the event have multiple main documents? if TRUE, the CELEX uri without a bracket or ‘R’ postfix is preferred for matching
Responsible institution corporate body	e_resp_inst__corp_body	String	The ‘corporate body’ name of the responsible institution for a specific event. For proposal, this is usually the abbrevation for the responsible Commission DG.
Documents
Document Name/ID	doc	String	Document identifier using ‘nice’ names usually corresponding to an ‘event‘ name.
Document CELEX uri	doc_uri_celex	String	The CELEX uri of the document (used for matching events to documents)
Document uris	doc_uris	String (JSON)	All document URIs of the document in JSON format
Legal instrument	doc_leg_instr	String	Legal instrument of the text as noted in the ‘RESOURCE-TYPE’ identifier of a document notice
Instrument subtype	doc_leg_instr_subtype	String	The subtype of legislative insturment of the text (Legislation, Recast, Codification) taken from the title of a text
Implementing act	doc_leg_instr_implementing	Logical	Is the text an implementing act? Taken from the ‘RESOURCE-TYPE’ identifier of a document notice
Amending act	doc_amending	Logical	Is the text an amending act? Based on document title (see online appendix for additional information)
‘and’ amending act	doc_and_amending	Logical	Is the text an ‘and’ amending act? Based on document title (see online appendix for additional information)
Adapting act	doc_adatping	Logical	Is the text an adapting act? Based on document title
Repealing act	doc_repealing	Logical	Is the text a repealing act? Based on document title
‘and’ repealing act	doc_and_repealing	Logical	Is the text a ‘and’ repealing act? Based on document title
Document title	doc_title	String	Title of the text
Policy complexity
Structural size	doc_struct_size	Integer	Number of structural elements in text
Number of articles	doc_articles	Integer	Number of articles in the document
Average element depth	doc_avg_depth	Float	Average element depth of a text (see main text for explanation)
Average article depth	doc_avg_article_depth	Float	Average depth of an article in the text
Word entropy	doc_word_entropy	Float	Word entropy
Word entropy (lemmatized)	doc_word_entropy_l	Float	Word entropy using lemmatized unigram tokens
Lix score	doc_lix	Float	Lix readability score
SMOG	doc_smog	Float	SMOG index for the text. Texts with fewer than 30 sentences are measured as 0
Dale-Chall formula	doc_dale_chall	Float	Dale-Chall formula score for the text
Coleman-Liau index	doc_coleman_liau_index	Float	Coleman-Liau index for the text
FORCAST	doc_forcast	Float	FORCAST formula index for the text
Flesch-Kincaid Readability Score	doc_flesch_grade_level	Float	Flesch-Kincaid grade level for the text
Flesch-Kincaid Reading Ease	doc_flesch_reading ease	Float	Flesch-Kincaid reading ease for the text
Internal references (interdependence)	doc_ref_int_enacting	Integer	Number of internal references in the enacting text
Relative internal references (interdependence)	doc_ref_int_enacting_rel	Float	doc_ref_int_enacting / doc_articles
External references (embeddedness)	doc_ref_ext_enacting	Integer	Number of external references in the enacting text
Relative external references (embeddedness)	doc_ref_ext_enacting_rel	Float	doc_ref_ext_enacting / doc_articles
Word count (w/o annex)	doc_words_noannex	Integer	Word count for the text excluding the annex text (i.e. citations, recitals, enacting terms). Based on blank English language spaCy version 3.0.1 ‘Tokenizer’ component with some corrections for EU-specific legal identifiers
Complete complexity indicators	doc_complete_complexity	Logical	Are all complexity indicators non-missing for the given documents available?
Technical details
Bad EUR-LEX formatting indicator	doc__bad_formatting	Logical	Indicates whether the document text formatting as provided by EUR-Lex does not allow for a precise analysis
Bad EUR-LEX formatting reason	doc__bad_formatting_reason	String	Reason for the bad formatting classification
Text source format	doc_format	String	Format of the document
Text parsing failed	doc__euplexcy_failed	Logical	Did the parsing fail?
EUR-Lex search	eurlex_search	Logical	Indicates whether or not the procedure is included in the EUR-Lex search index. See this tweet for more information.

Questions, issues, bugs or suggestions?

The EUPLEX dataset is a living dataset. This means that it is regularly updated to include new procedures, text and improved measures on their policy complexity.

The data is collected in an automated process which enables us to keep the large scope of the dataset manageable. As a result, we cannot manually check and validate each and every case in the dataset. Thus, if you have any questions, issues or suggestions on how to improve and expand the dataset or come across any bugs in data, please do not hesitate to contact us.