CBAQuest Final Report including Tools !

CBAQuest: Exploration of society through the lens of labour market related documentation

How can text mining tools help to discover the secrets of collective agreements?


The world of work is changing rapidly. These changes present new opportunities to revisit the value of collective bargaining agreements, as tools for protecting workers’ rights and as historical documents for understanding developments in industrial relations.  *Collective bargaining agreements (CBAs) are the results of negotiations between employers and unions to regulate the terms and conditions of employment.

The increasing number of agreements available online provide a rich source of information for users to access for different purposes. The WageIndicator Foundation gives visitors to their website the ability to read and compare the original texts of agreements by topic, at national and international levels. The existing database compiled by the WageIndicator Foundation is one of few online resources improving labour market transparency by providing free and easy access to labour market information, including the texts of CBAs. Users can read existing agreements to better understand historical developments in collective bargaining and to set realistic expectations for negotiations.

But as the amount of data available on collective bargaining grows, so too does the challenge to navigate and understand it. The utility of the agreements for interested parties depends on the availability and coherence of the texts, as well as the ease-of-use of the tools provided for accessing them. Digital research methods provide researchers with the opportunity to find new insights into the agreements, and into the affordances and limitations of digital methods for understanding society more broadly.

This project explores the feasibility of assessing the ‘worker-friendliness’ of collective bargaining agreements.

It asks:

  • How might we evaluate the ‘worker-friendliness’ of collective bargaining agreements?
  • What are the possibilities and limitations of applying text mining methods to facilitate understanding of collective bargaining agreements?

In doing so, we hope to begin exploring new ways of understanding agreements and to contribute to improving global labour market transparency by building on the existing work of WageIndicator Foundation.  We also hope to contribute to a wider conversation about the significance and utility of digital humanities projects, both within and outside academic settings.

Defining ‘worker-friendliness’

We will rate the worker friendliness of collective labour agreements by considering four measurements:

  • Equality
  • Overtime and  annual leave
  • Text accessibility


The texts analysed for this project are from the WageIndicator Foundation’s collective bargaining agreement database. These agreements were collected from 58 countries in 28 languages. Their periods of validity are from the 1950s to 2024.

There are two datasets available: one with the full agreement texts and another with selected clauses, chosen by annotators in response to particular questions and organized by topic.


We measure worker friendliness based on equality and discrimination topics through evaluating 4 indicators: gender equality, discrimination, sexual harassment and grievance procedure, which are addressed in the CBA. Selection into these 4 indicators comes from similar issues mentioned in different variables. For example, indicator ‘gender equality’ include 11 binds related to women workers. The specific divisions of  each indicator are explained below. Worker friendliness in this context means valuable information available for workers in the CBA.

There are a lot more clauses and variables that fit to the topic of equality and discrimination. However, we decided to focus on the variables that fall under the geneq (gender equality) trigger. In this division we follow the selection made by Wage Indicator. Our score is just a sample tool that shows which areas might need further attention when drawing up a CBA.

Gender equality

Gender equality indicator includes 11 variables under ‘geneq trigger’ that implicitly or explicitly are directed at women workers. We measure this indicator by collecting the mentions of the separate binds and dividing them by the total number of indicators (11).Average score on gender equality indicator can be seen in Figure 1.

Fig 1. Average score on gender equality indicator

It is important to note that separate countries have a different number of CBAs included in the WageIndicator database, which could be ranging from 1 CBA to several dozens. For this reason, no generalizations should be made about any particular country based on the score.

The variables included in gender equality[1] [2] [3] 

Variable nameAsked question
genderDoes the clause make a special reference to gender?
eqpromotionDoes the CBA contain clauses on equal opportunities for promotion for women workers?
eqtrainingDoes the CBA contain clauses on equal opportunities for training and retraining for women workers?
eqofficerDoes the CBA contain clauses which provide for a gender equality trade union officer at the workplace?
equalityexcludedtriggerAre there groups of women workers (e.g. temporary workers) which are excluded from any of the above clauses?
equalitydifferenttriggerAre there groups of women workers which are under different arrangements from those specified in the above clauses (e.g. part-time workers)?
equalitymonitoringDoes the agreement contain clauses for monitoring gender equality?
equalityotherclauseDoes the CBA contain any other clause on gender equality?
equalitytxtComments regarding gender equality issues:
violenceleaveDoes the agreement provide for a special leave for workers subjected to domestic or intimate partner violence?
support_disabilitiesDoes the agreement provide for support for women workers with disabilities?


In the annotated dataset there is one specific variable referring to discrimination. Apart from it, we also classified the determinants of ‘equal pay’ to this topic. Clauses related to equal pay focus on the same payments for everyone, taking into account not only gender, but also race, religion, age etc. Average score on discrimination indicator can be seen in Figure 2.

Fig 2. Average score on discrimination indicator

The variables included in discrimination

Variable nameAsked question
discriminationDoes the agreement contain clauses addressing discrimination at work?  
eqpayDoes the agreement contain clauses on equal pay for work of equal value?

Sexual harassment

Sexual harassment indicator includes the variable that is specifically addressing sexual harassment in the workplace. It is scored on a binary scale. If there is a mention of a variable the CBA gets 1 point and if there isn’t it scores 0. Average score on sexual harassment indicator can be seen in Figure 3.

Fig 3. Average score on sexual harassment indicator.

The variables included in sexual harassment

Variable nameAsked question
sexualharDoes the agreement contain clauses addressing sexual harassment at work?  

Grievance procedure

The annotated variables mention various issues regarding equality and discrimination, but the ways to solve those issues are not explicitly marked. While the text of CBAs indicates that the issues are noticed, but with no solution offered, problems can persist. We feel that including the information about complaint/grievance procedure is crucial for workers and thus an important measure of worker friendliness of CBAs. We measure this indicator by scanning the CBA’s for terms ‘grievance’, ‘procedure’ and ‘complaint’. Search words were translated to 12 other languages besides English (Spanish, Polish, Greek, Italian, French, Finnish, Portuguese, Czech, Slovak, Dutch, Turkish, Chinese).

Different from the previous three general topics, our current data does not have a special trigger to describe this  complaint/grievance procedures, so our approach is to iterate the text of each clause for fuzzy matching complaint/grievance procedures related terms, and present the result as a binary variable, namely if any of these terms  exists in the current clause text, then give 1 – score to these clauses, otherwise 0 will be given[4] [5] [6] [7] [8] [9] [10] [11] . Once each of the clauses is given a binary score, we aggregate them by their CBA ids, and finally get the initial score showing how informative it is in each CBA. Average score on grievance procedure indicator can be seen in Figure 4.

Fig 4. Average score on grievance procedure indicator

The variables included in discrimination

Variable nameDescription
procedureThe procedure score is calculated by two steps: Clauses level: binary score showing whether complaint/grievance procedures exist or not based on different equality issues/triggers and different CBAs. CBA level: aggregate the clause-leveled scores by their CBA ids, and get a sum-up value for each CBA, which illustrates how much information related to complaint/grievance procedures on different issues in each CBA is given. In general, the higher the score is, the more clauses in this CBA  have descriptions of complaint/grievance procedures.

Final score calculation formula

In order to solve the imbalance of scores under different indicators, we regularize the scores of the four indicators to make them all scores in the 1-5 range.

The regularizing formula is :

X_scaled = X_std * (max – min) + min

where the X_std is the standard deviation calculated by  the formula below:

 (X – X.min) / (X.max – X.min)

The final score was produced as the weighted average of  all the variables, the defalt weight are all set to be 0.25 which is the average score of these four indicators.                  

final_score= 0.25*procedure + 0.25*gender equality + 0.25*discrimination  + 0.25*sexual harassment

Weight can be adjusted according to the  user’s needs in the future and a regularization to 0-5[12] [13] [14]  will also apply to the final_score to make it comparable to other topics beside equality.

Data visualization and findings[15] [16] [17] :

The figure above shows the gap and relationship between the total number of CBAs,  the number of CBAs with equality related indicators, and the number of CBAs with both equality related triggers and  procedures mentioned. The total number of CBAs that are contained is 1247. From those, the ones that were found to be related with the indicators, that means they have the equality related triggers, were 584. Among all the 584 gender equality related CBAs we are looking at, 101 of them were found to  have the procedure related terms mentioned, in the 13 languages aforementioned.

The figure above is a stacked bar chart which illustrates the contribution of  each indicator to the overall score. It also shows the difference in the overall scores between countries

This graph shows the number of CBAs we have in the current dataset for each country. The graph clearly shows that many countries have only a small number of CBAs, which may cause our current score results to be to some extent unrepresentative in countries with less data. But this problem will be solved with the continuous expansion of the database in the future.

Our limitations:

– We focused our score on the geneq trigger, that includes variables on gender equality and discrimination. However, there are more variables that could fall under the umbrella of the topic of gender equality and discrimination, for example binds referring to breastfeeding and maternity leave found under the ‘workfam trigger’.

– While talking about gender equality we are conditioned to talk about it in binary terms. This is a methodology issue partly because of the way CBA’s were annotated, but mainly due to the way the most of CBA’s are written.

More possibilities/future:

  • A more comprehensive scoring system, that includes all of the variables that can be connected to the topics of equality and discrimination.
  • The complexity of time is a key factor; not just a variable, but as a factor.

Code for calculating the scores and some visualizations can be found in the Jupyter notebook: DHH21_CBA/GenderEqualityScores.ipynb at main · EavanXing0416/DHH21_CBA (


Overtime & Annual Leave

In order to measure the worker friendliness of CBA from the perspective of the annual leave and overtime triggers, we intend to create a formula to score the individual CBAs. We first selected related indicators and counted word frequencies to gain more insights from the clauses. Considering the practicality and reading related literatures, we end up with three binary indicators:

  1. whether there are regulations on overtime,
  2. whether there is travel allowance provided, and
  3. whether the number of days of annual leave after 1 year of working is above the international standard of 15 working days. (ILO Holidays with Pay Convention C132,1970)

Selection of indicators

We have done preliminary research based on recent popular study related to annual leave and overtime working (Wooden and Warren, 2008; Skinne and Pocockr, 2013; Ostoj,2019), as well as news. In Skinne and Pocockr’s study(2013) focusing on Australia, they state that ‘there are significant work–life penalties for not taking paid annual leave – particularly for workers with parenting responsibilities and for women.’ It shows these clauses would be of importance to employee’s work and life. Thus, we believe that the annual leave and overtime triggers deserve further research and analysis. Word frequency analysis with the use of Python has also been conducted to help us identify if there are interesting terms that need to be focused on. After assessing the practicability, we have decided to move forward with these three binary indicators mentioned above.

Indicators and formula

For the first two indicators, we make use of the existing annotations that come with the data to check if these two conditions are met. After reviewing the dataset, we have decided to focus on two labels – ‘overtime_trigger’ and ‘annuleav_trigger’. The first one indicates that the corresponding clause concerns overtime regulations and the second one is for travel allowance. By utilizing these two labels, we can easily determine whether points should be given to a CBA based on these two criteria.

As for the third indicator, we use the label ‘holidaysdays’ to locate all clauses that concern the number of days of annual leave, and extract the numbers by regular expression.

The value of an indicator would be one if the answer is yes and zero if the answer is no. The final score is calculated based on the following formula:

(ind1 + ind2 + ind3)/3*4 + 1

Findings and limitations

We have applied the formula on all available CBAs in the dataset and generated a bar chart revealing the average worker friendliness scores of CBAs by country from the perspective of annual leave and overtime working (in ascending order).

It is clear that our current results do not seem perfect. As shown in the graph below, some countries such as the UK and Belgium get low average scores while they are often perceived as countries with sophisticated systems protecting workers. There are many possible reasons for this observation.

First of all, the absence of clauses relevant to overtime or annual leave does not always imply low worker friendliness. There may already be clear laws and regulations in specific countries regarding these aspects and there will be no need for CBAs to have sections dedicated to them, which explains why the assumption that absence means worker unfriendliness is not always correct. But it is important to realize that what we are trying to calculate here is the worker friendliness of CBAs themselves without considering the contexts, and this situation means that the existing legislation is sophisticated in these aspects and the usefulness of CBAs in further improving labor’s rights is limited.

Secondly, the availability of CBAs in the dataset varies across countries. It can be difficult to collect CBAs from countries such as the UK due to reasons like privacy concerns, and the small and biased sample may also affect the accuracy of the final score.

Thirdly, there are also limitations regarding binary indicators. In the current scoring system, the indicators mostly concern the existence of certain clauses, but the existence of clauses does not have anything to do with the quality of these clauses and does not necessarily mean that those clauses are enough to protect labor’s rights.

In terms of the comprehensiveness of our formula, we tried to include five indicators in the formula in our initial plan. However, soon after we started working on these indicators, we realized that two of them were pretty difficult to achieve in such a short period of time, and they were the amount of compensation for overtime work and whether annual leave can be accumulated. As for the first one, different CBAs have different ways of stating the extra money you get for overtime work. For example, some CBAs use percentages of the original pay and some use a finite amount of money you get for each hour of overtime work. The variations in expressions make it difficult to automatically extract the numbers and evaluate different CBAs on the same scale. For the second one, we attempted keyword detection but there were various issues such as negation that made the results not accurate. There was also not enough data to train a machine learning model such as Naive Bayes classifier for automated prediction. These indicators are left for future work.

Text accessibility

The goal of this indicator is to give each collective labour agreement a score on the accessibility of the text: how easy it is for the workers to understand the contract.

We used three different measures to measure text accessibility:

  • Concreteness
  • Readability
  • Lexical Density

One of the challenges we faced when calculating each measure, is that there are no methods available that accurately capture the Concreteness, Readability and Lexical Density for every language in our corpus. We mitigate this limitation by using an ensemble of different methods for each measure and aggregating them. For concreteness we did find one measure which works for every language.

Concreteness refers to the amount of abstract versus concrete words used in a text. Texts with relatively more concrete words are more accessible than texts with relatively more abstract words. The method used to extract the amount of abstract words in the text involves training different word embedding spaces for each language available in the CBA data-set, and then calculating the Semantic Neighborhood Density (SND) score for each word available in a language. We hypothesize that words with a SND score under 0.4 are abstract words, and then calculate the percentage of abstract words present in the CBAs per language which can be seen in the figure below. For a more detailed explanation of word embeddings and the SND follow this link.

While concreteness measures the use of referentiality to concrete objects, readability measures the years of education one would need to be able to understand a piece of text, which is usually measured through the amount of long words and long sentences. We use the Flesch-Kincaid Readability Score, Coleman Liau Index, Automated Readability Index, and LIX to operationalize Readability. Similarly, we also look at language from a syntactic and morphological perspective using Lexical density. Lexical density looks at language from a semantic perspective and is a measure of the number of different words that are used. We use the Type/Token Ratio, Hapax Legomena frequency, Simpson’s index, and Yule’s K measure to operationalize the concept of Lexical density.

We hypothesize that for a text to be considered readable, it should require fewer years of education and should have low Lexical density.

For each measure we use quintiles to assign scores. We compare the language of the contract with all the contracts written in the same language. If the quintile of the score corresponds to a high readability score (you need more years of education to read the text) or a high lexical density(you need access to larger vocabulary to understand the text), we assign a score of 1 and conversely we assign as score of 5 if it corresponds to the quintile with low readability/lexical density.

We take the average of the measures to generate a readability and lexical density score and, by further averaging these scores, we obtain a final score for text accessibility. The figure below shows the text accessibility scores for various languages.  

A more detailed explanation which includes the different methods used for each measure can be found here



In summary, our project identified a number of ways that the ‘worker-friendliness’ of agreements might be measured, making use of text mining methods to analyse and score agreements on various indicators. By using and visualising these scores, we have been able to find new ways of evaluating agreements at a glance, in ways that might facilitate understanding of these agreements for labour market researchers and workers in general. But we have also identified a number of challenges and limitations presented by the existing data and by our chosen methods, that invite further research into the secrets of collective bargaining agreements.

We’d like to thank very much the organizers of #DHH21, and especially Daniela and Stefio for their great leadership. Thanks to our data sponsor Wage Indicator for the permissions to use their datasets and giving time to Daniela to mentor us as we found our way to the end goal. Thanks to the team: everybody brought their best of intentions and skills – out of these  10 days, we were able to produce a prototype of a digital tool to offer researchers, policy makers, workers and anyone seeking to discover useful information about the CBAs for improving the lives of workers.

Now – with no further ado – please test drive our tool on its current website here


Frege, G. (1892). Über Sinn und Bedeutung. Zeitschrift für Philosophie und philosophische Kritik, 100, 25–50.

Jarvis, S. (2013). Capturing the diversity in lexical diversity. Language Learning, 63(s1),


Reilly, M., & Desai, R. H. (2017). Effects of semantic neighborhood density in abstract and concrete words. Cognition, 169, 46-53.

Skinner, N., & Pocock, B. (2013). Paid annual leave in Australia: Who gets it, who takes it and implications for work–life interference. Journal of Industrial Relations, 55(5), 681-698.

Ostoj, I. (2019). VARYING PAID ANNUAL LEAVE LENGTH IN THE WORLD’S ECONOMIES AND ITS UNDERLYING CAUSES. Ekonomicko-manazerske spektrum, 13(1), 62-71.

Wooden, M., & Warren, D. (2008). Paid annual leave and working hours: Evidence from the HILDA survey. Journal of Industrial Relations, 50(4), 664-670.

ILO Holidays with Pay Convention 1970, viewed 27 May 2021, <> ‘Collective Bargaining Agreements Database’. WageIndicator Subsite Collection,

Published by Nadine: #CBAQuest comms #DHH21

#CBAQuest #DHH #DHH21 #dh #digitalhumanities #SSHOC

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with
Get started
%d bloggers like this: