Invisible Victims: Estimating Underreporting in the Armed Conflict

© European Union

The internal armed conflict in Colombia represents a large portion of the country's history. The dispute for power and territorial control between different armed groups and state institutions has unleashed the violation of human rights, mainly against the civilian population, which is always in the middle of the conflict. In the framework of the Final Agreement for the Termination of the Conflict and the Construction of a Stable and Lasting Peace, signed in 2016, the Commission for the Clarification of Truth, Coexistence, and Non-Repetition (Truth Commission) was created in 2017, as a mechanism to know the truth of what happened in the framework of the armed conflict and contribute to the clarification of the violations and infractions committed during the conflict (Truth Commission, 2017). Likewise, the Commission seeks to provide inputs for the construction of the state policy on victims.

The Truth Commission works hand in hand with the Special Jurisdiction for Peace (JEP), which represents the justice component of the Comprehensive System of Truth, Justice, Reparation, and Non-Repetition within the framework of the Final Agreement. The joint work of these two organizations, together with the support of the Human Right Data Analysis Group (HRDAG), resulted in the Final Report of the Truth Commission, which had among its objectives to show the findings related to the Armed Conflict in Colombia. The construction of this report had an enormous challenge: the estimation of the underreporting of victims of the conflict. María Juliana Durán and Paula Andrea Amado, HRDAG consultants, explain the methodology developed to face this challenge.

(See seminar Truth Commission data: estimation of the underreporting of victims)

The conflict in Colombia has been extensively documented by entities with different methodologies and scopes. HRDAG made use of 112 databases from 44 different sources that constituted a base of nearly 13 million records. The objectives of the project were focused on making a statistical analysis of patterns of violence in the period 1985-2018, eliminating duplicate data from the 112 databases used, imputing missing fields in the records, and estimating underreporting. All of the above are for four types of human rights violations: homicide, kidnapping, recruitment, and disappearance. As an example, Figure 1 shows the number of victims by the perpetrator, according to the Single Registry of Victims (RUV) and the National Center of Historical Memory (CNMH).

Figure 1. Comparison of RUV and CNMH homicide records, by perpetrator. Prepared by HRDAG.

To achieve these objectives, the project was divided into three stages.

  1. Deduplication or Record Linkage

To identify multiple records that possibly belong to the same victim, three categories of models have been defined:

  • Block model: aims to analyze each pair of records and examine whether they correspond to the same event. For this purpose, groups of records that share some characteristics are selected. The selection rules are not immediate - it is not easy to enumerate a set of explicit criteria that apply to all data - so Michelle Dukich, a researcher at HRDAG with expertise in record linkage and database cleaning1, was asked to label (as duplicates and non-duplicates) a sample of about two million records. A machine learning model machine learning was then trained to "learn" the rules that Michelle was implicitly using and use them to classify the rest of the data.
  • Classification model: with the labels assigned in the previous stage, the identification of correferent records that is, records that possibly refer to the same fact. Is made. Then, the classification model estimates a score (defined between 0 and 1) that indicates the similarity between each pair of records (1 if it is similar and 0 otherwise).
  • Group model: an expansion group is generated from the classification of the previous model. For this purpose, a score is used to measure the 'similarity' between records, and a threshold is defined from which it is assumed that a pair of records correspond to the same person and the same violent event. The 'correct' threshold was estimated with training data classified by Michelle Dukich. Figure 2 shows that the error in the count after applying this method is very low.

Figure 2. Results of testing the cluster model in the identification of duplicate records. Prepared by HRDAG.

At the end of this process it was concluded that, of the almost 13 million initial records, 8,775,884 are unique records.

The probability of a victim being registered in several databases depends, for example, on the city in which he or she resides and whether or not he or she is a public figure. Eliminating these duplicate records is essential so that the data accurately reflect the dynamics of the conflict and do not over-represent certain groups.


1 In this context, an "oracle" is a person or tool that, like Michelle, is responsible for assigning labels of a particular category to each observation in a database.

  1. Statistical imputation of missing fields

The database of unique records generated in the previous stage contains characteristics (fields) of each documented violent event. For example, an organization to which the perpetrator belonged; gender, age, and ethnicity of the victim; date and location of the event, etc... With such a large number of records, missing fields are not a problem per se. If a missing field were random, this would not have a major impact on the analysis. However, the fact that a field is missing is related to variables such as the date of the event (because the victim registration system has changed over time) or the location (because there is a disparity in the reporting capabilities of each region).

Taking these observed variables into account corrects for bias. The researchers hypothesize that the missing values of a variable follow a pattern similar to the observed values, conditional on the other observed variables. Thus, they make use of the MAR: Missing at Random, which consists of constructing chained equations with a specification conditional on the observed data, containing information on how the variables are related for example, the gender and age of the victim and a random component. This equation is used to impute missing data, when available, to each record. The process is repeated multiple times to estimate the parameters of interest for each data set, and finally, the estimates are combined to generate a point prediction for the missing data, applying Rubin's combination rules. This methodology was implemented using the Predictive mean matching of the R mice package. 

The different databases contain heterogeneous information, so they are divided into base variables (those that may have missing information and are relevant to the study) and support variables. The latter are constructed using additional information from the databases. For example, the murder weapon, the sidewalk or location, the victim's profession, or the account of the event. It is precisely the account of the event, after being cleaned and lemmatized, that is used for the construction of a neural network of the Long-Short Term Memory (LSTM) type that allows estimating the probability score of a supporting variable for each category of a violent event (i.e. that it was a homicide, that it was committed by guerrilla groups, etc.). 

  1. Estimation of underreporting of victims

Again, underreporting (events that are not recorded in any database) does not exist by chance, but because of structural differences in the armed conflict scenario: absence of entities to record complaints, fear of reporting due to possible threats, geographical difficulties, among others. Also, there are ethnic, geographic, or ideological groups that are systematically less likely to be registered when they experience a human rights violation.

Given the above, the researchers employed Multiple Systems Estimation (MSE) to adequately address the problem of underreporting of victims. MSE is applicable when four assumptions are met:

  • The estimated population is closed. That is, once a person enters a state (e.g., homicide), he or she does not leave that state. This is reasonable because of the nature of the states considered.
  • The linkage of records is accurate: this is ensured by the database processing developed in step 1.
  • Independence of sources, i.e., the fact that a fact is documented in one source does not affect the probability that it is also documented in other sources. Although it is difficult for this to be true, the literature has shown that the problem is reduced when three or more bases are used. The study uses more than 100. 
  • Homogeneous capture probability among observation units. That is, that all events have the same probability of being documented in each database used. As discussed, this is not realistic, so a stratification approach was applied to resolve the non-compliance with this assumption, in which subsets of the data are defined in which the condition is met. For example, stratifying by year gave good results. Each stratum functions as a proxy for the homogenization of the probability mentioned above. A type of Bayesian non-parametric ESM called "Multiple Latent Class Models for Capture-Recapture" is used to implement this solution.

This approach seeks to approximate the real size of a population (number of victims of each human rights violation) based on the patterns of documentation of the events: if an event is documented multiple times by several sources2 and the above assumptions are met, the size of the underreporting is not so significant. On the other hand, if the coincidences between databases are small, the underreporting is estimated to be larger.

By applying this method, a confidence interval can be estimated for the actual number of victims for each year between 1985 and 2018. Figure 3 shows the result for the case of enforced disappearance. It is concluded, among others, that the increase in this phenomenon in the early 2000s was more pronounced than previously believed and that in 2007 there was a peak in underreporting. 

Figure 3. Victims of enforced disappearance between 1985 and 2016. Taken from the Methodological Report of the joint JEP-CEV-HRDAG data integration and statistical estimation project (2022).


2 In the case of forced disappearance, most records are provided by the RUV, which compromises the robustness of the methodology according to the National Administrative Department of Statistics (Dane, 2023).

Some implications

The results of this analysis are relevant on two fronts: on the one hand, having more precise records on human rights violations allows for better quantitative analysis of these phenomena and more accurate formulation of public policy on victims; on the other hand, underreporting is in itself an interesting variable to study. These data indicate which periods, places, and population groups have had the worst monitoring of human rights violations during the armed conflict and, thus, indicate how well state and non-state capacities have been used to document these phenomena. In sum, this research generates a great contribution to clarifying the magnitude and characteristics of the violence generated by such a complex phenomenon as the Colombian armed conflict.

The results of the estimates for the different human rights violations are published by the National Administrative Department of Statistics (DANE, 2023). They can be accessed through this link. The final database, constructed with the methodology described here, is organized temporally and geographically, which will allow the development of future research on the dynamics of the conflict and its relationship with other institutional, social, and economic variables. 

Tags
Government

Newsletter

Get information about Data Science, Artificial Intelligence, Machine Learning and more.

Recent articles

In the Blog articles, you will find the latest news, publications, studies and articles of current interest.

Mathematics of Discontent: A Study of the Panamanian Protests from Graph and Game Theory

During the second half of 2022, Panama faced an unprecedented social event. Although in the past there have been protests by certain social sectors, never before had there been such a massive demonstration that included different sectors of Panamanian society ...

Algorithmic Justice

Fairness in artificial intelligence models: how to mitigate discrimination in the presence of multiple sensitive attributes?

Let's suppose we have a machine learning model, 𝑓, that predicts the price of an insurance premium, Y, based on data that includes a sensitive attribute, such as gender. Discrimination may occur due to a statistical bias...

Technology

Translation models for the preservation of indigenous languages in Colombia

According to the National Indigenous Organization of Colombia (ONIC) there are 69 languages spoken in Colombian territory, 65 of which are indigenous languages. This makes Colombia the third most linguistically diverse country in Latin America, after Brazil and Mexico, with a notable concentration in the Amazon and Vaupés...

Economía

Optimal designs for electricity auctions

This blog entry is based on my master's thesis in Industrial Engineering and Economics at the Universidad de los Andes, titled "Optimal Design for Electricity Auctions: A Deep Learning Approach."

Technology

Invisible Victims: Estimating Underreporting in the Armed Conflict

The internal armed conflict in Colombia represents a large portion of the country's history. The dispute for power and territorial control between different armed groups and state institutions has unleashed the violation of human rights.

Algorithmic Justice

Trade-off between justice and adjustment: a case study of crime

The study of algorithmic justice emerged in 2011 with Cynthia Dwork [1], who based it on the principle of equal opportunity: all people, regardless of their characteristics, should be able to access the same opportunities and benefits.