Chasing The Trajectory of Terrorism: A Machine Learning Based Approach to Achieve Open Source Intelligence

Pranav Pandya

24th July 2018

Abstract

In recent years, terrorism has taken a whole new dimension and becoming a global issue because of widespread attacks and comparatively high number of fatalities. Understanding the attack characteristics of most active groups and subsequent statistical analysis is, therefore, an important aspect toward counterterrorism support in the present situation. In this thesis, we use a variety of data mining techniques and descriptive analysis to determine, examine and characterize threat level from top ten most active and violent terrorist groups and then use machine learning algorithms to avail intelligence toward counterterrorism support. We use historical data of terrorist attacks that took place around the world between 1970 to 2016 from the open-source Global Terrorism Database and the primary objective is to translate terror incident related information into actionable intelligence. In other words, we chase the trajectory of terrorism in the present context with statistical methods and derive insights that can be useful.

A major part of this thesis is based on supervised and unsupervised machine learning techniques. We use Apriori algorithm to discover patterns in various groups. From the discovered patterns, one of the interesting patterns we find is that ISIL is more likely to attack other terrorists (non-state militia) with bombing/explosion while having resulting fatalities between 6 to 10 whereas Boko Haram is more likely to target civilians with explosives, without suicide attack and resulting fatalities more than 50. Within the supervised machine learning context, we extend the previous research in time-series forecasting and make use of TBATS, ETS, Auto Arima and Neural Network model. We predict the future number of attacks in Afghanistan and SAHEL region, and the number of fatalities in Iraq at a monthly frequency. From time-series forecasting, we prove two things; the model that works best in one time-series data may not be the best in another time-series data, and that the use of ensemble significantly improves forecasting accuracy from base models. Similarly, in the classification modeling part, previous research lacks the use of algorithms that are recently developed. We also extend the previous research in binary classification problem and make use of a cutting-edge LightGBM algorithm to predict the probability of suicide attack. Our model achieves 96% accuracy in terms of AUC and correctly classifies “Yes” instances of suicide attacks with 86.5% accuracy.

Introduction

Today, we live in the world where terrorism is becoming a primary concern because of the growing number of terrorist incidents involving civilian fatalities and infrastructure damages. The ideology and intentions behind such attacks is indeed a matter of worry. Living under the constant threat of terrorist attacks in any place is no better than living in a jungle and worrying about which animal will attack you and when. An increase in a number of radicalized attacks around the world is a clear indication that terrorism transitioning to from a place to an idea, however, the existence of specific terror group and their attack characteristics over the period of time can be vital to fight terrorism and to engage peacekeeping missions effectively. Having said that number terrorist incidents are growing these days, availability of open-source data containing information of such incidents, recent developments in machine learning algorithms and technical infrastructure to handle a large amount of data open ups variety of ways to turn information into actionable intelligence.

Definition of terrorism

Terrorism in a broader sense includes state-sponsored and non-state sponsored terrorist activities. The scope of this research is limited to non-state sponsored terrorist activities only. Non-state actors in simple words mean entities that are not affiliated, directed or funded by the government and that exercise significant economic, political or social power and influence at a national and international level up to certain extent (NIC, 2007). An example of non-state actors can be NGOs, religious organizations, multinational companies, armed groups or even an online (Internet) community. ISIL is the prime example of a non-state actor which falls under armed groups segment.

Global Terrorism Database (National Consortium for the Study of Terrorism and Responses to Terrorism (START), 2016) defines terrorist attack as a threatened or actual use of illegal force and violence by a non-state actor to attain a political, economic, religious or social goal through fear, coercion or intimidation.

This implies that three of the following attributes are always present in each event of our chosen dataset:

The incident must be intentional – the result of a conscious calculation on the part of a perpetrator.
The incident must entail some level of violence or immediate threat of violence including property violence, as well as violence against people.
The perpetrators of the incidents must be sub-national actors.

Problem statement

Nowadays, data is considered as the most valuable resource and machine learning makes it possible to interpret complex data however most use cases are seen in the business context such as music recommendation, predicting customer churn or finding a probability of having cancer. With recent development in machine learning algorithms and access to open source data and software, there are plenty of opportunities to correctly understand historical terrorist attacks and prevent the future conflicts. In the last decade, terrorist attacks have been increased significantly (data source: GTD) as shown in the plot below:

Figure .: Terrorist attacks around the world between 1970-2016

After September 2001 attacks, USA and other powerful nations have carried out major operations to neutralize the power and spread of known and most violent terrorist groups within the targeted region such as in Afghanistan, Iraq and most recently in Syria. It’s also worth mentioning that the United Nations already have ongoing peacekeeping missions in conflicted regions around the world for a long time. However number of terror attacks continues to rise and in fact, it is almost on a peak in the last 5 years. This leads to a question why terrorism is becoming unstoppable despite the continued efforts. Understanding and interpreting the attack characteristics of relevant groups in line with their motivations to do so can reflect the bigger picture. An extensive research by (Heger, 2010) supports this argument and suggests that a group’s political intentions are revealed when we examine who or what it chooses to attack.

Research design and data

This research employs a mix of qualitative and quantitative research methodology to achieve the set objective. In total, we evaluate cases of over 170,000 terrorist attacks. We start with exploratory data analysis to assess the impact on a global scale and then use a variety of data mining techniques to determine the most active and violent terrorist groups. This way, we ensure that the analysis reflects the situation in present years. We use descriptive statistics to understand the characteristics of each group over the period of time and locate the major and minor epicenters (most vulnerable regions) based on threat level. To examine whether or not chosen groups have a common link with the number of fatalities, we perform statistical hypothesis test with ANOVA and PostHoc test.

The research then makes use of a variety of machine learning algorithms with supervised and unsupervised technique.

According to (Samuel, 1959), A well-known researcher in the field of artificial intelligence who coined the term “machine learning”, defines machine learning as a “field of study that gives computers the ability to learn without being explicitly programmed”. It is a subset of artificial intelligence which enables computers to learn from experience in order to create inference over a possible outcome used later to take a decision.

With the Apriori algorithm, we discover interesting patterns through association rules for individual groups. This way, we can pinpoint the habits of specific groups. Next, we perform a time-series analysis to examine seasonal patterns and correlations. To address the broad question “when and where”, we use four time-series forecasting models namely Auto Arima, Neural Network, TBATS, and ETS to predict a future number of attacks and fatalities. We evaluate and compare the performance of each model on hold out set and use ensemble approach to further improve the accuracy of predictions. As illustrated in Literature review section, most research in time-series forecasting addresses the country and year level predictions. We extend the previous research in this field with seasonality component and make forecasts on a monthly frequency. Similarly, in the classification modeling part, previous research lacks the use of algorithms that are recently developed and that (practically) out perform traditional algorithms such as logistic regression, random forests etc. We extend the previous research in binary classification context and make use of a cutting-edge LightGBM algorithm to predict the class probability of an attack involving a suicide attempt. We illustrate the importance of feature engineering and hyperparameter optimization for modeling process and describe the reasons why standard validation techniques such as cross-validation would be a bad choice for this data. We propose an alternate strategy for validation and use AUC metric as well as confusion matrix to evaluate model performance on unseen data. From the trained model, we extract the most important features and use explainer object to further investigate the decision-making process behind our model. The scope of analysis can be further extended with a shiny app which is also an integral part to make this research handy and interactive.

Data

This research project uses historical data of terrorist attacks that took place around the world between 1970 to 2016 from open-source Global Terrorism Database (GTD) as a main source of data. It is currently the most comprehensive unclassified database on terrorist events in the world and contains information on over 170,000 terrorist attacks. It contains information on the date and location of the incident, the weapons used and the nature of the target, the number of casualties and the group or individual responsible if identifiable. The total number of variables is more than 120 in this data. One of the main reason for choosing this database is because 4,000,000 news articles and 25,000 news sources were reviewed to prepare this data from 1998 to 2016 alone (National Consortium for the Study of Terrorism and Responses to Terrorism (START), 2016).

Main data is further enriched with country and year wise socio-economical conditions, arms import/export details and migration details from World Bank Open Data to get a multi-dimensional view for some specific analysis. This additional data falls under the category of early warning indicators (short term and long term) and potentially linked to the likelihood of violent conflicts as suggested by the researcher (Walton, 2011) and (Stockholm International Peace Research Institute, 2017).

An important aspect of this research is a use of open-source data and open-source software i.e. R. The reason why media-based data source is chosen as a primary source of data is that journalists are usually the first to report and document such incidents and in this regard, first-hand information plays a significant role in the quantitative analysis. Since the source of data is from publicly available sources, the term “intelligence” refers to the open-source intelligence (OSINT) category. Intelligence categories are further explained in the next chapter.

Policy and practice implications

This research project is an endeavor to achieve actionable intelligence using a machine learning approach and contributes positively to the counterterrorism policy. The outcome of this research provides descriptive findings of most lethal groups, corresponding pattern discovery through Apriori algorithm and predictive analysis through time-series forecasting and classification algorithm. Research findings and insights will be helpful to policy makers or authorities to take necessary steps in time to prevent future terrorist incidents.

Deliverables

a report in pdf version
a report in gitbook version
Shiny app
R scripts

To ensure that the research claims are (easily) reproducible, this thesis uses rmarkdown and bookdown package which allows code execution in line with a written report. gitbook version of this report is highly recommended over pdf version because it allows interactivity for some specific findings such as network graph in pattern discovery chapter. In addition, a shiny app in R is developed to make the practical aspects of this research handy, interactive and easily accessible. This app also allows to further extending the scope of analysis. All the scripts will be publicly accessible on my GitHub profile¹ after submission.

https://github.com/pranavpandya84 ↩