Introduction

This blog post details my experiments with different Machine Learning (ML) approaches, in an attempt to build a model capable of identifying named entities in written English texts.

Named Entity Recognition otherwise known as NER is a subfield of Natural Language Processing, and a method of information extraction. It involves the identification of entities in text data. These entities are Names of People (Per), Countries (Geo), Organisations (Org), Events (Eve), Time-based entities like dates, and so on. Extracted entities are used in other texts such as Search and Information retrieval, Question Answering, Document Classification, and much more.

NER solutions are normatively categorized into three, namely:

Rule-Based Approach: As suggested by the name, this approach employs rules in identifying entities within the text. These rules are designed to consider the pattern, the position, the text casing, or even the characters present in the words. While this approach is specific and accurate in picking out entities, it needs a comprehensive set of rules to identify all possible entities in the text. These rules can become verbose, leading to computationally expensive algorithms.
ML-Based Approach: The Machine Learning approach builds upon one of the inadequacies of Rule-based methods for Named Entity Recognition. Such an approach involves teaching an ML algorithm to identify entities (training) and asking it to do so on unseen data (inference), leading to lesser lines of code and a more compact solution.
While this seems more straightforward as compared to rule-based approaches, it relies on the processing and transformation of text into numerical values before these algorithms can comprehend them. Additionally, several experiments are needed before arriving at an model that recognizes entities. Furthermore, the usage of ML models comes with leaving some room for inaccurate predictions, which in some cases might be detrimental to the task at hand.
Hybrid Approach: This approach builds on the strengths of the two previously mentioned approaches. A blend of these approaches can occur in different ways including setting guardrails for the ML models with rule based algorithms while they identify entities in production.

Structure of Experimentation

In this blog post, I write about my observations when experimenting with different ML solutions for an NER Task. Below is the structure of my experimentation

Text Transformation: Transform text into vectors based on the frequency of words, contextual information or simply meaning of these words.
Training: Train an ML algorithm with a subset of the transformed data and the already specified labels
Evaluate: Understand the model’s performance, by conducting inference on a subset of the transformed data, that was not seen by the model. The model’s predictions are compared with the actual labels through Precision (The number of entities the model correctly identified).
Further assessing the model’s performance by running it on other humanly written sentences and observing the predicted labels.

My objective is to develop a Machine Learning model capable of recognizing named entities in free form, unstructred text.

Justifying the Choice of Metric

Precision is used as the choice metric in this exercise to see how correctly the model can identify entities while training it with imbalanced data.

Precision is therefore defined as

$$\begin{equation} \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} \end{equation}$$

Overview of the Data Used in This Exercise

The data is an annotated dataset with columns for words (tokens), their respective Parts of Speech (POS), and NER Tags. The tokens include alphabetic, numeric characters and punctuation marks. These punctuation marks are removed from data to achieve a smaller data sample for this exercise.

When joined together, these tokens form meaningful sentences that humans can comprehend. Preliminary exploration shows that there are about 31 null values in the tokens column, which are not useful, so they are dropped from the data.

There are 35,167 distinct tokens/ words, which when joined to form sentences (based on sentence identifiers) result in ~50,000 sentences. The distribution of NER tags is quite interesting to observe as there are 17 different tags that represent:

Geo = Geographical Entity
Org = Organization
Per = Person
Gpe = Geopolitical Entity
Tim = Time indication
Art = Artifact
Eve = Event
Nat = Natural Phenomenon

These tags conform with the BIO tagging scheme (developed by L. A. Ramshaw and M. P. Marcus in 1995), where a tag can be preceded with a “B-” (when an entity is at the beginning of an entity chunk or sentence) or an “I-”. (when an entity is inside an entity chunk or sentence) When the observed word is not an entity, it is tagged with “O“.

The tag “O“ is the most occuring one, which conforms to reality as not every word in a sentence is an entity. Infact, entities make up only a portion of a sentence or a body of text.

Approaches Tried and Their Outcomes

Approach 1: TFIDF + Multinomial Naive Bayes (Overall Precision: 0.907, Precision excluding dominant tag “O”: 0.788)

Features Used: Only the words (tokens) are used in this approach
Method of splitting data into train and test sets: The data is divided by tokens, to ensure equal distribution of the tags in train and test sets. 70% for training and 30% for testing
Method of data transformation: Term Frequency - Inverse Document Frequency (TF-IDF) is used to transform the words into a vector, wherein the values indicate the frequency of a word. The higher values indicate a word is rare and may be considered important for a particular tag
Type of Model: Multinomial Naive Bayes model is used to build this model. The model considers the rarity or frequency of a word (as indicated by TF-IDF) and estimates the probability that the word belongs to a particular tag
Performance on Test set: A precision score of 0.907 is observed when the most frequent tag “O“ is present, when this tag is absent, precision drops to 0.808.
Problems with this approach:
1. The context in which a word is used determines its tag, this allows a word to have multiple tags, as seen below

This model does not generalize well on unseen names of people. In this example, Mark and Rutte are names of a person

Approach 2: Fast Text + Decision Tree (Overall Precision: 0.938, Precision excluding dominant tag “O”: 0.775)

Features Used: Only the words (tokens) are used in this approach
Method of splitting data into train and test sets: The data is split by tokens, to ensure equal distribution of the tags in train and test sets. 70% for training and 30% for testing
Method of data transformation: FastText is used, wherein embeddings are derived by considering the character n-grams of the word. Essentially, this makes embedding representation possible for unseen words, as opposed to Word2Vec.
Type of Model: Decision Tree, which recursively splits data into subsets till it achieves maximum homogeniety in its leaf nodes.
Performance on Test set: Precision score of 0.938 is observed when the most frequent tag “O“ is present (better than the previous approach), when this tag is absent, precision drops to 0.775.
Problems with this approach: Context is not considered with Fast Text. Additionally, the model is not yet able to differentiate between Names of countries and names of people

Approach 3: Multilingual-e5-base embeddings of Tokens and Sentences, Pre-indicated Parts of Speech Tags + Neural Network (Overall precision: 0.932, Precision excluding dominant tag “O”: 0.718)

Features Used: Tokens, Sentences in which these tokens appear, and Parts of Speech (POS) tags of these tokens.
Overall data is sampled to reduce the compute power needed for training.
Method of splitting data into train and test sets: The data is split by sentences, to ensure equal distribution of the tags in train and test sets. 70% for training and 30% for testing.
Method of data transformation: Embeddings are created from the tokens and the sentences in which they exist. POS tags are one hot encoded.
Type of Model: Neural network with three inputs each for the abovementioned features. Within the model, dot product is derived from token and sentence embeddings. This is done to give the model a sense of the kind of words to focus on, which may be the subject of the sentence, and have tags attached. Below is an architecture of the model.

Performance on the test set: Precision score of 0.938 is observed when the most frequent tag “O“ is present (similar to the previous approach), when this tag is absent, precision drops to 0.718.
Problems with this approach: While this approach takes into consideration the context the word is used, it is not yet able to identify chunks of text, which might altogether refer to an entity. However, it does perform well on identifying names regardless of if they are of European or African origin. Admittedly, it does perform lower in terms of Precision that other methods, however this approach is able to generalize better.

Findings

The above observations indicate contextual and/ or semantic information enables ML models better understand text information and (as in our case) NER tasks.
The creation of embeddings often lead to high dimensionality, which are better dealt with by Neural Networks or more complex models as opposed to statistical ML models
More works need to be done, to improve NER for example, identifying chunks as entities. an example “today by 12 noon“ should altogether be identified as a Time entity.
The last model does generalize better on unseen data than previous models. and so we adopt this as the model of choice.

Concluding Remarks

Named Entity Recognition models can be put to use in a series of ways. As a standalone solution, it is capable of deriving key information from lengthy tasks, identifying locations, places or even persons of interest. It could also be useful in search and information retrieval, text classification amongst a host of other processes.

This model is made available through this link and can be used to identify entities inherent in text. The code for this exercise, as well as those for the model’s deployment on Streamlit cloud can be found in this GitHub repository

Experimenting with Machine Learning (ML) Methods For Named Entity Recognition