Sentiment Analysis: Idioms and their Importance


Sentiment analysis (or opinion mining) aims to automatically extract and classify sentiments (the subjective part of an opinion) and/or emotions (the projections or display of a feeling) expressed in text.

There are several language features that we use to indicate sentiment within text. Features can take the form of single words (unigrams), short phrases (bigrams), and longer phrases (n-grams), emoticons (e.g. 🙂 is commonly used to represent positive sentiment), slang (e.g. chuffed, do one’s nut), abbreviations (e.g. great — GR8), onomatopoeic elements (e.g. gr, hm), as well as the use of upper case, punctuation (e.g. !!, ?!), and repetitions of letters (e.g. sweeeeet) for affective emphasis. These features are often extracted from text and presented to machine learning models which are trained to classify the sentiment expressed within them based on the features they contain.

Although these features are extensively used in sentiment analysis, less attention has been paid towards the effect of ignoring idioms as features. In this case, this post investigates the importance of including idioms as features in sentiment analysis by comparing the performance of two state of the art tools when idioms are represented and when they are not. There are two requirements to achieve this: 1) the sentiment associated with an idiom needs to be identified, and 2) idioms need to be automatically recognised in text.


Before we get technical, let’s first define what idioms are.

Idioms are often defined as multi-worded expressions (expressions or phrases which are made up of at least 2 words). But what makes them different to other phrases is that their overall meaning can’t be guessed from the literal meaning of each word which form the idiom. For example, a fish out of water is used to refer to someone who feels uncomfortable in a particular situation, not its literal sense. The following figure provides other examples of English idioms.

Image for post

Examples of idioms about money and finance by Kaplan

But because of this, idioms are a challenge for language learners. It’s therefore common for them and their meanings to be taught and remembered, as opposed to learning their structures.

To distinguish idioms from other phrases and sayings, the following properties can be considered:

  • Conventionality: The overall meaning of an idiom can’t (entirely) be predicted from the literal meaning of each word which form them.
  • Inflexibility: Their syntax is restricted, i.e. they don’t vary much in the way they are composed.
  • Figuration: They typically have figurative meaning stemming from metaphors, hyperboles and other types of figuration.
  • Proverbiality: They usually describe a recurrent social situation.
  • Informality: They’re associated with less formal language such as colloquialism.
  • Affect: They typically imply an affective stance toward something rather than a neutral one.

The last property, affect, implies that an idiom itself may be useful in determining the sentiment expressed within a piece of text. For example, “I am over the moon with how it turned out” expresses a positive sentiment.



In order to use idioms as features of sentiment analysis, we need more information about their underlying sentiment. In this case, we turned to the Learn English Today website which organises 580 idioms by themes, many of which can be directly (e.g. happiness, sadness) or indirectly (e.g. success, failure) mapped to an emotion. We focused specifically on emotion-related idioms as they’re anticipated to have some impact on the sentiment analysis result. We selected 16 out of a total of 60 available themes which are listed in the following table, together with the number of associated idioms.

Image for post

Distribution of idioms across themes (Table 1)


As well as a list of idioms, we also need examples of them being used in context. In this case, we searched the British National Corpus (a large text corpus of both written and spoken English compiled from various sources) for examples of the 580 idioms used in different contexts. In total, we collected 2,521 sentences which contained an expression that could be matched to an idiom.

In most cases, idioms had a figurative meaning associated to them. But in other cases, they conveyed literal ones. In this sense, some of the sentences are false positives. For example:

“The Welsh farmer’s son had the 1988 conditional jockeys’ title already in the bag.”

“I looked in the bag, it was full of fish.”

It was necessary to include false positives so that we could evaluate how incorrectly recognised idioms may affect the results of sentiment analysis.


Whilst idioms have been extensively studied across many disciplines, so far, there isn’t a comprehensive set of idioms that have been systematically mapped to their sentiments. This is the main reason why idioms have been underrepresented as features in sentiment analysis.

In this case, at least 3 annotators were required to tag whether each example of idioms used in context were reflecting a positive, negative, neutral or ambiguous sentiment. Likewise, 5 annotators were required to tag the sentiment expressed by each idiom out of context.

To measure the reliability of the annotated datasets, we measured inter-annotator agreement using Krippendorff’s alpha coefficient. There are other agreement measures, however, this measure is known as being reliable as it considers any number of annotators (not just two), any number of categories, and considers incomplete or missing data.

Krippendorff’s alpha coefficient is calculated according to the following formula:

Image for post

Krippendorff’s alpha coefficient

where Do is the observed disagreement (the proportion of items on which both annotators agree), and De is the disagreement expected when annotations are given at random. Krippendorff suggests α = 0.667 as the lowest acceptable value to consider a dataset as being reliable for training a model. Krippendorff’s alpha coefficient of 1 indicates perfect agreement, whereas 0 intuitively indicates no agreement. Therefore, higher values indicate better agreement.

The agreement on the idiom dataset was calculated as De = 0.606, Do = 0.205, α = 0.662. The agreement on the corpus of idioms used in context was calculated as De = 0.643, Do = 0.414, α = 0.355.

The agreement on idioms alone (α = 0.662) illustrates that they can somewhat be mapped to their sentiment polarity. Though, significantly lower agreement (α = 0.355) on contextual examples of idioms illustrates how subjective sentiment interpretation is amongst annotators.

The values for Krippendorff’s alpha coefficient can be obtained using an online tool for calculating annotator agreement, such as ReCal, or it can be implemented in Python.


Annotated contextual examples of idioms were then used to create a gold standard (the standard accepted as being the most valid) for sentiment analysis experiments. In order to create a gold standard, each sentence with an annotation agreed by the relative majority of at least 50% of the annotators was treated as the ground truth. That is, if 2 annotators agreed that “All right, do not jump down my throat” reflected a negative sentiment, whereas the third noted it reflected a positive one, the ground truth associated with the sentence was determined as being negative.


In order to incorporate idioms as features of sentiment analysis, we required the means of automatically recognising them in text. The fact that the structure of most idioms are inflexible makes this feasible.

Lexico-syntactic patterns (a string-matching pattern based on text tokens and syntactic structure) can be used to computationally model idioms to automatically recognise their occurrences in text. A lot of the idioms are frozen phrases (their structures don’t change) which can be recognised by simple string matching. But syntactic changes, such as inflection (e.g. verb tense change), are also seen in idioms. These can be modelled using regular expressions (RegEx), e.g. spill[s|t|ed] the beans, or for more complex idioms, lexico-syntactic patterns (e.g. put NP in PRN’s place).

In this case, idiom recognition rules were implemented as expressions in Mixup (My Information eXtraction and Understanding Package), a simple pattern-matching language. For example, the following grammar:

〈idiom〉 ::= 〈VB〉 〈PRP$〉 heart on 〈PRP$〉 sleeve

〈VB〉 ::= wear | wore | worn | wearing

〈PRP$〉 ::= my, your, his, her, its, our, their

was used to successfully recognise the idiom wear one’s heart on one’s sleeve in the following sentence:

“Rather than〈idiom〉 wear your heart on your sleeve 〈/idiom〉, you keep it under your hat.”

The pattern-matching rules were applied to the test dataset of 500 sentences (40% of the original dataset) in which an annotator marked up all idiom occurrences differentiating between the figurative and literal meaning. For example:

“Phew, that was a 〈idiom〉 close shave 〈/idiom〉.”

“He has polished shoes, a 〈nonidiom〉 close shave 〈/nonidiom〉, and too much pride for a free drink.”

The performance of recognising idioms was recorded with an F1-score of 97.14%, where an idiom was considered to be correctly recognised if the suggested text span matched exactly the one marked up by the annotator.

Image for post

Recognising idioms using Mixup


The 5 annotations collected for each idiom were used to calculate their feature vectors. Each idiom was represented as a triple: (positive, negative, other), where each value represented the percentage of annotations in the corresponding category. For example, the idiom wear one’s heart on one’s sleeve received 1 positive, 0 negative, and 4 other annotations. It was, therefore, represented as the following triple: (20, 0, 80).

As we wanted to investigate the impact of idioms as features in sentiment analysis, we conducted two experiments in which we combined the triple representations of idioms with the results of two popular sentiment analysis methods: SentiStrength and Stanford CoreNLP’s sentiment annotator.

In the first experiment, we used SentiStrength, a bag-of-words approach which assigns sentiment polarity to a sentence by aggregating the polarity of individual words, e.g.

Input: The party is over.

Analysis: The party [1] is over [−1] .

Output: result = 0, positive = 1, negative = −1

As illustrated in the given example, the phrase party is over would be recognised as an idiom, which mapped to the following triple: (0, 100, 0) denoting that all annotators considered it to be negative. We appended the two vectors to create a single feature vector for the given sentence as follows:

Image for post

In the second experiment, we used a sentiment annotator distributed as part of the Stanford CoreNLP, a suite of core NLP tools. This method uses recursive neural networks to perform sentiment analysis at all levels of compositionality across the parse tree by classifying each sub-tree on a 5-point scale: very negative, negative, neutral, positive and very positive. In addition to classification, it also provides a probability distribution across the 5 classes, which we used as features in our method by converting them into a 5-dimensional vector. As before, the idiom party is over would be recognised and its triple appended to create a single feature vector for the given sentence as follows:

Image for post

For both experiments, the feature vectors produced for each sentence were concatenated with their ground truth class label:

Image for post


Once we had represented and combined both idioms and contextual examples of idioms as feature vectors, we used Weka, a popular suite of machine learning software, to train a classifier and perform classification experiments. We based our choice of a machine learning method on the results of cross-validation experiments on the training dataset (60% of the original dataset). A Bayesian network classifier outperformed other methods.

The classification performance was evaluated in terms of three measures — precision (P), recall (R) and F1-score based on the numbers of true positives (TP), false positives (FP) and false negatives (FN).

Image for post

The evaluation results using SentiStrength as the baseline method

Image for post

The evaluation results using Stanford CoreNLP sentiment annotator as the baseline method


So, what did we learn from this analysis?

We demonstrated the value of idioms as features of sentiment analysis by showing that idiom-based features significantly improve sentiment classification results when idioms are present. The overall performance in terms of F1-score was improved from 45% to 64% in one experiment, and from 46% to 61% in the other.

The next steps would be to explore how idiom recognition rules can be implemented in Python using RegEx. This is to idiom recognition rules more accessible and compatible with building sentiment classifiers using Scikit-learn.

For the full datasets and Mixup rules, check out my GitHub repo below:

Original post:

Leave a Reply

Your email address will not be published. Required fields are marked *