Measuring the accuracy of text analytics

How should a publisher, or researcher, measure the quality of text analytics? If an automatic tool extracts hundreds of concepts from your content, how do you know if they really are the key concepts? Probably the most widely used measure of the accuracy of text analytics is the F1-Score. This is a measure of accuracy (comparing the machine result with that of human indexer) that attempts to reconcile two criteria of accuracy, recall and precision:

• Recall: the extent to which the automatic tool has retrieved the key terms;
• Precision: the extent to which the software has not retrieved irrelevant terms.

Why is the F1 score not always to be relied on? There have been many criticisms of the F1 Score, but one very central comment was made by an academic researcher in life science. He described to me how he used text analytics to identify relevant content. Starting from the premise that there are too many articles published to read them all, even within a single domain (and life science is one of the most popular subjects for academic articles), he works as follows.
1. He identifies some articles that are relevant, and highlights sections and terms that interest him.
2. He uses text analytics software to find other articles containing those terms and concepts.

This is typical of the use of text analytics; most tools perform a similar service. But the interesting angle was the researcher’s comment:

“I don’t mind if the software finds 20% or 30% of articles that are not relevant, because it saves me so much time when it identifies articles that are relevant. I want to make sure it hasn’t missed anything, and I’m happy to do a bit of reading to check the automatic tool results. But it’s still a great improvement: instead of reading or skimming 100 articles by hand, I now only have to look in detail at a few to check their relevance.”

In other words, this researcher is prioritizing recall over precision. Since the F1 score makes an arbitrary balance between precision and recall, it is less appropriate in this case. In fact it could be argued that a different balance of recall and precision would be a more reliable indicator for academic research use, and there are other versions of the F score available. Unfortunately, the F1 score has become widely established, and so tends to be the first indicator that people seek when measuring text analytics. For academic researchers, it may not be the best measure.


Receive an email every time we publish a new blog post