A common complaint about machine-learning tools is that they are black boxes, operating in ways that cannot be explained (see, for example, this recent post). We make no claims for other AI tools, but in the case of UNSILO, the automated concept extraction engine has a very precise and well-documented activity. In this post, we will show how, for example, UNSILO’s engine works with related concepts rather than just with strings.
Take, for example, the phrase “secondary brain injury”, a medical concept used widely in academic text. A search using the UNSILO showcase, which comprises around a million medical articles (try it out for yourself on the UNSILO showcase here) reveals hundreds of hits for this concept, such as the article below.
Any term or phrase entered in the search box is automatically matched to the closest concept in the index. However, the English language being what it is, many researchers use the similar concept “secondary brain damage”, and a search for this phrase in the Showcase reveals a separate set of hits:
Most search tools operate on a string basis, albeit with a few tweaks to maximise the number of results from the search – which is not quite what is required here, in the context of academic search. For example, most search engines, including Google, will automatically expand a search syntactically, and so will search for “brains” as well as for “brain”, and “damaged” when the user keys in “damage”.
This query expansion can cause confusion rather than clarification: for a medical researcher, “secondary brain damage” is the concept they are interested in, not many syntactical variations on the constituent terms. The plural form “brains” will never appear in this context. However, a researcher is very interested indeed in related phrases, including terms that may have little or no syntactical relationship with the original, but a clear meaning relationship. How can we tell there is a meaning relationship? UNSILO’s machine-learning capabilities, using statistical methods, identify that many researchers use the phrase “secondary brain injury” as largely synonymous with “secondary brain damage” in articles with a similar context. A really useful indexing tool would identify both phrases as equivalent.
UNSILO does indeed do this: a glance at the concepts extracted by the UNSILO engine shows what is actually taking place. Behind the scenes, the concept extraction engine identifies synonyms and similar expresions in this context, and expands the query by concepts, rather than just by syntax tools. If you look at the concepts identified by the engine, you can see on the second line of the concept results, the engine has expanded the query to include not only “secondary brain injury” but also “secondary brain damage”:
To identify related concepts, rather than just similar strings, is the achievement of the UNSILO automated concept extraction tool, making searches more precise, and related content more rapidly discoverable.