UNSILO delivers real-time concept extraction


Real-time concept extraction has for years been the dream of AI companies working with text. But the challenge is immense: identifying significant phrases from a text requires a system that has already been trained, for example.

UNSILO’s standard model is to extract concepts automatically from text collections. Using a continuous ingestion process, newly published content is scanned for concepts against a background corpus, typically within a few hours of publication. New content is compared with the existing corpus, and significant phrases identified either as corresponding to existing concepts in the corpus, or entirely new phrases that have never been used before.

However, a delay even of just a few hours imposes a limitation; there are many situations where extracting concepts in real time would be very valuable. Fundamentally, a key stage in the academic researcher journey is reading articles and chapters. Given that 4,000 new STM (science, technology and medicine) articles are published every day, a researcher needs assistance to identify which articles should be read in detail; it simply isn’t possible to scan all the relevant-sounding articles by hand.  In addition, a researcher might want to look at concepts for a manuscript that has not yet been published.

It is for this reason that UNSILO is developing real-time concept extraction. Labelled “Insights”, the system is now running successfully on the UNSILO development servers, and the first results are now visible. The system is trained on an existing corpus, for example of open-access content, but it could also be the publisher’s existing collection of published articles and chapters.

As is appropriate for a major technological innovation, the UNSILO developers chose some text about Star Wars for the first public trial – the Wikipedia article on Darth Vader, to be precise. The results are shown above. There are no surprises about the key concepts identified – you can see “Star Wars”, “Dark Side” and “Television Series” appearing at the top. Each concept has a relevance score attached to it, so the user can see at a glance the top five or top 25 concepts and compare how central they are to the text.

As the tool develops, with the aid of researchers commenting on how it aligns with their workflow, we hope to provide a solution that complements human activity – not replacingn human input, but enabling the researcher to concentrate on the most relevant content. It may only save the researcher a few minutes per article, but given that the average researcher reads 264 articles per year according to a 2012 survey, and a typical researcher will glance at perhaps five or ten times that number of articles before reading them, that adds up to a significant time saving.

Receive an email every time we publish a new blog post