In this post Mario Juric, UNSILO’s head of research, responds to a question that several of our clients have asked: does UNSILO use triple stores for its concept extraction?
Triple stores have been around now for several years, and some content owners (notably the BBC) have used triple stores in their publishing. What are the advantages and drawbacks of using triple stores, and does UNSILO use them?
One major benefit of triple stores is the ability to make inferences from a content repository. Everyone will be familiar with the simple inferencing possible from triple stores, for example
Jose Mourinho is manager of Manchester United. Zlatan Ibrahimović plays for Manchester United. Hence Jose Mourinho is the manager of Zlatan Ibrahimović.
Such a trivial inference may hardly seem ground-breaking, but it represents a great step forward for machines to be able to do something meaningful with the data they store. Nonetheless, using triple stores at present is not a common practice with most publishers. Why is this? In fact, there are many difficulties in creating and maintaining metadata in triple-store form.
SPARQL, the most widely used query language for triple stores, is not simple to use for non-expert users. Writing queries using SPARQL is too difficult for average users, so a natural-language interface needs to be added but most users are used to keyword search, which can be difficult or even impossible to translate into a meaningful SPARQL query. There is currently also no standard full-text SPARQL interface to do simple keyword search.
A further problem is scalability – the ability for a system to manage not just thousands but millions or even billions of queries as well terabytes or petabytes of data. Triple-store solutions are complex to scale, and therefore most solutions rely on large replicated nodes that each hold all data in memory to perform well, but this limits the amount of data that can be indexed without sacrificing query performance.
But perhaps the biggest single problem is the need for highly skilled staff to create and manipulate the triple stores. It is something of a paradox that the implementation of new technology in publishing has often resulted in an increase in the headcount, to manage the complexities of the software being introduced that is supposed to save labour. Managing a triple store and using it via SPARQL queries requires skilled staff as well as considerable computing power.
The UNSILO approach
In contrast, UNSILO uses a hybrid approach, combining machine learning with NLP tools to do complex natural language parsing and corpus-wide semantic analysis. It is this hybrid approach that gives UNSILO the power to make more accurate and precise links between content objects, which means a more effective recommender engine, or a peer reviewer system that matches authors and reviewers more precisely than a human can and in far less time.
UNSILO recognises that creating highly structured data for the semantic web needs a lot of human involvement. It is more efficient to extract concepts and relations automatically without relying solely on human-created ontologies (which is what UNSILO does). Machines already surpass or equal humans in many type of tasks within the areas of text, sound and image analysis, and machine-based analytics will continue to increasingly outpace human capabilities in future years.