One of the world’s leading university presses, and arguably the oldest, has started providing this week recommended links on the Cambridge Core site (https://www.cambridge.org/core/). The Cambridge site related content across around one million journal articles and book chapters.
The Cambridge corpus poses a problem for any indexing and recommender tool. As a major university press publisher, Cambridge University Press publishes content across a very wide range of subjects, reflecting a publishing heritage of several hundred years. Major traditional strengths of the list include history, law, and social science, and they have been joined by more recent topics such as neuroscience, anaesthesia, and artificial intelligence. Trying to find a single taxonomy that covers all these subject areas would be a challenge indeed. While there are plenty of single-subject classifications available within each discipline, none of them will be broad enough to cover all Cambridge content.
The solution is to use the UNSILO corpus-based indexing. The concepts identified are based on an analysis of the corpus itself, and this facilitates cross-domain discovery. An extensive trial of over 100,000 linked documents on the Cambridge site during 2019 demonstrated there was a significant interest by researchers in identifying related content in this way.
How the recommender works
The Cambridge corpus is pre-indexed by UNSILO servers running in the cloud. Individual concepts are identified by assessing each document against others in the corpus and identifying significant words and phrases.
Each night, in the background, by a process of continuous ingestion, the index is updated by extracting concepts from the latest Cambridge publications, and adding to existing terms or identifying new index terms. In other words, the index is being expanded every 24 hours as new content is published. Content is identified as related when there is a large number of overlapping concepts between them, and the linked articles are then ranked by relevance (based on the significance of the concept to the article). The relations between concepts are semantic as well as syntactic, and is based entirely on the words in the content, not on usage, citations, or any other non-intrinsic metadata. The Cambridge index is created specifically for the Press, and is unique to the Cambridge corpus.
The UNSILO recommender identifies links between individual chapters, as well as articles. A search on Cambridge Core for “climate change and extinction” reveals an article on the subject, which is then linked to a number of chapters and articles on the same subject:
Concepts are related semantically, not only by syntax. For example, a chapter on “epilepsy” is linked to a chapter on “seizures”:
Similarly, the terms “kidney” and “renal” are linked, as they are closely synonymous. A chapter on “kidney disease” is linked to one on “renal disease”.
Within the arts, a chapter on “arts festivals” is linked to chapters on “cultural heritage” and “cultural expressions” (see the screenshot at the top of this post).
Over the coming months, Cambridge will track usage of the recommender system, and we hope in a subsequent post to indicate any findings identified.