Tagging content to UN Sustainable Development Goals

The United Nations Sustainable Development Goals (SDGs) have been one of the most impressive innovations of the UN in recent years. First introduced in 2015, the 17 SDGs (actually 16, with the 17th being partnerships for the goals) were designed to be an easily comprehensible and assimilable collection of topics, so it is not surprising that publishers and institutions are using them to evaluate their research and to compare their output with other, similar organizations. For example, the THES runs an annual ranking that measures universities’ research output against each of the goals. The Princess Nourah bint Abdulrahman University in Saudi Arabia was ranked first for SDG5, gender equality, based on a number of criteria, including its research output in this area. How exactly can a university identify the number of publications in one year on gender equality?

Currently, this is estimated using Boolean search carried out on Scopus, the Elsevier-owned A&I collection of abstracts. A lengthy Boolean search (available for examination on Mendeley here) attempts to identify all the relevant articles, by creating a sequence of OR statements containing words and phrases relating to gender equality. Let’s look at the final version used of a Boolean search for SDG 5, “Achieve gender equality and empower all women and girls”:

Updated search string (third update): 30,458 document results TITLE-ABS-KEY ( ( {gender inequality} OR {gender equality} OR {employment equity} OR {gender wage gap} OR {female labor force participation} OR {female labour force participation} OR {women labor force participation} OR {women labour force participation} OR {women’s’ employment} OR {female employment} OR {women’s unemployment} OR {female unemployment} OR ( access AND {family planning services} ) OR {forced marriage} OR {child marriage} OR {forced marriages} OR {child marriages} OR {occupational segregation} OR {women’s empowerment} OR {girls’ empowerment} OR {female empowerment} OR {female genital mutilation} OR {female genital cutting} OR {domestic violence} OR {women AND violence} OR {girl* AND violence} OR {sexual violence} OR ( {unpaid work} AND {gender inequality} ) OR ( {unpaid care work} AND {gender inequality} ) OR {women’s political participation} OR {female political participation} OR {female managers} OR {women in leadership} OR {female leadership} OR {intra-household allocation} OR ( access AND {reproductive healthcare} ) OR {honour killing} OR {honor killing} OR {honour killings} OR {honor killings} OR {antiwomen} OR {anti-women} OR {feminism} OR {misogyny} OR {female infanticide} OR {female infanticides} OR {human trafficking} OR {forced prostitution} OR ( equality AND ( {sexual rights} OR {reproductive rights} OR {divorce rights} ) ) OR {women’s rights} OR {gender injustice} OR {gender injustices} OR {gender discrimination} OR {gender disparities} OR {gender gap} OR {female exploitation} OR {household equity} OR {female political participation} OR {women’s underrepresentation} OR {female entrepreneurship} OR {female ownership} OR {women’s economic development} OR {women’s power} OR {gender-responsive budgeting} OR {gender quota} OR ( {foreign aid} AND {women’s empowerment} ) OR {gender segregation} OR {gender-based violence} OR {gender participation} OR {female politician} OR {female leader} OR {contraceptive behaviour} OR {women’s autonomy} OR {agrarian feminism} OR {microfinance} OR {women’s livelihood} OR {women’s ownership} OR {female smallholder} OR {gender mainstreaming} ) ) AND PUBYEAR < 2018 AND PUBYEAR > 2012

There are around 85 search terms included here. A search of this kind has many limitations:

  • Boolean string searches have no awareness of significance. Any research article that mentions “male unemployment”, for example, will be counted as part of SDG5, according to the above criteria, even If only a passing reference.
  • In an effort to be comprehensive, the authors of the search have included many terms of may or may not be relevant to the SDG, for example “same sex marriage”, “male unemployment”, “menstrual”, and “unpaid work”.
  • The search only covers abstracts, so will be limited to articles that cover the respective topics in the abstract.
  • Boolean search only looks for strings of characters; it is not semantic. A Boolean search often misses many valid expressions. This one includes “men and violence” and “women and violence”, but not “male violence”, for example.

Is there a better way of achieving the result? The UNSILO concept extraction tool, as used by the OECD to tag all their content to the SDG goals, provides, we believe, a much more precise way of selecting content:

  • UNSILO works with concepts, and uses machine learning to identify all the synonyms and closely related expressions for any term within a corpus. The corpus can be as big as you wish. Thus, for example, running the UNSILO concept tool across the Medline (biomedical) corpus, the following related terms are revealed for “domestic violence”: “family violence”, “partner violence”, “spousal violence”. “intimate partner violence”. These are not hypothetical phrases, but actual phrases from published articles in the Medline corpus. Despite the authors of the Boolean search revising their search no fewer than four times, the resulting set of terms still has gaps. For example, one of the phrases searched for is “female labour force participation”. Using UNSILO, one sees immediately that the Medline corpus contains examples of a close synonym, “workforce participation” for “labour force participation”. “Family planning services” has the close synonyms “reproductive health services” and “contraception counseling”, neither of which are in the Boolean set.
  • UNSILO uses clusters of concepts to identify the most relevant content: a considerably more sophisticated approach than string matching. Behind the scenes, each article in the corpus has hundreds of concepts identified. Rather than 85 terms for the Boolean search, the UNSILO engine uses hundreds of terms to identify the most relevant matches. Humans can then assess the cut-off point for what constitutes relevance.
  • UNSILO identifies minor syntactic variations such as British or American English by default, as well as identifying common abbreviations (UN for United Nations, for example).
  • Just as with a Boolean search string, the set of concepts identified for the search by UNSILO can be documented, and so provide evidence for the results obtained

It would be an interesting exercise to compare what is found, and what missed, by the two approaches. In the meantime, the results of using UNSILO for SDGs can be seen on the dedicated OECD website, and on the Taylor and Francis SDG website.


Receive an email every time we publish a new blog post