The Semantic Measures Library

Today I write a bit about the Semantic Measures Library. I extensively used it for my first prototypes because my work focuses on semantic applications and this library has been a great help. Let’s get started!

Semantics vs terminology

First, I think I need to define the terms “semantics”, “concepts/concept-based”, “semantic web” and “ontology” that I may use throughout this post. You may have heard about them, I believe they are some kind of buzzwords nowadays. Anyway, here is a simple explanation of semantics through the computer science point of view. When you type a Google query, you express a need of information to the information retrieval system (IRS) by using keywords (e.g. “smartphones review”). Google will give you its best matches for the keywords you entered. There might be some very relevants webpages for you that don’t contain the exact keywords. Some may talk about “smartphones comparison” or “smartphone tests”. The thing is, if the keyword “review” is missing on the webpage, it will not be returned*. The idea behind semantics is to define concepts as a meaning, instead of the only words labeling the concepts. On semantic applications (information retrieval, indexing, etc.), we don’t use terms, we use concepts. For example, instead of having a keyword “java”, we would have a unique identifier referring to the java programming language. What’s the difference? Well, if you annotate a document with the keyword “java”, the reader will not be able to clearly understand if the document refers to the programming language or to the Java island. If you annotate it with the unique ID describing the java programming language, then you will annotate it with the actual meaning, not a keyword. There is something else useful with semantics: structure. The concepts are generally structured as a graph called ontology. This means that besides providing a controlled vocabulary (with unique IDs, no synonymy, no polysemy), the ontology gives links between pairs of concepts such as specialisation/generalisation, e.g., a dog is an animal.

*Well, OK. It appears that Google actually uses some semantics. Apparently, even though it doesn’t consider structure, synonymy is well handled by the IRS. Try typing “car image”, you will get results with keywords “car picture” highlighted.

Why do we need such library?

OK. Now you’re convinced that concept-based applications definitely are what you need. I said that the knowledge can be represented as a graph. This induces a lot of things, specially the possibility to compute similarity between pairs of concepts. Without ontology, how can you asses the similarity between the concept of a dog the one of a cat? By using the data structure, you know that the concept referring to the dog and the one referring to the cat both are specialisations of the concept “animal”. They have a common ancestor, there is a path to go from one to another through this ancestor and some measures allow us to assess the similarity of pairs of concepts relying on the structure. Many measures have been defined to assess the similarity of two concepts. By combining pairwise measures, one can assess the similarity of two groups of concepts, for instance, the minimum pairwise similarity between the groups. There is an extensive literature on this subject, which I won’t cover in this post. I you want more on this, read this comprehensive survey.

So how can this library be useful? Well, I use it all the time in my prototypes to compute semantic similarities. For example, biomedical paper are often annotated with MeSH (an ontology) descriptors (basically, concepts). Check out this one. There is a list of MeSH terms for this paper. When we need to find the similarity between two documents, for example, one way is to compute the semantic similarity of the concepts annotating them. With the rise of linked data, computing semantic similarities can really be helpful/useful. Take for instance Freebase, which proposes a lot of entities related together. You can find movies associated with genres. Genres are ordered. So on one hand you have semantic annotations of movies (the genres) and on another hand you have the genres structure. By computing the similarity between films or groups of films, you can easily cluster them, make a graph with them, recommend them, etc.

So now, check this library out. There are some code snippets that may help you out starting a project. Soon, I’ll propose a simple application of this library to make things even clearer.

Leave a reply

Your email address will not be published. Required fields are marked *