Semantic clustering benchmark

The benchmark project for evaluating semantic clustering has been initiated at the LGI2P research center during the PhD of Nicolas Fiorini. The main motivation behind this work is to provide several datasets of semantically annotated documents.

This benchmark contains 8 datasets, each of which containing about 70 bookmarks that are annotated with WordNet 3.0. One dataset is designed for optimizing your method while the others are supposed to be used to evaluate it. For each dataset, there is a set of expert trees that have been manually created. To evaluate your results, you have to compare your tree for each dataset with the expert ones by using the python script provided. The average distance with the expert trees gives the score of your output tree for a given dataset.

This benchmark is the result of a curation of linked open data (LOD) containing users and bookmarks associated with synsets. The descriptions of the synsets give a correspondence with WordNet. We mapped the bookmarks to the corresponding WordNet synsets and pruned the LOD graph to provide a clean dataset consisting of bookmarks associated with WordNet URIs.

For more information, please visit

Leave a reply

Your email address will not be published. Required fields are marked *