Francine Chen, Palo Alto Research Center, "Multiple Similarity Measures and Source Pair Information for Improving Story Link Detection"
Abstract:
State-of-the-art story link detection systems, that is, systems that determine whether two stories are about the same news event, are usually based on the cosine similarity measured between two stories. I will first present an overview of a cosine-similarity-based link detection system and then describe a method for improving the performance of a link detection system by using a variety of similarity measures and source-pair specific statistical information. The utility of a number of different similarity measures, including cosine, Hellinger, Tanimoto, and clarity, both alone and in combination, are examined; and several machine learning techniques for combining the different types of information are compared. The techniques investigated were SVMs, voting, and decision trees, each of which makes use of similarity and statistical information differently. Experimental results indicate that the combination of similarity measures and source-pair specific statistical information using an SVM provides the largest improvement in estimating whether two stories are linked.