Publication

Reducing the Loss of Information through Annealing Text Distortion

Sept. 23, 2010

People

Manuel Cebrian

Former Research Scientist

Groups

Share this publication

Ana Granados, Manuel Cebrian, David Camacho, Francisco de Borja Rodriguez

Abstract

Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper we take a step towards understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the non-distorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: LempelZiv, Statistical and Block-Sorting.

TR-642.pdf

Reducing the Loss of Information through Annealing Text Distortion

People

Groups

Abstract

Word-of-mouth algorithms: What you do not know will hurt you

Introducing Causality and Traceability in Word-of-Mouth Algorithms

The genetic algorithm as a general diffusion model for social networks

Measuring the Collective Potential of Populations from Dynamic Interaction Data

Reducing the Loss of Information through Annealing Text Distortion

People

Groups

Share this publication

Abstract

Word-of-mouth algorithms: What you do not know will hurt you

Introducing Causality and Traceability in Word-of-Mouth Algorithms

The genetic algorithm as a general diffusion model for social networks

Measuring the Collective Potential of Populations from Dynamic Interaction Data