Importance Weighted Feature Selection Strategy for Text Classification

Baoli Li
Baoli Li
27 November 2016
Presentation Slides
Baoli LI
Feature selection, which aims at obtaining a compact and effective feature subset for better performance and higher efficiency, has been studied for decades. The traditional feature selection metrics, such as Chi-square and information gain, fail to consider how important a feature is in a document. Features, no matter how much effective semantic information they hold, are treated equally. Intuitively, thus calculated feature selection metrics are very likely to introduce much noise. We, therefore, in this study, extend the work of Li et al. [1] on document frequency metric, propose a general importance weighted feature selection strategy for text classification, in which the importance value of a feature in a document is derived from its relative frequency in that document. Extensive experiments with two state-of-the-art feature selection metrics (Chi-square and information gain) on three text classification datasets demonstrate the effectiveness of the proposed strategy.

