Abstract:
Because of the short sentiment text length, the lack of information, and the sparseness of features. When use the n-gram approach, the redundancy and relevance between words are ignored. This paper proposes n-gram features selection method based on Chi-square statistics. Firstly, each feature is evaluated by taking into account the simultaneous or individual occurrence of features within the feature set. Based on the idea that the occurrence of one feature but not the other may also convey valuable information for discrimination. Then the redundancy between words is reduced by chi-square statistic algorithm calculate the relevance between features and categories. So that we can extract n-gram features of high categories relevance and low redundancy. Finally, using Support Vector Machine classifier to identify the text orientation in different corpus, the experimental results show that this method improves the accuracy of text classification.