ترجمه فارسی عنوان مقاله

روش هموارسازی لجستیک مبتنی بر رگرسیون برای طبقه بندی متنی چینی

عنوان انگلیسی

A logistic regression-based smoothing method for Chinese text categorization

کد مقاله	سال انتشار	تعداد صفحات مقاله انگلیسی
1409	2011	10 صفحه PDF

منبع

Publisher : Elsevier - Science Direct (الزویر - ساینس دایرکت)

Journal : Expert Systems with Applications, Volume 38, Issue 9, September 2011, Pages 11581–11590

ترجمه کلمات کلیدی

طبقه بندی متن - طبقه بندی مبتنی بر ان - گرام - انتخاب پارامتر - تقسیم بندی کلمه - رگرسیون لجستیک -

کلمات کلیدی انگلیسی

Text classification,N-gram-based classification,Feature selection,Word segmentation, Logistic regression,

دانلود رایگان 2 صفحه اول مقاله لاتین (PDF)

پیش نمایش مقاله

چکیده انگلیسی

Automatic Chinese text classification is an important and a well-known technology in the field of machine learning. The first step for solving Chinese text categorization problems is to tokenize the Chinese words from a sequence of non-segmented sentences. However, previous literatures often employ a Chinese word tokenizer that was trained with different sources and then perform the conventional text classification approaches. However, these taggers are not perfect and often provide incorrect word boundary information. In this paper, we propose an N-gram-based language model which takes word relations into account for Chinese text categorization without Chinese word tokenizer. To prevent from out-of-vocabulary, we also propose a novel smoothing approach based on logistic regression to improve accuracy. The experimental result shows that our approach outperforms traditional methods at least 11% on micro-average F-measure.

مقدمه انگلیسی

In recent years, with the rapid growth of the World Wide Web there are more and more digital text services on the internet. Examples include Google, Yahoo, etc. Search engine does not only provide user with the information, but also manage the information effectively. In order to make it easy, many search engine systems classify the texts in advance to organize their taxonomy. For example, Lam, Ruiz, and Srinivasan (1999) discuss that “One useful application for automatic categorization is to support effective text retrieval”. Owing to the huge amount of data, it is quite difficult to classify all texts by artificial selection and the contrived arbitrariness. Automatic text classification is a well-studied technique in machine learning and data mining domains, and there are many applications about it, like web classification, information retrieval, information filtering, etc. For example, Jiang (2006) indicate that “In regard to spam filtering, it can be use to classify incoming messages into either legitimate or spam category”. The methods of automatic text classification help people to classify by machine learning technique. There are two typical types, supervised learning and unsupervised learning. The unsupervised learning is a kind of two step training. First, it clusters all texts according to the feature of texts, and then assigns each cluster to different class for building the classification model. On the contrary, the supervised learning trains the model using the data which is assigned to a class beforehand. The difference relies in the former does not need labeled training data. We follow the supervised learning prototype. Since it is known that the unsupervised learning still far away from state-of-the-art. Over the past decade, many supervised machine learning techniques have been applied to text categorization problems, such as Naive Bayes classifiers (Sebastian, 2002 and Yen et al., 2006), support vector machines (Sebastian, 2002 and Yen et al., 2006), linear least squares models, neural networks (Sebastian, 2002 and Yen et al., 2006), and k-nearest neighbor classifiers ( Sebastian, 2002 and Yen et al., 2006). Yang and Liu experiment on the news text of Reuter ( Yang, 1999 and Yang and Liu, 1999). They reported that support vector machine and k-nearest neighbor classifier achieved the best accuracy in comparison to the other four methods. In 2002, Sebastian (2002) pointed out that SVM had better performance in general case. However, support vector machine classifiers still have many problems. First, it is designed to solve the binary-class classification problem. Therefore it should be converted to handle multi-class classification problem. Second, the support vector machine classifiers generally could not generate probability. Most support vector machine classifiers represent the similarity between the class and the text using the cosine similarity. Unlike probability, the cosine similarity is very abstractive to represent the similarity between the class and the text. These problems are solved by Tipping (2001). He combines the logistic regression in the support vector machine classifiers to generate the probability. Meanwhile, the multi-class classification problem also can be transformed into several binary-class classification problems. Nevertheless, a known problem with the Tipping’s approaches is their relative inability to scale with large problems like text classification. Fortunately, Silva and Ribeiro (2006) solved the problem by the method of dimension reduction. Nonetheless, these “standard” approaches to Chinese text categorization has so far been using a document representation in a word-based ‘input space’ (He, Tan, & Tan, 2003), i.e. as a vector in some high (or trimmed) dimensional Euclidean space where each dimension corresponds to a word. The advantages of this method are effective and high feasibility, but it assumes each word is independent with each other. Hence we need to solve word segmentation first for Chinese text. Most text classifiers treat each feature are mutually independent, word segmentation in Chinese is a difficult problem and we could not ensure that every words which is segmented by some word segmentation process are mutually independent. To solve it, character level N-gram models ( Cavnar and Trenkle, 1994, Damashek, 1995, Peng and Schuurmans, 2003, Peng et al., 2003 and Teahan and Harper, 2001) could be applied. These approaches generally outperform traditional word-based methods in term of accuracy. Moreover, these approaches can avoid the word segmentation step models the word-by-word relations ( Dumais et al., 1998 and Peng et al., 2002). By following this line, in this paper, we present a novel N-gram-based smoothing estimator using the logistic regression for Chinese text categorization. One main feature is that the method captures the dependence relations between Chinese characters from the given training data and can be generalized to handle unknown words in the testing text. By means of the proposed logistic regression-based smoothing algorithm, it shows even better empirical results in accuracy than the other smoothed method. Our method is also able to be combined with the conventional feature selection criterion, such as chi-square and information gains. The remainder of the paper is organized as follows: Section 2, the concept of the related works is introduced, and the proposed text categorization method is proposed in Section 3. Section 4 presents the experiments of our approach, and in Section 5, we draw the conclusion and future remarks.

نتیجه گیری انگلیسی

This paper presents an n-gram-based model for Chinese text classification. We use the logistic regression to smooth the probability of n-gram. In our experiment, logistic regression smoothing outperform traditional back-off smoothing, because logistic regression has the ability to process unknown terms and it will not over-evaluate the conditional probability which originally is zero. Besides, we proposed a novel feature selection method which is suitable to N-gram-based model. In our experiment, we prove that it could improve the F-measure in most case. Third, we consider a text as a set of sentences. According to our experimental result, this could improve system performance especially in the case which use tri-gram model or whose training data set is large enough. In the future, we plane to find out a method to evaluate the relationship between sentences. Since we believe that considering a text as a set of sentences is not an optimal assumption and there exist some fact could respect the relationship between sentences. We also try to reduce the consumption of memory and calculative time of CPU on feature selection.