Efficient feature selection for complex text Abstract This paper deals with a new feature se-lection and feature contrasting approach for classification of highly unbalanced textual data with a high degree of similar-ity between associated classes. An exam-ple of such classification context is illus-trated by the task of classifying biblio-graphic references into a patent classifi-cation scheme. This task represents one of the domains of investigation of the QUAERO project, with the final goal of helping experts to evaluate upcoming pa-tents through the use of related research. 1Introduction Text categorization is a machine learning task which aims at automatically assigning predefined category labels to new upcoming free text docu-ments with related characteristics [COH 05]. Be-cause of its numerous applications, like mail or news filtering [COR 07][WAN 08], emotion de-tection [PAN 08], text genre analysis [BHA 93], or patent categorization [KOS 01][KOS 10], text classification has been one of the most studied branches within the field of machine learning [HIL 07]. However, a bunch of classification problems raise new challenges in the domain, especially those ones which implies to deal with imbalanced data and highly similar classes. In the context of text categorization, patents valida-tion assistance takes part in that class. It consists in generating help to experts in their task of evaluation of the novelty of a patent based on the automatic assignation of the most relevant scien-tific papers related with the classification codes of the said patent. As soon as learning is based on citations extracted from the patents which are usually associated with a hierarchy of classifica-tion codes having different levels of generality, first, there is no guaranty of a homogeneous dis-tribution of the citations (i.e. learning samples) among the codes, second, there is a high chance to have similar citations in different classes. We illustrate in that paper that the exploitation of standard strategies for classification and prepro-cessing would leads to non exploitable results in the above mentioned context. We thus propose a new preprocessing alternative as well. The re-maining of the paper is structured as follows. Section 2 presents the process we used to gener-ate our experimental dataset. Section 3 presents a new feature selection approach suitable to deal with the unsolved class imbalance and class similarities problems. Section 4 compares the results with and without the use of the proposed approach. Section 5 draws our conclusion and perspectives. 2Dataset generation Our main experimental resource is issued from the QUAREO1 project. It is a collection of pa-tents related to the pharmacology domain com-pleted with bibliographic references issued from the Medline2 database. The source data contains 6387 patents in XML format, grouped into 15 subclasses of the A61K class (medical prepara-tion). For obtaining the bibliographical refer-ences, 25887 citations are firstly extracted from the patents. Then the Medline database is queried with extracted citations for related references. The querying results in 7501 bibliographical ref-erences3 which are then labeled by the class code of their citing patent. 1http://www.quaero.org 2http://www.ncbi.nlm.nih.gov/pubmed/3Medline recall is 90% relatively to unique references The set of labeled references represents the final document set on which the training is performed. It is converted to a bag of words model [SAL 71] using the TreeTagger tool [SCH 94] developed by the Institute for Computational Linguistics of the University of Stuttgart. In our case, the refer-ences are firstly lemmatized and the tagging pro-cess is performed on lemmatized items (in the case when a word is unknown to the lemmatizer, its original form is conserved). The punctuation signs and the numbers identified by the tagger are deleted. Every reference is finally represent-ed as a term vector filled with term frequencies. The description space generated by the tagger has dimensionality 31214. To reduce the gener-ated noise, a frequency threshold of 45 (i.e. an average threshold of 3/class) is applied on the extracted descriptors. It resulted in a thresholdeddescription space of dimensionality 1804. Final-ly, Term Frequency-Inverse Document Frequen-cy (TF-IDF) weighting scheme [SAL 88] is ex-ploited on the thresholded space to obtain a sparse representation of the data. Fig. 1. Distribution of data in patents classes(patents-cited articles-descriptors). Figure 1 highlights the highly imbalanced distri-bution of both, patents, extracted references and keywords attached with references relatively to the different class codes. As an example, small-est class contains only 22 extracted references (A61K41 class) whilst the bigger one has more than 2500 (A61K31 class). The exploitation of resampling techniques [GOO 06] as well as the one of standard feature selection techniques [FOR 03] could be envis-aged to compensate influence of the bigger clas-ses. However, in our context, the ability of such techniques to precisely detect the right class is curtailed by the high class to class similarity dueto the association of the initial patents to a spe-cialized branch of the patent classification: inter-class similarity computed using cosine correla-tion between class profiles generated by the de-scriptors issued from the extracted bibliograph-ical references indicates that more than 70% of classes' couples have a similarity between 0.5 and 0.9. As an alternative, we thus propose a new filter approach which relies on the exploitation of a class-based quality measure grounded on the fea-ture maximization metric. Such metric has been formerly exploited by Falk et al. in the unsuper-vised context for clustering French verbs relying on syntactic and semantic features [FAL 12] and said authors demonstrated both its intrinsic effi-ciency for the clustering task and its generic ad-vantages for cluster labeling. 3Feature maximization based selection Let us consider a set of clusters C resulting from a clustering method applied on a set of data Drepresented with a set of descriptive features F, feature maximization introduced by Falk et al. in [FAL 12] is a metric which favors clusters with maximum Feature F-measure. The Feature F-measure of a feature f associated to a cluster c is defined as the harmonic mean of Fea-ture Recalland Feature Precision indexes which in turn are defined as: =∑∈∑ ∑∈∈ =∑∈∑∈,∈where represents the weight of the feature ffor data d and Fc represent the set of features oc-curring in the data associated to the cluster c. Taking into consideration the basic definition of feature maximization metric presented above, the feature maximization-based feature selection process can thus be defined as a parameter-free and class-based process in which a class feature is characterized using both its capacity to dis-criminate a given class from the others (index) and its capacity to accurately represent theclass data ( index). The set Sc of features that are characteristic of a given class c belong-ing to an overall class set C results in: = ∈ | > > where =∑∈// and =∑/ ∈||. with C/f representing the restriction of the set Cto the classes in which the feature f is present. A61K6A61K8A61K9A61K31A61K33A61K35A61K36A61K38A61K39A61K41A61K45A61K47A61K48A61K49A61K511 0 0 02 0 0 03 00 04 0 0 05 0 0 06 00 07 0 0 0termesnotic esbrevets Features that are judged relevant for a given classare the features whose representation is altogeth-er better than their average representation in all the classes including those features and better than the average representation of all the fea-tures, as regard to the feature F-measure metric. In the specific framework of the feature maximi-zation process, a contrast enhancement step can be exploited complementary to the former fea-ture selection step. The role of this step is to fit the description of each data to the specific char-acteristics of its associated class which have beenformerly highlighted by the feature selection step. In the case of our metric, it consists in mod-ifying the weighting scheme of the data specifi-cally to each class by taking into consideration the "information gain" provided by the Feature F-measures of the features, locally to that class. For a feature c belonging to the set of selected features SC of a class C, the gain CFc can be de-fined as: = /4Experiments and results To perform our experiments we firstly take into consideration different classification algorithms which are implemented in the Weka toolkit4: J48 Decision Tree algorithm [QUI 93], Random For-est algorithm [BRE 01] (RF), KNN algorithm [AHA 91], DMNBtext Bayesian Network algo-rithm [SU 08] (DMT) and SMO-SVM algorithm [PLA 98] (SMO). Most of these algorithms are general purpose classification algorithms, except from DMNBtext which is a Discriminative Multino-mial Naive Bayes classifier especially developed for text classification. As compared to classical Multinomial Naive Bayes classifier this algo-rithm cumulates the computational efficiency of Naive Bayes approaches and the accuracy of Discriminating approaches by taking into ac-count both the likelihood and the classification objectives during the frequency counting. Other general purpose algorithms whose accuracy has especially been reported for text classification are SMO and KNN [ZHA 01]. Default parame-ters are used when executing these algorithms, except for KNN for which the number of neigh-bors is optimized based on resulting accuracy. We then more especially focus on testing the ef-ficiency of the feature selection approaches in-4http://www.cs.waikato.ac.nz/ml/weka/ cluding our new proposal (FMC). We include in our test a panel of filter approaches which are computationally tractable with high dimensional data5, making again use of their Weka toolkit implementation. The panel of tested methods includes: Chi-square selector [LAD 11], Infor-mation gain selector |HAL 99b], CBF subset se-lector [DAS 03] (CB), Symmetrical Uncertainty selector [YUL 03], ReliefF selector [KON 94] (RF) and Principal Component Analysis selector [PER 01] (PCA). Defaults parameters are also used for most this methods, except for PCA for which the percentage of explained variance is tuned based on resulting accuracy. 10-fold cross validation is used on all our experiments. The different results are reported in Tables 1 to 4and in Figure 2 to 3. Tables and figures present standard performance measures (True Positive Rate (TP) or Precision, False Positive Rate (FP), Recall (R), F-measure (F) and ROC) weighted by class sizes and averaged over all classes. For each table, and each combination of selection and classification methods, a performance in-crease indicator is computed using the DMNBtext True Positive results on the original data as the reference. Finally, as soon as the re-sults are identical for Chi-square, Information Gain and Symmetrical Uncertainty, they are thus reported only once in the tables as Chi-square results (and noted CHI+). Table 1 highlights that performance of all classi-fication methods are low on the considered da-taset if no feature selection process is performed.It also confirms the superiority of the DMNBtext, SMO and KNN methods on the two other tree-based methods in that context. Addi-tionally, DMNBtext provides the best overall performance in terms of discrimination as it is illustrated by its highest ROC value. However, as it is also shown by confusion matrix of figure 2, the method is clearly not exploitable in an opera-tional patent evaluation context because of the high resulting confusion between classes. It high-lights its intrinsic incapacity to cope with the very important data attraction effect of the big-gest class. Whenever a usual feature selection process is performed in combination with the best method, that is DMNBtext method, the exploitation of the usual feature selection strategies slightly alters 5Exhaustive presentation and comparison of usual feature selection methods can be found in [FOR 03]. the quality of the results, instead of bringing up an added value, as it is shown in Table 1. Alter-natively, same table highlights that the feature reduction of F-max selection method is similar to Chi-square and its combination with F-max data description contrasting boosts the performance of the classification method (+81%), leading to ex-cellent classification results (Accuracy of 0.96) in a very complex classification context. TP FP P R F ROC TP Incr. J48 0.42 0.16 0.40 0.42 0.40 0.63 -23% RF 0.45 0.23 0.46 0.45 0.38 0.72 -17% SMO 0.54 0.14 0.53 0.54 0.52 0.80 0% DMT 0.54 0.15 0.53 0.54 0.50 0.82 0% (Ref) KNN 0.53 0.16 0.53 0.53 0.51 0.77 -2% Table 1. Classification results on initial data. TP FP R F ROC Nbr Feat. TP Incr. CHI+ 0.52 0.17 0.52 0.47 0.80 282 -4% CBF 0.47 0.21 0.47 0.41 0.75 37 -13% PCA 0.47 0.18 0.47 0.44 0.77 483 -13% RF 0.52 0.16 0.52 0.48 0.81 937 -4% FMC 0.96 0.01 0.96 0.96 0.999 262/cl +81% Table 2. Classification results after feature selection (DMT classification). TP FP R F ROC TP Incr. J48 0.80 0.05 0.80 0.79 0.92 +48% RF 0.76 0.09 0.76 0.73 0.96 +40% SMO 0.92 0.03 0.92 0.91 0.98 +70% DMT 0.96 0.01 0.96 0.96 0.999 +81% KNN 0.66 0.14 0.66 0.63 0.85 +22% Table 3. Classification results after FMC feature selection (all classification methods). Class label Class size Selected features TP Rate with FMC TP Rate before a61k31 2533 223 0.999 0.79 a61k33 60 276 0.77 0.02 a61k35 459 262 0.97 0.31 a61k36 212 278 0.89 0.43 a61k38 1110 237 0.99 0.24 a61k39 1141 240 0.99 0.65 a61k41 22 225 0.14 0 a61k45 304 275 0.83 0.09 a61k47 304 278 0.91 0.21 a61k48 140 265 0.76 0.12 a61k49 90 302 0.76 0.26 a61k51 78 251 0.90 0.26 a61k6 47 270 0.55 0.04 a61k8 87 292 0.74 0.02 a61k9 759 250 0.97 0.45 Table 4. Class data and FMC selected features/class. Table 2 and figures 2-3 illustrate the capabilitiesof the F-max approach to efficiently cope with the class imbalance problem. Hence, examina-tion of both TP rate changes (especially in small classes) in table 4 and confusion matrices of fig-ures 2-3 shows that the data attraction effect of the majority class that occurs at a high level in the case of the exploitation of the original data (figure 2) is quite completely overcome whenev-er the F-max approach is exploited (figure 3). The capability of the approach to correct class imbalance is also clearly highlighted by the ho-mogeneous distribution of the selected features in the classes it provides, despite of their very different sizes (table 4). Figure 2. Confusion matrix of the optimal results before feature selection (DMT classification). Figure 3. Confusion matrix of the optimal results after FMC feature selection (DMT classification) 5Conclusion Feature maximization is an efficient cluster qual-ity metric which favors clusters with maximum feature representation as regard to their associat-ed data. In this paper, we have proposed a specif-ic adaptation of such metric, which has already been exploited by other authors with generic ad-vantages in the framework of unsupervised learn-ing, to the context of supervised classification. Our main goal was to build up an efficient fea-ture selection and feature contrasting model that could overcome the usual problems arising in the supervised classification of large volume of data, and more especially in that of large full text data. These problems relate to classes imbalance, high dimensionality, noise, and high degree of simi-larity between classes. Through our experiments on a large dataset constituted of bibliographical records extracted from a patents classification, we more especially showed that our approach can naturally cope with the said handicaps, sig-nificantly enhancing the performance of classifi-cation methods (+80%). Another main advantage of this technique is that it is a parameter-free ap-proach and it can thus be used in a larger scope, like in the one of semi-supervised learning. References [AHA 91] Aha, D. & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6:37-66. [BHA 93] Bhatia, V. K. (1993). Analysing Genre - Language Use in Professional Settings. London, Longman, Applied Linguistics and Language Study Series. [BRE 01] Breiman, L. (2001). Random forests. Ma-chine Learning, 45(1): 5–32. [COH 05] Cohen, A.M. & Hersh, W.R. (2005). A survey of current work in biomedical text mining. Briefings in Bioinformatics 6, pp. 57-71. [COR 07] Cormack, G.V. & Lynam, T.R. (2007). Online supervised spam filter evaluation. ACM Transactions on Information Systems, 25(3):11. [DAS 03] Dash, M. & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelli-gence, 151(1):155-176. [EVA 07] Evans, M., McIntosh, W., Lin, J. & Cates, C. (2007). Recounting the courts? Applying auto-mated content analysis to enhance empirical legal research. Journal of Empirical Legal Studies, 4(4):1007–1039. [FAL 12] Falk, I., Gardent, C. & Lamirel, J.-C. (2012). Classifying French Verbs using French and English Lexical Resources Proceedings of ACL 2012. Jeju Island. Korea. [FOR 03] Forman, G. (2003). An extensive empirical study of feature selection metrics for text classifi-cation. The Journal of Machine Learning Research, 3 (2003):1289-1305. [GOO 06] Good, P. (2006). Resampling Methods. 3rd Ed. Birkhauser. [HAL 99] Hall, M.A. (1999). Correlation-based Fea-ture Selection for Machine Learning. PhD, Univer-sity of Waikato. [HAL 99b] Hall, M.A. & Smith, L.A. (1999). Feature Selection for Machine Learning: Comparing a Cor-relation-Based Filter Approach to the Wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Confer-ence, pp. 235–239. AAAI Press. [HILL 07] Hillard, D. Purpura, S. & Wilkerson, J. (2007). An active learning framework for classify-ing political text. In Annual Meeting of the Mid-west Political Science Association. Chicago. [KON 94] Kononenko, I. (1994). Estimating Attrib-utes: Analysis and Extensions of RELIEF. In: Eu-ropean Conference on Machine Learning, pp. 171-182. [KOS 01] Koster, C.H.A., Seutter, M., Beney, J. (2001). Classifying Patent Applications with Win-now. In: Proceedings Benelearn 2001. Antwerpen. Belgia. [KOS 10] Koster C.H.A., Seutter, M., Beney, J. (2010). Multi-classification of Patent Applicationswith Winnow. In: Ershov Memorial Conference, pp. 546-555. [LAD 11] Ladha, L. & Deepa, T. (2011). Feature se-lection methods and algorithms. International Jour-nal on Computer Science and Engineering, 3(5): 1787–1797. [PAN 08] Pang, B. & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135. [PER 01] Pearson, K. (1901). On Lines and Planes ofClosest Fit to Systems of Points in Space. Philosophical Magazine, 2(11):559–572. [PLA 98] Platt, J. (1998). Fast Training of SupportVector Machines using Sequential Minimal Opti-mization. In: Advances in Kernel Methods - Sup-port Vector Learning, Schoelkopf, B., Burges C. & Smola A. editors. MIT Press. [QUI 93] Quinlan, R. (1993). C4.5: Programs for Ma-chine Learning. San Mateo, CA: Morgan Kauf-mann. [SAL 71] Salton, G. (1971). Automatic processing offoreign language documents. Prentice-Hill: Eng-lewood Cliffs. NJ. [SAL 88] Salton, G. & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5): 513–523. [SCH 94] Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In: Proceed-ings of International Conference on New Methods in Language Processing. [SCH 98] Schoelkopf, B., Burges, C. & Smola, A.J. editors (1998). Advances in Kernel Methods - Support Vector Learning. MIT Press. [SU 08] Su, J., Zhang, H., Ling, C. & Matwin, S. (2008). Discriminative parameter learning for bayesian networks. ICML 2008. [WAN 98] WAN, H. (1998). An automatic news arti-cle filtering engine. PhD, Acadia University. Wolfville. Nova Scotia. Canada. [YUL 03] Yu, L. & Liu, H. (2003). Feature Selectionfor High-Dimensional Data: A Fast Correlation-Based Filter Solution. ICML 2003, pp. 856-863, August 21-24, 2003. Washington DC. USA. [ZHA 01] Zhang T. & Oles, F. J. (2001). Text catego-rization based on regularized linear classificationmethods. Inf. Retr., 4(1):5–31.