Newer
Older
ez-indexation / app / node_modules / rd-teeft / test / dataset / in / corpus / SUCCESS.txt
@kieffer kieffer on 7 Mar 2017 21 KB v0.0.0
Efficient feature selection for complex text Abstract This  paper  deals  with  a  new  feature  se-lection  and  feature  contrasting  approach for  classification  of  highly  unbalanced textual data with a high degree of similar-ity between associated classes. An exam-ple of such classification context is illus-trated  by  the  task  of  classifying  biblio-graphic  references  into  a  patent  classifi-cation  scheme.  This  task  represents  one of  the  domains  of  investigation  of  the QUAERO  project,  with  the  final  goal  of helping experts to evaluate upcoming pa-tents through the use of related research. 1Introduction Text  categorization  is  a  machine  learning  task which aims at automatically assigning predefined category  labels  to  new  upcoming  free  text  docu-ments with related characteristics [COH 05]. Be-cause  of  its  numerous  applications,  like  mail  or news  filtering  [COR  07][WAN  08],  emotion  de-tection [PAN 08], text genre analysis [BHA 93], or  patent categorization  [KOS 01][KOS 10],  text classification  has  been  one  of  the  most  studied branches  within  the  field  of  machine  learning [HIL  07].  However,  a  bunch  of  classification problems  raise  new  challenges  in  the  domain, especially  those  ones  which  implies to  deal  with imbalanced  data  and  highly  similar  classes.  In the context of text categorization, patents valida-tion assistance takes part in that class. It consists in  generating  help  to  experts  in  their  task  of evaluation of the novelty of a patent based on the automatic assignation of the most relevant scien-tific  papers  related  with  the  classification  codes of  the  said  patent.  As  soon  as  learning  is  based on citations extracted from the patents which are usually  associated  with  a  hierarchy  of  classifica-tion  codes  having  different  levels  of  generality, first, there is no guaranty of a homogeneous dis-tribution  of  the  citations  (i.e.  learning  samples) among  the  codes,  second,  there  is  a  high  chance to have similar citations in different classes.  We illustrate in that paper that the exploitation of standard  strategies  for  classification  and  prepro-cessing would leads to non exploitable results in the above mentioned context. We thus propose a new  preprocessing  alternative  as  well.  The  re-maining  of  the  paper  is  structured  as  follows. Section 2 presents the process we used to gener-ate our experimental dataset. Section 3 presents a new  feature  selection  approach  suitable  to  deal with  the   unsolved   class  imbalance  and   class similarities  problems.  Section  4  compares  the results  with  and  without  the  use  of  the  proposed approach.  Section  5  draws  our  conclusion  and perspectives. 2Dataset generation Our  main  experimental  resource  is  issued  from the  QUAREO1 project.  It  is  a  collection  of  pa-tents  related  to  the  pharmacology  domain  com-pleted  with  bibliographic  references  issued  from the  Medline2 database.  The  source  data  contains 6387  patents  in  XML  format,  grouped  into  15 subclasses  of  the  A61K  class  (medical  prepara-tion).  For  obtaining  the  bibliographical  refer-ences,  25887  citations  are  firstly  extracted  from the patents. Then the Medline database is queried with  extracted  citations  for  related  references. The querying results in 7501 bibliographical ref-erences3 which are then labeled by the class code of their citing patent.  1http://www.quaero.org 2http://www.ncbi.nlm.nih.gov/pubmed/3Medline recall is 90% relatively to unique references 
The set of labeled references represents the final document set on which the training is performed. It is converted to a bag of words model [SAL 71] using  the  TreeTagger  tool  [SCH 94]  developed by  the  Institute for  Computational  Linguistics  of the University of Stuttgart. In our case, the refer-ences are firstly lemmatized and the tagging pro-cess  is  performed  on  lemmatized  items  (in  the case when a word is unknown to the lemmatizer, its  original  form  is  conserved).  The  punctuation signs  and  the  numbers  identified  by  the  tagger are  deleted.  Every  reference  is  finally  represent-ed  as  a  term  vector  filled  with  term  frequencies. The  description  space  generated  by  the  tagger has  dimensionality  31214.  To  reduce  the  gener-ated  noise,  a  frequency  threshold  of  45  (i.e.  an average  threshold  of  3/class)  is  applied  on  the extracted descriptors. It resulted in a thresholdeddescription  space  of  dimensionality  1804.  Final-ly,  Term  Frequency-Inverse  Document  Frequen-cy  (TF-IDF)  weighting  scheme  [SAL  88]  is  ex-ploited  on  the  thresholded  space  to  obtain  a sparse representation of the data. Fig. 1. Distribution of data in patents classes(patents-cited articles-descriptors). Figure  1  highlights the  highly  imbalanced  distri-bution  of  both,  patents,  extracted  references  and keywords  attached  with  references  relatively  to the  different  class  codes.  As  an  example,  small-est  class  contains  only  22  extracted  references (A61K41  class)  whilst  the  bigger  one  has  more than 2500 (A61K31 class). The    exploitation    of    resampling    techniques [GOO 06]  as  well  as  the  one  of  standard  feature selection  techniques  [FOR  03]  could  be  envis-aged to compensate influence of the bigger clas-ses.  However,  in  our  context,  the  ability  of  such techniques  to  precisely  detect  the  right  class  is curtailed by the high class to class similarity dueto  the  association  of  the  initial  patents  to  a  spe-cialized branch of the patent classification: inter-class  similarity  computed  using  cosine  correla-tion  between  class  profiles  generated  by  the  de-scriptors  issued  from  the  extracted  bibliograph-ical  references  indicates  that  more  than  70%  of classes'  couples  have  a  similarity  between  0.5 and 0.9. As  an  alternative,  we  thus  propose  a  new  filter approach  which  relies  on  the  exploitation  of  a class-based quality measure grounded on the fea-ture  maximization  metric.  Such  metric  has  been formerly  exploited  by  Falk  et  al.  in  the  unsuper-vised context for clustering French verbs relying on syntactic and semantic features [FAL 12] and said  authors  demonstrated  both  its  intrinsic  effi-ciency  for  the  clustering  task  and  its  generic  ad-vantages for cluster labeling. 3Feature maximization based selection Let us consider a set of clusters C resulting from a  clustering  method  applied  on  a  set  of  data Drepresented  with  a  set  of  descriptive  features F, feature maximization introduced by Falk et al. in [FAL  12]  is  a  metric  which  favors  clusters  with maximum Feature  F-measure.  The Feature  F-measure of  a  feature f  associated  to  a cluster c is defined as the harmonic mean of Fea-ture    Recalland Feature    Precision indexes which in turn are defined as:  =∑∈∑   ∑∈∈ =∑∈∑∈,∈where represents  the  weight  of  the  feature ffor data d and Fc represent the set of features oc-curring in the data associated to the cluster c.  Taking  into  consideration  the  basic  definition  of feature maximization metric presented above, the feature   maximization-based   feature   selection process  can  thus  be  defined  as  a  parameter-free and  class-based  process  in  which  a  class  feature is  characterized  using  both  its  capacity  to  dis-criminate  a  given  class  from  the  others  (index) and its capacity to accurately represent theclass  data  ( index).  The  set Sc  of  features that  are  characteristic  of  a  given  class c  belong-ing to an overall class set C results in: = ∈ |  >    > where =∑∈//
and  =∑/
∈||. with C/f  representing  the  restriction  of  the  set Cto the classes in which the feature f is present. A61K6A61K8A61K9A61K31A61K33A61K35A61K36A61K38A61K39A61K41A61K45A61K47A61K48A61K49A61K511 0 0 02 0 0 03 00 04 0 0 05 0 0 06 00 07 0 0 0termesnotic esbrevets
Features that are judged relevant for a given classare the features whose representation is altogeth-er  better  than  their  average  representation  in  all the  classes  including  those  features  and  better than  the  average  representation  of  all  the  fea-tures, as regard to the feature F-measure metric.  In the specific framework of the feature maximi-zation  process,  a  contrast  enhancement  step  can be  exploited  complementary  to  the  former  fea-ture  selection  step.  The  role  of  this  step  is  to  fit the  description  of  each  data  to  the  specific  char-acteristics of its associated class which have beenformerly   highlighted   by   the   feature   selection step. In the case of our metric, it consists in mod-ifying  the  weighting  scheme  of  the  data  specifi-cally  to  each  class  by  taking  into  consideration the  "information  gain"  provided  by  the  Feature F-measures  of  the  features,  locally  to  that  class. For  a  feature c  belonging  to  the  set  of  selected features SC of a class C, the gain CFc can be de-fined as:  =  /4Experiments and results To  perform  our  experiments  we  firstly  take  into consideration  different  classification  algorithms which are implemented in the Weka toolkit4: J48 Decision Tree algorithm [QUI 93], Random For-est  algorithm  [BRE  01]  (RF),  KNN  algorithm [AHA  91],  DMNBtext  Bayesian  Network  algo-rithm [SU 08] (DMT) and SMO-SVM algorithm [PLA 98] (SMO). Most  of  these  algorithms  are  general  purpose classification algorithms, except from DMNBtext  which  is  a  Discriminative  Multino-mial Naive Bayes classifier especially developed for  text  classification.  As  compared  to  classical Multinomial  Naive  Bayes  classifier  this  algo-rithm  cumulates  the  computational  efficiency  of Naive  Bayes  approaches  and  the  accuracy  of Discriminating  approaches  by  taking  into  ac-count  both  the  likelihood  and  the  classification objectives  during  the  frequency  counting.  Other general  purpose  algorithms  whose  accuracy  has especially  been  reported  for  text  classification are  SMO  and  KNN  [ZHA  01].  Default  parame-ters  are  used  when  executing  these  algorithms, except  for KNN  for  which  the  number  of  neigh-bors is optimized based on resulting accuracy. We then more especially focus on testing the ef-ficiency  of  the  feature  selection  approaches  in-4http://www.cs.waikato.ac.nz/ml/weka/ cluding  our  new  proposal  (FMC). We include in our  test  a  panel  of  filter  approaches  which  are computationally  tractable  with  high  dimensional data5,  making  again  use  of  their  Weka  toolkit implementation.  The  panel  of  tested  methods includes:  Chi-square  selector  [LAD  11],  Infor-mation  gain  selector  |HAL  99b],  CBF  subset  se-lector  [DAS  03]  (CB),  Symmetrical  Uncertainty selector  [YUL  03],  ReliefF  selector  [KON  94] (RF)  and  Principal  Component  Analysis  selector [PER  01]  (PCA).  Defaults  parameters  are  also used  for  most  this  methods,  except  for  PCA  for which  the  percentage  of  explained  variance  is tuned  based  on  resulting  accuracy.  10-fold  cross validation is used on all our experiments. The different results are reported in Tables 1 to 4and  in  Figure  2  to  3.  Tables  and  figures  present standard  performance  measures  (True  Positive Rate (TP) or Precision, False Positive Rate (FP), Recall (R), F-measure (F) and ROC) weighted by class  sizes  and  averaged  over  all  classes.  For each  table,  and  each  combination  of  selection and  classification  methods,  a  performance  in-crease     indicator     is     computed     using     the DMNBtext  True  Positive  results  on  the  original data  as  the  reference.  Finally,  as  soon  as  the  re-sults  are  identical  for  Chi-square,  Information Gain and Symmetrical Uncertainty, they are thus reported  only  once  in  the  tables  as  Chi-square results (and noted CHI+). Table 1 highlights that performance of all classi-fication  methods  are  low  on  the  considered  da-taset if no feature selection process is performed.It    also    confirms    the    superiority    of    the DMNBtext, SMO and KNN methods on the two other  tree-based  methods  in  that  context.  Addi-tionally,  DMNBtext  provides  the  best  overall performance  in  terms  of  discrimination  as  it  is illustrated by its highest ROC value. However, as it is also shown by confusion matrix of figure 2, the method is clearly not exploitable in an opera-tional  patent  evaluation  context  because  of  the high resulting confusion between classes. It high-lights  its  intrinsic  incapacity  to  cope  with  the very  important  data  attraction  effect  of  the  big-gest class. Whenever  a  usual  feature  selection  process  is performed  in  combination  with  the  best  method, that is DMNBtext method, the exploitation of the usual  feature  selection  strategies  slightly  alters 5Exhaustive presentation and comparison of usual feature selection methods can be found in [FOR 03].
the  quality  of  the  results,  instead  of  bringing  up an  added  value,  as it  is  shown  in Table  1.  Alter-natively,  same  table  highlights  that  the  feature reduction of F-max selection method is similar to Chi-square  and  its  combination  with  F-max  data description contrasting boosts the performance of the  classification  method  (+81%),  leading  to  ex-cellent  classification  results  (Accuracy  of  0.96) in a very complex classification context. TP  FP  P R F ROC TP Incr. J48 0.42 0.16 0.40 0.42 0.40 0.63 -23% RF 0.45 0.23 0.46 0.45 0.38 0.72 -17% SMO 0.54 0.14 0.53 0.54 0.52 0.80 0% DMT 0.54 0.15 0.53 0.54 0.50 0.82 0% (Ref) KNN  0.53 0.16 0.53 0.53 0.51 0.77 -2% Table 1. Classification results on initial data. TP  FP  R F ROC Nbr Feat. TP Incr. CHI+ 0.52 0.17 0.52 0.47 0.80 282 -4% CBF 0.47 0.21 0.47 0.41 0.75 37 -13% PCA  0.47 0.18 0.47 0.44 0.77 483 -13% RF 0.52 0.16 0.52 0.48 0.81 937 -4% FMC 0.96 0.01 0.96 0.96 0.999 262/cl +81% Table 2. Classification results after feature selection (DMT classification). TP  FP  R F ROC TP Incr. J48 0.80 0.05 0.80 0.79 0.92 +48% RF 0.76 0.09 0.76 0.73 0.96 +40% SMO 0.92 0.03 0.92 0.91 0.98 +70% DMT 0.96 0.01 0.96 0.96 0.999 +81% KNN  0.66 0.14 0.66 0.63 0.85 +22% Table 3. Classification results after FMC feature  selection (all classification methods). Class label Class size Selected features  TP Rate with FMC TP Rate before a61k31 2533 223 0.999     0.79 a61k33 60 276 0.77 0.02 a61k35 459 262 0.97     0.31 a61k36 212 278 0.89     0.43 a61k38 1110 237 0.99     0.24 a61k39 1141 240 0.99     0.65 a61k41 22 225 0.14     0 a61k45 304 275 0.83     0.09 a61k47 304 278 0.91     0.21 a61k48 140 265 0.76     0.12 a61k49 90 302 0.76     0.26 a61k51 78 251 0.90     0.26 a61k6 47 270 0.55     0.04 a61k8 87 292 0.74 0.02 a61k9 759 250 0.97 0.45 Table 4. Class data and FMC selected features/class. Table 2 and figures 2-3 illustrate the capabilitiesof  the  F-max  approach  to  efficiently  cope  with the  class  imbalance  problem.  Hence,  examina-tion of both TP rate changes (especially in small classes) in table 4 and confusion matrices of fig-ures  2-3  shows  that  the  data  attraction  effect  of the  majority  class  that  occurs  at  a  high  level  in the  case  of  the  exploitation  of  the  original  data (figure 2) is quite completely overcome whenev-er  the  F-max  approach  is  exploited  (figure  3). The  capability  of  the  approach  to  correct  class imbalance  is  also  clearly  highlighted  by  the  ho-mogeneous  distribution  of  the  selected  features in  the  classes  it  provides,  despite  of  their  very different sizes (table 4). Figure 2. Confusion matrix of the optimal results before feature selection (DMT classification). Figure 3. Confusion matrix of the optimal results after FMC feature selection (DMT classification) 5Conclusion Feature maximization is an efficient cluster qual-ity  metric  which  favors  clusters  with  maximum feature representation as regard to their associat-ed data. In this paper, we have proposed a specif-ic  adaptation  of  such  metric,  which  has  already been exploited by other authors with generic ad-vantages in the framework of unsupervised learn-ing,  to  the  context  of  supervised  classification. Our  main  goal  was  to  build  up  an  efficient  fea-ture  selection  and  feature  contrasting  model  that could overcome the usual problems arising in the supervised classification of large volume of data, and more especially in that of large full text data. These problems relate to classes imbalance, high dimensionality,  noise,  and  high  degree  of  simi-larity  between  classes.  Through  our  experiments on  a  large  dataset  constituted  of  bibliographical records  extracted  from  a  patents  classification, we  more  especially  showed  that  our  approach can  naturally  cope  with  the  said  handicaps,  sig-nificantly  enhancing  the performance  of classifi-cation methods (+80%). Another main advantage of this technique is that it is a parameter-free ap-proach and it  can thus  be  used in  a  larger scope, like in the one of semi-supervised learning. 
References  [AHA  91]  Aha,  D.  &  Kibler,  D.  (1991).  Instance-based   learning   algorithms.   Machine   Learning, 6:37-66. [BHA  93]  Bhatia,  V.  K.  (1993).  Analysing  Genre  - Language  Use  in  Professional  Settings.  London, Longman,    Applied    Linguistics    and    Language Study Series. [BRE  01]  Breiman,  L.  (2001).  Random  forests.  Ma-chine Learning, 45(1): 5–32. [COH  05]  Cohen,  A.M.  &  Hersh,  W.R.  (2005).  A survey  of  current  work  in  biomedical  text  mining. Briefings in Bioinformatics 6, pp. 57-71. [COR  07]  Cormack,  G.V.  &  Lynam,  T.R.  (2007). Online  supervised  spam  filter  evaluation.  ACM Transactions on Information Systems, 25(3):11. [DAS  03]  Dash,  M.  &  Liu,  H.  (2003).  Consistency-based  search  in  feature  selection.  Artificial  Intelli-gence, 151(1):155-176. [EVA  07]  Evans,  M.,  McIntosh,  W.,  Lin,  J.  &  Cates, C.  (2007).  Recounting  the  courts?  Applying  auto-mated  content  analysis  to  enhance  empirical  legal research.   Journal   of   Empirical   Legal   Studies, 4(4):1007–1039. [FAL   12]   Falk,   I.,   Gardent,   C.   &   Lamirel,   J.-C.  (2012). Classifying French Verbs using French and English  Lexical  Resources  Proceedings  of  ACL 2012. Jeju Island. Korea. [FOR 03] Forman, G. (2003). An extensive empirical study  of  feature  selection  metrics  for  text  classifi-cation. The Journal of Machine Learning Research, 3 (2003):1289-1305. [GOO 06] Good, P. (2006). Resampling Methods. 3rd Ed. Birkhauser. [HAL  99]  Hall,  M.A.  (1999).  Correlation-based  Fea-ture Selection for Machine Learning. PhD, Univer-sity of Waikato. [HAL 99b] Hall, M.A. & Smith, L.A. (1999). Feature Selection for Machine Learning: Comparing a Cor-relation-Based  Filter  Approach  to  the  Wrapper.  In Proceedings  of  the  Twelfth  International  Florida Artificial   Intelligence   Research   Society   Confer-ence, pp. 235–239. AAAI Press. [HILL  07]  Hillard,  D.  Purpura,  S.  &  Wilkerson,  J. (2007).  An  active  learning  framework  for  classify-ing  political  text.  In  Annual  Meeting  of  the  Mid-west Political Science Association. Chicago. [KON  94]  Kononenko,  I.  (1994).  Estimating  Attrib-utes:  Analysis  and  Extensions  of  RELIEF.  In:  Eu-ropean  Conference  on  Machine  Learning,  pp.  171-182. [KOS  01]  Koster,  C.H.A.,  Seutter,  M.,  Beney,  J. (2001).  Classifying  Patent  Applications  with  Win-now.  In:  Proceedings  Benelearn  2001.  Antwerpen. Belgia. [KOS   10]   Koster   C.H.A.,   Seutter,   M.,   Beney,   J. (2010).  Multi-classification  of  Patent  Applicationswith  Winnow.  In:  Ershov  Memorial  Conference, pp. 546-555. [LAD  11]  Ladha,  L.  &  Deepa,  T.  (2011).  Feature  se-lection methods and algorithms. International Jour-nal  on  Computer  Science  and  Engineering,  3(5): 1787–1797. [PAN 08] Pang, B. & Lee, L. (2008). Opinion mining and  sentiment  analysis.  Foundations  and  Trends  in Information Retrieval, 2(1-2):1–135. [PER 01] Pearson, K. (1901). On  Lines and Planes ofClosest Fit to Systems of Points in Space. Philosophical Magazine, 2(11):559–572. [PLA  98]  Platt,  J.  (1998).  Fast  Training  of  SupportVector  Machines  using  Sequential  Minimal  Opti-mization.  In:  Advances  in  Kernel  Methods  -  Sup-port Vector Learning, Schoelkopf, B., Burges C. & Smola A. editors. MIT Press. [QUI 93] Quinlan, R. (1993). C4.5: Programs for Ma-chine  Learning.  San  Mateo,  CA:  Morgan  Kauf-mann. [SAL  71]  Salton,  G.  (1971). Automatic  processing  offoreign  language  documents.  Prentice-Hill:  Eng-lewood Cliffs. NJ. [SAL  88]  Salton,  G.  &  Buckley,  C.  (1988).  Term weighting  approaches  in  automatic  text  retrieval. Information  Processing  and  Management,  24(5): 513–523. [SCH  94]  Schmid,  H.  (1994).  Probabilistic  part-of-speech  tagging  using  decision  trees.  In:  Proceed-ings  of  International  Conference  on  New  Methods in Language Processing. [SCH  98]  Schoelkopf,  B.,  Burges,  C.  &  Smola,  A.J. editors   (1998).   Advances   in   Kernel   Methods   - Support Vector Learning. MIT Press. [SU  08]  Su,  J.,  Zhang,  H.,  Ling,  C.  &  Matwin,  S. (2008).   Discriminative   parameter   learning   for bayesian networks. ICML 2008. [WAN 98] WAN, H. (1998). An automatic news arti-cle   filtering   engine.   PhD,   Acadia   University. Wolfville. Nova Scotia. Canada. [YUL 03] Yu, L. & Liu, H. (2003). Feature Selectionfor  High-Dimensional  Data:  A  Fast  Correlation-Based  Filter  Solution.  ICML  2003,  pp.  856-863, August 21-24, 2003. Washington DC. USA. [ZHA 01] Zhang T. & Oles, F. J. (2001). Text catego-rization  based  on  regularized  linear  classificationmethods. Inf. Retr., 4(1):5–31.