A new feature selection and feature contrasting approach based on quality metric: application to eﬀicient classification of complex textual data

: Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data. This metric has already been successfully exploited, altogether, for defining unbiased clustering quality indexes, for efficient cluster labeling, as well as for substituting to distance in the clustering process, like in the IGNGF incremental clustering method. In this paper we go one step further showing that a straightforward adaptation of such metric can provide a highly efficient feature selection and feature contrasting model in the context of supervised classification. We more especially show that this technique can enhance the performance of classification methods whilst very significantly outperforming (+80%) the state-of-the art variable selection techniques in the case of the classification of unbalanced, highly multidimensional and noisy textual data, with a high degree of similarity between the classes. Our experimental dataset is a reference dataset of 7000 publications related to patents classes issued from a reference classification in the domain of pharmacology.


Introduction
Since the 1990s, advances in computing and storage capacity allow the manipulation of very large data.Whether in bio-informatics or in text mining, it is not uncommon to have description space of several thousand or even tens of thousands of variables.One might think that classification algorithms are more efficient if there are a large number of variables.However, the situation is not as simple as this.The first problem that arises is the increase in computation time.Moreover, the fact that a significant number of variables are redundant or irrelevant to the task of classification significantly perturbs the operation of the classifiers.In addition, as soon as most learning algorithms exploit probabilities, probability distributions can be difficult to estimate in the case of the presence of a very high number of variables.The integration of a variable selection process in the framework of the classification of high dimensional data becomes thus a central challenge.
In the literature, three types of approaches for variable selection are mainly proposed: the integrated (embedded) approaches, the "wrapper" methods and the filter approaches.An exhaustive overview of the state-of-the-art techniques in this domain has been achieved by many authors, like Ladha  The integrated (embedded) approaches incorporate the selection of the variables in the learning process [BRE 84].The most popular methods of this category are the SVM based methods and the neural based methods.SVM-EFR (Recursive Feature Elimination for Support Vector Machines) [GUY 02] is an integrated process that performs the selection of variables an iterative basis using a SVM classifier.The process starts with the complete feature set and remove the variables given as the least important by the SVM.The original version uses a linear kernel.However, some extension using non-linear kernels have been proposed to consider potential nonlinear dependencies between variables.In an alternative way, the basic idea of the approaches of the FS-P (Feature Selection-Perceptron) family is to perform a supervised learning based on a perceptron neural model and to exploit the resulting interconnection weights between neurons as indicators of the feature that may be relevant and provide a ranking [MEJ 06].
On their own side, "wrapper" methods explicitly use a performance criterion for searching a subset of relevant predictors.More often it's error rate (but this can be a prediction cost or the area under the ROC curve).As an example, the WrapperSubsetEval method evaluates the attribute sets using a learning approach.Cross-Validation is used to estimate the accuracy of the learning for a given set of attributes.The algorithm starts with the empty set of attributes and continues until adding attributes does not improve performance [WIT 05].
Forman presents a remarkable work of methods comparison in [FOR 03].As other similar works, this comparison clearly highlights that, disregarding of their efficiency, one of the main drawbacks of embedded and of the wrapper methods is that they are very computationally intensive.This prohibits their use in the case of highly multidimensional data description space.A potential alternative is thus to exploit filter-based methods in such context.
Filter approaches are selection procedures that are used prior and independently to the learning algorithm.They are based on statistical tests.They are thus lighter in terms of computation time than the other approaches and the obtained features can generally be ranked regarding to the tests' results.
The Chi-square method exploits a usual statistical test that measures the discrepancy to an expected distribution assuming that a variable is independent of a class label.Like any statistical test, he is known to have erratic behavior for very low expected frequencies (which is common case in text classification) [LAD 11].
The information gain is also one of the most common methods of evaluation of the attributes.This univariate filter provides an ordered classification of all variables.Based on to this approach, selected variables are those who obtain a positive value of information gain [HAL 99b].
In the MIFS (Mutual Information Feature Selection) method, a variable X is added to the subset M (of cardinality m) of already selected variables if it maximizes the quantity: Thus, a variable is considered to be interesting if its link with the target Y surpasses his average connection with already selected predictors.The method takes into account both the relevance and redundancy.The search stops when the best variable is X such I Y, X In this paper, we show that, despite of their diversity, all the existing filter-based approaches fail to successfully solve the variable selection task in the case they are faced with highly unbalanced, highly multidimensional and noisy textual data, with a high degree of similarity between the classes.We thus propose a new filter-based variable selection approach which relies on the exploitation of a class quality measure based on a specific feature maximization metric.Such metric already demonstrated high potential in the framework of unsupervised learning.
The paper is structured as follows.The first section presents the feature maximization principle along with the new proposed technique.The second section describes our dataset and our experiment which is performed experimental is a reference dataset of 7000 publications related to patents classes issued from a reference classification in the domain of pharmacology.The last section draws our conclusion and our perspectives.

Feature maximization metric principles in unsupervised learning
Feature maximization is an unbiased cluster quality metrics that exploits the properties of the data associated to each cluster without prior consideration of clusters profiles.This metrics has been initially proposed in [LAM 04].Its main advantage is to be independent altogether of the clustering methods and of their operating mode.Whenever it is used during the clustering process, it can substitute to distance during that process [LAM 11b].In a complementary way, whenever it is used after learning, it can be exploited to set up overall clustering quality indexes [LAM 10][GHR 10] or for cluster labeling [LAM 08].
Let us consider a set of clusters C resulting from a clustering method applied on a set of data D represented with a set of descriptive features F, feature maximization is a metric which favors clusters with maximum Feature F-measure.The Feature F-measure of a feature f associated to a cluster c is defined as the harmonic mean of Feature Recall and Feature Precision indexes which in turn are defined as: where ./ * represents the weight of the feature f for data d and F c represent the set of features occurring in the data associated to the cluster c.
An important application of the feature maximization metric is related to the estimation of the overall clustering quality.For that purpose, averaged Macro-Recall (MR) and Macro-Precision (MP) indexes can be directly derived from the former indexes.
They are expressed as: Macro-Recall and Macro-Precision indexes have opposite behaviors according to the number of clusters.Thus, these indexes permit to estimate in a global way an optimal number of clusters for a given method and a given dataset.The best data partition, or clustering result, is in this case the one which minimizes the difference between their values [LAM 04].Conversely to classical distance-based indexes, they are independent of the clustering process.Moreover, it has been demonstrated in [LAM 11] that straightforward adaptations of these indexes permits to detect degenerated clustering results, whenever those jointly include a small number of heterogeneous or "garbage" clusters with large size and a big number of "chunk" clusters with very small size.
Another As regards to this approach, a feature is then said to be maximal or prevalent for a given cluster iff its Feature F-measure is higher for that cluster than for any other cluster.Thus the set L c of prevalent features of a cluster c can be defined as: Whenever it has been exploited in combination with hypertree representation, this technique has highlighted promising results, as compared to the state-of-the-art labeling techniques, like Chi-square labeling, for synthetizing complex clustering output issued from the management of highly multidimensional data [LAM 08].Additionally, the combination of this technique with unsupervised Bayesian reasoning resulted in the proposal of the first parameter-free fully unsupervised approach for analyzing the textual information evolving over time [LAM 10b].
Exhaustive experiments on large reference datasets of bibliographic records have shown that the approach is reliable and likely to produce accurate and meaningful results for diachronic scientometrics studies [LAM 12].
Last but not least, a central application of feature maximization metric is related to incremental clustering.The IGNGF (Incremental Neural Gas with Feature Maximization) clustering method is a neural-based parameter-free incremental clustering algorithm that substitutes feature maximization to usual distance in the clustering process.Thanks to this approach, the IGNGF clustering process is roughly the following.During learning, an incoming data point d is temporary added to every existing cluster, its feature profile is updated (i.e. each cluster is associated with its maximal features) and its average Feature F-measure is computed.Then the winning cluster is the cluster which maximizes the Kappa criteria given by: where ∆ represents the gain in Feature F-measure for the new cluster and ∩ / are the features shared by cluster c and the data point d.This way, those clusters are preferred which share more features with the new data point and clusters which don't have any common feature with the data point are ignored.The gain in Feature F-measure multiplied by the number of shared features can be optionally adjusted by the Euclidean distance of the new data point d to the cluster centroid vector ; D. Clusters with negative Kappa score are ignored.The data point is then added to the cluster c with maximal Kappa and Hebbian connections between winner and its neighbors are updated.If not such cluster is found, a new cluster is created.
The IGNGF method was shown to outperform other usual neural and non neural methods for clustering tasks on relatively clean data, and especially if said data are sparse and/or highly multidimensional [Lam 11].The first applications of the IGNF method for clustering of textual data revealed very promising results.Especially, this method was exploited for the automatic classification of the French verbs using syntactic and semantic features issued from several reference lexicons.The method showed significantly better performance (+20%) than the best state-of-the-art methods of the field, including the reference methods based on spectral clustering [FAL 12].In the context of the websites' classification, it has been also shown that the IGNGF method allowed, in an unattended way, to obtain better results (in terms of sensibility and purity) than those provided by the supervised methods this by automatically isolating latent, not originally labeled, classes [LAM 12b].

Adaptation of feature maximization metric for feature selection in supervised learning
Taking into consideration the basic definition of feature maximization metric presented in the former section, its exploitation for the task of feature selection in the context of supervised learning becomes a straightforward process, as soon as this generic metric can apply on data associated to a class as well as to those associated to a cluster.The feature maximization-based selection process can thus be defined as a class-based process in which a class feature is characterized using both its capacity to discriminate a given class from the others ( index and its capacity to accurately represent the class data (FR P f index .Finally, the set of all the selected features S C is the subset of F defined as: Features that are judged relevant for a given class are the features whose representation is altogether better than their average representation in all the classes including those features and better than the average representation of all the features, as regard to the feature F-measure metric.
In the specific framework of the feature maximization process, a contrast enhancement step can be exploited complementary to the former feature selection step.The role of this step is to fit the description of each data to the specific characteristic of its associated class which have been formerly highlighted by the feature selection step.In the case of our metric, it consists in modifying the weighting scheme of the data specifically to each class by taking into consideration the information gain provided by the Feature F-measures of the features, locally to that class.This step more precisely operates as described in Algorithm 1.
Thanks to the former strategy, the information gain provided by a feature in a given class is proportional to the ratio between the value of the Feature F-measure of this feature in the class and the average value of the Feature F-measure of the said feature on all the partition.For a given data and a given feature describing this data, the resulting gain acts a contrast weight factorizing with any existing feature weight that can be issued from data preprocessing.Moreover, each data description can be optionally reduced to the features which are characteristic of its associated class.If it is present, normalization of the data description is discarded by those operations.Optional renormalization can also be performed in the curse of the algorithm. Algorithm

Data extraction and preprocessing
The data is a collection of patent documents related to pharmacology domain.The bibliographic citations in the patents are extracted from the Medline database1 .The source data contains 6387 patents in XML format, grouped into 15 subclasses of the A61K class (medical preparation).25887 citations have been extracted from 6387 patents [HAL 12].Then the Medline database is queried with extracted citations for related scientific articles.The querying gives 7501 articles with 90% recall.Each article is then labeled by the class code of the citing patent.The set of labeled articles represents the final document set on which the training is performed.The final document set is unbalanced, with smallest class containing 22 articles (A61K41 class) and largest class containing 2500 articles (A61K31 class).Inter-class similarity computed using cosine correlation indicates that more than 70% of classes' couples have a similarity between 0.5 and 0.9.Thus the ability of any classification model to precisely detect the right class is curtailed.A common solution to deal with unbalance in dataset is undersampling majority classes and oversampling minority classes.However sampling that introduces redundancy in dataset does not improve the performance in this dataset, as it has been shown in [HAL 12].So that bootstrapping of train and test data may solve problems of classification sensibility, stability, scalability and dimensionality but does not improve accuracy computation over the sampled correlations.Conversely, pruning irrelevant features and contrasting the relevant ones has we propose hereafter seems thus to be a good alternative.
The document set is converted to a bag of words model [SAL 71] using the TreeTagger tool [SCH 94] developed by the Institute for Computational Linguistics of the University of Stuttgart.This tool is both a lemmatizer and a tagger.A lemmatizer associates a lemma, or a syntactic root, to each word in the text and a tagger automatically annotates text with morpho-syntactic information.In our case, the documents are firstly lemmatized and the tagging process is performed on lemmatized items (in the case when a word is unknown to the lemmatizer, its original form is conserved).The punctuation signs and the numbers identified by the tagger are deleted.The feature selection according to grammatical categories allows identifying salient features for the documents classification according to document types or opinions.
Every document is represented as a term vector filled with keyword frequencies.The description space generated by the tagger has dimensionality 31214.To reduce noise generated by the TreeTager tool, a frequency threshold of 45 (i.e. an average threshold of 3/class) is applied on the extracted descriptors.It resulted in a thresholded description space of dimensionality 1804.The whole text collection is then represented as a (N+1) x J matrix where J is number of articles in the collection in a N-dimensional space.Each line j of this matrix is an N-dimensional bag of words vector for article j, plus its class label.The Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme [SAL 88] gives a sparse matrix representation of the text collection.

Testing process
To perform our experiments we firstly take into consideration different classification algorithms which are implemented in the Weka toolkit: Most of these algorithms are general purpose classification algorithms, except from DMNBtext which is a Discriminative Multinomial Naïve Bayes classifier especially developed for text classification.As compared to classical Multinomial Naïve Bayes classifier this algorithm cumulate the computational efficiency of Naïve Bayes approaches and the accuracy of Discriminating approaches by taking into account both the likelihood and the classification objectives during the frequency counting.Other general purpose algorithms whose accuracy has especially been reported for text classification are SMO and KNN [ZHA 02].Default parameters are used when executing these algorithms, except for KNN for which the number of neighbors is optimized.
Defaults parameters are also used for most this methods, except for PCA for which the percentage of explained variance is tuned for optimization.
We first experiment the methods separately.In a second phase we combine the feature selection provided by the method with the feature contrasting technique we have proposed.10-fold cross validation is used on all our experiments.

Results
The different results are reported in tables 1 to 5 and in figure 1. Tables present standard performance measures (True Positive, False Positive, Precision, Recall, F-measure and ROC) weighted and averaged over all classes.For each table, and each combination of selection and classification methods, a performance increase indicator is computed using the DMNBtext True Positive results on the original data as the reference.Finally, as soon as the results are identical for Chi-square, Information Gain and Symmetrical Uncertainty, they are thus reported only once in the tables as Chi-square results.
Table 1 highlights that performance of all classification methods are low on the considered dataset if no feature selection process is performed.They also confirm the superiority of the DMNBtext, SMO and KNN methods on the two other tree-based methods in that context.Additionally, DMNBtext provides the best overall performance in terms of discrimination as it is illustrated by its highest ROC value.
Whenever a usual feature selection process is performed in combination with the best method, that is DMNBtext method, the exploitation of the usual feature selection strategies slightly alters the quality of the results, instead of bringing up an added value, as it is shown in table 2. Alternatively, same table highlights that even if the feature reduction effect is less with the F-max selection method, its combination with F-max data description contrasting boosts the performance of the method (+81%), leading to excellent classification results (Accuracy of 0.96) in a very complex classification context.
Even if the benefit of the former use of F-max selection and contrasting approach is very high with the DMNBtext method, table 3 shows that the added value provided by this preprocessing approach also concerns, to a lesser extent, all the other classifiers, leading to an average increase of their performance of 45% as compared to the reference result.Another interesting phenomenon that can be observed is that, with such help, tree-based classification methods significantly, and unusually, outperform the KNN method on text.
The results presented in table 4 more specifically illustrates the efficiency of the F-max contrasting procedure that acts on the data descriptions.In the experiments related to that table, F-max contrasting is performed individually on the features extracted by each selection method and, in a second step, DMNBtext classifier is applied on the resulting contrasted data (see algo 1).The results show that, whatever is the kind of feature selection technique that is used, resulting classification performance is enhanced whenever is a former step of F-max data description contrasting is performed.The average performance increase is 44%.Table 5 and figure 1 illustrate the capabilities of the F-max approach to efficiently cope with the class imbalance problem.Hence, examination of the confusion matrices of figure 1 shows that the data attraction effect of the majority class that occurs at a high level in the case of the exploitation of the original data (figure 1(a)) is quite completely overcome whenever the F-max approach is exploited (figure 1(b)).The capability of the approach to correct class imbalance is also clearly highlighted by the homogeneous distribution of the selected variables in the classes it provides, despite of their very different sizes (table 5).

Conclusion
Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data.In this paper, we have proposed a straightforward adaptation of such metric, which has already demonstrated several generic advantages in the framework of unsupervised learning, to the context of supervised classification.Our main goal was to build up an efficient feature selection and feature contrasting model that could overcome the usual problems arising in the supervised classification of large volume of data, and more especially in that of large full text data.These problems relate to classes' imbalance, high dimensionality, noise, and high degree of similarity between classes.Through our experiments on a large dataset constituted of bibliographical records extracted from a patents' classification, we more especially showed that our approach can naturally cope with the said handicaps.Hence, in such context, whereas the state-of-the-art variable selection techniques remain inoperative, feature maximization-based variable selection and contrasting can very significantly enhance the performance of classification methods (+80%).Another important advantage of this technique is that it is a parameter-free approach and it can thus be used in a larger scope, like in the one of semi-supervised learning.
The set S c of features that are characteristic of a given class c belonging to an overall class set C results in: and C /f represent the restriction of the set C to the classes in which the feature f is represented.
important application of feature maximization metric is related to clusters' labeling whose role is to highlight the prevalent features of the clusters associated to a clustering model at a given time.Labeling can thus be used altogether for visualizing or synthesizing clustering results and for optimizing the learning process of a clustering method [ATT 06].It can rely on endogenous data properties or on exogenous ones.Endogenous data properties represent the ones being used during the clustering process.Exogenous data properties represent either complementary properties or specific validation properties.Exploiting feature maximization metric for cluster labeling results in a parameter-free labeling technique [LAM 08].

Table 1 :
classification results on initial data.

Table 3 :
classification results after F-max + contrast feature selection (all classification methods).

Table 3 :
classification results after feature selection by all methods and F-max contrasting (DMNBtext classification).

Table 5 :
class data and F-max selected features/class.