Evolutionary subspace clustering using variable genome length

Subspace clustering is a data‐mining task that groups similar data objects and at the same time searches the subspaces where similarities appear. For this reason, subspace clustering is recognized as more general and complicated than standard clustering. In this article, we present ChameleoClust+, a bioinspired evolutionary subspace clustering algorithm that takes advantage of an evolvable genome structure to detect various numbers of clusters located in different subspaces. ChameleoClust+ incorporates several biolike features such as a variable genome length, both functional and nonfunctional elements, and mutation operators including large rearrangements. It was assessed and compared with the state‐of‐the‐art methods on a reference benchmark using both real‐world and synthetic data sets. Although other algorithms may need complex parameter settings, ChameleoClust+ needs to set only one subspace clustering ad hoc and intuitive parameter: the maximal number of clusters. The remaining parameters of ChameleoClust+ are related to the evolution strategy (eg, population size, mutation rate), and a single setting for all of them turned out to be effective for all the benchmark data sets. A sensitivity analysis has also been carried out to study the impact of each parameter on the subspace clustering quality.


Introduction
Clustering is a data mining task that aims to group objects sharing similar characteristics into sets (i.e., the clusters) over the whole data space.A related problem is the subspace clustering one, which purpose is not only to identify groups of similar objects, but also to detect the subspaces where these similarities occur.Retrieving such subspaces turns out to be particularly useful while dealing with high dimensional data (Kriegel et al., 2009).Subspace clustering can be conceived as "similarity examined under different representations" (Patrikainen and Meila, 2006).For this reason it is recognized as a more complicated and general task than standard clustering.
Several evolutionary clustering approaches have been proposed (Hruschka et al., 2009), however very few of them address the subspace clustering task.As described in Section 5, these earlier approaches require non-evolutionary steps to tackle this problem.In order to address the subspace clustering task, we decided not to rely on non-evolutionary stages, but rather to take advantage of an evolvable genome structure.According to (Banzhaf et al., 2006) knowledge from evolutionary and molecular biology should be taken into account in the interest of conceiving better bio-inspired optimization algorithms.Among important phenomena in evolutionary biology, the dynamic evolution of the genome structure appears as a promising source of advances for bio-inspired optimization.Important phenomena such as the variable genome length or the variable percentages of coding or functional elements within the genome are related to the evolution of genome structures phenomenon (Knibbe et al., 2007).Several studies have shown for instance that an evolvable genome structure allows evolution to shape the effects of evolution principles themselves (e.g.mutations), phenomenon known as evolution of evolution (EvoEvo) (Hindré et al., 2012).Among the state-ofthe-art formalisms used for in silico experimental evolution reviewed in (Hindré et al., 2012), two models enable genome structure evolution: (Knibbe et al., 2007) and (Crombach and Hogeweg, 2007).Both formalisms have inspired key aspects of our work.
In this paper, we present ChameleoClust + , an evolutionary algorithm that takes advantage of a genome having an evolvable structure to tackle the subspace clustering problem.ChameleoClust + genome is a coarse-grained genome, inspired on (Crombach and Hogeweg, 2007), and is defined as a list of tuples of numbers.The genome is mapped at the phenotype level by using the genome tuples to denote core point locations in different dimensions, and to build the subspace clusters around these core points.Furthermore, the genome also contains a variable proportion of non-functional elements as in (Knibbe et al., 2007).During replications the genome undergoes both local mutations and large random rearrangements similar to those used in (Knibbe et al., 2007) and (Crombach and Hogeweg, 2007), namely: large deletions and duplications.Local mutations modify the genome elements, while rearrangements modify the genome length, and both can change the proportion of non-functional elements.The key intuition in the design of the ChameleoClust + algorithm is to take advantage of such an evolvable structure to detect various number of clusters in subspaces of various dimensions.In order to assess the algorithm, we used the reference subspace clustering evaluation framework presented in (Müller et al., 2009), and compared it to state-of-the-art algorithms on both real and synthetic datasets.The experiments show that ChameleoClust + obtains competitive results.Moreover, these results can be achieved with a single parameter related to the domain: the maximal number of clusters.In addition, for each generation the computational complexity is linear with respect to the number of objects and to the number of dimensions, and a fast overall evolutionary convergence is observed.This enables to keep the time consumption low on the datasets of the evaluation framework.
The main contribution of this paper is to show that, using an evolvable genome structure, a single stage fully evolutionary approach can consistently deliver subspace clusters of very good quality, requiring only an easy parameter setting and limited time resources.
The rest of the paper is organized as follows.The next section introduces the proposed algorithm, and Sections 3 and 4 describe respectively the evaluation method and results.Section 5 presents the related work and we conclude with a summary in Section 6.

ChameleoClust +
ChameleoClust + includes several bio-like features such as a variable genome length and organization, presence of both functional and non-functional tuples, and variation operators including large chromosomal rearrangements.These features, inspired by the in silico experimental evolution formalisms of (Knibbe et al., 2007) and (Crombach and Hogeweg, 2007), give the algorithm a large degree of freedom by making the genome structure evolvable.ChameleoClust + takes advantage of this structural flexibility to build subspace clustering with various number of clusters using subspaces having different numbers of dimensions.

Dataset and clusters
A dataset S = {s 1 , s 2 . . .} is a set of objects.Each object has a unique identifier and is described in R D by D features (the coordinates of the objects).The size of S is the number of objects in S, and D is the number of dimensions (i.e., the dimensionality) of S. Each dimension is represented by a number from 1 to D and the set of all dimensions of the dataset is denoted D = {1, . . ., D}.The algorithm takes as input a dataset S and a parameter c max that is the maximal number of desired clusters.The algorithm outputs a subspace clustering in the form of a set of disjoints clusters, where each cluster is defined as a set of objects and a set of dimensions.

Overall clustering principle
Each individual encodes in its genome a subspace clustering.More precisely a genome defines a set of so called core points located in various subspaces having possibly less than D dimensions.If the objects of the dataset tends to form groups around these core points, then a high fitness is associated to the corresponding individual.The reproduction (including selection and mutations) is performed for a whole generation in a synchronized way.After a given number of generations the process is stopped and the subspace clustering corresponding to the individual having the highest fitness is retained.

Preprocessing
As in many typical clustering problems, the first step is to standardize the dataset to ensure that all features could have similar impact on the distance computation during the clustering.Thus each feature value x is replaced by its z-score: z = x−µ σ , where µ is the dataset mean and σ is dataset standard deviation for the given feature.After standardization, data values in different dimensions are independent of the original offset and scale, and all features have the same unitary standard deviation and a zero mean (i.e., the entire dataset is centered around the origin O).Finally the maximal value among all absolute values of the z-score of all features is computed and is noted x max in the rest of the paper.

Genome structure
A genome Γ is a list [γ 1 , . . ., γ i , . . ., γ n ] of tuples of the form γ i = g, c, d, x , where g ∈ {0, 1} indicates if γ is a functional tuple of the genome (g = 1) or not (g = 0), and c, d, x are used to define the phenotype only if g = 1.The previous elements have the following specific domains: c ∈ {1, . . ., c max }, d ∈ {1, . . ., D} and x ∈ V alCoord, with V alCoord = {j × x max /1000 | j ∈ {−1000, . . ., 1000}}, i.e. all values from −x max to x max with step x max /1000.The genome structure previously defined is evolvable: The number of functional and non-functional elements and their respective positions in the genome may change.In Section 4.1 we show the adaptation of the genome size and of the number of functional elements for different datasets.

Phenotype
A phenotype Φ is simply a set of core points.Informally a core point is a specific point around which objects can be grouped to form a subspace cluster.The number of core points cannot exceed a maximal number of desired clusters specified as a parameter c max .Each core point is identified by a number c ∈ [1, c max ] and is denoted p c .The intuition of the genotype-phenotype mapping is that each functional element of the genome 1, c, d, x is a contribution of value x to the location of core point p c in dimension d.More precisely, let x d be the coordinate of p c for dimension d, then x d is the sum of all the values x contained in a tuple of the form 1, c, d, x in the genome Γ .For a given core point index c ∈ [1, c max ], the subspace associated to core point p c is the set D pc containing the dimensions that contribute to the location of p c , i.e., the set of all the dimensions d in D such that there exists at least one functional element of the form 1, c, d, x in Γ , where x can be any coordinate value.
For a given dataset S, a phenotype Φ defines a subspace clustering of S, by associating each object of S to the best matching core point in Φ.A non empty set of objects associated to a core point p c forms a cluster in subspace D pc .The precise definition of the notion of best match is given in the section 2.7 hereafter.
Notice that the length of the genome can be different among individuals, leading to phenotypes containing different numbers of core points in various subspaces and thus defining subspace clustering models with different number of clusters in subspaces having different number of dimensions.Notice also that the genotype to phenotype mapping is not bijective, and the same phenotype can be obtained from different genotypes containing different functional or nonfunctional elements.

Mutation operators
Each new genome is copied from a parent and modified by biologically inspired mutation operators of two kinds: Global rearrangements and point mutations.These operators are general mutation operators, they are not guided by some criteria related to the subspace-clustering task, and both functional and nonfunctional elements can be impacted by mutations.
For a genome Γ , an application of the point mutation operator is defined as follows.
-Point substitution: Let γ i ∈ Γ of the form γ i = g, c, d, x be an element uniformly drawn in the genome, and let k ∈ {1, 2, 3, 4} be a value chosen uniformly.The point substitution operator modifies the k-th element of the tuple γ i and replace it with a new random number drawn uniformly in its associated range: where U denotes the uniform random selection of a element in a set.
For the rearrangements, Γ is considered as being circular (as bacterial genomes).This means that the tuple γ n is adjacent to the tuple γ 1 .In order to define the possible rearrangements let us define two basic operators.
-Sublist extraction operator: Rearrangements are responsible for increasing or decreasing the genome length.The model uses two kinds of rearrangements: Large deletions and large duplications.For one application of a rearrangement operation on a genome Γ = [γ 1 , . . ., γ n ], a portion of Γ bounded by two tuples γ i , γ j ∈ Γ is considered, where i and j are uniformly chosen in {1, . . ., n}.The two rearrangement operators can then be defined as follows: -Large deletions: The segment between tuples γ i and γ j is excised.
-Large duplications : The segment between tuples γ i and γ j is copied and inserted at the location of a third tuple γ p (uniformly chosen).
During the reproduction of an individual, the whole mutation stage is defined as follows.For each of the two kinds of rearrangement operations, the total number of rearrangements is drawn from a binomial law B(L, u m ) where L is the genome size and u m is the mutation rate (same rate for all mutation operators).Then the corresponding number of large deletions and large duplications are performed in a random order.Once all rearrangements have been applied, the number of point substitutions is drawn from a binomial law B(L , u m ) where L is the genome size after applying the rearrangement operations.Then all these point substitutions are carried out.

Fitness
The fitness of a individual of phenotype Φ is related to the quality of the subspace clustering defined by Φ over a given dataset.This quality measure is a distance-based measure reflecting how the objects in the dataset tend to form groups around the core points of Φ.In (Beyer et al., 1999) and (Aggarwal et al., 2001) it has been shown that distance comparisons are less meaningful when dimensionality increases, this effect is called the concentration effect of the distances.It has been shown in (Aggarwal et al., 2001) that the Manhattan distance is robust to this effect.In the ChameleoClust + algorithm, the distance used is the Manhattan segmental distance introduced in (Aggarwal et al., 1999) for the well known subspace clustering algorithm PROCLUS.It is a normalized version of the classic Manhattan distance to compare distances in subspaces with different number of dimensions.Let y 1 and y 2 be two points in a space over the set of dimension D, and y 1,i (resp.y 2,i ) denotes the coordinate of y 1 (resp.y 2 ) in the dimension i of D. Then, the Manhattan segmental distance is: This distance is used here to define a function E(x, p c ) to assess the mismatch of the assignment of an object x ∈ S in space D to a core point p c in subspace D pc .The highest is E(x, p c ), the worst is the association of x to p c .This function is defined by: where O is the origin of the entire space.The mismatch evaluation E(x, p c ) increases with the distance between the core point p c and the object x (term d Dp c (x, p c )).It also increases if the subspace D pc has not enough dimensions to explain the shift of x with respect to O (term d D\Dp c (x, O)).The value E(x, p c ) is then simply the average of d Dp c (x, p c ) and d D\Dp c (x, O) weighted by their respective subspace dimensionalities.
To evaluate the fitness of an individual with phenotype Φ, each object x in the dataset S is assigned to the core point p c ∈ Φ for which E(x, p c ) is minimal (in the rare cases where several core points lead to the same minimal value, then one of them is chosen nondeterministically).Let S pc be the set of objects associated to p c , then if S pc is not empty, the core point p c defines the subspace cluster S pc , D pc , otherwise p c defines no cluster.
The fitness F is then defined as the opposite of the average of the mismatches computed for the best possible assignments of the dataset objects:

|S|
The fitness function F(Φ, S) goes to 0 when the evaluation of the mismatches between objects and core points tends to 0 (perfect match), and is strongly negative when objects and core points are poorly related.Notice that a core point p c with no associated object (S pc = ∅) is not penalized, and its corresponding functional elements in the genome may then be preserved for further exploration during evolution.
The computation cost of F(Φ, S) is proportional to the size of the dataset S, but to guide the search it is not necessary to evaluate the fitness over the whole input dataset S, and it can be sufficient to evaluate it over a sample.This strategy is used in ChameleoClust + , with an incrementally changing sample to avoid the possible misleading consequences of a poor single sample selection (i.e., sample not very representative of S).This is defined as follows.Let t be the index of the current generation (starting at t = 0).Let L = [x 0 , x 1 . . .] be a list containing all the dataset objects in a random order.For generation t, the fitness value of an individual is then F(Φ, S t ) where S t is a subset of S of size ω defined by: S t is simply the set of objects in L from index t × ω to index t × ω + ω − 1, restarting from the beginning of L when the last element is reached.
As shown is Section 4, for reasonable sizes of S t , this leads to an important reduction of the execution time, without effective degradation of the clustering quality.

Population
Each individual can be perceived as an asexual artificial organism containing a single chromosome.The population evolves during T generations.At each generation the population is completely renewed but its size N remains constant over time.As in the evolution simulation model of (Knibbe et al., 2007) we rely on an exponential ranking selection (Blickle and Thiele, 1996) in order to use the same distribution for the selection of the individuals all over the evolution (i.e., the selection is not directly related to fitness values but to ranks).In this selection scheme, the individuals of the current generation are ranked according to their fitness, in increasing order of performance (the worst has rank 1 and the best rank N).Then for each of the N individuals of the offspring generation, the parent of this individual is determined by a trial over a N classes multinomial law, where each class is associated to an individual of the current generation.For this multinomial law, an individual α has a success probability p α = (s − 1) where r α is the rank of the individual α and s the selection pressure parameter.This procedure is sketched in Figure 1.In order to avoid the possible decrease of the best fitness during the evolution, the algorithm uses an elitist selection method.More precisely, it always adds in the next generation an unchanged copy of the best current individual, and performs the random reproduction using only N − 1 trials.
For the initialization of the population, the N individuals of the first generation have genomes of the same size, denoted |Γ init |, and containing only nonfunctional elements.These N genomes are drawn independently, and filled with random tuples of the form 0, U({1, . . ., c max }), U({1, . . ., D}), U(V alCoord}) .

Time complexity
Let |Γ | be the maximal genome size among all individuals in the current generation.In this section we distinguish the time complexity related to the fitness computation and the complexity related to the reproduction operations.

Fitness computation
In order to compute the fitness of an individual, the algorithm first needs to build the phenotype of this individual from its genome.Considering that only the set of functional tuples, denoted Γ f , contribute to the phenotype, only these tuples should be selected, this search having a complexity of O(|Γ |).Once each functional tuple has been retrieved, ChameleoClust + proceeds to sort them by cluster and dimension to obtain the phenotype, this operation has a complexity of O(|Γ f | × ln(|Γ f |)).Once the phenotype has been built, the algorithm associates each one of the ω objects in S t to the core point it matches the best.Since, in the worst case, each element in Γ f can define a core point, this operation has a complexity of O(|Γ f | × D × ω).Thus, for each individual, the time complexity related to the fitness computation is Then, the complexity of the fitness computation over the whole population is given by: Reproduction operations When the individual fitnesses have been computed, ChameleoClust + proceeds to rank the individuals in order to give them a reproduction probability.This operation has a complexity of O(N × ln(N )).The genomes of the individuals of the new generation are initialized by copying the genomes of their parents, this operation having a complexity of O(N × |Γ |).Then, these genomes are modified by rearrangements (large duplications and large deletions) followed by point mutations.Let L m be the maximal genome size reached during the rearrangement steps for all individuals.For one genome, the numbers of duplications and of deletions are drawn from a binomial law and cannot be greater than |Γ |, leading for their application to a complexity of O(|Γ |×L m ).In a similar way, the number of point mutations is bounded by L m , and the complexity of the application of this operator is O(L m ).The expression of the complexity of the reproduction operations for the whole population is then It should be noticed that this worst case is not reached in the experiments.Indeed, for a genome of size |Γ |, the number of rearrangements (for both kinds) is not |Γ |, but is only |Γ | × u m on average, where effective parameter settings correspond to low values of u m (u m 1), as shown in the next section.
3 Experimental setup

Experimental protocol
In order to evaluate and compare ChameleoClust + to state-of-the-art algorithms, we used the evaluation framework of reference designed for subspace clustering and described in (Müller et al., 2009).This evaluation framework relies on a systematic approach to compare the results of representative algorithms that address the major subspace clustering paradigms.The comparison detailed in (Müller et al., 2009) was made using different evaluation measures on both real and synthetic datasets.We clustered with ChameleoClust + the same datasets and computed the same quality measures.
In the framework of (Müller et al., 2009), as each algorithm requires several parameters (from 2 to 9), they are executed with many different parameter settings to explore the parameter space.Then, using an external labeling of the objects, only the subspace clusterings that are among the best (with respect to the external labeling) are retained.So, the results reported for these algorithms are in some sense the best possible subspace clusterings that could be achieved if we were able to find the most appropriated parameter values.Since generally no external labeling is available when we search for clusters, parameter tuning is most of the time a difficult task and these high quality subspace clusterings are likely to be hard to obtain.
An important point to notice, is that for ChameleoClust + we did not perform any parameter optimization using external information, but we simply followed the parameter setting guideline presented in Section 3.3.Then, we ran ChameleoClust + and took the subspace clustering defined by the individual of the last generation having the best fitness.Since the algorithm is nondeterministic, we ran it 10 times in the same conditions and report the minimal, maximal and mean values of the measures over these 10 runs.So, we compare clusterings effectively found by ChameleoClust + to the best clusterings that could potentially be found by the other algorithms.All experiments were run on a quad-core Intel 2.67GHz CPU running Linux Ubuntu 14.04, using a single core and less than 250 MB of RAM.

Datasets
We studied ChameleoClust + performances on real world data using the six benchmark datasets selected in (Müller et al., 2009) for their representativity: breast, diabetes, liver, glass, shape, pendigits and vowel (most of them coming from the UCI archive (Bache and Lichman, 2013)).These datasets have different dimensionalities and contain different numbers of objects.These objects are already structured in classes, and the class membership is used by quality measures to assess the cluster purity.However the number of classes does not necessarily reflect the number of subspace clusters, since even within a class the objects can form several clusters in different subspaces.
We also ran ChameleoClust + on the 16 synthetic benchmark datasets provided by (Müller et al., 2009).These datasets are particularly useful to study the algorithm performances, as the true clusters and their subspaces are known.Each dataset contains 10 hidden subspace clusters laying in subspaces having 50%, 60% and 80% of the total dimensions of the dataset.Seven synthetic datasets were generated in (Müller et al., 2009) to study scalability with respect to the dataset dimensionality: D05, D10, D15, D20, D25, D50 and D75 with 5, 10, 15, 20, 25, 50 and 75 dimensions respectively.These datasets have about 1500 objects each and about 10% of noise objects.In addition to the previous datasets, five synthetic datasets were built to analyze scalability with respect to the dataset size: S1500, S2500, S3500, S4500 and S5500 with 1500, 2500, 3500, 4500 and 5500 objects respectively.For these datasets, the number of dimensions was set equal to 20 and the percentage of noise objects close to 10%.Finally four datasets were generated to study the capacity to cope with noise: N10, N30, N50 and N70 with 10%, 30%, 50% and 70% of noise objects in the dataset respectively.These datasets were built by adding noise points to the dataset D20.
All datasets and additional description are made available by the authors of (Müller et al., 2009) at http://dme.rwth-aachen.de/openSubspace/evaluation.

Parameter setting
Sliding sample size The dataset sample used to compute the fitness at each generation should contain enough objects in order to be representative of the entire dataset, but needs to be small enough in order to reduce the runtime.The sliding sample size ω was set to 10% of the dataset size and this setting turned out to be an interesting trade-off, as shown in Section 4.3 Figure 8.Of course, while the fitness is computed only on this sample, the final association of objects to clusters (using the core points defined by the best individual) and the evaluation of this clustering are still performed on the whole dataset.
Selection pressure Let α be an individual of the current generation and β be an individual of the next generation, according to Section 2.8, α has the probability p α = (s − 1) s (N −rα ) s N −1 to be the parent of β.For the best individual (r α = N ), the previous expression simplifies to p α = s−1 s N −1 .The selection pressure was set to s = 0.5 so that with a large population (N 1) the best individual has a success probability close to p α 0.5.Therefore each individual of the next generation has one chance out of two to explore the neighborhood of the best current individual, and the same chance to descend from another one, exploring then potentially different solutions.

Initial genome size
The genomes are initialized with random tuples denoting non-functional elements (see Section 2.8) and the size of these initial genomes was chosen to be equal to |Γ init | = 200.This genome size matches with the amount of tuples required to build a typical subspace clustering model, e.g., 10 clusters in 20 dimensions or 20 clusters in 10 dimensions.As the genome size and the genome structure are not constrained and are able to evolve (as illustrated in figures 4b and 4c), the initial genome size is not a determining choice for the algorithm.
A sensitivity analysis performed in Section 4.3 shows that the result quality is not substantially modified for a large range of the three previous parameters.
Mutation rate The mutation rate was set according to its impact on the number of replications that actually produce genomes that are different from or identical to the parental genome.Let ϕ be the probability that no mutation of any types (substitution, deletion, duplication) occurs during one replication of an individual.As defined in Section 2.6, the number of mutations of a given type that take place during one replication follows a binomial distribution B(|Γ |, u m ).Thus the probability that no mutation of one type occurs is equal to ϕ depends strongly on the mutation rate and the genome length as illustrated in the figure 2. Indeed, when the mutation rate is too low genomes are extremely invariable regardless of their respective lengths, i.e., ϕ 1.Consequently, when the mutation rate is too low, genomes are likely to evolve too slowly.On the contrary when the mutation rate is too large genomes are extremely variable regardless of their respective lengths, i.e., ϕ 0 .Consequently, when the mutation rate is too high, genomes are susceptible to evolve improperly because of drastic changes.Besides the previous effect, for intermediate mutation rates Figure 2 illustrates that the genome variability estimated by the mutation probability increases together with the genome length, longer genomes being more variable than shorter ones.
In order to tune properly the mutation rate, we consider a range of plausible genome sizes that individuals could grow in order to tackle the subspaceclustering problem.Let us take |Γ min | = 50 as a minimal reasonable genome length (e.g., Γ min can encode 10 clusters in subspaces having 5 dimensions or 5 clusters in subspaces having 10 dimensions and is a quite small clustering model).Let us take |Γ max | = 400 as a maximal reasonable genome length (e.g., Γ max can encode 20 clusters in subspaces having 20 dimensions in average).Γ max is also the more variable genome we consider.
A sensitivity analysis performed in Section 4.3 show that the results quality are not substantially modified close to the mutation rates range defined previously.However mutation rates chosen far outside the given range lead to poorer results.
A suitable range of mutation rates should allow the less variable genomes to evolve fast enough and should not lead the more variable genomes to jump too far in the genomes space.We decided to work with mutation rates that allow Γ min to have at most 95% of chances to avoid mutations and Γ max to have at least 5% of chances to avoid mutations.From the expression of ϕ, we have u m = 1 − ϕ Population size and number of generations In order to adjust these parameters, we analyzed the fitness value and its convergence curves on three datasets: shape and pendigits, that are respectively the smallest and the largest of the real datasets, and dataset D20 a typical synthetic dataset of the framework (20 dimensions and 10% of noise points).
For the population size, Figure 3 illustrates that the larger the population is, the higher the fitness values are.Indeed a larger population has a higher exploration power, and is more likely to reach optimal solutions.However these improvements reach a plateau and then tend to be less significant.Figure 3 illustrates that an appropriate fitness convergence is reached with a population size set to N = 300 or greater.
For the number of generations, at 5000 the algorithm achieved a good convergence for fitness.This is illustrated in Figure 4a, where this convergence seems complete for shape and D20 datasets, and nearly complete for the pendigits dataset.A careful setting of the number of generations is not required before performing the subspace clustering, because the user can monitor the fitness curve during the process in order to stop it when the fitness convergence reaches a plateau.However, as detecting such plateaux is somewhat subjective, here we decided to evaluate ChameleoClust + with an early stopping at 5000 generations for all the experiments.Notice that, as could be expected and as shown by the sensitivity analysis carried out in Section 4.3, allowing the algorithm to evolve during more generations does not have a negative impact on the clustering quality and can still slightly improve it.Figures 4b and 4c illustrate that the early generations are characterized by a fast evolution of the genome structure, and particularly of the number of functional tuples and of the fraction of functional tuples.At 5000 generations, the algorithm has already been able to take advantage of the genome structure evolution to encode a subspace clustering model having a fitness value close to the maximum reached Figure 4a.Readers may notice that the convergence of the genome structure may be slower than the fitness convergence.However the main point with regard to the subspace clustering problem is to obtain well positioned core points (i.e., to have an optimized phenotype), and consequently it is not necessary to run the algorithm until a stable genome structure is reached, but the algorithm can be stopped earlier, as soon as a stable fitness is obtained.
Maximal number of subspace clusters c max is the maximal number of subspace clusters that can be built, and it was the only parameter that required to be tuned.The other parameters are related to the evolution strategy (population size, mutation rate, ...) and for all of them the single setting established previously in this section turned out to be effective for all the benchmark datasets.However, c max does not require a fine tuning since ChameleoClust + can adapt the number of subspace clusters between 1 and c max .In order to set this parameter, we first executed ChameleoClust + with c max = 10.When the algorithm outputs exactly c max clusters, this means that the algorithm is likely to have been limited by a too low value set for c max .In this case, the clustering was repeated with increasing values of c max , with an increment of 10, until ChameleoClust + output less than c max clusters.Only the last value of c max is retained, allowing then ChameleoClust + to regulate the number of clusters built.Using this pro-cedure, for the real world datasets the c max parameter was set to 10 for breast and glass, to 20 for shape and pendigits, to 30 for liver and diabetes and finally to 40 for vowel.For the synthetic datasets, the same procedure, leads to set c max to 30 for D05, the dataset having 5 dimensions, and to 20 for the fifteen other datasets.

Evaluation measures
In order to compare our algorithm to the others, we used the same standard evaluation measures for clusters and subspace clusters as (Müller et al., 2009): entropy, accuracy, F1, RNIA and CE (extension of Clustering Error to subspace clustering).We performed also the same simple transformation of entropy and RNIA, by computing RN IA = 1 − RN IA and entropy = 1 − entropy to have all evaluation measures ranging from 0 (low quality) to 1 (high quality).The three first measures (entropy, accuracy and F1) reflect how well objects that should have been grouped together were effectively grouped.The two last measures, RNIA and CE introduced in (Patrikainen and Meila, 2006), take into account the way the objects are grouped and also relevancy of the subspaces found by the algorithm.For these measures, when the true dimensions of the subspace clusters are not known (for real datasets), then as in (Müller et al., 2009) all dimensions have been considered as relevant, but then the interpretation of these measures should remain cautious since the true sets of dimensions are likely to be smaller.Of course this does not apply to the synthetic datasets, since for them the reference clusters and their dimensions are known.We refer the reader to (Müller et al., 2009) for a detailed presentation of the evaluation measures.

Real datasets
We computed the minimum, the maximum and the mean of the evaluation measures over 10 standard runs of ChameleoClust + using the same parameter setting for all datasets as justified and given in Section 3.3, except of course for the parameter specifying the maximum number of clusters (c max ) that was tuned according to the simple procedure also given in Section 3.3.As explained in Section 3.1, these results are compared to the ones provided by (Müller et al., 2009), that represent the best possible outputs that could be produced by the main subspace clustering approaches over their respective parameter space.More precisely, for these other algorithms, on each real dataset only two outputs were retained: 1) the one computed for the parameter setting that maximizes the F 1 measure, and 2) the one obtained when maximizing the accuracy.These two outputs led in the result tables to two values for each measure, the smallest of the two being called best min and the other best max.For all datasets we also give the number of subspace clusters found, the average dimensionality of these clusters, and their coverage.The coverage is here the percentage of objects of Fig. 4: Evolution of the mean ± standard deviation (dashed lines) of different measures for the best individuals of each generation for 10 runs over the real world datasets shape (red) and pendigits (blue) and the synthetic dataset D20 (green).
the dataset that were associated to clusters, and could be less than 100%.This is the case for algorithms that identified some objects as outliers or as reflecting noise, and also for algorithms that were not able to identify a cluster for these objects.Finally even though ChameleoClust + has been executed on a computer (2.67GHz CPU) different from the one used by (Müller et al., 2009) (2.3GHz CPU), we report the runtimes, since at least their orders of magnitude can still be compared.
In order to illustrate the performances of ChameleoClust + we focus on dataset shape in Table 1 that reproduces the results obtained in (Müller et al., 2009) completed by the results of ChameleoClust + .For the sake of completeness, the detailed evaluation on the other datasets is given in the Appendix 7.3.In Table 1, when an algorithm has a best possible run with a higher evaluation than ChameleoClust + the result is highlighted in gray, and if the evaluation is similar to ChameleoClust + then the result is simply emphasized in bold.
For Accuracy and CE ChameleoClust + (together with DOC and MINECLUS) has among the best results, while its parameters were not optimized using the class labels to maximize the Accuracy.
For F 1 and RN IA the best possible runs of DOC and MINECLUS are observed with better results than standard runs of ChameleoClust + , but they tend to split the dataset in more clusters (same behavior also on the synthetic datasets) and have runtimes considerably higher than ChameleoClust + .The best possible runs of PROCLUS achieve better results than ChameleoClust + for F 1 , but their coverage falls to about 80% to 90% leaving an important part of the dataset outside of the clusters.
Looking at entropy many algorithms have best possible runs leading to a better entropy than ChameleoClust + .However, in clustering tasks, the entropy cannot be interpreted regardless of the number of clusters, because usually the entropy quality measure tends to improve when the number of clusters increases.Indeed, by definition of the entropy measure, the best entropy is obtained for the extreme case where we have one cluster per object.ChameleoClust + and three other algorithms (FIRES, P3C, STATPC) are able to avoid the spreading of the data over too many clusters, but at the cost of a degradation of the entropy measure.Notice that among them, ChameleoClust + is the only one to obtain such a reasonable number of clusters with a 100% coverage.

Regulation of the subspace clustering
The mutational operators defined on Section 2.4 and 2.6 allow the ChameleoClust + genome structure to evolve, reaching potentially different genome sizes and different percentages of functional tuples according to each dataset.This allows ChameleoClust + to adapt, for each dataset, the amount of information encoded within its genome.In addition, the genotype-phenotype mapping, detailed in Section 2.5, permits ChameleoClust + to encode different number of clusters described in subspaces with different dimensionalities.Let us analyze more precisely to which extent ChameleoClust + takes advantage of these degrees of freedom.
Before describing the results obtained by ChameleoClust + , it should be noticed that most of the time the number of classes within a dataset does not correspond to the number of clusters found by the algorithms.Indeed, there is no constraint requiring that the objects of a class form a single group in their feature space, and consequently it is not surprising to obtain more clusters than classes.Moreover, in some cases, a few algorithms found a very large number of clusters (sometimes even more clusters than objects), this behavior being due to their ability to output overlapping clusters.Table 2 summarizes (over 10 runs) the average number of clusters, their average dimensionalities, the average genome length and the average number of functional tuples in the genome, for each one of the seven real world datasets.The subspace clustering models produced by ChameleoClust + are very different for each dataset: the average number of clusters produced varies between 5.1 for the breast dataset to 28.0 for the vowel dataset and the average dimensionality of the subspaces found varies between 2.06 for liver to 12.15 for breast.Similarly the average genome length varies from 172.6 for liver to 1093.9 for pendigits and the average number of functional tuples goes from 98.8 for liver up to 409.7 for shape.For all datasets, the number of clusters and the average dimensionalities of the subspaces found by ChameleoClust + are coherent with the number of clusters found by the other algorithms (see Table 1 for dataset shape and Appendix 7.3 for the others).
Broader comparison For almost every dataset, the performances of ChameleoClust + are competitive with respect to the best possible runs of the other algorithms.To compare these approaches in a broader way, we ranked them according to the eight following evaluation criteria: the coverage, the number of clusters found, the quality measures (F1, Accuracy, CE, RNIA and Entropy), and the runtime.For each real world dataset and each criterion we ranked the eleven algorithms with respect to the column best max, from rank 1 for the lowest performance to rank 11 for the highest one.The ranking for the coverage and for the number of clusters needs further precisions.For the coverage, a method that built less representative models (excluding too many points) had a lower rank with respect to a method that covered a larger part of the dataset.For the number of clusters, the fewer the clusters in the clustering model, the easier their interpretation, so methods that built a reduced number of clusters had a higher rank.
Then, for each of the eight criteria we computed the average rank over the seven datasets, obtaining for each algorithm eight average ranks.The same was also performed with the column best min.The average ranks of the different algorithms are given in Figure 5 (colored dots).The figure also reports the mean of the average rank of each method (red stars), showing that ChameleoClust + has the second best mean.However, it should be noticed that there is no ever winning algorithm.
For the four algorithms having the best means (i.e., MINECLUS, ChameleoClust + , DOC and PROCLUS) we compared more precisely the number of clusters they produced, their coverage and their runtimes.The table 3 summarizes for each algorithm: (1) The number of datasets where the highest and lowest number of clusters found remain interpretable (100 clusters or less).(2) The number of datasets where the highest and lowest coverage are acceptable, i.e., the amount of excluded data points are not too high (coverage of at least 95%).And (3) the number of datasets where the shortest and longest execution last for a reasonable time (one hour or less).The results obtained by the other algorithms are also presented for the sake of completeness.MINECLUS, ChameleoClust + , DOC and PROCLUS produced for each dataset an interpretable number of clusters, but PROCLUS and DOC usually produced lower coverage.MINECLUS and DOC had higher run times and last for more than one hour for several datasets.ChameleoClust + produced good quality results together with low runtimes and high coverage.
However, the different approaches have different characteristics (e.g., global cluster shapes, distance-based/density-based, 100% coverage or not) and, as observed previously, no method leads to the best results on all datasets.For exploratory analysis of the data, a good strategy is to apply several methods from different families.In particular using at least one of the approaches that tend to build hyper-spherical shaped clusters.In the comparison, this corresponds to the algorithms PROCLUS, P3C, STATPC and ChameleoClust + .Among these four methods, ChameleoClust + always reaches a 100% coverage, while the others, on most datasets, do not cluster more than 95% of the objects, as shown Table 3.Thus, beyond an easy parameter setting and good performances, ChameleoClust + is an interesting choice to find hyper-spherical shaped clusters in subspaces with a 100% coverage of the data.

Synthetic data
ChameleoClust + was executed 10 times on each of the 16 synthetic datasets.
For each dataset we retained the run reaching the highest fitness (for the best individual) among the 10 runs (notice that this selection is made without using any external labeling, but only the fitness values).Then for each evaluation measure, we plotted the measure value obtained with respect to the number of clusters found by each of the 16 selected runs.The results are shown in Figure 6 as red dots.For each evaluation measure we also plotted in blue the shape of the area where the other algorithm results lay (as reported in (Müller et al., 2009)).
Again for these other algorithms, their results correspond to there best runs over the parameter space.More precisely, for each quality measure, the results were collected as follows.For an algorithm and a given dataset, the parameter space of the algorithm were explored, and using the external true labels, only the execution leading to the highest value of the measure has been retained.
In the plots of the figure 6, good performances correspond to regions where the outputs contain about 10 clusters (the real number of clusters) and reach a high value for the quality measures.For almost every synthetic dataset the number of clusters found by ChameleoClust + is very close to the real number.ChameleoClust + always found between 6 and 25 clusters.As reported in (Müller et al., 2009) the other algorithms found between 5 and 50 clusters, excepted a few cases where much more clusters were found (up to more than several thousands).Using the default parameter setting method of Section 3.3, most of the evaluation measures for ChameleoClust + are comparable to the highest evaluations obtained by (Müller et al., 2009) when exploring the parameter space of the other algorithms using the true clusters to guide the search.We give, Figure 7a and Figure 7b, the runtime of ChameleoClust + with respect to the number of dimensions, and to the number of objects of the synthetic datasets.These curves show that the algorithm scales rather linearly in both cases and are consistent with the time complexity obtained in Section 2.9.These confirmations are important in order to infer the sizes of real datasets that could be processed.For example, on the largest real dataset, pendigits, having about 7500 objects and 16 dimensions, the runtime of algorithm ChameleoClust + is less than 4500 seconds (Table 5).This runtime corresponds to a single threaded version of ChameleoClust + , and it could be reduced by handling individuals in parallel since each member of the new generation can be obtained independently.For instance, the computation of the offspring population can be distributed over the 32 cores of a typical workstation, in order to decrease the execution time by a factor of about 1/32.Thus, according to the time complexity given in Section 2.9, showing that the runtime increases linearly with respect to the number of objects (as confirmed by the experiments reported Figure 7b), it is possible to process on a 32 cores hardware a 32 times larger dataset with similar execution times.This means that in a reasonable amount of time of about 4500 seconds, ChameleoClust + could obtain a subspace clustering for a dataset of 7500 × 32 = 240000 objects.

Sensitivity analysis
In order to study the impact of the different parameters on the quality of the subspace clustering models obtained, a sensitivity analysis of the parameters has   been carried out by varying the parameter values one-at-a-time.For each parameter setting, the execution was repeated 10 times and the average and standard deviations of the two measures that reflect the relevance of the subspace (RN IA and CE) were computed.As in Section 3.3, we consider the three representative datasets shape, pendigits and D20 to carry out this sensitivity analysis.
A parameter setting was used as a reference: size ω of the sliding sample S t set to 10% of the dataset size, selection pressure s = 0.5, initial genome size |Γ init | = 200 elements, mutation rate u m = 0.00142, population size N = 300 individuals, number of generations set to 5000, and maximal number of subspace clusters c max = 20.This corresponds to the default values specified in Section 3.3.For each parameter, the effects of changing its value were observed and are discussed in the following.

Sliding sample
The results obtained on the three datasets for sample sizes of 5%, 10%, 30%, 50%, 70%, 90% and 100% of the dataset size, are given in Figure 8a and Figure 8b.These curves show that the impact of the dataset sample size on the subspace cluster quality is low when the sliding sample used to compute the fitness is about 10% of the dataset size or more.As could be expected, using a small ratio on a small dataset leads to the most important degradations.This is the case for the smallest one, shape, that contains only 160 objects, and for which a 5% sample contains only 8 objects.However, for reasonable sample sizes, the samples are representative enough of the whole dataset and good quality clusterings are obtained, as shown Figure 8. Disjoint, but more representative, samples still create a small instability in the fitness landscape, and even if an elitist selection strategy is used, as described Section 2.8, this can lead to local decreases of the fitness of the best individuals as can be observed on Figure 4a.Despite of this instability, this figure also shows that there is still a global improvement and convergence of the fitness (over the samples), and the same holds for the clustering quality (over the whole dataset) as depicted Figure 13.
Selection pressure Figure 9a and Figure 9b present the results obtained when varying the selection pressure (values 0, 0.1, 0.3, 0.5, 0.7, 0.9 and 0.999).This change has a weak impact on the subspace cluster quality for s in [0.1, . . ., 0.9].This is not the case when the selection pressure is very low (s > 0.9), since according to Section 2.8 almost the same reproduction probabilities are assigned to each individual, and thus promising individuals have almost the same number of children as unadapted ones.This is consistent with the degradation of the clustering quality observed in the figures 9a and 9b.When the selection pressure is very high (s < 0.1), almost the complete future generation comes from the best individual of the current generation (individual having a very high reproduction probability).In this case, the genetic variability within the new generation is likely to be reduced, and Figure 9 shows a decrease of the cluster quality measures.For the smallest dataset, shape, the current default sample is also small, and thus is likely to be not very representative.Then, generating offspring using  only the individual that has the best fitness on this sample could be the cause of the important quality degradation observed for shape when s is below 0.1.

Initial genome length
Set of parameter values: 10, 50, 100, 200, 300, 400, 500.As illustrated in Figure 10a and Figure 10b, the impact of the initial genome size is minor when the initial size is at least equal to 50.Indeed, ChameleoClust + genome size is evolvable and can be modified by large deletions and large duplications, consequently the initial size does not have a considerable impact on the algorithm quality.However, very small initial genomes have a high probability to stay unchanged (no mutation) as shown Figure 2, and consequently evolution tends to be slower and results tend to be poorer (for the same total number of generations).
Population size Set of parameter values: 10, 50, 100, 300, 500, 1000.As illustrated in Figure 11a and Figure 11b the larger the population the better the results.Indeed smaller populations may only explore a small portion of the solution space at each generation and tend also to have a smaller genetic variability.This leads to a slower evolution and poorer results.At the other end of the parameter range, the gain induced by having more individuals tends to become smaller as the population size increases.
Mutation rate Set of parameter values: 0.0001, 0.00034, 0.00142, 0.00249, 0.01.We decided to test the mutation rates delimiting the suitable mutation rate range defined in Section 3.3 (u m = 0.00034 and u m = 0.00249), the default mutation rate (u m = 0.00142) and two values outside the suitable mutation rate range (u m = 0.01 and u m = 0.0001).If the mutation rate is chosen inside the boundary defined in Section 3.3, it does not have a major impact on the subspace cluster quality, as showed in Figure 12.If we choose a mutation rate far outside the boundary, the subspace cluster quality decreases.This is coherent with Figure 2, the mutation rate is too low, and then the evolution process becomes very slow as most of the individuals do not mutate.While, when the mutation rate is too high, it becomes harder for the organisms to converge towards a suitable subspace clustering.

Number of generations
We ran ChameleoClust + 10 times for each chosen dataset over 120000 generations.The different evaluation measures were computed each 100 generations.As illustrated in Figures 13a and 13b, the more generations we let the algorithm evolve the better are the results.However the improvements tend to be less significant and results reach finally a plateau.As discussed in the paragraph related to the number of generations in Section 3.3, the early generations are characterized by a fast evolution of the genome structure and of the subspace clusters quality.Well positioned core points are rapidly found, and it is not necessary to wait for too many generations to get good results.In this section, the impact of the choice of the parameter values, on the cluster quality, has been discussed.It should be noticed that similar effects can be observed on the fitness itself, as shown for the population size in Figure 3  0    and for the number of generations in Figure 4a.For the sake of completeness, the fitness curves obtained when modifying the other parameters are given in Appendix 7.2 (Figures 17a,17b,17c and 17d).

Possible alternative models
Aside from its evolvable genome size driven by large duplications and deletions, the ChameleoClust + approach relies on two other choices: an elitist reproduction method and the presence of non-functional elements.In this section, their effects on the quality measures RN IA and CE are reported using the datasets pendigits, shape and D20 (similar trends were observed on the fitness values, and the corresponding figures are given in Appendix 7.1).
For the synthetic dataset D20 the parameter c max was set to the true number of groups (i.e., c max = 10).For the real datasets this parameter was set to the number of classes (c max = 9 for shape and c max = 10 for pendigits) and other runs were performed using twice the number of classes (c max = 18 for shape and c max = 20 for pendigits), since the real number of groups is not necessarily equal to the number of classes.
ChameleoClust + was executed 10 times for each dataset and each value of c max , using the setting described Section 3.3 for the other parameters.Figures 14a and 14b show the impact of elitism on CE and RN IA.In these experiments, elitism does not seem to have a significant positive or negative effect.However, it is still a way to avoid the possible lost of a good current solution during the search.Since, according to the complexity given Section 2.9, it does not increase the cost of the generation of a new population, then there is no advantage to remove it from ChameleoClust + .To test an alternative model without non-functional elements, all elements in the initial genomes were set to be functional, and the point mutations that could transform them, during evolution, into non-functional elements were simply discarded.Figures 15a and 15b report the impact of these non-functional elements on the quality measures, showing a positive effect that turns out to be significant for pendigits and D20.

Related work
Many approaches have been investigated for subspace clustering in the literature using various clustering paradigms.The reader is referred for instance to (Kriegel et al., 2009), (Müller et al., 2009), and (Parsons et al., 2004) for detailed reviews and comparisons of the best methods and main categories: -The cell-based approach, that defines clusters as hyper-rectangles laying in specific subspaces and containing more than a given number of objects.Clusters are usually constructed by discretizing the data space into axis-parallel cells and then aggregating promising cells.These selected cells are commonly the ones containing more objects than a threshold given as parameter.Other typical parameters are the number or the size of the cells.-The density-based approach, in which clusters are dense groups of objects in space.A cluster can have an arbitrary shape, but must be separated from the other clusters by low density regions.This approach defines dense regions as regions where within a given radius a number of objects exceeding a minimum threshold can be found.Clusters are built by joining together the objects from adjacent dense regions.-The clustering-oriented approach, that usually defines properties of the targeted clustering such as the expected number of clusters or the cluster average dimensionality.According to these constraints, the objects are grouped together mainly using distance-based similarity.Most of these methods tend to build hyper-spherical shaped clusters in particular subspaces.
It should be noticed that subspace clustering is also related to paradigms known as co-clustering, bi-clustering and pattern-based clustering.According to Sim et al. (2013), the main difference with subspace clustering is that these approaches consider the objects and the features of the dataset interchangeably, and cluster simultaneously objects and features exhibiting common patterns.Different survey articles have been dedicated to these paradigms, we refer the reader to Charrad and Ahmed (2011), Sim et al. (2013) and Mounir and Hamdy (2015) for a detailed presentation.
Even if many evolutionary clustering approaches exist (Hruschka et al., 2009) very few of them address the subspace clustering problem.An early approach was presented in (Sarafis et al., 2003), introducing a subspace clustering evolutionary algorithm that uses a rule-based representation to encode axis-parallel hyper-rectangular disjoint clusters.This algorithm is a member of the cell-based subspace clustering family.It uses task-specific mutation and recombination operators, and requires a non-evolutionary first stage to find promising clusters in 2D subspaces.More recently, in (Vahdat et al., 2010), a different evolutionary approach has been presented.It is also based on a first non-evolutionary clustering stage, used here to find a set of cluster candidate positions in each dimension.Next, it uses a genetic algorithm to produce subspace clusters by combining the candidate positions found at the previous step.The final stage is then to run a second genetic algorithm to find the best combination of subspace clusters to form the whole clustering of the data.This approach is related to the clustering-oriented family.The ChameleoClust + algorithm presented in this paper also falls into the clustering-oriented category, but it is a single stage and fully evolutionary approach, without any preliminary stage to identified clusters in lower dimensional spaces.In addition it relies on generic bio-like mutation operations that are not specific to the subspace clustering task.Moreover, ChameleoClust + has shown to performed well when compared to state-of-the-art subspace clustering algorithms using a reference evaluation framework.
More recently different extensions/variants of the subspace clustering problem have been investigated.For instance, in (Aksehirli et al., 2013) and(Aksehirli et al., 2015), the authors have introduced a grouping of objects based on the sharing of similar neighborhoods, instead of relying on traditional distance measures.The handling of noise has also received an increasing attention, as in (Wang and Xu, 2016), (Vidal and Favaro, 2014) and (Soltanolkotabi et al., 2014), that tackled subspace clustering in the context of very noisy datasets.Another useful aspect, in a clustering process, is the integration of user knowledge by means of constraints.In (Hu et al., 2015), the authors have proposed such a constraint-based subspace clustering method, to guide the search for the cluster content and their subspaces.

Conclusion
In this paper, we presented ChameleoClust + , an evolutionary algorithm for subspace clustering.Its key underlying principle is to use an evolvable genome structure to find various numbers of clusters in subspaces of different dimensionality.The genome undergoes local point mutations and is shaped by two kinds of global rearrangements: large deletions and large duplications.Beyond cluster locations, this enable to evolve the number of clusters and the number of dimensions used by each cluster.
ChameleoClust + was shown to be very competitive with respect to state-ofthe-art algorithms using an evaluation framework of reference, that includes both real and synthetic datasets (varying size, number of dimensions and proportion of noise).A parameter setting method has been described, and was effective for all the datasets of the framework.In addition, a sensitivity analysis showed that the impact of the parameters related to the evolution strategy (population size, mutation rate, ...) is low for a large portion of the parameter space.The only parameter not related to evolution is the maximum number of desired clusters.To set its value, a simple procedure was given and adopted for all the datasets used in the evaluation.
Directions for future work include to investigate the impact of more complex transfers, like crossover in vertical transfer or bio-inspired horizontal transfer operations, in this evolutionary subspace clustering approach.Another promising direction of work is to extend ChameleoClust + to handle data incrementally (e.g., streaming data), a context that requires to adapt the standardization of the data.It could be handled by recomputing periodically the needed statistics over a recent part of the data, and then modifying the current core point locations by taking into account the shift and scaling induced by the new standardization parameters.0.08 0.05 0.17 0.16 0.12 0.08 0.69 0.43 0.13 0.12 0.98 0.95 3 2 7.0 4.7 1610 625 STATPC 0.22 0.22 0.56 0.56 0.06 0.06 0.12 0.12 0.14 0.14 1.00 1.00 39 39 10.0 10.0 18485 16671 ChameleoClust + 0.41 0.37 0.42 0.38 0.17 0.13 0.65 0.54 0.45 0.

Fig. 1 :
Fig. 1: Parent of the new individual I k drawn from current population according to this population ranking, and with probabilities p 1 < . . .< p α < . . .< p N .

≈Fig
Fig. 2: ϕ value computed as a function of the mutation rate u m for different genome sizes.The suitable chosen range of genomic variability and its related mutation rate range are delimited by dashed lines.The retained mutation rate is marked by a vertical plain line.

F
Fig.3: Mean fitness values ± standard deviation for the best individual of the last generation for each one of the 10 runs on shape (red), pendigits (blue) and D20 (green) as a function of the population size.
Evolution of the percentage of functional tuples.

Fig. 5 :
Fig.5: Mean over the different datasets of the ranking of each algorithm for the maximum and the minimum value obtained for each evaluation measure: Accuracy, Entropy, F1, CE, RNIA, Number of cluster, Coverage, Runtime (colored dots) and average ranking for each method (red stars).

Fig. 6 :
Fig. 6: Accuracy, F 1 , RN IA, Entropy and CE as a function of the number of clusters for the subspace clustering having the best fitness among 10 runs for the synthetic datasets (red dots) and region where the state-of-the-art algorithm results lay.

Fig. 8 :
Fig. 8: Mean ± standard deviation of quality measures for the best individual of the last generation for each one of the 10 runs on shape (red), pendigits (blue) and D20 (green) as a function of the dataset sample size relative to the dataset size |St| |S| (percentage of the dataset size).

Fig. 10 :
Fig.10: Mean ± standard deviation of quality measures for the best individual of the last generation for each one of the 10 runs on shape (red), pendigits (blue) and D20 (green) as a function of the initial genome size.

Fig. 11 :
Fig.11: Mean ± standard deviation of quality measures for the best individual of the last generation for each one of the 10 runs on shape (red), pendigits (blue) and D20 (green) as a function of the population size N .
of RNIA quality measure.

Fig. 13 :
Fig.13: Evolution of the mean ± standard deviation (dashed lines) of quality measures for the best individual of each generation over 10 runs of ChameleoClust + for shape (red), pendigits (blue) and D20 (green).
Fitness with (red) and without (blue) non-functional elements.

Fig. 16 :
Fig.16: Mean ± standard deviation of the fitness of the best individual of the last generation for 10 runs on shape, pendigits and D20 under different conditions.

Fig. 17 :
Fig.17: Mean ± standard deviation of the fitness of the best individual of the last generation for each one of the 10 runs on shape (red), pendigits (blue) and D20 under different conditions.

Table 2 :
Average number of clusters and average dimensionality per cluster found for each dataset

Table 3 :
Number of datasets where the conditions on runtime (less than one hour), coverage (more than 0.95%) and number of clusters (less than 100) were fulfilled Mean ± standard deviation of quality measures for the best individual of the last generation for each one of the 10 runs on shape (red), pendigits (blue) and D20 (green) as a function of the selection pressure parameter s.