Подход к оценке числа кластеров в концепции устойчивости кластеризации

Ложкинс Алексейс

Подход к оценке числа кластеров в концепции устойчивости кластеризации

В работе предлагается рассматривать неточности исследуемых данных для определения «правильного» числа кластеров. Полагаем, что кластеризация является устойчивой, если группировки возмущенных данных для соответствующего числа кластеров не отличаются по составу в результате многочисленных повторений. Исследуемые данные всегда имеют ошибки различной природы, величины и носят случайный характер. Основным настраиваемым параметром предлагаемых алгоритмов является уровень дисперсии шумов, добавляемых к исследуемым данным. С другой стороны, множество дисперсий отражает уровень ошибок в исследуемых данных. Достоверная, точная и правильная кластеризация состоит из кластеров, которые имеют минимальный уровень зависимости от уровня возмущений.

Общественные науки в целом

Диссертации

Вуз: Санкт-Петербургский государственный университет (СПбГУ)

ID: 587d36825f1be77c40d59105

UUID: 176142cc-8df9-4b9f-bf38-98d05b440c49

Язык: Русский

Опубликовано: почти 8 лет назад

Просмотры: 7

Ложкинс Алексейс

Источник: Санкт-Петербургский государственный университет

Комментировать 0

Рецензировать 0

Скачать - 833986 bytes

Поделиться работой

Saint Petersburg State University Department of mathematical game theory and statistical decisions Lozkins Aleksejs Master's thesis The stability based approach of cluster number determination Specialization 01.04.02 Applied Mathematics and Informatics Master's Program Game Theory and Operations Research Research advisor D.Sc., Professor Bure V. M. Saint Petersburg 2016

Contents Introduction 1 1 The formal description of existed methods 4 1.1 The common existed approaches . . . . . . . . . . . . . . . . 4 1.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Formalization of the approach 10 2.1 Methodology description . . . . . . . . . . . . . . . . . . . . 10 2.2 An algorithm based on data inaccuracies . . . . . . . . . . . . 14 2.3 Algorithm based on extended set . . . . . . . . . . . . . . . . 16 3 Computational experiments on articial data 18 3.1 Articial data description . . . . . . . . . . . . . . . . . . . . 18 3.2 Algorithms tuning and application on the articial data 19 . . . 4 Applications of algorithms for applied problems 4.1 The cluster analysis of European countries by unemployment rates 4.2 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 The clustering analysis of European Union countries by demographic parameters . . . . . . . . . . . . . . . . . . . . . . . Bibliography 28 36 1

Introduction The cluster analysis is the data learning tool. There exist some dierentiations of data mining tasks: clustering, classication, regression, association rule learning and etc. All this data analysis tools have own specicity. The clustering is subset of data mining tools, which solve the automatic grouping of data without the supervisor. The cluster analysis has two main problems: clustering algorithm and which number of cluster to chose. The data structure has a dierent representations and where does not exists the common clustering algorithm for each kind of dataset. There exist lots of clustering algorithms: hierarchical clustering, k-means algorithms, HCS clustering algorithm, biclustering and etc. For this reason are developed set of clustering algorithms for dierent cases of data. The supervised learning (classication) has the information about groups and their properties in turn the clustering is the data grouping without the information about clusters properties. In this case the dierent criteria are introduced [1]. There is no common criterion for each type of dataset and only local-optimum of the clustering can be found. The criterion variety depends on the clustering application area. The clustering analysis is applied in medicine, computational biology, computer vision, economics, genetic and etc. The proposed methods are related to additional problem in data analysis data inaccuracy. The both clustering and data clearness problems are employed for development of criterion of cluster number estimation. The goal of the dissertation work is to produce the cluster number validation criterion and to solve the data imprecision problem at the same time. The current work propose the new algorithms of cluster number determination which in certain sense improve the existing algorithm. 2

The materials are published in 7 printed works, 2 of them are the articles [2, 3] and one paper [4] is indexed in Scopus, 4 conference abstracts [5, 6, 7, 8] and 1 paper is under publication in the Ìîëîäîé ó÷åíûé journal in geographical section applying the developed method to the real data. The structure of the dissertation consists of introduction, four sections, conclusion and bibliography. The introduction presents the relevance of the work, the introduction into the problems of considered topics, the short presentation of goals of research. The Section 1 presents the existing approaches of considered problem, the description used methodologies, the basic notation is introduced and the literature review conducted. The Section 2 presents the additional specic notation for proposed methods, the similarity measure between clusterings is introduced and its metric properties are proved, the two algorithms and their descriptions are exposed. The Section 3 shows the algorithms' work on the articial data and their operability test. The Section 4 presents two results of proposed algorithms usage on real data and results interpretation. In the Conclusion the research results are summed up and the main conclusions are formulated. 3

Chapter 1 The formal description of existed methods 1.1 The common existed approaches Nowadays, the problem of true cluster number estimation is unsolved. Although it does not mean that there are no alternatives for grouping number estimation in the cluster analysis. A wide spectrum of methodologies exists, which consists of a large amount of criteria for the data natural group number estimation. The criteria is not optimal in general case as it gives only localoptimum. Stability methodology in cluster analysis has a direct connection with current work, which causes the main interest. Comparing to the mathe- matical analysis the stability concept in clustering analysis does not have a mathematical interpretation. This concept depends on logical beliefs and in most of the cases of the clustering analysis it looks like a cluster variety estimation. The basis of this variety estimation is considered to have the inverse property the similarity between the inter results, which are generated under an assumption. Before the statement of central concepts, the basic notation should be introduced. The common theoretical notation does not exist, because the clustering analysis have been developed in multiple scientic areas (biology, medicine, computer science, statistics etc.). The notation presented here will be simple to use in mathematical descriptions. 4

X = {x1 , x2 , ..., xn } of d-dimensional Euelements xi = (xi,1 , xi,2 , ..., xi,d ) d-dimensional the elements from X will be considered as vectors Let us denote the nite set Rd , where ∀xi ∈ X , further clidean space vectors without additional second index. The clustering method will be denoted as a α(X). The data partition and the result of clustering 1 2 k is α(X) = {S , S , ..., S }, where k is the number of partitions or clusters j j l and ∀S ⊂ X and S ∩ S = ∅ for ∀j 6= l . Additional notations for each of function from the set of the concept descriptions will be introduced directly in concepts presentation. Most of stability based concepts use the distances between clusters. A variety of distance measures can be found in literature, the following representation of clusters needs to be introduced:  1, cij = 0, The matrix C 1, C 2 if xi and xj is in the same cluster and xi 6= xj ; otherwise. C q for each clustering αq (X) its own. The dot product of matrices is dened by formula 1 2 hC , C i = X c1ij c2ij . i,j The dot product computes the number of elements which are clustered together. The dot product have the following property: hC 1 , C 2 i ≤ p hC 1 , C 1 ihC 2 , C 2 i. The property is named as Cauchy-Schwartz inequality by its developers. Under this inequality the cosine similarity measure was introduced by Fowlkes and Mallows in [9]: hC 1 , C 2 i . cor(C 1 , C 2 ) = p hC 1 , C 1 ihC 2 , C 2 i These are normalized correlations between two clustering, which can be considered as distance between clusterings or cluster similarity measure. 5

Signicant amount of similarity measures between the clusters is introduced in the cluster analysis, the commonly used measures will be presented in this part of work. measures: The dot product helps us to express the following hC 1 , C 2 i J(C , C ) = , hC 1 , C 1 i + hC 2 , C 2 i − hC 1 , C 2 i 1 2 Jaccard coecient [10] is the similarity measure, where only non-negative matches are considered. The squared norm of matrix k C k2 = hC, Ci is proposed to use [11] for the matching coecient representation: M (C 1 , C 2 ) = 1 − 1 k C 1 − C 2 k2 , n2 a way of weighed penalty for non-matchings usage in similarity estimation. The choice of similarity measure depends on the aim of the analysis. In the most of cases the precise measures are excessive and make additional computational load, which is undesirable. The rst concept of true cluster number determination which we consider is presented in [12]. The idea consists of the random subsample generation X1 from general set elements from X. X. The condition for X1 is the xed number of This number of elements has been characterized by sam- f , which describes the proportion of elements in subsample X1 to number of elements in X . The fraction of points sampled is f > 0.5 pling ratio whole and it should be close, but not equal 1. This condition provides the property for two dierent subsamples X1 and X2 with same f: X1 ∩ X2 6= ∅. The main assumption of this concept is that the clustering structure of X should be inherent by subsamples. Under this assumption the clusterings of two dierent subsamples with same ratio should have the same clusterings on intersection of these subsamples. The comparison of two dierent clusterings is held by similarity measures. The true clustering is stable with respect to sub-sampling. 6

The second stability concept of cluster number validation is based on supervised learning. The idea is formulated in [13] and the additional modications are added (modications of subsample generations) by Lozkins and Bure. Y , where yi ∈ Y element xi ∈ X and xi The solution of clustering is presented as a set of labels represents the number of the cluster, wherein the cluster label is function φ, yi . The supervised learning has the denition of classier which under training set makes the conclusion of element label. A lot of classication approaches exists with it's own classication function and in this work the theory of classication is omitted. The work [13] does not provide the methods how to choose the training set. Therefore the idea of the previous concept of the subsample is used. X1 ∈ X The training set is the subsample with fraction level f and the inheritance of cluster structure implied. Let us assume the classier Y1 is the clustering result of transformed to matrix C. C1 . φ1 X1 . is trained on subsample The classication result The clustering results of X (X1 , Y1 ), where Y = φ1 (X) is is represented by In this idea the stability estimation is compared to average similarity measures for each considered cluster number. If the variety of clusters for dierent classiers for the same fraction ratio for xed number of clusters is low, then clustering is stable with respect to another number of clusters considered. The challenge of presented concepts is the fraction ratio and in the last case the classier function. In some cases it can cause problems (for example: f is too small and does not include the whole cluster) and the result can be inaccurate. This way a large amount of variants of algorithms needs to be tested. 1.2 Literature review A lot of methodologies can be found there for correct cluster number estimation. Dierent points of data interpretation makes the elds for disjoint ideas of clustering problems solution. The main existing methodolo- 7

gies in clustering analysis of cluster number validation are described in this section. The methodology of consideration of intra- and inter-cluster variances as a function of number of clusters is a basis of criteria represented in the following research: [14, 15, 16, 17, 18, 19, 20], [21] (C -index) and [1] (the Gap Statistic method). Methods described in the works mentioned above use the geometrical prospective. The gap statistic consider the intra clusters dispersion or called the error measure Wk versus number of clusters k. The discrete curve Wk mono- tonically decrease as the number of clusters increase. The sudden jump down of the curve at certain number of k is an appropriate number of cluster elbow criteria. C -index consider the sum of distances over all pairs from the same cluster S , the Smin is the sum of l shortest distances between pairs, the Smax is the sum of l largest distances in the whole set. The index is dened S−Smin by formula: C = Smax −Smin . The better clustering is determined by small C -index value. The Nonparametric approach denes the true number of clusters as the number of areas with high data density. Each cluster corresponds to the domain of attraction, i.e. area with high data concentration. The Wishart in [22] was the rst to suggest the density mode for cluster structures' research. Later the idea was developed by Hartigan in [23]. Multiple methodologies exist apart from the ones mentioned. However, this work is focused on the methodology latter described as it is the closest to the one proposed by the author of the work. Methods involving cluster stability concept are based on consideration of two clustered selections at a time which are taken from initial dataset. The composition of clusters can be interpreted as their reliability [24] at high stability level. The idea described is a basis for criteria of determination of optimal number of clusters used in the work of Levine and Domany [25] and Ben-Hur, Elissee and Guyon [12, 26, 27]. Another methodology uses the goodness of t tests. The paper [28] introduced the X-means algorithm for optimal cluster number determination via Bayessian information criterion. The mean of Hotelling's T-square p8

values is considered in [29] for distance estimation between samples. The earlier works by Volkovich, Brazly and Morozensky [30] describe the way for cluster number determination through statistical criterion. The listed circumstances suggest that correct clustering should be stable with respect to the random perturbation of initial data, at least at a low variance. As the value of variance can be varied, mathematical ex- pectation of the random perturbation is always equal to zero, suggesting the absence of regular errors in the initial data. In the papers [31] and [32] similar approaches are being used, these works are focused on a clusters similarity measures and consideration of normal and binomial noise distributions respectively. The data perturbations in an initial dataset X as a grouping stability estimation are considered in [2, 3, 4]. The main assumption of these methods is inaccuracy of the data studied. If the low perturbation of data is added then the clustering should be the same. The variances of the perturbations are the control parameters for the stability validation. The acceptable clustering is the data sample grouping that is robust to random perturbations of investigated data compared to the other considered variants. These methods used with real data are presented in [3, 7] and will be described in later sections. 9

Chapter 2 Formalization of the approach 2.1 Methodology description The idea of the approach comes from the studied data clearness problems. Initial data explored always has inaccuracies, the nature of these errors has a dierent origin, rate and it has the casual character. Erroneous data arises due to the measurement and/or rounding errors. Also, there are errors in mathematical models. Analyzing the socioeconomic information the situations can be observed when the initial numerical data contains various kinds of inaccuracies. Occasionally, data are intentionally misrepresented, e. g. economical data relating to the income of juridical or natural person, or socio-economical data, which are collected as a result of selective survey of various social groups. Apart from those mentioned other possible sources of observational inaccuracy exist. Consequently, the analyzed data is imprecise. In that kind of clustering tasks the result cannot rely on the truth of already existing data. Thus, the correct value of observation can dier from the data investigated. The circumstances suggest that correct clustering should be stable with respect to the random perturbation of the initial data, at least at the low variance. As the value of the variance can be varied, mathematical expectation of random perturbation is always equal to zero, suggesting the absence of regular errors in the initial data [2, 4]. 10

Two versions of algorithms are presented above which determine the acceptable clustering that is robust to random perturbations of the underlying data. Every option has right for existence. The additional notation has to be introduced. The new dataset which is generated as a noise addition to each element Xσ = {xi,σ |xi ∈ X, xi,σ = xi + εi,σ , i = 1, ..., n}, j 1 d T where εi = (εi,σ , ..., εi,σ ) and εi,σ are mutually independent random elements from uniform distribution U (−δ, δ), or in a general case, from betadistribution [33] dened in (−σ, σ). The noise generation can be performed of X will be denoted as from other distributions, for example, from truncated normal distribution (i. e. initial normal distribution has a zero mathematical expectation and variance equals to σ 2 , N (0, σ 2 )) within a given truncation range (−δ, δ). The choice of distribution is contingent on initial supposition on the initial data error nature. In the case of the absence of the distribution information of initial data inaccuracies, as a benchmark distribution is given the uniform distribution. k disjoint subsets of set X are denoted as X = ∪ki=1 Si , where Si ∩ Sj , ∀i 6= j ∨ i = 1, ..., k; j = 1, ..., k . The partition algorithm or the approach lets us denote αk (X) = {S1 , ..., Sk }. One of the approaches uses the perturbed set Xσ . The perturbation level σ should be estimated by the expert or the set of {σ} must be considThe result of partition into ered. In the second case the complexity of the stability estimation is growing up.The assumption of data inaccuracy is represented by the perturbed data. The grouping of are the k Xσ into k clusters will be denoted as disjoint sets and the union of all S1σ , S2σ , ..., Skσ , which ∪ki=1 Siσ = Xσ . The second approach uses the extended perturbed set where the set Xσ+ = X ∪ Xσ with not more than 2n number of elements. The partition of Xσ+ into k disjoint groups will be denoted as S1σ+ , S2σ+ , ..., Skσ+ . Lets dene σ− the following sets Si = Siσ+ \ Xσ , i = 1, ..., k , which are the k disjoint sets σ− and the union of all Si is the set X . The approach of partition is dened σ− σ− σ− as αk (X) = {S1 , ..., Sk }. In this case the stability concept is viewed as group variability under the data extension by the perturbed data. The data extension does not end on one perturbed initial set addition, the number of perturbed sets can dier 11

and noise distribution for each set can have it's own distribution parameters. For example, lets consider the with no more than rn r perturbed sets number of elements. In Xσ+ = X ∪ Xσ1 ∪ ... ∪ Xσr most cases r = 1 is enough. Large number of perturbed sets add complexity but does not give additional precision. The cluster stability concept implies cluster comparison. The concept of suciently rough estimation of cluster consistency comparison is proposed. Consider the partitions: C trices and Cσ, 0. And titions is S1 , S2 , ..., Sk and S1σ , S2σ , ..., Skσ . If the relevant ma- of the partition do match, then the distance between par- 1 otherwise. Using the dot product, the introduced (by Lozkins and Bure) rough partition similarity measure can be represented by the following formula: lb(αk (X), αk (Xσ )) = Lemma. The  0, if 1, otherwise. √ hC,C σ i hC,CihC σ ,C σ i = 1; lb(αk (X), αk (Xσ )) similarity measure has distance prop- erties. Proof. Proposed similarity measure satises the non-negative axiom: lb(αk (X), αk (Xσ )) ≥ 0, the set of default values is {0, 1}. Lets show that identity of indiscernibles axiom holds. The condition lb(αk (X), αk (Xσ )) = 0 then hC, C σ i = hC, Ci = hC σ , C σ i Schwartz √ holds if and only if hC,C σ i √ hC,C σ i hC,CihC σ ,C σ i √ = 1. If C = Cσ hC,C σ i = 1. CauchyhC,CihC σ ,C σ i p inequality hC, C σ i ≤ hC, CihC σ , C σ i gives that ≤ 1. If C = 6 C σ then hC, C σ i < hC, Ci and hC, C σ i < hC σ , C σ i σ consequently hC,CihC σ ,C i because hC, Ci and hC σ , C σ i have maximal numbers of matches (from def- inition of matrix C in the Section 1), consequently √ hC,C σ i hC,CihC σ ,C σ i < 1 and lb(αk (X), αk (Xσ )) = 1. The proposed similarity measure has the symmetry property inherited from the dot product lb(αk (X), αk (Xσ )) = lb(αk (Xσ ), αk (X)). 12 symmetry property:

lb(αk (X), αk (Xσ )) = 0 consequently the lb(αk (X), αk (Xσ )) ≤ lb(αk (X), αk (Xβ ))+lb(αk (Xσ ), αk (Xβ )) holds as lb(αk (X), αk (Xβ )) + lb(αk (Xσ ), αk (Xβ )) ≥ 0 under non-negativity σ condition. The second case lb(αk (X), αk (Xσ )) = 1, that means C 6= C . Let us show that lb(αk (X), αk (Xσ )) ≤ lb(αk (X), αk (Xβ )) + lb(αk (Xσ ), αk (Xβ )) β σ β holds. Without loss of generality, C = C . Consequently C 6= C . In this case lb(αk (X), αk (Xβ )) + lb(αk (Xσ ), αk (Xβ )) = 1 and inequality condition β σ β holds. If C 6= C and C 6= C , then lb(αk (Xσ ), αk (Xβ )) = 2 and inequality Let us show that the triangle inequality hold. If condition holds too. Proved. The rough similarity measure is chosen to save the computational costs. The other more precise similarity measures can be used, but this would signicantly increase computational complexity of the method. In most cases it is excessive and unjustied. The proposed distance between clusterings is adapted to the algorithms in the later Sections. The main attention is drawn to the perturbation parameter determination using this metric, which allows to reduce the number of settings. The meaning of the clustering stability in our understanding is dened by the clusterings variety level under random perturbations in data compared with initial clustering. The superior, precise and proper clustering consists of the clusters which have the minimal level of dependence from perturbation level. Denition. The variability frequency the ratio of the number of observed mismatches to the total number of repetitions, denoted by ν. The variability frequency is the result of algorithm's work, it shows the stability level for each number of clusters considered. This value represents the normalized number of initial and perturbed data clustering mismatches and consequently the clustering reliability. 13

2.2 An algorithm based on data inaccuracies In Section 2.1 the data inaccuracy problem was mentioned as well as some ways of data interpretation in simulation studies for solution. The true clustering determination is considered as a main problem, which the algorithm tries to solve. The clustering algorithm and choice of number of clusters are important problems in cluster analysis. The algorithm described in this section nds the local-optimum of cluster number and partially solves the clustering algorithm problem (which clustering algorithm approach is better). The cluster robustness is used for grouping number validation and is the basic assumption for the approach. The algorithm is based on the stability concept. The stability meaning in this approach is connected to errors in data, rather to errors representation in the data. In mathematical science this idea is used in stochastic programming. The algorithm gives a local-optimum, because clustering analysis has lot of clustering methods and there are methods which solutions ore not precise (fuzzy). Dierent algorithm nature adds the complexity to the cluster analysis problem. Using the expert opinion the clustering algorithm set can be chosen. This set is the parameter of the proposed algorithm. Under this set is related the local-optimum, in real situation all clustering methods can not be considered. In most cases it is not necessary and one or two clustering approaches can be considered (expert opinion). There are a lot of approaches for cluster number determination (Section 1.2), the majority uses interval of admissible cluster number values and holds in the considered algorithm. The sequence of integers or the integer number interval is enough, the expert opinion is not necessary. Let us denote this set as [kmin , kmax ]. The perturbed set Xσ is the crown (the most important part) of the algorithm setting. If the data errors level is known, then xed is enough, otherwise the set of {σ} should be considered. σ consideration An important question: how to choose the set of noises distribution variances? This set can be estimated experimentally, starting from small variances values. If for all number clusters the variability frequencies are relatively low then the 14

variances increase. If the variability frequencies are near 1, then the the Ω = {σi }si=1 . Xσ . The clustering variances decrease. Let us denote the variances set The algorithm works with two sets X and into k clusters of these sets should give the same results on most cases of the noises addition. There is the precision control parameter, which reects the number of additions to Xσ . In each generation the clusterings are compared. The comparison runs using the measure can be used, but lb distance between clusterings (another similarity lb distance is more suitable for this algorithm). The algorithm scheme is presented below: Input : X , [kmin, kmax], Ω = {σi}si=1; Output : νσi,k ; Requires : a clustering algorithm αk (X), similarity measure between clusters; N umOf Dif f ers = 0; for (k in kmin..kmax) do αk (X); for (σi in Ω) do for (i in 1..N umOf Repetitions) do Xσi generation; N umOf Dif f ers = N umOf Dif f ers + lb(αk (X), αk (Xσi )); end for νσi ,k = end for end for The N umOf Dif f ers N umOf Repetitions ; N umOf Repetitions is the precision control parameter which was mentioned before. The result of the algorithm's work is the frequency matrix {νσi ,k }. This result representation is complex and seriously interpreted. The statistics introduction into the result modication is the way of the result matrix simplication. For example, the median or the mean of the variances transform matrix to the vector. The vector represents the variability frequency for each considered number of cluster νk . 15

The rst version of proposed algorithm is presented in the work [2] and the practical usage of this approach is considered in [3] and will be discussed in the Sections 3 and 4.1. 2.3 Algorithm based on extended set In Section 2.2 the algorithm was considered, which is based on the data errors. In this section another algorithm uses a similar assumption, with some dierences and signicant modications. The work [23] describes clusters as the islands in the sea, where the heights above the sea level are represented as group densities. There is a similarity between above described concept and the proposed algorithm in this section. The idea oered represents the islands articial uplift. If the islands heights above sea level increase beyond uplift, then the clusters are more stable. If square of islands grows, then clusters are less stable. The idea is presented in [4] and the application is publishing in the journal Ìîëîäîé ó÷åíûé. Input : X , [kmin, kmax], Ω = {σi}si=1; Output : νσi,k ; Require : a clustering algorithm αk (X), similarity measure between clus- ters; N umOf Dif f ers = 0; for (k in kmin..kmax) do αk (X); for (σi in Ω) do for (i in 1..N umOf Repetitions) do Xσi + = X ∪ Xσi ; αkσi − (X); N umOf Dif f ers = N umOf Dif f ers + lb(αk (X), αkσi − (X)); end for νσi ,k = end for end for N umOf Dif f ers N umOf Repetitions ; 16

The islands analogy helps to understand the concept of the proposed method. The islands uplift procedure are represented by extended dataset Xσ+ . The σ has the same meaning as in the previous section. The union of initial and perturbed datasets Xσ+ makes stable clusters more unchange- able and unstable clusters have high variety level comparable with initial clustering. The extended set has an adjustable size (2n, 3n, ...). The size helps to separate the stable clusterings from the unstable accurately, but the computational complexity signicantly increases. The result simplication using statistics takes a place in this approach as well as in previous section. The similarity measure is free to selection and should be simple in computational aspect. 17

Chapter 3 Computational experiments on articial data 3.1 Articial data description The algorithms mentioned in the previous sections are tested on articial dataset. This experiment shows the correct work of the proposed approaches and helps to describe parameters of the algorithms. The articial dataset consists of ve samples generated from multivariate normal distribution with 100 elements in each sample with dierent means and variances. The parameters of the samples are presented in the Table 3.1. Table 3.1: Multivariate normal distribution parameters Sample number Means vector Sample 1 (1,1) Sample 2 (1,4) Sample 3 (3.5, 2.5) Sample 4 (6,1) Sample 5 (6,4) 18 Variances matrix 0.15 0 0.15 0 0.2 0 0.15 0 0.15 0 0 0.15 0 0.15 0 0.2 0 0.15 0 0.15

The exposed data represent ve samples each can be presented as cluster without special research. ber validation is known. In this case, the result of cluster num- The Figure 3.1 displays the articial data. two-dimensional dataset is denoted as The X. Figure 3.1: Union of 5 samples from multivariate Gaussian distribution 3.2 Algorithms tuning and application on the articial data The presented algorithms use similar input parameters. For result comparison the same parameters will be used. The parameters of the algorithms are presented in Table 3.2. The approach for cluster number determination which uses the extended set as union of initial datasets and one perturbed dataset (only one perturbed dataset) is applied. Table 3.2: Algorithms parameters Parameter Cluster number set Value [kmin , kmax ] [2,10] Distance between clusters σ1 = 0.02; σi = σi−1 + 0.02; i = 2..20 k -mean algorithm lb(αk (X), αk (Xσi ) Number of repetitions/simulations 50 Perturbation distributions Uniform distribution, Gaussian dis- Variances sequence Clustering algorithms tribution 19

The set of perturbation distribution parameters is most important. The correct variances set estimation plays the main role in variability frequency calculation. This set depends on intra data average standard deviation level. The Figure 3.2 (a) and (b) present the variability frequencies for 2, 3, 5 and 6 number of clusters depending on perturbation variances values. If frequencies for each variance value is low (less than 0.5), then the variances level is not correct and should be increased. If frequencies for each variance value are near one, then the variances level is not correct too and should be reduced. In Figure 3.2 (a) and (b) is generated on correct variances set, because there exists the cluster number with low level of frequencies (black line corresponds to number of clusters equal 5) which is interpreted as more stable, and other cluster numbers have high variability of frequencies and are not stable. In both cases the results do not have signicant dierences. It has to do with main methodology of the algorithms and the simple case of studied data. Figure 3.2: The frequencies for both algorithms (a) The frequencies, resulting from the algorithm described in Section 2.2 (b) The frequencies, resulting from the algorithm described in Section 2.3 The Figure 3.2 (a) and (b) represent the results where variability frequencies νσi ,k tion level. depend on two parameters: number of clusters and perturba- In the case when the set of numbers of clusters consist of 25 numbers this results presentation is simple analyze and make conclusions. 20

When the number of clusters is greater than 5, as in our case, the mean and median are used for result simplication. The elbow criterion is used in cluster number validation theory [1]. The statistics introduction to our algorithms makes it possible to use this criterion. The experimental results for k -means clustering algorithm for the rst cluster validation method are presented on Figure 3.4 (a), (b), (c) and (d). There four dierent cases are considered where two perturbation distributions and two statistics are taken. The Figures a, c present the medians of frequencies and Figures 3 b, d represent the means of frequencies for dierent distributions. Two dierent statistics show similar results and conrm the truthfulness. The average of frequencies allows to represent the curves in Figure 3.2 as the point without loss of information. The Figure 3.4 (a), (b) are generated using uniformly distributed perturbations in comparison with the Figures c,d results for normally distributed perturbation are better with minimal frequency about 0.3, but in the case with normal distribution the minimal frequency is 0.4. The variances set for both distributions was the same then the experiment was carried out. The Figure 3.6 has the same interpretation as the Figure 3.4. The results of proposed algorithms on articial dataset obviously are similar and dier insignicantly at non-stable numbers of clusters. The goal of the cluster number validation algorithms is to obtain the sharp dips of the curve at stable points (Elbow criteria). In Figure 3.4 and Figure 3.6 the number of clusters is equal to 5. This is the expected solution thereby proving the conrmation to proposed algorithms. The results can not be interpreted as an optimal solution, because the limited set of number of clusters and number of clustering algorithms were considered. 21

Figure 3.4: The k -mean clustering algorithm is being considered for the cluster number validation algorithm described in Section 2.2. (a) The median of frequencies for each considered number of cluster the perturbations from uniform distribution (b) The mean of frequencies for each considered number of cluster the perturbations from uniform distribution (c) The median of frequencies for each considered number of cluster the perturbations from normal distribution (d) The mean of frequencies for each considered number of cluster the perturbations from normal distribution 22

Figure 3.6: The k -mean clustering algorithm is being considered for the cluster number validation algorithm described in Section 2.3. (a) The median of frequencies for each considered number of cluster the perturbations from uniform distribution (b) The mean of frequencies for each considered number of cluster the perturbations from uniform distribution (c) The median of frequencies for each considered number of cluster the perturbations from normal distribution (d) The mean of frequencies for each considered number of cluster the perturbations from normal distribution 23

Chapter 4 Applications of algorithms for applied problems 4.1 The cluster analysis of European countries by unemployment rates The cluster analysis application to data exploration helps in nding dependence or groups of elements with similar properties. The dependence may occur under large amount of parameters and simple tools are insucient. The results of cluster analysis provide an answer which interpretation may provide new information or discovery. The Section 4.1 describes the practical application of algorithm from Section 2.2 to economical dataset. The unemployment rates of European countries are considered as underlying data set. The data consist of 3 parameters or features, the unemployment rates of European countries by I, II and III quarters of 2014. The analogous result presentation can be found in the [7] where the ination rates were considered instead of unemployment rates. The rst real application of discussed approaches starts on this data set and is presented in [3]. The dataset of unemployment rates enclosed in Appendix 1. The parameters of the algorithm are presented in the Table 4.1. The number of elements in investigated data is equal to 42. The maximal tested amount of clusters is 10, consequently the average contents of the clusters is about 4 elements. This groups are nely divided and may add 24

Table 4.1: Algorithms parameters Parameter Value Cluster number set [kmin , kmax ] [2,10] σ1 = 0.07; σi = σi−1 + 0.07; i = 2..20 k -mean algorithm, hierarchical clus- Variances sequence Clustering algorithms tering Distance between clusters lb(αk (X), αk (Xσi ) Number of repetitions/simulations 50 Perturbation distributions Uniform distribution noises into research (clustering into 42 clusters would be more stable). The uniform distribution U (−δ, δ) is considered only, some practical advantages of mentioned distribution are listed in the Section 3. Figure 4.1 displays the variability frequencies of clusters set for k -means νσi ,k for [2, 5] number clustering method. The variances set is correct: each group number contains frequencies less than 0.1 and greater than 0.5. There are two stable points 2 and 3 with average frequencies 0.15 and 0.3 respectively Figure 4.1 (b) and (c). Figure 4.3 represents the same results as Figure 4.1 but the clustering algorithm is dierent. Another clustering algorithm gives closer curves of frequencies for the stable grouping numbers. In the average the 2 and 3 cluster numbers continue to be stable compared to other cluster number cases. The clusterings into 2 and 3 clusters of underlying set for both clustering algorithm are respectively the same. In the rst case, clusters consist of countries presented in Table 4.2. In the second cluster prevail countries resulting from the collapse of Yugoslavia in 2004. Bosnia and Herzegovina, Croatia, Macedonia and Serbia are post-communistic countries. All countries of second cluster have close geographical location. Greece in 2014 was in economic crisis and had problems with unemployment rate in that year. The clustering results for 3 clusters have the consistence presented in Table 4.3. Most of counties from second cluster are the countries which participated in the Soviet Union. After the Union collapse the economies of most 25

Figure 4.1: The k -mean algorithm is considered as clustering algorithm for the variability frequencies calculation (a) The frequencies for dierent number of clusters (black 2, red 3, green 4, blue 5 cluster numbers) (b) Median of frequencies for each considered number of cluster, perturbations are uniform (c) Mean of frequencies for each considered number of cluster, perturbations are uniform Table 4.2: The clustering of European countries by unemployment rate for two clusters Cluster number Countries Cluster Nr.1 Albania, Armenia, Austria, Belgium, Bulgaria, Cyprus, Czech Republic, Denmark, Estonia, Finland, France, Germany, Hungary, Iceland, Ireland, Italy, Kazakhstan, Latvia, Lithuania, Luxembourg, Malta, Moldova, Montenegro, Netherlands, Norway, Poland, Portugal, Ro- mania, Russia, Slovakia, Slovenia, Sweden, Switzerland, Turkey, Ukraine, United Kingdom Cluster Nr.2 Bosnia and Herzegovina, Croatia, Greece, Macedonia, Serbia, Spain 26

Figure 4.3: The hierarchical clustering algorithm is considered for the variability frequencies calculation (a) The frequencies for dierent number of clusters (black 2, red 3, green 4, blue 5 cluster numbers) (b) Median of frequencies for each considered number of clusters, perturbations are uniform (c) Mean of frequencies for each considered number of cluster, perturbations are uniform Table 4.3: The clustering of European countries by unemployment rate for two clusters Cluster number Countries Cluster Nr.1 Austria, Belgium, Czech Republic, Denmark, Estonia, Finland, France, Germany, Iceland, Ireland, Kazakhstan, Luxembourg, Malta, Moldova, Netherlands, Norway, Romania, Russia, Sweden, Switzerland, Turkey, Ukraine, United Kingdom Cluster Nr.2 Albania, Armenia, Bulgaria, Croatia, Cyprus, Ireland, Italy, Latvia, Lithuania, Montenegro, Poland, Portugal, Slovakia, Slovenia Cluster Nr.3 Bosnia and Herzegovina, Greece, Macedonia, Spain 27

of countries had the equal level of development and it aected the todays economics level. The result interpretation has a lot of points of view, the additional background or analysis of the results by expert in economical area makes this outcome more valuable. The cluster analysis of European countries by unemployment rates is carried out. The two clustering algorithms are tested with uniformly dis- tributed perturbations. The two stable clusterings are obtained and interpreted in this Section. 4.2 The clustering analysis of European Union countries by demographic parameters Current demographic trends in Europe are characterized by fairly high mortality rate, low birth rate, low natural population growth and high migration activity. Cluster analysis reveals signicant natural grouping of data. Abstracting from the physical sense, it is possible to determine a division and communication in the tested data at the technical level. This analysis provides non-trivial results. The heterogeneity of measuring scales demographics are assessed by requires additional analysis tools. According to the initial values the local normalization has been applied. The following demographic indicators were the basis of the analysis: birth rate, mortality rate, migration, natural population increase and the proportion of the urban population in the total population. The indicators values are enclosed in Appendix 2 [35]. The clustering was carried out on dierent set of parameters and the most interesting will be presented in this Section. The second algorithm shown in Section 2.3 will be used for stable clustering estimation. The setting for the algorithm is presented in the Table 4.4. 28

Table 4.4: Algorithms parameters Parameter Cluster number set Value [kmin , kmax ] [2,10] Variances sequence σ1 = 0.005; σi = σi−1 +0.005; i = 2..10 Clustering algorithms hierarchical clustering Distance between clusters lb(αk (X), αk (Xσi ) Number of repetitions/simulations 50 Perturbation distributions Uniform distribution Figure 4.5: The hierarchical clustering algorithm is considered for the calculation of variability frequencies (a) The ve indicators of demography are studied (birth rate, mortality rate, migration, natural population increase and the proportion of the urban population in the total population) (b) The two indicators of demography are studied (birth rate, natural population increase) (c) The three indicators of demography are studied (natural growth rate, migration rate and the proportion of the urban population in the total population) 29

There are ve indicators for each country. indicators are carried out in this experiment. The all subsets of these The outcomes with stable points are presented below. The Figure 4.5 (c) displays the cluster number of ve as more stable. The data have the groups presented in Table 4.5. Table 4.5: Clusters for three demography indicators Cluster number Countries Cluster Nr.1 Austria, Germany, Netherlands, Finland, Greece, Spain, Italy, Portugal, Bulgaria, Bosnia and Herzegovina, Hungary, Latvia, Lithuania, Macedonia, Poland, Romania, Serbia, Slovakia, Slovenia, Croatia, Montenegro, Czech Republic, Estonia Cluster Nr.2 Belgium, United Kingdom, Luxembourg, France, Switzerland, Denmark, Norway, Sweden, Andorra, Cyprus, Malta, Monaco, San Marino Cluster Nr.3 Ireland, Albania Cluster Nr.4 Liechtenstein Cluster Nr.5 Iceland The rst group, which includes highly Western Europe and Eastern Europe countries with zero natural increase, or close to zero values of natural increase, mostly positive migration rate and rather high proportion of the urban population, is the largest. The second cluster includes the developed countries of Western and Northern Europe, there are positive values of natural growth, high migration increase and high values of the urban population ratio. The third cluster includes only two countries with high natural population increase, the very low migration increase and the low average values of the proportion of the urban population. The third cluster includes economically moderately developed countries of Europe the Ireland and the country with a relatively low level of economical development among the other European countries Albania. In Albania, the highest natural population growth is achieved mainly due to the high proportion of Muslim population. These countries are less attractive to migrants than the developed countries of Western and Northern Europe with higher standards of living. Liechten- 30

stein forms a fourth cluster with approximately equal low values of natural population growth, migration and the proportion of the urban population. The fth cluster includes only one country Iceland with a fairly good indicator of natural population increase among the other European countries, with a negative migration growth and a very high proportion of the urban population. Five clusters of the level of demographic development are built into the rating. According to the three indicators the best demographic situation is observed in the countries that make up the second cluster. The fourth cluster is in the second place and includes the country with positive indicators that aect the changes in the population, but with a low proportion of the urban population. The fth cluster takes the third place in the ranking and includes a country with high rates of natural population increase and the proportion of the urban population, but with a low migration increase. The rst cluster is in the fourth place in the ranking, and the countries belonging to it have a small positive or negative natural population growth and migration, that is, they have low population dynamics. At the same time in the rst cluster suciently high proportion of the urban population is observed. The third cluster includes countries with a very uneven distribution of the demographic indicators values that distinguishes it from the other clusters. Figure 4.5 (a) represents the division into four clusters according to ve indicators and demonstrates similar trends to the previous clustering a favorable demographic situation in the economically developed countries of Western and Northern Europe and less favorable in the countries of Eastern and South-Eastern Europe. The clusters are presented in the Table 4.6. In the region of Central and Eastern Europe a major factor for birth rate decline is that after the collapse of the world socialist system introduced the change into the socioeconomic order. Earlier, the Central and Eastern European socialist countries belonged to the group and now they belong to the countries which economies are in transition (from a socialist to a market). Under socialism there were many conditions that had a positive eect on the birth rate: the availability of jobs, better housing, benets for children. As the result of transformation of economical system during early 90-ies of XX century the beginning of XXI century, countries of Central and Eastern 31

Table 4.6: Clusters for ve demography indicators and partition into 4 groups Cluster number Countries Cluster Nr.1 Austria, Germany, Finland, Greece, Italy, Portugal, Bulgaria, Bosnia and Herzegovina, Hungary, Latvia, Lithuania, Macedonia, Poland, Romania, Serbia, Slovakia, Slovenia, Croatia, Montenegro, Czech Republic, Estonia Cluster Nr.2 Belgium, United Kingdom, Luxembourg, Netherlands, France, Switzerland, Denmark, Norway, Sweden, An- dorra, Spain, Cyprus, Malta, Monaco, San Marino Cluster Nr.3 Ireland, Albania, Iceland Cluster Nr.4 Liechtenstein Europe have lost their social security, social assistance system was not yet so highly developed as in Western and Northern Europe, the people had sought employment in the dicult social and economic conditions. All this has contributed little to increase the birth rate [36]. Table 4.7: Clusters for ve demography indicators and partition into 2 groups Cluster number Countries Cluster Nr.1 Austria, Germany, Finland, Greece, Italy, Portugal, Bulgaria, Bosnia and Herzegovina, Hungary, Latvia, Lithuania, Macedonia, Poland, Romania, Serbia, Slovakia, Slovenia, Croatia, Montenegro, Czech Republic, Estonia, Belgium, United Kingdom, Luxembourg, Netherlands, France, Switzerland, Denmark, Norway, Sweden, Andorra, Spain, Cyprus, Malta, Monaco, San Marino Cluster Nr.2 Ireland, Albania, Iceland, Liechtenstein In the Table 4.7 the second variant of European countries grouping by ve indicators is presented. It is virtually identical to the previous result (Table 4.6): The rst and the second clusters are merged into one cluster the rst, the third and the fourth clusters are combined in the second cluster. The rst cluster is characterized by the average for the European Region birth rate and mortality, low positive, zero or negative values of natural population and migration growth and relatively high values of the proportion of the urban population. In the second cluster there are countries with a positive natural population growth and negative or low values of the migration 32

population growth. The second cluster is composed of small countries (by area), including those with an moderate level of economic development. Clustering in terms of birth rate and the coecient of migration growth allows for grouping of the countries based on the migration gain, and one of the components of natural population increase birth in Table 4.8, without taking into account mortality rates and other demographic indicators. Clustering result is dierent from the two previous clustering outcomes. Lux- embourg, Cyprus and Monaco small states (micro-states) of Europe constitute the third cluster. They have average European migration growth values and a high birth rate compared to other European countries. The rst cluster includes the bulk of the studied European countries with average European birth rates and relatively low or negative growth in migration. This demographic situation can be described as stagnant. The second cluster includes only two countries with a relatively high birth rate, but low negative migration increase, so the overall population growth is negative or low positive. Table 4.8: Clusters for two demography indicators and partition into three groups Cluster number Countries Cluster Nr.1 Austria, Germany, Finland, Greece, Italy, Portugal, Bulgaria, Bosnia and Herzegovina, Hungary, Latvia, Lithuania, Macedonia, Poland, Romania, Serbia, Slovakia, Slovenia, Croatia, Montenegro, Czech Republic, Estonia, Belgium, United Kingdom, Luxembourg, Netherlands, France, Switzerland, Denmark, Norway, Sweden, Andorra, Spain, Malta, San Marin, Iceland Cluster Nr.2 Ireland, Albania Cluster Nr.3 Cyprus, Monaco, Liechtenstein Clustering carried out by dierent groups of indicators leads to the conclusion that the European countries with higher levels of economic development forms cluster of countries with more stable demographic situation, in which a higher proportion of the urban population is observed and uctuations in the values of dierent indicators is small. Most of the countries of Europe make up, as a rule, large clusters, which are the countries with 33

stagnant, but more favorable demographic situation among the rest of the Europe. Small countries with a moderate level of economic development form small clusters. Thus, it is possible to identify the following demographic trends as result of European countries clustering. Major countries in Western and Northern Europe with a high level of economic development are often grouped into one cluster. The small countries of Western Europe, the countries of Central and Eastern and Southern Europe, as a rule, are grouped into another cluster. Less economically developed European countries, moderate level Western European countries and the small countries of the Western and Northern Europe are grouped in a separate cluster. 34

Conclusion In this dissertation we have investigated new approaches to clustering number determination under clustering stability concept. The stability concept was based on assumption of data inaccuracies which gives us the solution to the data inaccuracies problem itself as well as clustering number estimation. The existed and additional notation is developed. The two dierent algorithms which use the same assumptions but different data inaccuracies interpretation are proposed. The algorithms are tested on articial dataset and result of algorithms work is compared with expected result. The algorithms parameters adjustment have been demonstrated and parameters interpretation is given in Section 3. The two experiments on real data are carried out. The stable clustering under proposed criterion of the European countries unemployment rates found and historical and geographical result interpretation are provided. The cluster analysis of European Union countries demography indices are carried out. 35

Bibliography [1] Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B, Vol. 63(2), 411423. [2] Ëîæêèíñ À., Áóðå Â.Ì. (2016). Âåðîÿòíîñòíûé ïîäõîä ê îïðåäåëåíèþ ëîêàëüíî-îïòèìàëüíîãî Ïåòåðáóðãñêîãî ÷èñëà óíèâåðñèòåòà, êëàñòåðîâ ñåð. 10. // Âåñòíèê Ïðèêëàäíàÿ Ñàíêò- ìàòåìàòèêà, èíôîðìàòèêà, ïðîöåññû óïðàâëåíèÿ, âûï. 1, Ñ. 2838. [3] Ëîæêèíñ À. (2015). Êëàñòåðíûé àíàëèç ñòðàí Åâðîïû ïî óðîâíþ áåçðàáîòèöû // Òðóäû XLVI ìåæäóíàðîäíîé íàó÷íîé êîíôåðåíöèè àñïèðàíòîâ è ñòóäåíòîâ Ïðîöåññû óïðàâëåíèÿ è óñòîé÷èâîñòü. ÑÏá.: Èçäàòåëüñêèé Äîì Ôåäîðîâîé Ã. Â., Òîì 2(18), 1, Ñ. 641646. [4] Lozkins A., Bure V. M. (2015). The method of clusters stability assessing //" Stability and Control Processes" in Memory of VI Zubov (SCP), 2015 International Conference. IEEE, pp. 479482. [5] Ëîæêèíñ À. (2015). Êëàñòåðíûé àíàëèç ñòðàí Åâðîïû ïî óðîâíþ áåçðàáîòèöû // Ñáîðíèê òåçèñîâ êîíôåðåíöèè Ïðîöåññû óïðàâëåíèÿ è óñòîé÷èâîñòü 2015. [6] Ëîæêèíñ îöåíêè À., Áóðå óñòîé÷èâîñòè ìåæäóíàðîäíîé Ì. Â. (2015). Ýìïèðè÷åñêèé ìåòîäîâ êëàñòåðèçàöèè êîíôåðåíöèè, ïîñâÿùåííîé // ïîäõîä Ìàòåðèàëû 85-ëåòèþ ñî III äíÿ ðîæäåíèÿ ïðîôåññîðà, ÷ë.-êîðð. ÐÀÍ Â. È. Çóáîâà Óñòîé÷èâîñòü è ïðîöåññû óïðàâëåíèÿ. ÑÏá.: Èçäàòåëüñêèé Äîì Ôåäîðîâîé Ã. Â., Ñ. 431433. 36

[7] Lozkins A. (2015). Clustering of European Countries by an Ination Rate and Clusters Research // Abstracts of 20th International Conference of Mathematical Modelling and Analysis, pp. 56. [8] Lozkins A., Bure V. M. (2016). The approach for estimation of clustering robustness // 12th German Probability and Statistics Days Book of Abstracts. pp. 197198. [9] Fowlkes, E. B., Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, Vol: 78, pp. 553584. [10] Jaccard P. (1901). Distribution de la Flore Alpine: dans le Bassin des dranses et dans quelques regions voisines. Rouge. [11] Jain, A., Dubes, A. (1988). Algorithms for Clustering Data. Clis, Prentice-Hall, New Jersey. Englewood [12] Ben-Hur, A., Elissee, A., Guyon, I. (2002). A stability based method for discovering structure in clustered data. on Biocomputing, pp. 617. Proceedings of Pacic Symposium [13] Lange T. et al. (2004). Stability-based validation of clustering solutions. Neural computation, Vol: 16, 6, pp. 12991323. [14] Dunn, J. C. (1974). Well Separated Clusters and Optimal Fuzzy Partitions. Journal Cybern., Vol. 4, 95104. [15] Calinski, R., Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, Vol. 3, 127. [16] Hartigan, J. A. (1985). Statistical theory in clustering. J. Classication, Vol. 2, 6376. [17] Krzanowski, W., Lai, Y. (1985) A criterion for determining the number of groups in a dataset using sum of squares clustering. 44, 2334. 37 Biometrics, Vol.

[18] Sugar, C., James, G. (2003). Finding the Number of Clusters in a Data Set: An Information Theoretic Approach. tistical Association, Vol. 98, 750763. Journal of the American Sta- [19] Gordon, A. D. (1994). Identifying genuine clusters in a classication. Computational Statistics and Data Analysis, Vol. 18, 561581. [20] Milligan, G., Cooper, M. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, Vol. 50, 159179. [21] Hubert, L., Schultz, J. (1976). Quadratic assignment as a general dataanalysis strategy. British J. Math. Statist. Psychology, Vol. 29, 190241. [22] Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining eects. Numerical Taxonomy, 282311. [23] Hartigan, J. (1975). Clustering Algorithms. New York: John Wiley. [24] Cheng, R., Milligan, G.W. (1996). Measuring the inuence of individual data points in a cluster analysis. Journal of Classication, Vol. 13, 315 335. [25] Levine, E., Domany, E. (2001). Resampling Method for Unsupervised Estimation of Cluster Validity. Neural Computation, Vol. 13, 25732593. [26] Ben-Hur, A., Elissee, A., Guyon, I. (1998). Statistical learning Theory and randomized algorithms for control. IEEE Control Systems, Vol. 12, 6985. [27] Ben-Hur, A., Guyon, I. (2003). Methods in Molecular Biology / M. J. Brownstein and A. Khodursky (Ed.). Humana Press, 159182. [28] Pelleg, D., Moore, A. (2000). X means Extending k-means with ecient estimation of the number of clusters. In: Proceedings of the 17-th Inter- national Conf. on Machine Learning, San Francisco: 727734. 38 Morgan Kaufmann,

[29] Volkovich, Z., Brazly, Z., Toledano-Kitai, D., Avros, R. (2010). The Hotelling's metric as a cluster stability measure. new technologies, Vol. 14 (4), 6572. Computer modelling and [30] Volkovich, Z., Brazly, Z., Morozensky, L. (2008). A Statistical model of cluster stability. Pattern Recognition, Vol. 41 (7), 21742188. [31] Barzily, Z., Golani M. and Volkovich Z. (2008). On a simulation ap- Special Issue, Mathematical and Computer Modeling in Applied Problems, Institute Informatics Problems, proach to cluster stability validation. RAS, 86112. [32] Toledano-Kitai, D., Avros, R., Volkovich, Z., Weber, G.- W. and Ya- Journal of Intelligent and Fuzzy Systems, Special, Issue: Recent Advances in Intelligent & Fuzzy Systems, 417427. halom, O. (2013). A binomial noised model for cluster validation. [33] Áóðå Â.Ì., Ïàðèëèíà Å.Ì. Òåîðèÿ âåðîÿòíîñòåé è ìàòåìàòè÷åñêàÿ ñòàòèñòèêà. ÑÏá: Ëàíü, 416 ñ. [34] Trading Economics [eresource]: http://tradingeconomics.com/country-list/unemployment-rate URL: (date of treatment: 28.01.2015). [35] Ñòàðêîâà Í.Â. (2015). Ýêîíîìè÷åñêàÿ è ñîöèàëüíàÿ ãåîãðàôèÿ ñòðàí Åâðîïû: Ó÷åáíî-ìåòîäè÷åñêîå ïîñîáèå. ÑÏá.: ¾ÑÎËÎ¿, 119 ñ. [36] Êëóïò Ì. (2008). Äåìîãðàôèÿ ðåãèîíîâ Çåìëè. ÑÏá.: Ïèòåð, 347 ñ.: èë. 39

Appendix Appendix 1: The unemployment rates of European countries for 2014. Country First quarter Second quarter Third quarter Albania 13,86 13,45 12,91 Armenia 17,80 17,50 17,10 Austria 8,90 7,73 9,00 Belgium 8,50 8,50 8,60 44,18 43,86 43,66 Bulgaria 13,00 11,40 10,80 Croatia 22,47 19,67 17,67 Cyprus 15,87 16,07 16,27 Czech Republic 8,46 7,60 7,36 Denmark 4,17 3,97 4,00 Estonia 8,50 6,90 7,50 Finland 9,03 9,63 7,53 France 10,10 10,10 10,40 Germany 5,07 5,00 5,00 Greece 27,17 26,83 26,03 Hungary 8,60 8,03 7,63 Iceland 5,43 5,03 4,90 Ireland 11,85 11,70 11,20 Italy 12,67 12,53 12,87 Kazakhstan 5,10 5,03 5,07 Latvia 11,90 10,70 10,60 Lithuania 12,40 11,20 9,10 Luxembourg 7,10 7,23 7,23 Macedonia 28,39 28,22 27,90 Malta 6,00 5,80 5,84 Bosnia and Herzegovina 40

Country First quarter Second quarter Third quarter Moldova 5,10 3,60 3,30 Montenegro 14,94 14,11 13,48 Netherlands 8,70 8,57 8,07 Norway 2,77 3,27 3,60 Poland 13,70 12,50 11,67 Portugal 15,10 13,90 13,10 Romania 7,00 6,93 6,77 Serbia 20,80 20,30 17,60 Russia 5,53 5,03 4,87 Slovakia 13,46 12,81 12,57 Slovenia 14,10 13,07 12,50 Spain 25,93 24,47 23,67 Sweden 8,55 8,63 7,23 Switzerland 3,40 3,03 2,97 Turkey 10,00 8,97 10,13 Ukrain 9,40 8,60 8,90 UK 6,97 6,50 6,07 41

Appendix 2: The demography indicators of European Union countries. Countries Birth Mortality Natural Migration Urban rate rate population of popula- growth rate tion growth popula- tion rate Austria 9 9 0 5 67 Albania 13 7 6 -15 54 andorra 9 4 5 4 90 Belgium 12 9 3 6 99 Bulgaria 9 15 -6 -1 73 8 9 -1 0 46 UK 13 9 4 2 80 Hungary 9 13 -4 1 69 Germany 8 11 -3 5 73 Greece 9 10 -1 -1 73 Denmark 10 9 1 4 87 Ireland 16 6 10 -7 60 Iceland 14 6 8 -1 95 Spain 10 8 2 -3 77 Italy 9 10 -1 4 68 Cyprus 12 7 5 14 62 Latvia 10 14 -4 -2 68 Lithuania 10 14 -4 -7 67 Liechtenstein 10 6 4 3 15 Luxembourg 11 7 4 19 83 Macedonia 11 10 1 0 65 Malta 10 8 2 3 100 Monaco 6 6 0 13 100 Netherlands 10 8 2 1 66 Norway 12 8 4 9 80 Poland 10 10 0 0 61 Portugal 9 10 -1 -4 38 Bosnia and Herzegovina 42

Countries Birth Mortality Natural Migration Urban rate rate population of popula- growth rate tion growth popula- tion rate Romania 9 12 -3 0 55 San Marino 9 7 2 7 84 Serbia 9 14 -5 1 59 Slovakia 10 10 0 1 54 Slovenia 11 9 2 0 50 Finland 11 10 1 3 68 France 13 9 4 1 78 Croatia 10 12 -2 -1 56 Montenegro 12 10 2 0 64 Czech Repub- 10 10 0 1 74 Switzerland 10 8 2 8 74 Sweden 12 10 2 5 84 Estonia 11 12 -1 -5 69 lic 43

Appendix 3: The articial data generation program code in R. #Open the library library("mvtnorm", lib.loc="~/R/win-library/3.1") # Covariance matrix varcov = matrix(c(0.15, 0, 0, 0.15), 2) #Generation of four two-dimensional samples from #the normal distribution y11 = rmvnorm(100, mean=c(1,1), varcov) y21 = rmvnorm(100, mean=c(1,4), varcov) y31 = rmvnorm(100, mean=c(3.5,2.5), matrix(c(0.2, 0, 0, 0.2), 2)) y41 = rmvnorm(100, mean=c(6,1), varcov) y51 = rmvnorm(100, mean=c(6,4), varcov) #Union of samples y1 = c(y11[,1], y21[,1], y31[,1], y41[,1], y51[,1]) y2 = c(y11[,2], y21[,2], y31[,2], y41[,2], y51[,2]) #Making the dataframe y = data.frame(y1, y2) #Plot the dataset plot(y$y1, y$y2, xlab = "x", ylab = "y", main = "Artificial data set") 44

Appendix 4: The algorithm described in Section 2.2 implementation in R. #Initial clustering vectOfInitClust = function (numClust, dataSet) { kmeanInit = kmeans(dataSet, centers = numClust) return (kmeanInit) } #Frequencies generation for uniformly distributed noises kDataClustUnif = function(numClust, lenghts = 500, nRepeat = 50, dataSet = y, variance, kmeanInit){ MatrixOfDistance = dist(dataSet, method = "euclidean", diag = FALSE) hClust1 = hclust(MatrixOfDistance) initClust = cutree(hClust1, k = numClust) k = 0 for (i in 1:nRepeat){ newDataAdding1 = c(dataSet[,1], dataSet[,1] + runif(lenghts,min = -delta, max = delta)) newDataAdding2 = c(dataSet[,2], dataSet[,2] + runif(lenghts,min = -delta, max = delta)) newDataAdding = data.frame(newDataAdding1, newDataAdding2) Distance = dist(newDataAdding, method = "euclidean", diag = FALSE) TreeGeneration = hclust(Distance) vect = cutree(TreeGeneration, k = numClust) if (mean(abs(initClust - vect[1:lenghts])) != 0) { k = k + 1 } } return(k1/nRepeat) 45

} #Set of Frequences for fixed number of Clusters (for uniform) frequnceUnifSet = function(nClust , variances, nRepeat = 50, dataSet = y){ kmeanInit = kmeans(dataSet, centers = nClust) i = 0; frequences = 0; for (sigma in variances){ i = i + 1 frequences[i] = kDataClustUnif(numClust = nClust, dataSet = dataSet, variance = sigma, nRepeat = nRepeat, kmeanInit = kmeanInit) } return (frequences) } # Iteration through the number of clusters (median) frequenceNumClustUnifMedian = function(setNumClust = 2:10, nRepeat = 50, varSet, dataSet = y){ i = 0 medFrequence = 0 for (k in setNumClust){ i = i + 1 medFrequence[i] = median(frequnceUnifSet(nClust = k, variances = varSet, nRepeat = nRepeat, dataSet = dataSet)) } return (medFrequence) } 46

Рецензии:

Авторизуйтесь, чтобы добавить рецензию

- у работы пока нет рецензий -

Отзывы:

Авторизуйтесь, чтобы оставить отзыв