Saint Petersburg State University
Department of mathematical game theory and statistical
decisions
Lozkins Aleksejs
Master's thesis
The stability based approach of cluster
number determination
Specialization 01.04.02
Applied Mathematics and Informatics
Master's Program Game Theory and Operations Research
Research advisor
D.Sc., Professor
Bure V. M.
Saint Petersburg
2016
Contents
Introduction
1
1 The formal description of existed methods
4
1.1
The common existed approaches
. . . . . . . . . . . . . . . .
4
1.2
Literature review
. . . . . . . . . . . . . . . . . . . . . . . .
7
2 Formalization of the approach
10
2.1
Methodology description
. . . . . . . . . . . . . . . . . . . .
10
2.2
An algorithm based on data inaccuracies . . . . . . . . . . . .
14
2.3
Algorithm based on extended set . . . . . . . . . . . . . . . .
16
3 Computational experiments on articial data
18
3.1
Articial data description . . . . . . . . . . . . . . . . . . . .
18
3.2
Algorithms tuning and application on the articial data
19
. . .
4 Applications of algorithms for applied problems
4.1
The cluster analysis of European countries by unemployment
rates
4.2
24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
The clustering analysis of European Union countries by demographic parameters
. . . . . . . . . . . . . . . . . . . . . . .
Bibliography
28
36
1
Introduction
The cluster analysis is the data learning tool. There exist some dierentiations of data mining tasks: clustering, classication, regression, association
rule learning and etc. All this data analysis tools have own specicity. The
clustering is subset of data mining tools, which solve the automatic grouping
of data without the supervisor. The cluster analysis has two main problems:
clustering algorithm and which number of cluster to chose.
The data structure has a dierent representations and where does not
exists the common clustering algorithm for each kind of dataset. There exist
lots of clustering algorithms:
hierarchical clustering, k-means algorithms,
HCS clustering algorithm, biclustering and etc. For this reason are developed
set of clustering algorithms for dierent cases of data.
The supervised learning (classication) has the information about groups
and their properties in turn the clustering is the data grouping without the
information about clusters properties. In this case the dierent criteria are
introduced [1]. There is no common criterion for each type of dataset and
only local-optimum of the clustering can be found. The criterion variety depends on the clustering application area. The clustering analysis is applied
in medicine, computational biology, computer vision, economics, genetic and
etc.
The proposed methods are related to additional problem in data analysis data inaccuracy. The both clustering and data clearness problems are
employed for development of criterion of cluster number estimation. The goal
of the dissertation work is to produce the cluster number validation criterion
and to solve the data imprecision problem at the same time.
The current
work propose the new algorithms of cluster number determination which in
certain sense improve the existing algorithm.
2
The materials are published in 7 printed works, 2 of them are the
articles [2, 3] and one paper [4] is indexed in Scopus, 4 conference abstracts
[5, 6, 7, 8] and 1 paper is under publication in the Ìîëîäîé ó÷åíûé journal
in geographical section applying the developed method to the real data.
The structure of the dissertation consists of introduction, four sections,
conclusion and bibliography.
The introduction presents the relevance of the work, the introduction
into the problems of considered topics, the short presentation of goals of
research.
The Section 1 presents the existing approaches of considered problem,
the description used methodologies, the basic notation is introduced and the
literature review conducted.
The Section 2 presents the additional specic notation for proposed
methods, the similarity measure between clusterings is introduced and its
metric properties are proved, the two algorithms and their descriptions are
exposed.
The Section 3 shows the algorithms' work on the articial data and
their operability test.
The Section 4 presents two results of proposed algorithms usage on real
data and results interpretation.
In the Conclusion the research results are summed up and the main
conclusions are formulated.
3
Chapter 1
The formal description of
existed methods
1.1 The common existed approaches
Nowadays, the problem of true cluster number estimation is unsolved.
Although it does not mean that there are no alternatives for grouping number
estimation in the cluster analysis. A wide spectrum of methodologies exists,
which consists of a large amount of criteria for the data natural group number
estimation. The criteria is not optimal in general case as it gives only localoptimum.
Stability methodology in cluster analysis has a direct connection with
current work, which causes the main interest.
Comparing to the mathe-
matical analysis the stability concept in clustering analysis does not have
a mathematical interpretation. This concept depends on logical beliefs and
in most of the cases of the clustering analysis it looks like a cluster variety
estimation. The basis of this variety estimation is considered to have the inverse property the similarity between the inter results, which are generated
under an assumption.
Before the statement of central concepts, the basic notation should be
introduced.
The common theoretical notation does not exist, because the
clustering analysis have been developed in multiple scientic areas (biology,
medicine, computer science, statistics etc.). The notation presented here will
be simple to use in mathematical descriptions.
4
X = {x1 , x2 , ..., xn } of d-dimensional Euelements xi = (xi,1 , xi,2 , ..., xi,d ) d-dimensional
the elements from X will be considered as vectors
Let us denote the nite set
Rd , where
∀xi ∈ X , further
clidean space
vectors
without additional second index. The clustering method will be denoted as a
α(X). The data partition and the result of clustering
1
2
k
is α(X) = {S , S , ..., S }, where k is the number of partitions or clusters
j
j
l
and ∀S ⊂ X and S ∩ S = ∅ for ∀j 6= l . Additional notations for each of
function from the set of
the concept descriptions will be introduced directly in concepts presentation.
Most of stability based concepts use the distances between clusters. A
variety of distance measures can be found in literature, the following representation of clusters needs to be introduced:
1,
cij =
0,
The matrix
C 1, C 2
if
xi
and
xj
is in the same cluster and
xi 6= xj ;
otherwise.
C q for each clustering αq (X) its own.
The dot product of matrices
is dened by formula
1
2
hC , C i =
X
c1ij c2ij .
i,j
The dot product computes the number of elements which are clustered together.
The dot product have the following property:
hC 1 , C 2 i ≤
p
hC 1 , C 1 ihC 2 , C 2 i.
The property is named as Cauchy-Schwartz inequality by its developers. Under this inequality the cosine similarity measure was introduced by Fowlkes
and Mallows in [9]:
hC 1 , C 2 i
.
cor(C 1 , C 2 ) = p
hC 1 , C 1 ihC 2 , C 2 i
These are normalized correlations between two clustering, which can be considered as distance between clusterings or cluster similarity measure.
5
Signicant amount of similarity measures between the clusters is introduced in the cluster analysis, the commonly used measures will be presented
in this part of work.
measures:
The dot product helps us to express the following
hC 1 , C 2 i
J(C , C ) =
,
hC 1 , C 1 i + hC 2 , C 2 i − hC 1 , C 2 i
1
2
Jaccard coecient [10] is the similarity measure, where only non-negative
matches are considered.
The squared norm of matrix
k C k2 = hC, Ci
is proposed to use [11]
for the matching coecient representation:
M (C 1 , C 2 ) = 1 −
1
k C 1 − C 2 k2 ,
n2
a way of weighed penalty for non-matchings usage in similarity estimation.
The choice of similarity measure depends on the aim of the analysis.
In the most of cases the precise measures are excessive and make additional
computational load, which is undesirable.
The rst concept of true cluster number determination which we consider is presented in [12]. The idea consists of the random subsample generation
X1
from general set
elements from
X.
X.
The condition for
X1
is the xed number of
This number of elements has been characterized by sam-
f , which describes the proportion of elements in subsample X1 to
number of elements in X . The fraction of points sampled is f > 0.5
pling ratio
whole
and it should be close, but not equal 1. This condition provides the property
for two dierent subsamples
X1
and
X2
with same
f:
X1 ∩ X2 6= ∅.
The main assumption of this concept is that the clustering structure of
X
should be inherent by subsamples. Under this assumption the clusterings of
two dierent subsamples with same ratio should have the same clusterings on
intersection of these subsamples. The comparison of two dierent clusterings
is held by similarity measures. The true clustering is stable with respect to
sub-sampling.
6
The second stability concept of cluster number validation is based on
supervised learning. The idea is formulated in [13] and the additional modications are added (modications of subsample generations) by Lozkins and
Bure.
Y , where yi ∈ Y
element xi ∈ X and xi
The solution of clustering is presented as a set of labels
represents the number of the cluster, wherein the
cluster label is
function
φ,
yi .
The supervised learning has the denition of classier
which under training set makes the conclusion of element label.
A lot of classication approaches exists with it's own classication function
and in this work the theory of classication is omitted.
The work [13] does not provide the methods how to choose the training
set.
Therefore the idea of the previous concept of the subsample is used.
X1 ∈ X
The training set is the subsample
with fraction level
f
and the
inheritance of cluster structure implied.
Let us assume the classier
Y1
is the clustering result of
transformed to matrix
C.
C1 .
φ1
X1 .
is trained on subsample
The classication result
The clustering results of
X
(X1 , Y1 ), where
Y = φ1 (X) is
is represented by
In this idea the stability estimation is compared to average similarity
measures for each considered cluster number.
If the variety of clusters for
dierent classiers for the same fraction ratio for xed number of clusters
is low, then clustering is stable with respect to another number of clusters
considered.
The challenge of presented concepts is the fraction ratio and in the
last case the classier function. In some cases it can cause problems (for
example:
f
is too small and does not include the whole cluster) and the
result can be inaccurate. This way a large amount of variants of algorithms
needs to be tested.
1.2 Literature review
A lot of methodologies can be found there for correct cluster number estimation. Dierent points of data interpretation makes the elds for
disjoint ideas of clustering problems solution. The main existing methodolo-
7
gies in clustering analysis of cluster number validation are described in this
section.
The methodology of consideration of intra- and inter-cluster variances
as a function of number of clusters is a basis of criteria represented in the
following research: [14, 15, 16, 17, 18, 19, 20], [21] (C -index) and [1] (the Gap
Statistic method). Methods described in the works mentioned above use the
geometrical prospective.
The gap statistic consider the intra clusters dispersion or called the
error measure
Wk
versus number of clusters
k.
The discrete curve
Wk
mono-
tonically decrease as the number of clusters increase. The sudden jump down
of the curve at certain number of
k
is an appropriate number of cluster
elbow criteria.
C -index consider the sum of distances over all pairs from the same
cluster S , the Smin is the sum of l shortest distances between pairs, the
Smax is the sum of l largest distances in the whole set. The index is dened
S−Smin
by formula: C =
Smax −Smin . The better clustering is determined by small
C -index value.
The
Nonparametric approach denes the true number of clusters as the
number of areas with high data density. Each cluster corresponds to the domain of attraction, i.e. area with high data concentration. The Wishart in
[22] was the rst to suggest the density mode for cluster structures' research.
Later the idea was developed by Hartigan in [23].
Multiple methodologies exist apart from the ones mentioned. However,
this work is focused on the methodology latter described as it is the closest
to the one proposed by the author of the work. Methods involving cluster
stability concept are based on consideration of two clustered selections at a
time which are taken from initial dataset. The composition of clusters can be
interpreted as their reliability [24] at high stability level. The idea described
is a basis for criteria of determination of optimal number of clusters used
in the work of Levine and Domany [25] and Ben-Hur, Elissee and Guyon
[12, 26, 27].
Another methodology uses the goodness of t tests.
The paper [28]
introduced the X-means algorithm for optimal cluster number determination
via Bayessian information criterion.
The mean of Hotelling's T-square p8
values is considered in [29] for distance estimation between samples.
The
earlier works by Volkovich, Brazly and Morozensky [30] describe the way for
cluster number determination through statistical criterion.
The listed circumstances suggest that correct clustering should be
stable with respect to the random perturbation of initial data, at least at
a low variance.
As the value of variance can be varied, mathematical ex-
pectation of the random perturbation is always equal to zero, suggesting
the absence of regular errors in the initial data. In the papers [31] and [32]
similar approaches are being used, these works are focused on a clusters similarity measures and consideration of normal and binomial noise distributions
respectively.
The data perturbations in an initial dataset
X
as a grouping stability
estimation are considered in [2, 3, 4]. The main assumption of these methods
is inaccuracy of the data studied. If the low perturbation of data is added
then the clustering should be the same. The variances of the perturbations
are the control parameters for the stability validation. The acceptable clustering is the data sample grouping that is robust to random perturbations
of investigated data compared to the other considered variants. These methods used with real data are presented in [3, 7] and will be described in later
sections.
9
Chapter 2
Formalization of the
approach
2.1 Methodology description
The idea of the approach comes from the studied data clearness problems. Initial data explored always has inaccuracies, the nature of these errors
has a dierent origin, rate and it has the casual character. Erroneous data
arises due to the measurement and/or rounding errors. Also, there are errors in mathematical models. Analyzing the socioeconomic information the
situations can be observed when the initial numerical data contains various
kinds of inaccuracies.
Occasionally, data are intentionally misrepresented,
e. g. economical data relating to the income of juridical or natural person,
or socio-economical data, which are collected as a result of selective survey
of various social groups. Apart from those mentioned other possible sources
of observational inaccuracy exist. Consequently, the analyzed data is imprecise.
In that kind of clustering tasks the result cannot rely on the truth
of already existing data. Thus, the correct value of observation can dier
from the data investigated. The circumstances suggest that correct clustering should be stable with respect to the random perturbation of the initial
data, at least at the low variance. As the value of the variance can be varied,
mathematical expectation of random perturbation is always equal to zero,
suggesting the absence of regular errors in the initial data [2, 4].
10
Two versions of algorithms are presented above which determine the
acceptable clustering that is robust to random perturbations of the underlying data. Every option has right for existence. The additional notation has
to be introduced.
The new dataset which is generated as a noise addition to each element
Xσ = {xi,σ |xi ∈ X, xi,σ = xi + εi,σ , i = 1, ..., n},
j
1
d T
where εi = (εi,σ , ..., εi,σ )
and εi,σ are mutually independent random elements from uniform distribution U (−δ, δ), or in a general case, from betadistribution [33] dened in (−σ, σ). The noise generation can be performed
of
X
will be denoted as
from other distributions, for example, from truncated normal distribution
(i. e.
initial normal distribution has a zero mathematical expectation and
variance equals to
σ 2 , N (0, σ 2 ))
within a given truncation range
(−δ, δ).
The choice of distribution is contingent on initial supposition on the initial
data error nature. In the case of the absence of the distribution information
of initial data inaccuracies, as a benchmark distribution is given the uniform
distribution.
k disjoint subsets of set X are denoted as
X = ∪ki=1 Si , where Si ∩ Sj , ∀i 6= j ∨ i = 1, ..., k; j = 1, ..., k . The partition
algorithm or the approach lets us denote αk (X) = {S1 , ..., Sk }.
One of the approaches uses the perturbed set Xσ . The perturbation
level σ should be estimated by the expert or the set of {σ} must be considThe result of partition into
ered. In the second case the complexity of the stability estimation is growing
up.The assumption of data inaccuracy is represented by the perturbed data.
The grouping of
are the
k
Xσ
into
k
clusters will be denoted as
disjoint sets and the union of all
S1σ , S2σ , ..., Skσ ,
which
∪ki=1 Siσ = Xσ .
The second approach uses the extended perturbed set where the set
Xσ+ = X ∪ Xσ with not more than 2n number of elements. The partition of
Xσ+ into k disjoint groups will be denoted as S1σ+ , S2σ+ , ..., Skσ+ . Lets dene
σ−
the following sets Si
= Siσ+ \ Xσ , i = 1, ..., k , which are the k disjoint sets
σ−
and the union of all Si
is the set X . The approach of partition is dened
σ−
σ−
σ−
as αk (X) = {S1 , ..., Sk }.
In this case the stability concept is viewed as group variability under
the data extension by the perturbed data. The data extension does not end
on one perturbed initial set addition, the number of perturbed sets can dier
11
and noise distribution for each set can have it's own distribution parameters.
For example, lets consider the
with no more than
rn
r
perturbed sets
number of elements. In
Xσ+ = X ∪ Xσ1 ∪ ... ∪ Xσr
most cases r = 1 is enough.
Large number of perturbed sets add complexity but does not give additional
precision.
The cluster stability concept implies cluster comparison. The concept
of suciently rough estimation of cluster consistency comparison is proposed.
Consider the partitions:
C
trices
and
Cσ,
0.
And
titions is
S1 , S2 , ..., Sk
and
S1σ , S2σ , ..., Skσ .
If the relevant ma-
of the partition do match, then the distance between par-
1
otherwise.
Using the dot product, the introduced (by
Lozkins and Bure) rough partition similarity measure can be represented by
the following formula:
lb(αk (X), αk (Xσ )) =
Lemma.
The
0,
if
1,
otherwise.
√
hC,C σ i
hC,CihC σ ,C σ i
= 1;
lb(αk (X), αk (Xσ )) similarity measure has distance prop-
erties.
Proof.
Proposed similarity measure satises the non-negative axiom:
lb(αk (X), αk (Xσ )) ≥ 0,
the set of default values is
{0, 1}.
Lets show that identity of indiscernibles axiom holds. The condition
lb(αk (X), αk (Xσ )) = 0
then
hC, C σ i = hC, Ci = hC σ , C σ i
Schwartz
√
holds if and only if
hC,C σ i
√
hC,C σ i
hC,CihC σ ,C σ i
√
= 1.
If
C = Cσ
hC,C σ i
= 1. CauchyhC,CihC σ ,C σ i
p
inequality
hC, C σ i
≤
hC, CihC σ , C σ i gives that
≤ 1. If C =
6 C σ then hC, C σ i < hC, Ci and hC, C σ i < hC σ , C σ i
σ
consequently
hC,CihC σ ,C i
because
hC, Ci
and
hC σ , C σ i
have maximal numbers of matches (from def-
inition of matrix C in the Section 1), consequently
√
hC,C σ i
hC,CihC σ ,C σ i
< 1
and
lb(αk (X), αk (Xσ )) = 1.
The proposed similarity measure has the symmetry property inherited
from
the
dot
product
lb(αk (X), αk (Xσ )) = lb(αk (Xσ ), αk (X)).
12
symmetry
property:
lb(αk (X), αk (Xσ )) = 0
consequently the lb(αk (X), αk (Xσ )) ≤ lb(αk (X), αk (Xβ ))+lb(αk (Xσ ), αk (Xβ ))
holds as lb(αk (X), αk (Xβ )) + lb(αk (Xσ ), αk (Xβ )) ≥ 0 under non-negativity
σ
condition. The second case lb(αk (X), αk (Xσ )) = 1, that means C 6= C . Let
us show that lb(αk (X), αk (Xσ )) ≤ lb(αk (X), αk (Xβ )) + lb(αk (Xσ ), αk (Xβ ))
β
σ
β
holds. Without loss of generality, C = C . Consequently C 6= C . In this
case lb(αk (X), αk (Xβ )) + lb(αk (Xσ ), αk (Xβ )) = 1 and inequality condition
β
σ
β
holds. If C 6= C and C 6= C , then lb(αk (Xσ ), αk (Xβ )) = 2 and inequality
Let us show that the triangle inequality hold. If
condition holds too.
Proved.
The rough similarity measure is chosen to save the computational costs.
The other more precise similarity measures can be used, but this would signicantly increase computational complexity of the method. In most cases
it is excessive and unjustied.
The proposed distance between clusterings is adapted to the algorithms
in the later Sections. The main attention is drawn to the perturbation parameter determination using this metric, which allows to reduce the number
of settings.
The meaning of the clustering stability in our understanding is dened
by the clusterings variety level under random perturbations in data compared
with initial clustering. The superior, precise and proper clustering consists of
the clusters which have the minimal level of dependence from perturbation
level.
Denition. The variability frequency
the ratio of the number of
observed mismatches to the total number of repetitions, denoted by
ν.
The variability frequency is the result of algorithm's work, it shows the
stability level for each number of clusters considered. This value represents
the normalized number of initial and perturbed data clustering mismatches
and consequently the clustering reliability.
13
2.2 An algorithm based on data inaccuracies
In Section 2.1 the data inaccuracy problem was mentioned as well as
some ways of data interpretation in simulation studies for solution.
The
true clustering determination is considered as a main problem, which the
algorithm tries to solve. The clustering algorithm and choice of number of
clusters are important problems in cluster analysis. The algorithm described
in this section nds the local-optimum of cluster number and partially solves
the clustering algorithm problem (which clustering algorithm approach is
better). The cluster robustness is used for grouping number validation and
is the basic assumption for the approach.
The algorithm is based on the stability concept. The stability meaning
in this approach is connected to errors in data, rather to errors representation in the data.
In mathematical science this idea is used in stochastic
programming.
The algorithm gives a local-optimum, because clustering analysis has
lot of clustering methods and there are methods which solutions ore not
precise (fuzzy). Dierent algorithm nature adds the complexity to the cluster
analysis problem. Using the expert opinion the clustering algorithm set can
be chosen. This set is the parameter of the proposed algorithm. Under this
set is related the local-optimum, in real situation all clustering methods can
not be considered. In most cases it is not necessary and one or two clustering
approaches can be considered (expert opinion).
There are a lot of approaches for cluster number determination (Section 1.2), the majority uses interval of admissible cluster number values and
holds in the considered algorithm. The sequence of integers or the integer
number interval is enough, the expert opinion is not necessary. Let us denote
this set as
[kmin , kmax ].
The perturbed set
Xσ
is the crown (the most important part) of the
algorithm setting. If the data errors level is known, then xed
is enough, otherwise the set of
{σ}
should be considered.
σ consideration
An important
question: how to choose the set of noises distribution variances?
This set
can be estimated experimentally, starting from small variances values.
If
for all number clusters the variability frequencies are relatively low then the
14
variances increase.
If the variability frequencies are near 1, then the the
Ω = {σi }si=1 .
Xσ . The clustering
variances decrease. Let us denote the variances set
The algorithm works with two sets
X
and
into
k
clusters of these sets should give the same results on most cases of the noises
addition. There is the precision control parameter, which reects the number
of additions to
Xσ .
In each generation the clusterings are compared. The
comparison runs using the
measure can be used, but
lb distance between clusterings (another similarity
lb distance is more suitable for this algorithm).
The algorithm scheme is presented below:
Input : X , [kmin, kmax], Ω = {σi}si=1;
Output : νσi,k ;
Requires : a clustering algorithm αk (X), similarity measure between clusters;
N umOf Dif f ers = 0;
for (k in kmin..kmax) do
αk (X);
for (σi in Ω) do
for (i in 1..N umOf Repetitions) do
Xσi generation;
N umOf Dif f ers = N umOf Dif f ers + lb(αk (X), αk (Xσi ));
end for
νσi ,k =
end for
end for
The
N umOf Dif f ers
N umOf Repetitions ;
N umOf Repetitions is the precision control parameter which was
mentioned before.
The result of the algorithm's work is the frequency matrix
{νσi ,k }.
This result representation is complex and seriously interpreted. The statistics introduction into the result modication is the way of the result matrix
simplication. For example, the median or the mean of the variances transform matrix to the vector. The vector represents the variability frequency
for each considered number of cluster
νk .
15
The rst version of proposed algorithm is presented in the work [2] and
the practical usage of this approach is considered in [3] and will be discussed
in the Sections 3 and 4.1.
2.3 Algorithm based on extended set
In Section 2.2 the algorithm was considered, which is based on the
data errors.
In this section another algorithm uses a similar assumption,
with some dierences and signicant modications.
The work [23] describes clusters as the islands in the sea, where the
heights above the sea level are represented as group densities.
There is a
similarity between above described concept and the proposed algorithm in
this section.
The idea oered represents the islands articial uplift.
If
the islands heights above sea level increase beyond uplift, then the clusters
are more stable.
If square of islands grows, then clusters are less stable.
The idea is presented in [4] and the application is publishing in the journal
Ìîëîäîé ó÷åíûé.
Input : X , [kmin, kmax], Ω = {σi}si=1;
Output : νσi,k ;
Require : a clustering algorithm αk (X),
similarity measure between clus-
ters;
N umOf Dif f ers = 0;
for (k in kmin..kmax) do
αk (X);
for (σi in Ω) do
for (i in 1..N umOf Repetitions) do
Xσi + = X ∪ Xσi ;
αkσi − (X);
N umOf Dif f ers = N umOf Dif f ers + lb(αk (X), αkσi − (X));
end for
νσi ,k =
end for
end for
N umOf Dif f ers
N umOf Repetitions ;
16
The islands analogy helps to understand the concept of the proposed
method. The islands uplift procedure are represented by extended dataset
Xσ+ .
The
σ
has the same meaning as in the previous section. The union
of initial and perturbed datasets
Xσ+
makes stable clusters more unchange-
able and unstable clusters have high variety level comparable with initial
clustering.
The extended set has an adjustable size (2n, 3n, ...).
The size
helps to separate the stable clusterings from the unstable accurately, but the
computational complexity signicantly increases.
The result simplication using statistics takes a place in this approach
as well as in previous section. The similarity measure is free to selection and
should be simple in computational aspect.
17
Chapter 3
Computational experiments
on articial data
3.1 Articial data description
The algorithms mentioned in the previous sections are tested on articial dataset.
This experiment shows the correct work of the proposed
approaches and helps to describe parameters of the algorithms.
The articial dataset consists of ve samples generated from multivariate normal distribution with 100 elements in each sample with dierent
means and variances.
The parameters of the samples are presented in the
Table 3.1.
Table 3.1: Multivariate normal distribution parameters
Sample number
Means vector
Sample 1
(1,1)
Sample 2
(1,4)
Sample 3
(3.5, 2.5)
Sample 4
(6,1)
Sample 5
(6,4)
18
Variances matrix
0.15
0
0.15
0
0.2
0
0.15
0
0.15
0
0
0.15
0
0.15
0
0.2
0
0.15
0
0.15
The exposed data represent ve samples each can be presented as
cluster without special research.
ber validation is known.
In this case, the result of cluster num-
The Figure 3.1 displays the articial data.
two-dimensional dataset is denoted as
The
X.
Figure 3.1: Union of 5 samples from multivariate Gaussian distribution
3.2 Algorithms tuning and application on the
articial data
The presented algorithms use similar input parameters. For result comparison the same parameters will be used. The parameters of the algorithms
are presented in Table 3.2. The approach for cluster number determination
which uses the extended set as union of initial datasets and one perturbed
dataset (only one perturbed dataset) is applied.
Table 3.2: Algorithms parameters
Parameter
Cluster number set
Value
[kmin , kmax ]
[2,10]
Distance between clusters
σ1 = 0.02; σi = σi−1 + 0.02; i = 2..20
k -mean algorithm
lb(αk (X), αk (Xσi )
Number of repetitions/simulations
50
Perturbation distributions
Uniform distribution, Gaussian dis-
Variances sequence
Clustering algorithms
tribution
19
The set of perturbation distribution parameters is most important. The
correct variances set estimation plays the main role in variability frequency
calculation. This set depends on intra data average standard deviation level.
The Figure 3.2 (a) and (b) present the variability frequencies for 2, 3,
5 and 6 number of clusters depending on perturbation variances values. If
frequencies for each variance value is low (less than 0.5), then the variances
level is not correct and should be increased. If frequencies for each variance
value are near one, then the variances level is not correct too and should be
reduced.
In Figure 3.2 (a) and (b) is generated on correct variances set, because
there exists the cluster number with low level of frequencies (black line corresponds to number of clusters equal 5) which is interpreted as more stable,
and other cluster numbers have high variability of frequencies and are not
stable.
In both cases the results do not have signicant dierences. It has to
do with main methodology of the algorithms and the simple case of studied
data.
Figure 3.2: The frequencies for both algorithms
(a) The frequencies, resulting from the algorithm described in Section 2.2
(b) The frequencies, resulting from the algorithm described in Section 2.3
The Figure 3.2 (a) and (b) represent the results where variability frequencies
νσi ,k
tion level.
depend on two parameters: number of clusters and perturba-
In the case when the set of numbers of clusters consist of 25
numbers this results presentation is simple analyze and make conclusions.
20
When the number of clusters is greater than 5, as in our case, the mean
and median are used for result simplication. The elbow criterion is used
in cluster number validation theory [1].
The statistics introduction to our
algorithms makes it possible to use this criterion.
The experimental results for
k -means clustering algorithm
for the rst
cluster validation method are presented on Figure 3.4 (a), (b), (c) and (d).
There four dierent cases are considered where two perturbation distributions
and two statistics are taken.
The Figures a, c present the medians of frequencies and Figures 3 b, d
represent the means of frequencies for dierent distributions. Two dierent
statistics show similar results and conrm the truthfulness. The average of
frequencies allows to represent the curves in Figure 3.2 as the point without
loss of information.
The Figure 3.4 (a), (b) are generated using uniformly distributed perturbations in comparison with the Figures c,d results for normally distributed
perturbation are better with minimal frequency about 0.3, but in the case
with normal distribution the minimal frequency is 0.4. The variances set for
both distributions was the same then the experiment was carried out.
The Figure 3.6 has the same interpretation as the Figure 3.4.
The
results of proposed algorithms on articial dataset obviously are similar and
dier insignicantly at non-stable numbers of clusters.
The goal of the cluster number validation algorithms is to obtain the
sharp dips of the curve at stable points (Elbow criteria). In Figure 3.4 and
Figure 3.6 the number of clusters is equal to 5. This is the expected solution
thereby proving the conrmation to proposed algorithms.
The results can
not be interpreted as an optimal solution, because the limited set of number
of clusters and number of clustering algorithms were considered.
21
Figure 3.4:
The
k -mean
clustering algorithm is being considered for the
cluster number validation algorithm described in Section 2.2.
(a) The median of frequencies for each considered number of cluster the perturbations
from uniform distribution
(b) The mean of frequencies for each considered number of cluster the perturbations from
uniform distribution
(c) The median of frequencies for each considered number of cluster the perturbations
from normal distribution
(d) The mean of frequencies for each considered number of cluster the perturbations from
normal distribution
22
Figure 3.6:
The
k -mean
clustering algorithm is being considered for the
cluster number validation algorithm described in Section 2.3.
(a) The median of frequencies for each considered number of cluster the perturbations
from uniform distribution
(b) The mean of frequencies for each considered number of cluster the perturbations from
uniform distribution
(c) The median of frequencies for each considered number of cluster the perturbations
from normal distribution
(d) The mean of frequencies for each considered number of cluster the perturbations from
normal distribution
23
Chapter 4
Applications of algorithms
for applied problems
4.1 The cluster analysis of European countries
by unemployment rates
The cluster analysis application to data exploration helps in nding
dependence or groups of elements with similar properties. The dependence
may occur under large amount of parameters and simple tools are insucient.
The results of cluster analysis provide an answer which interpretation may
provide new information or discovery.
The Section 4.1 describes the practical application of algorithm from
Section 2.2 to economical dataset.
The unemployment rates of European
countries are considered as underlying data set. The data consist of 3 parameters or features, the unemployment rates of European countries by I, II
and III quarters of 2014. The analogous result presentation can be found in
the [7] where the ination rates were considered instead of unemployment
rates. The rst real application of discussed approaches starts on this data
set and is presented in [3]. The dataset of unemployment rates enclosed in
Appendix 1. The parameters of the algorithm are presented in the Table 4.1.
The number of elements in investigated data is equal to 42. The maximal tested amount of clusters is 10, consequently the average contents of
the clusters is about 4 elements. This groups are nely divided and may add
24
Table 4.1: Algorithms parameters
Parameter
Value
Cluster number set
[kmin , kmax ]
[2,10]
σ1 = 0.07; σi = σi−1 + 0.07; i = 2..20
k -mean algorithm, hierarchical clus-
Variances sequence
Clustering algorithms
tering
Distance between clusters
lb(αk (X), αk (Xσi )
Number of repetitions/simulations
50
Perturbation distributions
Uniform distribution
noises into research (clustering into 42 clusters would be more stable). The
uniform distribution
U (−δ, δ)
is considered only, some practical advantages
of mentioned distribution are listed in the Section 3.
Figure 4.1 displays the variability frequencies
of clusters set for
k -means
νσi ,k
for
[2, 5]
number
clustering method. The variances set is correct:
each group number contains frequencies less than 0.1 and greater than 0.5.
There are two stable points 2 and 3 with average frequencies 0.15 and 0.3
respectively Figure 4.1 (b) and (c).
Figure 4.3 represents the same results as Figure 4.1 but the clustering
algorithm is dierent.
Another clustering algorithm gives closer curves of
frequencies for the stable grouping numbers.
In the average the 2 and 3
cluster numbers continue to be stable compared to other cluster number
cases.
The clusterings into 2 and 3 clusters of underlying set for both clustering algorithm are respectively the same. In the rst case, clusters consist of
countries presented in Table 4.2.
In the second cluster prevail countries resulting from the collapse of
Yugoslavia in 2004. Bosnia and Herzegovina, Croatia, Macedonia and Serbia
are post-communistic countries. All countries of second cluster have close geographical location. Greece in 2014 was in economic crisis and had problems
with unemployment rate in that year.
The clustering results for 3 clusters have the consistence presented in
Table 4.3.
Most of counties from second cluster are the countries which participated in the Soviet Union. After the Union collapse the economies of most
25
Figure 4.1: The
k -mean
algorithm is considered as clustering algorithm for
the variability frequencies calculation
(a) The frequencies for dierent number of clusters (black 2, red 3, green 4, blue
5 cluster numbers)
(b) Median of frequencies for each considered number of cluster, perturbations are uniform
(c) Mean of frequencies for each considered number of cluster, perturbations are uniform
Table 4.2: The clustering of European countries by unemployment rate for
two clusters
Cluster number
Countries
Cluster Nr.1
Albania, Armenia, Austria, Belgium, Bulgaria, Cyprus,
Czech Republic,
Denmark,
Estonia,
Finland,
France,
Germany, Hungary, Iceland, Ireland, Italy, Kazakhstan,
Latvia, Lithuania, Luxembourg, Malta, Moldova, Montenegro,
Netherlands,
Norway,
Poland,
Portugal,
Ro-
mania, Russia, Slovakia, Slovenia, Sweden, Switzerland,
Turkey, Ukraine, United Kingdom
Cluster Nr.2
Bosnia and Herzegovina, Croatia, Greece, Macedonia,
Serbia, Spain
26
Figure 4.3: The hierarchical clustering algorithm is considered for the variability frequencies calculation
(a) The frequencies for dierent number of clusters (black 2, red 3, green 4, blue
5 cluster numbers)
(b) Median of frequencies for each considered number of clusters, perturbations are uniform
(c) Mean of frequencies for each considered number of cluster, perturbations are uniform
Table 4.3: The clustering of European countries by unemployment rate for
two clusters
Cluster number
Countries
Cluster Nr.1
Austria, Belgium, Czech Republic, Denmark, Estonia,
Finland, France, Germany, Iceland, Ireland, Kazakhstan,
Luxembourg, Malta, Moldova, Netherlands, Norway, Romania, Russia, Sweden, Switzerland, Turkey, Ukraine,
United Kingdom
Cluster Nr.2
Albania, Armenia, Bulgaria, Croatia, Cyprus, Ireland,
Italy, Latvia, Lithuania, Montenegro, Poland, Portugal,
Slovakia, Slovenia
Cluster Nr.3
Bosnia and Herzegovina, Greece, Macedonia, Spain
27
of countries had the equal level of development and it aected the todays
economics level.
The result interpretation has a lot of points of view, the additional
background or analysis of the results by expert in economical area makes
this outcome more valuable.
The cluster analysis of European countries by unemployment rates is
carried out.
The two clustering algorithms are tested with uniformly dis-
tributed perturbations. The two stable clusterings are obtained and interpreted in this Section.
4.2 The clustering analysis of European Union
countries by demographic parameters
Current demographic trends in Europe are characterized by fairly high
mortality rate, low birth rate, low natural population growth and high migration activity.
Cluster analysis reveals signicant natural grouping of data. Abstracting from the physical sense, it is possible to determine a division and communication in the tested data at the technical level. This analysis provides
non-trivial results.
The heterogeneity of measuring scales demographics are assessed by
requires additional analysis tools. According to the initial values the local
normalization has been applied.
The following demographic indicators were the basis of the analysis:
birth rate, mortality rate, migration, natural population increase and the
proportion of the urban population in the total population. The indicators
values are enclosed in Appendix 2 [35].
The clustering was carried out on dierent set of parameters and the
most interesting will be presented in this Section.
The second algorithm
shown in Section 2.3 will be used for stable clustering estimation.
The setting for the algorithm is presented in the Table 4.4.
28
Table 4.4: Algorithms parameters
Parameter
Cluster number set
Value
[kmin , kmax ]
[2,10]
Variances sequence
σ1 = 0.005; σi = σi−1 +0.005; i = 2..10
Clustering algorithms
hierarchical clustering
Distance between clusters
lb(αk (X), αk (Xσi )
Number of repetitions/simulations
50
Perturbation distributions
Uniform distribution
Figure 4.5: The hierarchical clustering algorithm is considered for the calculation of variability frequencies
(a) The ve indicators of demography are studied (birth rate, mortality rate, migration,
natural population increase and the proportion of the urban population in the total population)
(b) The two indicators of demography are studied (birth rate, natural population increase)
(c) The three indicators of demography are studied (natural growth rate, migration rate
and the proportion of the urban population in the total population)
29
There are ve indicators for each country.
indicators are carried out in this experiment.
The all subsets of these
The outcomes with stable
points are presented below.
The Figure 4.5 (c) displays the cluster number of ve as more stable.
The data have the groups presented in Table 4.5.
Table 4.5: Clusters for three demography indicators
Cluster number
Countries
Cluster Nr.1
Austria, Germany, Netherlands, Finland, Greece, Spain,
Italy, Portugal, Bulgaria, Bosnia and Herzegovina, Hungary, Latvia, Lithuania, Macedonia, Poland, Romania,
Serbia, Slovakia, Slovenia, Croatia, Montenegro, Czech
Republic, Estonia
Cluster Nr.2
Belgium, United Kingdom, Luxembourg, France, Switzerland,
Denmark,
Norway,
Sweden,
Andorra,
Cyprus,
Malta, Monaco, San Marino
Cluster Nr.3
Ireland, Albania
Cluster Nr.4
Liechtenstein
Cluster Nr.5
Iceland
The rst group, which includes highly Western Europe and Eastern
Europe countries with zero natural increase, or close to zero values of natural increase, mostly positive migration rate and rather high proportion of the
urban population, is the largest. The second cluster includes the developed
countries of Western and Northern Europe, there are positive values of natural growth, high migration increase and high values of the urban population
ratio.
The third cluster includes only two countries with high natural population increase, the very low migration increase and the low average values of
the proportion of the urban population. The third cluster includes economically moderately developed countries of Europe the Ireland and the country with a relatively low level of economical development among the other
European countries Albania. In Albania, the highest natural population
growth is achieved mainly due to the high proportion of Muslim population.
These countries are less attractive to migrants than the developed countries
of Western and Northern Europe with higher standards of living. Liechten-
30
stein forms a fourth cluster with approximately equal low values of natural
population growth, migration and the proportion of the urban population.
The fth cluster includes only one country Iceland with a fairly good
indicator of natural population increase among the other European countries,
with a negative migration growth and a very high proportion of the urban
population.
Five clusters of the level of demographic development are built into the
rating. According to the three indicators the best demographic situation is
observed in the countries that make up the second cluster. The fourth cluster
is in the second place and includes the country with positive indicators that
aect the changes in the population, but with a low proportion of the urban
population. The fth cluster takes the third place in the ranking and includes
a country with high rates of natural population increase and the proportion
of the urban population, but with a low migration increase. The rst cluster
is in the fourth place in the ranking, and the countries belonging to it have
a small positive or negative natural population growth and migration, that
is, they have low population dynamics. At the same time in the rst cluster
suciently high proportion of the urban population is observed. The third
cluster includes countries with a very uneven distribution of the demographic
indicators values that distinguishes it from the other clusters.
Figure 4.5 (a) represents the division into four clusters according to
ve indicators and demonstrates similar trends to the previous clustering
a favorable demographic situation in the economically developed countries of
Western and Northern Europe and less favorable in the countries of Eastern
and South-Eastern Europe. The clusters are presented in the Table 4.6.
In the region of Central and Eastern Europe a major factor for birth
rate decline is that after the collapse of the world socialist system introduced
the change into the socioeconomic order. Earlier, the Central and Eastern
European socialist countries belonged to the group and now they belong to
the countries which economies are in transition (from a socialist to a market).
Under socialism there were many conditions that had a positive eect on the
birth rate: the availability of jobs, better housing, benets for children. As
the result of transformation of economical system during early 90-ies of XX
century the beginning of XXI century, countries of Central and Eastern
31
Table 4.6: Clusters for ve demography indicators and partition into 4 groups
Cluster number
Countries
Cluster Nr.1
Austria, Germany, Finland, Greece, Italy, Portugal, Bulgaria, Bosnia and Herzegovina, Hungary, Latvia, Lithuania,
Macedonia,
Poland,
Romania,
Serbia,
Slovakia,
Slovenia, Croatia, Montenegro, Czech Republic, Estonia
Cluster Nr.2
Belgium, United Kingdom, Luxembourg, Netherlands,
France,
Switzerland,
Denmark,
Norway,
Sweden,
An-
dorra, Spain, Cyprus, Malta, Monaco, San Marino
Cluster Nr.3
Ireland, Albania, Iceland
Cluster Nr.4
Liechtenstein
Europe have lost their social security, social assistance system was not yet
so highly developed as in Western and Northern Europe, the people had
sought employment in the dicult social and economic conditions. All this
has contributed little to increase the birth rate [36].
Table 4.7: Clusters for ve demography indicators and partition into 2 groups
Cluster number
Countries
Cluster Nr.1
Austria, Germany, Finland, Greece, Italy, Portugal, Bulgaria, Bosnia and Herzegovina, Hungary, Latvia, Lithuania,
Macedonia,
Poland,
Romania,
Serbia,
Slovakia,
Slovenia, Croatia, Montenegro, Czech Republic, Estonia, Belgium, United Kingdom, Luxembourg, Netherlands, France, Switzerland, Denmark, Norway, Sweden,
Andorra, Spain, Cyprus, Malta, Monaco, San Marino
Cluster Nr.2
Ireland, Albania, Iceland, Liechtenstein
In the Table 4.7 the second variant of European countries grouping by
ve indicators is presented.
It is virtually identical to the previous result
(Table 4.6): The rst and the second clusters are merged into one cluster
the rst, the third and the fourth clusters are combined in the second cluster. The rst cluster is characterized by the average for the European Region
birth rate and mortality, low positive, zero or negative values of natural population and migration growth and relatively high values of the proportion of
the urban population. In the second cluster there are countries with a positive natural population growth and negative or low values of the migration
32
population growth. The second cluster is composed of small countries (by
area), including those with an moderate level of economic development.
Clustering in terms of birth rate and the coecient of migration growth
allows for grouping of the countries based on the migration gain, and one of
the components of natural population increase birth in Table 4.8, without
taking into account mortality rates and other demographic indicators. Clustering result is dierent from the two previous clustering outcomes.
Lux-
embourg, Cyprus and Monaco small states (micro-states) of Europe
constitute the third cluster. They have average European migration growth
values and a high birth rate compared to other European countries.
The
rst cluster includes the bulk of the studied European countries with average European birth rates and relatively low or negative growth in migration.
This demographic situation can be described as stagnant. The second cluster
includes only two countries with a relatively high birth rate, but low negative migration increase, so the overall population growth is negative or low
positive.
Table 4.8: Clusters for two demography indicators and partition into three
groups
Cluster number
Countries
Cluster Nr.1
Austria, Germany, Finland, Greece, Italy, Portugal, Bulgaria, Bosnia and Herzegovina, Hungary, Latvia, Lithuania,
Macedonia,
Poland,
Romania,
Serbia,
Slovakia,
Slovenia, Croatia, Montenegro, Czech Republic, Estonia, Belgium, United Kingdom, Luxembourg, Netherlands, France, Switzerland, Denmark, Norway, Sweden,
Andorra, Spain, Malta, San Marin, Iceland
Cluster Nr.2
Ireland, Albania
Cluster Nr.3
Cyprus, Monaco, Liechtenstein
Clustering carried out by dierent groups of indicators leads to the
conclusion that the European countries with higher levels of economic development forms cluster of countries with more stable demographic situation,
in which a higher proportion of the urban population is observed and uctuations in the values of dierent indicators is small. Most of the countries
of Europe make up, as a rule, large clusters, which are the countries with
33
stagnant, but more favorable demographic situation among the rest of the
Europe. Small countries with a moderate level of economic development form
small clusters.
Thus, it is possible to identify the following demographic trends as
result of European countries clustering.
Major countries in Western and
Northern Europe with a high level of economic development are often grouped
into one cluster.
The small countries of Western Europe, the countries of
Central and Eastern and Southern Europe, as a rule, are grouped into another
cluster.
Less economically developed European countries, moderate level
Western European countries and the small countries of the Western and
Northern Europe are grouped in a separate cluster.
34
Conclusion
In this dissertation we have investigated new approaches to clustering number determination under clustering stability concept. The stability
concept was based on assumption of data inaccuracies which gives us the
solution to the data inaccuracies problem itself as well as clustering number
estimation. The existed and additional notation is developed.
The two dierent algorithms which use the same assumptions but different data inaccuracies interpretation are proposed.
The algorithms are
tested on articial dataset and result of algorithms work is compared with
expected result. The algorithms parameters adjustment have been demonstrated and parameters interpretation is given in Section 3.
The two experiments on real data are carried out. The stable clustering under proposed criterion of the European countries unemployment rates
found and historical and geographical result interpretation are provided. The
cluster analysis of European Union countries demography indices are carried
out.
35
Bibliography
[1] Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of
clusters via the gap statistic.
J. Royal Statist. Soc. B, Vol. 63(2), 411423.
[2] Ëîæêèíñ À., Áóðå Â.Ì. (2016). Âåðîÿòíîñòíûé ïîäõîä ê îïðåäåëåíèþ
ëîêàëüíî-îïòèìàëüíîãî
Ïåòåðáóðãñêîãî
÷èñëà
óíèâåðñèòåòà,
êëàñòåðîâ
ñåð.
10.
//
Âåñòíèê
Ïðèêëàäíàÿ
Ñàíêò-
ìàòåìàòèêà,
èíôîðìàòèêà, ïðîöåññû óïðàâëåíèÿ, âûï. 1, Ñ. 2838.
[3] Ëîæêèíñ À. (2015). Êëàñòåðíûé àíàëèç ñòðàí Åâðîïû ïî óðîâíþ
áåçðàáîòèöû // Òðóäû XLVI ìåæäóíàðîäíîé íàó÷íîé êîíôåðåíöèè
àñïèðàíòîâ è ñòóäåíòîâ Ïðîöåññû óïðàâëåíèÿ è óñòîé÷èâîñòü. ÑÏá.:
Èçäàòåëüñêèé Äîì Ôåäîðîâîé Ã. Â., Òîì 2(18), 1, Ñ. 641646.
[4] Lozkins A., Bure V. M. (2015). The method of clusters stability assessing
//" Stability and Control Processes" in Memory of VI Zubov (SCP), 2015
International Conference. IEEE, pp. 479482.
[5] Ëîæêèíñ À. (2015). Êëàñòåðíûé àíàëèç ñòðàí Åâðîïû ïî óðîâíþ
áåçðàáîòèöû // Ñáîðíèê òåçèñîâ êîíôåðåíöèè Ïðîöåññû óïðàâëåíèÿ
è óñòîé÷èâîñòü 2015.
[6] Ëîæêèíñ
îöåíêè
À.,
Áóðå
óñòîé÷èâîñòè
ìåæäóíàðîäíîé
Ì.
Â.
(2015).
Ýìïèðè÷åñêèé
ìåòîäîâ
êëàñòåðèçàöèè
êîíôåðåíöèè,
ïîñâÿùåííîé
//
ïîäõîä
Ìàòåðèàëû
85-ëåòèþ
ñî
III
äíÿ
ðîæäåíèÿ ïðîôåññîðà, ÷ë.-êîðð. ÐÀÍ Â. È. Çóáîâà Óñòîé÷èâîñòü è
ïðîöåññû óïðàâëåíèÿ. ÑÏá.:
Èçäàòåëüñêèé Äîì Ôåäîðîâîé Ã. Â.,
Ñ. 431433.
36
[7] Lozkins A. (2015). Clustering of European Countries by an Ination Rate
and Clusters Research // Abstracts of 20th International Conference of
Mathematical Modelling and Analysis, pp. 56.
[8] Lozkins A., Bure V. M. (2016). The approach for estimation of clustering robustness // 12th German Probability and Statistics Days Book of
Abstracts. pp. 197198.
[9] Fowlkes, E. B., Mallows, C. L. (1983). A method for comparing two
hierarchical clusterings.
Journal of the American Statistical Association,
Vol: 78, pp. 553584.
[10] Jaccard P. (1901). Distribution de la Flore Alpine: dans le Bassin des
dranses et dans quelques regions voisines. Rouge.
[11] Jain, A., Dubes, A. (1988). Algorithms for Clustering Data.
Clis, Prentice-Hall, New Jersey.
Englewood
[12] Ben-Hur, A., Elissee, A., Guyon, I. (2002). A stability based method for
discovering structure in clustered data.
on Biocomputing, pp. 617.
Proceedings of Pacic Symposium
[13] Lange T. et al. (2004). Stability-based validation of clustering solutions.
Neural computation, Vol:
16, 6, pp. 12991323.
[14] Dunn, J. C. (1974). Well Separated Clusters and Optimal Fuzzy Partitions.
Journal Cybern., Vol. 4, 95104.
[15] Calinski, R., Harabasz, J. (1974). A dendrite method for cluster analysis.
Communications in Statistics, Vol. 3, 127.
[16] Hartigan, J. A. (1985). Statistical theory in clustering.
J. Classication,
Vol. 2, 6376.
[17] Krzanowski, W., Lai, Y. (1985) A criterion for determining the number
of groups in a dataset using sum of squares clustering.
44, 2334.
37
Biometrics,
Vol.
[18] Sugar, C., James, G. (2003). Finding the Number of Clusters in a Data
Set: An Information Theoretic Approach.
tistical Association, Vol. 98, 750763.
Journal of the American Sta-
[19] Gordon, A. D. (1994). Identifying genuine clusters in a classication.
Computational Statistics and Data Analysis, Vol. 18, 561581.
[20] Milligan, G., Cooper, M. (1985). An examination of procedures for determining the number of clusters in a data set.
Psychometrika,
Vol. 50,
159179.
[21] Hubert, L., Schultz, J. (1976). Quadratic assignment as a general dataanalysis strategy.
British J. Math. Statist. Psychology, Vol. 29, 190241.
[22] Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor
which reduces chaining eects.
Numerical Taxonomy, 282311.
[23] Hartigan, J. (1975). Clustering Algorithms. New York: John Wiley.
[24] Cheng, R., Milligan, G.W. (1996). Measuring the inuence of individual
data points in a cluster analysis.
Journal of Classication,
Vol. 13, 315
335.
[25] Levine, E., Domany, E. (2001). Resampling Method for Unsupervised
Estimation of Cluster Validity.
Neural Computation, Vol. 13, 25732593.
[26] Ben-Hur, A., Elissee, A., Guyon, I. (1998). Statistical learning Theory
and randomized algorithms for control.
IEEE Control Systems,
Vol. 12,
6985.
[27] Ben-Hur, A., Guyon, I. (2003). Methods in Molecular Biology / M. J.
Brownstein and A. Khodursky (Ed.).
Humana Press, 159182.
[28] Pelleg, D., Moore, A. (2000). X means Extending k-means with ecient
estimation of the number of clusters.
In: Proceedings of the 17-th Inter-
national Conf. on Machine Learning, San Francisco:
727734.
38
Morgan Kaufmann,
[29] Volkovich, Z., Brazly, Z., Toledano-Kitai, D., Avros, R. (2010). The
Hotelling's metric as a cluster stability measure.
new technologies, Vol. 14 (4), 6572.
Computer modelling and
[30] Volkovich, Z., Brazly, Z., Morozensky, L. (2008). A Statistical model of
cluster stability.
Pattern Recognition, Vol. 41 (7), 21742188.
[31] Barzily, Z., Golani M. and Volkovich Z. (2008). On a simulation ap-
Special Issue, Mathematical and
Computer Modeling in Applied Problems, Institute Informatics Problems,
proach to cluster stability validation.
RAS, 86112.
[32] Toledano-Kitai, D., Avros, R., Volkovich, Z., Weber, G.- W. and Ya-
Journal
of Intelligent and Fuzzy Systems, Special, Issue: Recent Advances in Intelligent & Fuzzy Systems, 417427.
halom, O. (2013). A binomial noised model for cluster validation.
[33] Áóðå Â.Ì., Ïàðèëèíà Å.Ì. Òåîðèÿ âåðîÿòíîñòåé è ìàòåìàòè÷åñêàÿ
ñòàòèñòèêà. ÑÏá: Ëàíü, 416 ñ.
[34] Trading
Economics
[eresource]:
http://tradingeconomics.com/country-list/unemployment-rate
URL:
(date
of treatment: 28.01.2015).
[35] Ñòàðêîâà Í.Â. (2015). Ýêîíîìè÷åñêàÿ è ñîöèàëüíàÿ ãåîãðàôèÿ ñòðàí
Åâðîïû: Ó÷åáíî-ìåòîäè÷åñêîå ïîñîáèå. ÑÏá.: ¾ÑÎËο, 119 ñ.
[36] Êëóïò Ì. (2008). Äåìîãðàôèÿ ðåãèîíîâ Çåìëè. ÑÏá.: Ïèòåð, 347 ñ.:
èë.
39
Appendix
Appendix 1: The unemployment rates of European countries for 2014.
Country
First quarter
Second quarter
Third quarter
Albania
13,86
13,45
12,91
Armenia
17,80
17,50
17,10
Austria
8,90
7,73
9,00
Belgium
8,50
8,50
8,60
44,18
43,86
43,66
Bulgaria
13,00
11,40
10,80
Croatia
22,47
19,67
17,67
Cyprus
15,87
16,07
16,27
Czech Republic
8,46
7,60
7,36
Denmark
4,17
3,97
4,00
Estonia
8,50
6,90
7,50
Finland
9,03
9,63
7,53
France
10,10
10,10
10,40
Germany
5,07
5,00
5,00
Greece
27,17
26,83
26,03
Hungary
8,60
8,03
7,63
Iceland
5,43
5,03
4,90
Ireland
11,85
11,70
11,20
Italy
12,67
12,53
12,87
Kazakhstan
5,10
5,03
5,07
Latvia
11,90
10,70
10,60
Lithuania
12,40
11,20
9,10
Luxembourg
7,10
7,23
7,23
Macedonia
28,39
28,22
27,90
Malta
6,00
5,80
5,84
Bosnia
and
Herzegovina
40
Country
First quarter
Second quarter
Third quarter
Moldova
5,10
3,60
3,30
Montenegro
14,94
14,11
13,48
Netherlands
8,70
8,57
8,07
Norway
2,77
3,27
3,60
Poland
13,70
12,50
11,67
Portugal
15,10
13,90
13,10
Romania
7,00
6,93
6,77
Serbia
20,80
20,30
17,60
Russia
5,53
5,03
4,87
Slovakia
13,46
12,81
12,57
Slovenia
14,10
13,07
12,50
Spain
25,93
24,47
23,67
Sweden
8,55
8,63
7,23
Switzerland
3,40
3,03
2,97
Turkey
10,00
8,97
10,13
Ukrain
9,40
8,60
8,90
UK
6,97
6,50
6,07
41
Appendix 2: The demography indicators of European Union countries.
Countries
Birth
Mortality
Natural
Migration
Urban
rate
rate
population
of
popula-
growth rate
tion growth
popula-
tion
rate
Austria
9
9
0
5
67
Albania
13
7
6
-15
54
andorra
9
4
5
4
90
Belgium
12
9
3
6
99
Bulgaria
9
15
-6
-1
73
8
9
-1
0
46
UK
13
9
4
2
80
Hungary
9
13
-4
1
69
Germany
8
11
-3
5
73
Greece
9
10
-1
-1
73
Denmark
10
9
1
4
87
Ireland
16
6
10
-7
60
Iceland
14
6
8
-1
95
Spain
10
8
2
-3
77
Italy
9
10
-1
4
68
Cyprus
12
7
5
14
62
Latvia
10
14
-4
-2
68
Lithuania
10
14
-4
-7
67
Liechtenstein
10
6
4
3
15
Luxembourg
11
7
4
19
83
Macedonia
11
10
1
0
65
Malta
10
8
2
3
100
Monaco
6
6
0
13
100
Netherlands
10
8
2
1
66
Norway
12
8
4
9
80
Poland
10
10
0
0
61
Portugal
9
10
-1
-4
38
Bosnia
and
Herzegovina
42
Countries
Birth
Mortality
Natural
Migration
Urban
rate
rate
population
of
popula-
growth rate
tion growth
popula-
tion
rate
Romania
9
12
-3
0
55
San Marino
9
7
2
7
84
Serbia
9
14
-5
1
59
Slovakia
10
10
0
1
54
Slovenia
11
9
2
0
50
Finland
11
10
1
3
68
France
13
9
4
1
78
Croatia
10
12
-2
-1
56
Montenegro
12
10
2
0
64
Czech Repub-
10
10
0
1
74
Switzerland
10
8
2
8
74
Sweden
12
10
2
5
84
Estonia
11
12
-1
-5
69
lic
43
Appendix 3: The articial data generation program code in R.
#Open the library
library("mvtnorm", lib.loc="~/R/win-library/3.1")
# Covariance matrix
varcov = matrix(c(0.15, 0, 0, 0.15), 2)
#Generation of four two-dimensional samples from
#the normal distribution
y11 = rmvnorm(100, mean=c(1,1), varcov)
y21 = rmvnorm(100, mean=c(1,4), varcov)
y31 = rmvnorm(100, mean=c(3.5,2.5), matrix(c(0.2, 0, 0, 0.2), 2))
y41 = rmvnorm(100, mean=c(6,1), varcov)
y51 = rmvnorm(100, mean=c(6,4), varcov)
#Union of samples
y1 = c(y11[,1], y21[,1], y31[,1], y41[,1], y51[,1])
y2 = c(y11[,2], y21[,2], y31[,2], y41[,2], y51[,2])
#Making the dataframe
y = data.frame(y1, y2)
#Plot the dataset
plot(y$y1, y$y2, xlab = "x", ylab = "y",
main = "Artificial data set")
44
Appendix 4: The algorithm described in Section 2.2 implementation
in R.
#Initial clustering
vectOfInitClust = function (numClust, dataSet) {
kmeanInit = kmeans(dataSet, centers = numClust)
return (kmeanInit)
}
#Frequencies generation for uniformly distributed noises
kDataClustUnif = function(numClust, lenghts = 500, nRepeat = 50,
dataSet = y, variance, kmeanInit){
MatrixOfDistance = dist(dataSet, method = "euclidean",
diag = FALSE)
hClust1 = hclust(MatrixOfDistance)
initClust = cutree(hClust1, k = numClust)
k = 0
for (i in 1:nRepeat){
newDataAdding1 = c(dataSet[,1], dataSet[,1] +
runif(lenghts,min = -delta, max = delta))
newDataAdding2 = c(dataSet[,2], dataSet[,2] +
runif(lenghts,min = -delta, max = delta))
newDataAdding = data.frame(newDataAdding1, newDataAdding2)
Distance = dist(newDataAdding, method = "euclidean",
diag = FALSE)
TreeGeneration = hclust(Distance)
vect = cutree(TreeGeneration, k = numClust)
if (mean(abs(initClust - vect[1:lenghts])) != 0)
{
k = k + 1
}
}
return(k1/nRepeat)
45
}
#Set of Frequences for fixed number of Clusters (for uniform)
frequnceUnifSet = function(nClust , variances, nRepeat = 50,
dataSet = y){
kmeanInit = kmeans(dataSet, centers = nClust)
i = 0;
frequences = 0;
for (sigma in variances){
i = i + 1
frequences[i] = kDataClustUnif(numClust = nClust,
dataSet = dataSet,
variance = sigma,
nRepeat = nRepeat,
kmeanInit = kmeanInit)
}
return (frequences)
}
# Iteration through the number of clusters (median)
frequenceNumClustUnifMedian = function(setNumClust = 2:10,
nRepeat = 50, varSet, dataSet = y){
i = 0
medFrequence = 0
for (k in setNumClust){
i = i + 1
medFrequence[i] = median(frequnceUnifSet(nClust = k,
variances = varSet, nRepeat = nRepeat, dataSet = dataSet))
}
return (medFrequence)
}
46
Отзывы:
Авторизуйтесь, чтобы оставить отзыв