OK. Let us first play with a very simple 2D data set created by us by sampling from two identical normal (=Gaussian) distributions with centres displaced.
library("mvtnorm")
set.seed(10)
setA <- rmvnorm(1000,c(0,0),matrix(c(1,0,0,1),2,2))
setB <- rmvnorm(1000,c(3,3),matrix(c(1,0,0,1),2,2))
mock.data <- rbind(setA,setB)
plot(mock.data)
\(K\)-means is a very popular and simple iterative algorithm for clustering. Let us learn its steps with the very simple 2 cluster 2D problem.
We start by randomly selecting two points that we will take as starting solution.
rc <- sample(1:dim(mock.data)[1],2)
plot(mock.data)
points(mock.data[rc,],pch=16,col="seagreen",cex=2)
We will then take this two points as the centres of the two clusters. We could have tried three or more clusters.
If these two points are the centres of the two clusters, each cluster will be made up of the points that are closer to each centre. Closer is a word that implies many things. In particular, it implies a distance and a metric. In the following, I will assume that the metric is Euclidean, but you have various choices (Euclidean, Manhattan, L\(_\infty\)) that you should consider before applying (most) clustering algorithms.
A quote from Elements of Statistical Learning
“An appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm. This aspect of the problem … depends on domain specific knowledge and is less amenable to general research.”
If you decide to use the euclidean distance, take into account its definition:
\(d(\mathbb{x},\mathbb{y}) = \sqrt(\sum_i^n (x_i-y_i)^2)\)
If dimension \(i\) has a range between 0 and, say, \(10^6\) while dimension \(j\) only goes from 0 to 1, differences in \(i\) will dominate the distance even if they only reflect noise.
Also, the choice of metric dictates the position of the centres, as we shall see later.
Now, we define a function to compute 2D distances:
d <- function(center,data){
s1 <- t(apply(data,1,"-",center))
s2 <- s1^2
s3 <- apply(s2,1,sum)
d <- sqrt(s3)
return(d)
}
… and will assign each point to the closest centre (in the euclidean sense)
# We define a function to do that
E <- function(data,centres){
# First some definitions...
n.data <- dim(data)[1]
n.dims <- dim(data)[2]
n.cl <- dim(centres)[1]
dists <- matrix(NA,n.data,n.cl)
# Then, compute distances from each point to each cluster centre
for (i in 1:n.cl) dists[,i] <- d(centres[i,],data)
# ... and define the cluster as the index with the minimum distance
cluster <- apply(dists,1,which.min)
return(cluster)
}
centres <- mock.data[rc,]
# Call the function
cluster <- E(mock.data,centres)
plot(mock.data,pch=16,col=cluster)
points(mock.data[rc,],pch=16,col="seagreen",cex=2)
We see that this assignment produces a linear boundary defined by the points at the same distance from the two centres. But it does not look right. This is because we have only taken the first step of a \(k\)-means cycle.
The second step consists of recalculating the positions of the cluster centres given the cluster assignments (partition) above:
# We define a function to recalculate the centres...
M <- function(data,cluster){
labels <- unique(cluster)
n.cl <- length(labels)
n.dims <- dim(data)[2]
means <- matrix(NA,n.cl,n.dims)
for (i in 1:n.cl) means[i,] <- apply(data[cluster==labels[i],],2,mean)
return(means)
}
means <- M(mock.data,cluster)
plot(mock.data,pch=16,col=cluster)
points(mock.data[rc,],pch=16,col="seagreen",cex=2)
points(means,pch=16,col="green",cex=2)
So far, we have taken several decisions: 1. we have chosen to measure (dis-)similarity between elements using the euclidean distance; 2. we decided to attach a label (cluster) depending on the minimum distance to the centres; 3. and finally, we have decided to recalculate the cluster centres as the mean of (and only of) the points with a given cluster label (we could have used the median, or the closest point to the mean or the median, or…)
Now, let us redo the cycle (steps 1 and 2) one more time:
cluster <- E(mock.data,means)
plot(mock.data,pch=16,col=cluster)
points(means,pch=16,col="seagreen",cex=2)
OK. If we do this over and over again, until one cycle does not produce any change in the cluster assignments and means, we have done a \(k\)-means clustering.
km <- kmeans(mock.data,2)
plot(mock.data,col=km$cluster,pch=16)
points(km$centers,pch=16,col="seagreen",cex=2)
So far, so good…
But life is not 2D, and if Science only dealt with 2D problems, most of the field of multivariate statistics would be pointless.
Let us have a look at a real (yet still simple) data set:
data <- read.table("OGLE.dat",sep=",",header=T)
attach(data)
plot(logP,WI,pch=".")
The data set corresponds to the OGLE III survey of periodic variable stars in the Large Magellanic Cloud, and the plot above shows only two of the 6 attributes that describe each source:
the logarithm of the period of each cycle
the logarithm of the first Fourier term amplitude
the ratio of amplitudes of the first two Fourier terms
the phase difference between the first two Fourier terms
the V-I colour index
the (reddening free) Wessenheit index
The 6 attributes have been re-scaled between 0 and 1.
The educated eye sees the classical pulsators, some extrinsic variability, and spurious structure.
Unfortunately, the spurious structure overlaps with true variability (like, for example, \(\gamma\) Doradus stars and Slowly Pulsating B-type variables that can have periods around 1 day). Therefore, careless cleaning (as the one practitioned below) is just wrong.
The variability zoo is beyond the scope of this lectures, so you will have to have faith and believe me.
Just as a teaser, I cannot help show you the HR diagram:
OK. Let us do a quick and dirty cleaning just so that you can play with the data…
mask <- (logP < 1-0.754 & logP > 1-0.76) | (logP < 1-0.827 & logP > 1-0.833) | (logP < 1-0.872 & logP > 1-0.873)
data <- data[!mask,]
attach(data)
plot(logP,WI,pch=".")
Now, let us try to apply \(k\)-means to the OGLE data set. How many clusters will we try to derive? So far, I have given you no heuristic to decide on the optimal number of clusters. So let us try 10, just for fun:
km10.ogle <- kmeans(data,10)
plot(logP,WI,pch=".",col=km10.ogle$cluster)
This is certainly too few. Let us try with 25 clusters:
set.seed(5)
km25.ogle <- kmeans(data,25,iter.max = 20)
plot(logP,WI,pch=".",col=km25.ogle$cluster)
points(km25.ogle$centers[,c(1,6)],col=seq(1,25),pch=16,cex=2)
OK: we see that eclipsing binaries are separated from the RR Lyrae stars. But the Long Period Variables sequences are mixed up, despite their being clearly separated in this 2D projection.
Let us try a different random seed, just to check that this was not a poor initialization:
set.seed(15)
km25.ogle <- kmeans(data,25,iter.max=20)
plot(logP,WI,pch=".",col=km25.ogle$cluster)
points(km25.ogle$centers[,c(1,6)],col=seq(1,25),pch=16,cex=2)
Ooooops. The LPVs are still mixed, and the RR Lyrae got mixed up now with the eclipsing binaries.
The problem is the inclusion of an attribute that is only useful for a subset of classes:
plot(logP,phi21,pch=".",col=km25.ogle$cluster)
What happens if I remove it?
set.seed(15)
km25.ogle <- kmeans(data[,-4],25,iter.max=20)
plot(logP,WI,pch=".",col=km25.ogle$cluster)
points(km25.ogle$centers[,c(1,5)],col=seq(1,25),pch=16,cex=2)
A bit better, but still insatisfactory. One would wish to remove the dependency on the random initialization, and have a guide as to how many clusters there actually are in the data. More on this later…
OK. \(k\)-means is just an algorithm that represents a special case of a more general one known as Expectation-Maximization.
I will show you what it is by working out a practical example (learn by doing).
First, let me generate again a synthetic data set taylored to highlight the main advantage of going from the particular (\(k\)-means) to the general:
set.seed(10)
setA <- rmvnorm(1000,c(0,0),matrix(c(3,0,0,3),2,2))
setB <- rmvnorm(1000,c(3,3),matrix(c(1,0.99,0.99,1),2,2))
setC <- rmvnorm(1000,c(-1,4),matrix(c(1,-0.9,-0.9,1),2,2))
mock.data <- rbind(setA,setB,setC)
plot(mock.data)
So this is our data set. It is two-dimensional to help illustrate the underlying concepts and the maths, but we could perfectly well try in more than 2D.
Our data set consists of 3000 points drawn from three normal (=Gaussian) distributions with varying shapes and orientations. The key point to keep in mind is that the three distributions overlap in 2D space.
Let us see the performance of our now-known \(k\)-means technique.
km <- kmeans(mock.data,3)
plot(mock.data,col=km$cluster,pch=16,cex=.5)
points(km$centers,pch=15,col="orange")
Not bad, but it clearly can be improved. We see that the distance threshold is a reasonable approach, but returns clusters of roughly the same area (not true for more than three clusters). Hence the inappropriate cluster assignments.
Let me now propose a different algorithm based on the concept of generative model. It naturally leads us to define the probability of the data given the model parameters, the so-called likelihood.
Since we know the true model underlying our data set, let us write down a log-likelihood function that would tell us what model parameters maximize the probability of our data set given the model.
\[\mathcal{N}(\vec{x})=\frac{1}{(2\pi)^{k/2}}\frac{1}{\Sigma^{1/2}}\cdot e^{-\frac{1}{2}((\vec{x}-\vec{\mu})^T\Sigma^{-1}(\vec{x}-\vec{\mu}))}\]
loglikNormal <- function(D,means,Sigmas,cluster){
labels <- unique(clusters)
n.cl <- length(labels)
loglik <- 0
for (i in 1:n.cl)
{
logliki <- apply(means[i,],1,dmvnorm,x=D[cluster==i,],sigma=Sigma[,,i])
logliki <- apply(logliki,2,sum)
loglik <- loglik+logliki
}
return(loglik)
}
It has four parameters: the data, the means and covariance matrices of the normal distributions, and… the cluster assignements! For three clusters and 3000 points, we have 6+9+3000. And this is 2D only.
The space of parameters is HUGE: the means, the covariance matrices, and partitions! It is impossible that we test all possible combinations these 3015 parameters. How could we conceivably compute the likelihood in this high-dimensional space?
Solution 1: the EM algorithm
(Solution 2 is… go Bayesian and use MCMC!)
Let us simplify the problem: assume that the three normals have the same (unknown) covariance, that the covariance is diagonal and that the two variances are equal.
Then, you have \(k\)-means! Except for a few subtleties…
Now, let us remove some constraints: the covariance matrices can be arbitrary, and we try to maximize the log-likelihood. Difficult eh? Then, let us use the EM algorithm.
The EM algorithm is an iterative algorithm to find (local) maxima of the likelihood function or the posterior distribution for models with a large amount of latent variables. It alternates the following two steps: * the E(xpectation) step: given a set of model parameters, it computes the expected value of the latent variables. And… * the M(aximization) step: given the latent values, computes the model parameters that maximize the likelihood.
rc <- sample(1:dim(mock.data)[1],3)
plot(mock.data,pch=16,cex=.5)
points(mock.data[rc,],pch=16,col="seagreen",cex=2)
# We start with a random selection of points as centres...
means <- mock.data[rc,]
Sigmas <- array(0,dim=c(2,2,3))
# And diagonal (unit) matrices for the covariances.
Sigmas[1,1,]=1
Sigmas[2,2,]=1
# And here is the function to compute the expected values of the
# latent variables:
E <- function(data,means,Sigmas)
{
n.cl <- dim(means)[1]
probs <- matrix(NA,nrow=dim(data)[1],ncol=n.cl)
for (i in 1:n.cl)
{
probs[,i] <- dmvnorm(x=data,mean=means[i,],sigma=Sigmas[,,i])
}
cluster <- max.col(probs)
return(cluster)
}
# OK. let us do it:
cluster <- E(mock.data,means,Sigmas)
points(mock.data,pch=16,cex=.5,col=cluster)
points(mock.data[rc,],pch=16,col="seagreen",cex=2)
We have accomplished the E-step, let us now complete the first cycle:
M <- function(data,cluster){
n.cl <- length(unique(cluster))
for (i in 1:n.cl)
{
means[i,] <- apply(data[cluster==i,],2,mean)
Sigmas[,,i] <- cov(data[cluster==i,])
}
M <- list(means=means,covariances=Sigmas)
return(M)
}
plot(mock.data,pch=16,cex=.5)
points(mock.data[rc,],pch=16,col="seagreen",cex=2)
points(mock.data,pch=16,cex=.5,col=cluster)
points(mock.data[rc,],pch=16,col="seagreen",cex=2)
parameters <- M(mock.data,cluster)
points(parameters$means,pch=16,col="orange",cex=2)
cluster <- E(mock.data,parameters$means,parameters$covariances)
plot(mock.data,pch=16,cex=.5,col=cluster)
points(parameters$means,pch=16,col="orange",cex=2)
Looks fine! Now, let us do it until a cycle results in no changes in either the model parameters or the cluster assignments:
for (i in 1:50)
{
parameters <- M(mock.data,cluster)
cluster.new <- E(mock.data,parameters$means,parameters$covariances)
ndiff <- sum(cluster!=cluster.new)
print(ndiff)
cluster <- cluster.new
plot(mock.data,pch=16,cex=.5,col=cluster)
points(parameters$means,pch=16,col="orange",cex=2)
if(ndiff==0) break
}
## [1] 221
## [1] 204
## [1] 189
## [1] 173
## [1] 143
## [1] 89
## [1] 44
## [1] 8
## [1] 6
## [1] 0
Here is the pseudo-code:
DBSCAN(DB, dist, eps, minPts) {
C = 0 /* Cluster counter */
for each point P in database DB {
if label(P) ≠ undefined then continue /* Previously processed in inner loop */
Neighbors N = RangeQuery(DB, dist, P, eps) /* Find neighbors */
if |N| < minPts then { /* Density check */
label(P) = Noise /* Label as Noise */
continue
}
C = C + 1 /* next cluster label */
label(P) = C /* Label initial point */
Seed set S = N \ {P} /* Neighbors to expand */
for each point Q in S { /* Process every seed point */
if label(Q) = Noise then label(Q) = C /* Change Noise to border point */
if label(Q) ≠ undefined then continue /* Previously processed */
label(Q) = C /* Label neighbor */
Neighbors N = RangeQuery(DB, dist, Q, eps) /* Find neighbors */
if |N| ≥ minPts then { /* Density check */
S = S ∪ N /* Add new neighbors to seed set */
}
}
}
}
RangeQuery(DB, dist, Q, eps) {
Neighbors = empty list
for each point P in database DB { /* Scan all points in the database */
if dist(Q, P) ≤ eps then { /* Compute distance and check epsilon */
Neighbors = Neighbors ∪ {P} /* Add to result */
}
}
return Neighbors
}
library("dbscan")
cluster <- dbscan(data[,-4],minPts = 5, eps=0.02)
mask <- cluster$cluster != 0
plot(data[,c(1,6)],pch=".")
points(data[mask,c(1,6)],pch=16,cex=.2,col=(cluster$cluster[mask]+1))
plot(data[,c(1,2)],pch=".")
points(data[mask,c(1,2)],pch=16,cex=.2,col=(cluster$cluster[mask]+1))
plot(data[,c(1,3)],pch=".")
points(data[mask,c(1,3)],pch=16,cex=.2,col=(cluster$cluster[mask]+1))
plot(data[,c(1,4)],pch=".")
points(data[mask,c(1,4)],pch=16,cex=.2,col=(cluster$cluster[mask]+1))
plot(data[,c(1,5)],pch=".")
points(data[mask,c(1,5)],pch=16,cex=.2,col=(cluster$cluster[mask]+1))
cluster <- optics(data[,-4],minPts = 15, eps=0.02)
#res <- optics_cut(cluster, eps_cl = 0.0195)
res <- extractDBSCAN(cluster, eps_cl = 0.0195)
mask <- res$cluster != 0
column <- 6
plot(data[,c(1,column)],pch=".")
points(data[mask,c(1,column)],pch=16,cex=.5,col=res$cluster[mask]+1)
plot(data[,c(1,2)],pch=".")
points(data[mask,c(1,2)],pch=16,cex=.5,col=(res$cluster[mask]+1))
plot(data[,c(1,3)],pch=".")
points(data[mask,c(1,3)],pch=16,cex=.2,col=(res$cluster[mask]+1))
plot(data[,c(1,4)],pch=".")
points(data[mask,c(1,4)],pch=16,cex=.2,col=(res$cluster[mask]+1))
plot(data[,c(1,5)],pch=".")
points(data[mask,c(1,5)],pch=16,cex=.2,col=(res$cluster[mask]+1))
apply(data,2,range)
## logP logA11 R21 phi21 V.I WI
## [1,] 0 0.0000000 0 2.429557e-05 0.0000000 0
## [2,] 1 0.9941714 1 9.999900e-01 0.9907912 1
Now, I would like you to play with the dbscan or optics parameters and see if you can find a way to separate all variability types.
When I had to deal with this dataset, I actually used a different algorithm known a Hierarchical Mode Association Clustering (HMAC).
Scheme:
Let us apply it to the mock data with 3 Gaussian components that overlap in 2D:
plot(mock.data)
library("kohonen")
##
## Attaching package: 'kohonen'
## The following object is masked from 'package:maps':
##
## map
library("fields")
cluster <- som(mock.data)
plot(cluster, type="code", main="Main")
plot(cluster, type="changes", main="Changes")
plot(cluster, type = "property", property = cluster$codes[[1]][,2], main=names(cluster$data)[2], palette.name=tim.colors)
plot(cluster, type="dist.neighbours")
plot(cluster, type="count")
Start with one cluster per data point, and iteratively merge at each step the two clusters that are ‘closer’ together, with distance between clusters
measured as:
single-linkage the minimum distance amongst all pairs of points that we can define with one in cluster \(C_i\) and another in cluster \(C_j\)
average linkage the average distance amongst all pairs of points.
complete linkage the maximum distance for all pairs.
centroid linkage the distance between the centre of mass of each cluster, or the data point closer to it.
minimum energy linkage similar to the average linkage but a penalization term that includes the intra-cluster variances for \(C_i\) and \(C_j\)
wards metric: merge the two clusters that yield the smallest increase in the total intracluster variance.
sample <- sample(dim(data)[1],10000)
dists <- dist(data[sample,])
#dists <- dist(data)
clusters <- hclust(dists, method = 'average')
#clusters <- hclust(dists, method = 'ward.D')
#clusters <- hclust(dists, method = 'ward.D2')
#clusters <- hclust(dists, method = 'single')
#clusters <- hclust(dists, method = 'complete')
#clusters <- hclust(dists, method = 'centroid')
#clusters <- hclust(dists, method = 'median')
#clusters <- hclust(dists, method = 'mcquitty')
plot(clusters)
clusterCut <- cutree(clusters, k=15)
#plot(data[,1],data[,6], col=clusterCut,pch=16,cex=.3)
plot(data[sample,1],data[sample,6], col=clusterCut,pch=16,cex=.3)
clusterCut <- cutree(clusters, k=200)
#plot(data[,1],data[,4], col=clusterCut,pch=16,cex=.3)
plot(data[sample,1],data[sample,4], col=clusterCut,pch=16,cex=.3)
dists <- dist(data[sample,-4])
#dists <- dist(data[,-4])
clusters <- hclust(dists, method = 'average')
clusterCut <- cutree(clusters, k=200)
#plot(data[,1],data[,4], col=clusterCut,pch=16,cex=.3)
plot(data[sample,1],data[sample,4], col=clusterCut,pch=16,cex=.3)
clusters <- hclust(dists, method = 'single')
clusterCut <- cutree(clusters, k=200)
#plot(data[,1],data[,4], col=clusterCut,pch=16,cex=.3)
plot(data[sample,1],data[sample,4],pch=16,cex=.3, col=clusterCut)
First, some definitions:
Subspace clustering starts from the hypothesis that, although the data set has been drawn from a multi-variate distribution in D dimensions (usually \(\mathbb{R}^D\)), each cluster lives in a linear or affine subspace of the original space. Imagine for example, a data set in 3D, where the data points lie in
two planes for clusters 1 & 2
a line for cluster 3
The data set is the union of sets from each of the subspaces. The objective then is to find at the same time the cluster assignments, and the subspace definitions.
In mathematical terms, each subspace is defined as
\[ S_i = {x \in \mathbb{R}^D: x=\mathbb{\mu}_i+U_i\cdot\mathbb{y}} \]
where \(U_i\) is a matrix with the subspace basis vectors as columns, \(\mathbb{y}\) are the new coordinates in that basis, and \(\mathbb{\mu}_i\) is the centre (in the case of affine subspaces). Our goal is to find the \(\mathbb{\mu}_i\), \(U_i\), the dimensionality \(d_i\) of each subspace, and the cluster labels.
This can be achieved in a number of ways (see [this introduction][http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.225.2898] for a comprehensive, yet not particularly clear introduction):
Algebraic methods
Generative methods
Spectral methods
Here I will only explain the iterative version which is the simplest, although some of the features are shared amongst the various methods.
In the iterative approach, we start from a (possibly random) initial clustering (like, for example, \(k\)-means). Then, we alternate to steps consisting of:
\[ ||\mathbb{x}_j-(\mathbb{\mu}_i-U_i\cdot\mathbb{y}_j)|| \]
Problems: * the subspaces are linear
we have to have an initial clustering
We need to fix a priori the number of clusters and the dimensionality of the subspaces
library("orclus")
cluster <- orclus(data,k=10,l=3,k0=50,inner.loops=20)
## iteration : 1
## Initialization with 50 clusters
## Warning: did not converge in 10 iterations
## Actual Subspace dimension : 6
## New number of clusters : 25
##
## iteration : 2
## Actual Subspace dimension : 4
## New number of clusters : 12
##
## iteration : 3
## Actual Subspace dimension : 3
## New number of clusters : 10
##
## Final reassigment...
plot(data[,1],data[,6],col=cluster$cluster,pch=".")
for(i in 1:length(cluster$size)) {
plot(as.matrix(data) %*% cluster$subspaces[[i]], col = cluster$cluster, ylab = paste("Identified subspace for cluster",i))
}
#for(i in 1:length(cluster$size)) {
for(i in 1:5) {
mask <- cluster$cluster==i
plot(data[,1],data[,6],col="black",pch=".")
points(data[mask,1],data[mask,6],col="red",pch=16,cex=.5)
}
Spectral clustering can be better understood if we view the data set as an undirected graph. In the graph, data points are vertices that are connected by links with strength or weight. The weight is supposed to reflect the similarity between two vertices (points).
In principle, the graph can be constructed including links between all pairs of points, although this may be too computer intensive (imagine the case of Gaia wih \(10^9\) points: we would need \(10^18\) links). If we need to simplify the affinity matrix, we can include only links to points within an \(\epsilon\)-hyperball, or only to the \(k\)-nearest neighbours (this needs to be made symmetric).
Once we have the vertices and links/arcs, there are several ways to decide how we measure affinity or similarity (the weights). In the \(\epsilon\)-neighbourhood, there is no strong need to define a weight (they could all be set to one). But in the fully connected graph, we need a measure of affinity that has to be positive and symmetric for all pairs of points \(\mathbb{x}_i\) and \(\mathbb{x}_j\). The most popular by far is the Gaussian kernel:
\[s_{ij}=s(\mathbb{x}_i,\mathbb{x}_j) = K(\mathbb{x}_i,\mathbb{x}_j) = \exp(\frac{-||\mathbb{x}_i-\mathbb{x}_j||^2}{2\sigma^2})\]
Let \(S={s_{ij}}\) be the matrix of similarity (or adjancency) of the graph.
Now it would be time to look at the kernel trick but we do not have time. There are other choices of kernel to measure similarity.
So, in summary we are left with a graph which we want to partition. We would like to divide the graph into disjoint sets of vertices such that vertices within one partition have strong links, while at the same time links that cross cluster frontiers have weak weights. How do we do this? The answer has to do with Laplace…
One obvious (and potentially wrong) way to partition the graph is to find the cluster assignments that minimize
\[ \mathcal{W}=\sum_{i=1}^{k} S(C_i,\neg{C_i})=\sum_{j\in C_i,k\in \neg{C_i}} s(j,k) \]
where the \(C_i\) are clusters in our partition. It is potentially wrong because in many scenarios, the optimal solution would be to separate a single point. Hence, we need to bias the minimization process to avoid singular clusters. We could minimize \(\mathcal{W}\) divided (scaled) by the cluster size (which we can approximate by 1. the number of vertices or 2. the so-called partition volume).
where \(D\) is the diagonal matrix formed by
\[d_{i}=\sum_{j=1}^{n}s(i,j) \]
A subset \(A \subset V\) of a graph is connected if any two vertices in \(A\) can be joined by a path such that all intermediate points also lie in \(A\).
A subset \(A\) is called a connected component if it is connected and if there are no connections between vertices in \(A\) and \(\bar{A}\).
The nonempty sets \(A_1, A_2, ..., A_k\) form a partition of the graph if \(A_i\cap A_j=\emptyset\) and \(A_1\cup A_2\cup ... A_k=V\).
The (justified) hope is that the new space of characteristics is more separable than the original one.
#library(kernlab)
##cluster <- specc(as.matrix(data[sample,]),centers=5,kernel="rbfdot",kpar="automatic")
#cluster <- specc(as.matrix(data),centers=5,kernel="rbfdot",kpar="automatic")
#plot(data[,1],data[,6],col=cluster,pch=".")
It is important to remark again that the features used in a clustering analysis have to be put on a common scale in order to avoid the clustering results being dominated by a few variables with very large ranges. What this scale is depends on your problem and clustering algorithm. The most common choices are:
Once we have corrected for scale differences, we can look at common measures of (dis)similarity for non-continuous variables. For binary variables, you can use the classical measures:
where \(S_{ij}\) is the number of elements that are i in vector 1 and j in vector 2. For example, for vectors (0,1,1,1,1,0,1) and (1,1,1,0,1,0,0), \(S_{00}=1\), \(S_{01}=1\), \(S_{10}=2\), and \(S_{11}=3\).
A very important addition is the Hamming distance. “In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other.” (From the wikipedia). This will prove useful for categorical variables. It is usually normalized by dividing by the number of vector elements (the length).
Each similarity measure has an associated dissimilarity measure:
The Hamming distance can also be used to measure (dis)similarity between vectors of categorical variables. In particular, we could use it to measure distance between strings: d(“house”,“mouse”)=1. It could even be applied to strings of different length is we count symbols in both strings.
An alternative to the Hamming distance for vectors of categorical variables is the binarization: convert a categorical variable like colour or profession into a series of binary variables. This requires knowledge of the universe. For example, if we have a variable that can take values in the following set {teacher, businessman, salesman, tertiary} we can cast it into four binary values that take the value 1 if and only if the categorical value coincides. Order here is important: assuming the order above, the case of a salesman would be characterized by the following four binary values: (0,0,1,0). Of course, this would even be more useful in the case where a variable can simultaneously take several values.
For ordered variables like rankings, we have an alternative: treating them as intervals. Take for example the case where we have 6 different categories in a company (trainee, temporary, stable, supervisor, coordinator, chief). The simplest approach would be to assign a natural number to each from 0 to 6; then, rescale the range to [0,1] dividing by the maximum. Of course, we have freedom to use non-equidistant schemes if the chief and coordinators are closer to one another than trainee and temporary, for example. This is also the case of unequal bins in histograms (distances between intervals).
Now, if we have a mixtures of several types of variables, we can use some kind of weighting scheme:
\[ d(i,j)=\sum_{k=1}^K\frac{w_k\cdot d^k(i,k)}{w_{k}} \]
The algorithm iteratively recomputes cluster prototypes and reassigns clusters (EM). Clusters are assigned using
\[ d(x, y) = d_{\rm euclid} (x, y) + λ\cdot d_{\rm simplematching}(x, y), \]
where \(d_{\rm simplematching}\) is the Sokal-Michener similarity measure described above.
Cluster prototypes are computed as cluster means for numeric variables and modes for factors (cf. Huang, 1998).
The center of a cluster of variables is a synthetic variable but is not a ’mean’ as for classical k-means. This synthetic variable is the first principal component calculated by PCAmix. PCAmix is defined for a mixture of qualitative and quantitative variables and includes ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) as special cases.
The correlation ratio is defined as
where \(\bar{y}\) is the overall mean, \(\bar{y}_x\) is the mean of cluster \(x\), and \(y_{xi}\) is the value of the i-th case in cluster \(x\).
The correlation ratio \(\eta\) takes values between 0 and 1. The limit \(\eta\) =0 represents the special case of no dispersion among the means of the different categories, while \(\eta =1\) refers to no dispersion within the respective categories. \(\eta\) is undefined when all data points of the complete population take the same value.
The homogeneity of a cluster of variables is defined as the sum of the correlation ratio (for qualitative variables) and the squared correlation (for quantitative variables) between the variables and the center of the cluster, which is in all cases a numerical variable.
Missing values are replaced by means for quantitative variables and by zeros in the indicator matrix for qualitative variables.
Ascendant hierarchical clustering of a set of variables. Variables can be quantitative, qualitative or a mixture of both. The aggregation criterion is the decrease in homogeneity for the clusters being merged.
The homogeneity of a cluster is the sum of the correlation ratio (for qualitative variables) and the squared correlation (for quantitative variables) between the variables and the center of the cluster which is the first principal component of PCAmix. PCAmix is defined for a mixture of qualitative and quantitative variables and includes ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) as special cases. Missing values are replaced by means for quantitative variables and by zeros in the indicator matrix for qualitative variables.
The similarity measure most commonly used in transactions is the Jacard index.
So what is Rock? Robust Clustering with linKs. It is hierarchical clustering algorithm based on the link function above. It iteratively merges the two clusters that maximize the following objective function:
where \(n_i\) and \(n_j\) represent the number of elements in clusters \(i\) and \(j\), the numerator represents the number of links between both clusters considering all potential pairs, and the denominator represents the increment in the expected number of links after merging. \(f(\theta)\) is a function such that \(n^{f(\theta)}\) represents the expected number of neighbours in a cluster of size \(n\).
It turns out that it is sufficient for \(f(\theta)\) to be a reasonable function to produce sensible results. For example, for market transactions, \(f(\theta)=\frac{1-\theta}{1+\theta}\) is a good choice.
Once clustering is complete we assign the remaining datapoints from disk by determining which cluster contains the most neighbours to each point (normalised by the expected number of neighbours).
Since purchases contain different numbers of items, some rows have NA (not available) values.
We want to find patterns as transaction types, not modelling the costumer.
I want a 10 one-sided A4 pages work that distills the knowledge gained by you by applying unsupervised classification techniques to the cases matrix. This must include exploratory analysis of the variables, a discussion of the variables used and why, a discussion of the chosen algorithm(s) and of the results.