In Chapter 2 we saw how clustering methods such as the K-means algorithm and hi-erarchical clustering can be used to identify data point that are similar to each otherin such a way that they belong to the same group (cluster) of points. However, whendealing with data (especially large data sets), there are often some limitations with thesealgorithms already discussed. We want to apply clustering algorithms whilst making thefewest assumptions possible and not having to specify the number of clusters to use be-fore starting.
Also, we want clustering algorithms to discover clusters of arbitrary shapeand be very efficient with large data sets. Although, the algorithms we have alreadyseen do offer solutions to some of these problems, none of which offer a solution to allof these problems. This is why in this chapter we will look at more advanced clusteringtechniques which may provide a better insight to the data and overcome these problems.3.1Density-Based Clustering (DBSCAN)In section 3.1 I have used pages 226-231 of M.
Ester, H.Kriegel, J.snder and X.Xu’sjournal, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databaseswith Noise [9] as a guideline for my definitions and discussions. If we look at Figure 3.1,we see the result of the K-means algorithm with K = 5 being applied to a syntheticdata set designed to expose the limitations of K-means clustering. Looking at the plotwithout the clustering there seems to be some noise within the data but also five distinctclusters, two elliptical clusters, two linear clusters and a small clustering in the bottomright hand corner. We recognise these clusters because of the high density of data pointsin a certain area and we associate data points as areas of noise when the density is muchlower. However, the clustering of the data in Figure 3.1 does not match these expectedclusters and is clearly interpreting the data incorrectly, hence we should use a differentclustering algorithm to group this data.We will now look in particular at DBSCAN (Density-Based Spatial Clustering andApplication with Noise), which is a density-based clustering algorithm, which is used to identify cluster of arbitrary shape in data consisting of noise and outliers. In thisalgorithm, each data point that belongs to a cluster has at least a minimum number ofpoints within a neighbourhood of a given radius of itself. If a data point doesn’t belongto a cluster, the neighbourhood of the point doesn’t have the required number of pointswithin it. Although we can use any appropriate similarity measure (Section 2.1) whendefining the µ-neighbourhood of a point, we will use the Euclidean distance from hereonwards in this section.Definition 10 (µ-neighbourhood of a point). The µ-neighbourhood of a point p is definedby(p) = {q €€ D|d(p, q) < µ},(3.1)where D is the data set. Any point which lies in the neighbourhood of p is called aneighbour of pTherefore, to start DBSCAN algorithm we need two parameters, µ and n, where µis the radius of the neighbourhood around point p and n is the minimum number ofpoints that must lie in the neighbourhood for that point to feature in the cluster. Anypoint p with a neighbour count greater than n is called a core point.If the neighbour count is less than n, but it belongs to the µ-neighbourhood of some point r, the point iscalled a border point. If a point is neither of these, it is called a noise point or an outlier.Before we state the algorithm, we start by defining three terms that are required in theunderstanding of the DBSCAN.Definition 11 (Direct density reachable). A point p is directly density reachable fromanother point q if: p is in the µ-neighbourhood of q and q is a core point.Definition 12 (Density reachable). A point p is density reachable from q if there existsa set of core points leading from q to p.Definition 13 (Density connected). Two points p and q are called density connected ifthere exists a core point r, such that both p and q are density reachable from r.Now that we have these three definitions, we can introduce the DBSCAN algorithmwhich is stated below.DBSCAN algorithm1. For each point p in the data, find all of its neighbour points. If a point, p, hasa neighbour count greater than n, then mark point p as a core point.2. For each core point find all of the points that are density connected and assignthem to the same cluster as the core point. If the point is not already assignedto a cluster, create a new cluster.3. For each point that isn’t a core point or a border point, treat as an outlier ornoise.If we have a border point in the data, it may belong to more than one cluster, inthis case the point will be assigned to the cluster that was discovered first. However,generally the result of DBSCAN is independent of the order in which the points arevisited.In an ideal situation we would know µ and n of each cluster and at least one pointfrom each cluster. With this information, we could find accurate clustering of the data byusing density reachability. However, there is usually no way of getting this informationin advance. But we can determine the parameters of the least dense cluster which is notconsidered to be noise, which is then a good candidate for these parameters.To do this we first look at finding the value of µ. Note that the DBSCAN algorithm isvery sensitive to the choice of µ, with particular regard to clusters with different densities.For example, if µ is too large, denser clusters may merge together and if µ is too small,sparse clusters may be considered to be noise. This means if there is a large difference in cluster densities, a single value for µ may not suffice. However, for the time being wewill try and find a single optimal value for µ.Firstly we calculate the distance between every point in the data and the K-nearestneighbour (K-NN) of this point.The idea is then to calculate the average of the distancesbetween every point and its K-NN where K is given by n. We then sort these distancesinto ascending order and plot them. On this graph, we now look for a point whichcorresponds to the optimal µ. This optimal µ will be the point such that all the pointsto the left of it correspond to core points and all the points to the right of it correspondto noise. This means we are looking for a knee in the data, which will correspondto the threshold where a sharp change occurs along the K-distance curve and will inturn correspond to the value of µ It turns out K-distance graphs for k > 4 do not differgreatly from the 4-distance graph, therefore for 2-dimensional data we will eliminate thisparameter by setting n = 4.Figure 3.2: Finding the optimal value for µLooking at Figure 3.2 we can see that the knee of the data is at 0.15, thereforewhen applying the DBSCAN algorithm we set µ = 0.15 and n = 4. Now that we have thetwo parameter values we need, lets apply this algorithm to the data. Figure 3.3 showsus the visualization of the DBSCAN algorithm. We can see that this is a much moreappropriate clustering than that provided by the K-means algorithm which is shownin Figure 3.1. Since the DBSCAN algorithm can group points into arbitrary shapes, itmeans it can distinguish what appears to be the correct clusters within the data whereasthe K-means algorithm wasn’t able to do this. If we look at Figure 3.1 the inaccuraciesof the plot all seem to be accounted for by the K-means algorithm not being able todistinguish arbitrary shapes. Also note how the K-means algorithm will include every single point into a cluster, whereas the black points in the DBSCAN algorithm accountsfor outliers, meaning not every point is clustered. This in turn leads to a more robustresult as it is less likely to be affected by unusually behaving data.Figure 3.3: Visualization of DBSCAN algorithmReviewing DBSCAN, we can say it is a very good clustering algorithm to use whenthe data involves lots of noise/outliers. It is also more appropriate than algorithmssuch as the K-means algorithm when the clusters are of arbitrary shapes and not justspherical. Finally, unlike the algorithms we have seen before, at no point do we have tospecify the number of clusters to split the data into. This means less prior knowledgeof the data is needed. Although, as we have seen, the algorithm is very sensitive to thechoices of µ and we need to try and choose the most appropriate value for µ.