# Clustering

Contents

• Introduction to Clustering
• Hierarchical Clustering techniques
• K means Clustering Techniques
• Fuzzy C means Clustering techniques
• Applications of Clustering

Clustering Analysis-

Clustering algorithm helps us to cluster the objects based on any common feature and arrange them into different clusters.

The different aspects that it makes clusters can be anything depending on shape, color or size. Fig: Various types of objects

For example, here we see different objects. The objects like apple and a box have same color. And therefore it can be clustered based on the color category. They can also be clustered on the basis of shapes. The algorithm takes up the patterns of the objects and cluster them based on the similar pattern.

How do we know how much clusters we will be needing?

It depends on the minimum threshold value. Once that value is achieved we won’t make more clusters.

What is clustering?

Clustering is the classification of objects into different groups, or more precisely, the partitioning of data set into subsets (Clusters) so that the data in each subset (ideally) shares some common trait- often according to some defined distance measure.

Types of clustering are:

1. Hierarchical Clustering-

A. Agglomerative (bottom-up)
-Computes all pair-wise pattern-pattern similarity coefficients. Ie. the difference between objects within cluster is minimum.
-Place each of n patterns into a class of its own.
-Merge the two most similar clusters into one – Replaces the two clusters into the new cluster and re-computes inter-cluster similarity scores w.r.t. The new cluster.
-Repeat the above step until there are k clusters left.(k can be 1)
-Each data is assigned as an individual cluster. K is having the maximum value and from there we go to a minimum value for k. Minimum value of k can be minimum distance between a cluster.

Agglomerative Clustering (Bottom Up)-
The nearest 2 points are merged to form a cluster. In this we combine the two closest points to form the first cluster. Similarly, we keep on forming clusters until we reach to a k point where all clusters are done. So, when you get merge the similar and close points, you can make clusters completely. Finally these k clusters are left.

B. Divisive Approach (top-down)
– Start at the top with all the patterns in one cluster
– The cluster is split using a flat clustering algorithm
– This procedure is applied recursively until each pattern is in its own singleton cluster. We have k=2, then we split it to k=4, and then k=6. From k=1 to k= maximum. We apply hierarchical approach only for small data sets and not for large data sets.

1. K means clustering
– Each cluster is associated with a centroid.
– Each data object is assigned to closet centroid.
– The centroid of each cluster is then updated based on the data objects assignment to the cluster.
– Repeat the assignment and update steps until convergence.
2. Fuzzy C-Clustering:
– An extension of k-means.
– Hierarchical, k means generates partitions. Each data object is assigned to closest centroid.
– Fuzzy-c means allow data points to be assigned into more than one cluster. Each data point has a degree of membership of belonging to each other.
– It is frequently used in pattern recognition. It is based on minimization of the following object function.

Fuzzy c means algorithm:
1. Let x(i) be a vector of values for data point g(i). Initialize membership U(0)= [u(ij)] for data point g(i) of cluster cl(j) by random.
2. At the kth step, compute the fuzzy centroid C(k) =[c(j)] for j=1,….,nc, where nc is the number of clusters, using where m is the fuzzy parameter and n is the number of data points.
3. Update the fuzzy membership U(k) = [u(ij)] , using 1. If ||U(k) – U(k-1)|| < , then STOP, else return to step2.
5. Determine membership cutoff – For each data point g assign g(i) to cluster cl(j) if u(ij) of               U(k)>

Application of Clustering-
• Data Mining
• Pattern Recognition
• Image Analysis
• Bioinformatics
• Voice mining
• Image processing
• Text mining
• Web cluster engines
• Weather report analysis

Multiple Clusters-
– Clustering

– K means Clustering Algorithm
– Example
– How to choose value of k?
– Applications of K means algorithm

We know that we want to try and partition data points  into the set of clusters based on the feature space. The data points that retain the closest point are merged into a cluster.

K means Clustering technique-
1. Choose k initial centroids- Suppose k=3.
2. Each cluster is associated with a centroid

1. Each data object is assigned to closest centroid.
2. The centroid of each cluster is then updated based on the data objects assignment to the cluster.
3. Repeat the assignment and update steps until convergence.

For example : K=3.

Step1: Make random assignments and compute centroids (big dots) Step2: Assign points to the nearest centroids. Step3: Recompute centroids (in this example, solution is now stable) K means setup-

• X1, x2, x3, ….. Xn are data points or vectors of observations
• Each observation (vector xi) will be assigned to one and only one cluster
• Dissimilarity measure: Euclidean distance metric
• K-means minimizes within cluster point scatter Where C(i) denotes cluster number for the ith observation
m(k) is the mean vector of the kth cluster
N(k) is the number of observations in the kth cluster.

The method works:

Step1 : Initialization-

• Randomly choose 2 centroids (k=2) for two clusters. For example: • In this case the two centroids are:
m(1) = (1.0,1.0) and
m(2) = (5.0,7.0)
• Finding the nearest centroid for every element-
For every element calculating its distance from center (Euclidian Distance)
This distance is calculated on the basis of this equation: Step2: Assigning elements to any of the cluster-

• Thus we obtain 2 clusters containing {1,2,3} and {4,5,6,7}. Once we have the two clusters then we can calculate the updated the centroid position, mean of the position of all elements in that particular cluster. So, C1= {1,2,3}= {1.0+1.5+3.0} and C2= {4,5,6,7} = {5.0+3.5+4.5+3.5}  This will be mean1 and mean 2 will be summation of second values C1m2={1.0+2.0+4.0} and C2m2={7.0+5.0+5.0+4.5} • So the new centroids are: Step3: Assigning elements to new clusters according to distance:

• Now using these centroids, m1={1.83,2.33} and m2={4.12,5.38}. Based on these values we calculate the Euclidean distance of each object as shown.
As we observe, that the value for data set3 is larger for C1 than in C2. So, with the updated cluster version the value for data set 3 will lie in C2.
Therefore, new clusters are: C1={1,2} and C2={ 3,4,5,6,7}
Next Centroids are:
m1=(1.25,1.5) and m2=(3.9,5.1)

Step4: Convergence

• Now using the centroids m1=(1.25, 1.5) and m2= (3.9,5.1) we compute the Euclidean distance of each object as shown.
• Therefore, new clusters are:{1,2} and {3,4,5,6,7}. Therefore, no change in the cluster.
• Thus algorithm comes to a halt here and final result consists of 2 clusters {1,2} and {3,4,5,6,7}

How to choose value of k?

1. How do we choose K depends on the distance between elements of a particular cluster and centroid of the same minimizes. This is one more clustering method called EM.
2. Run algorithm on data with several different values of k.
3. Use the prior knowledge about the characteristics of the problem.

Applications of K means clustering-

1. It is relatively faster and efficient. It computes result at 0(t n), where n is the number of objects or points, k is the number of clusters and t is the number of iterations.
2. K-means clustering can be applied to machine learning or data mining.
3. Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization for Image Segmentation)
4. Also used for choosing color palettes on old fashioned graphical display devices and Image Quantization.

This means k-means algorithm is useful for undirected knowledge discovery and is relatively simple. K-means has found widespread usage in lot of fields, ranging from unsupervised learning of neural networks, pattern recognitions, Classification analysis, Artificial Intelligence, image processing, machine vision and many others.

PCA Algorithm:

• Clustering

Applications in Self Driving-

For getting your hands on Image processing and real time Image Acquistion using OpenCV:

1. First you need to install Open CV if it is not already installed. Open your command prompt, and then just type this:
2. Once it is installed, we can simply check it by typing
```>>> import cv2 (if it shows no error that means it is installed) >>> import numpy >>> import matplotlib (these 2 libraries are needed) If it is done, we can start with the programming: import cv2 import numpy #reading image from the computer img=cv2.imread(r’C:\images\image2.png,1) ----- If you want to see this window in grayscale mode then just change 1 to 0. cv2.imshow(‘Image’,img) cv2.waitKey(0) cv2.destroyAllWindows()```
——————————————————————————–
```import cv2 import numpy #reading image from the computer img=cv2.imread(r’C:\images\image2.png,1) #if you need to design your own window, then change the frame dimensions cv2.namedWindow(‘Image’,cv2.WINDOW_NORMAl) cv2.imshow(‘Image’,img) cv2.waitKey(0) cv2.destroyAllWindows() --------------------------------------------------------------------------------------------- import cv2 import numpy #reading image from the computer img=cv2.imread(r’C:\images\image2.png,1) cv2.namedWindow(‘Image’,cv2.WINDOW_NORMAl) #if you need the image output cv2.imwrite(‘C:\images\output.jpg’,img) cv2.imshow(‘Image’,img) cv2.waitKey(0) cv2.destroyAllWindows()```
—————————————————————————————
If we want to work with real time images:
```cap=cv2.VideoCapture(0) while TRUE: ret,frame=cap.read() cv2.imshow(‘frame’, frame) if cv2.waitKey(1) && 0*FF==ord(‘q’): break cap.release() cv2.destroyAllWindows() ---------------------------------------------------------------------------------- # For Drawing anything on the image: import cv2 import numpy img=cv2.imread(r’C:\images\image2.png,1) cv2.namedWindow(‘Image’,cv2.WINDOW_NORMAl) #add a line to the image img=cv2.line(img,(10,120), (400, 500),(255,0,0),5)--- first the source coordinates, then the destination coordinates, color and the width. #add a rectangle to the image img=cv2.rectangle(img,(50,0),(200,180),(0,0,255),3) cv2.imshow(‘Image’,img) cv2.waitKey(0) cv2.destroyAllWindows()```
————————————————————————————
For edge detection:

```import cv2 import numpy import matplotlib.pyplot as plt img=cv2.imread(r’C:\images\image4.jpg’,0) edges=cv2.Canny(img,100,200) plt.subplot(121),plt.imshow(img,cmpa=’gray’) plt.title(‘original Image’),plt.xticks([]).plt.yticks([]) plt.subplot(122),plt.imshow(edges,cmap=’gray’) plt.title(‘Edge Image’),plt.xticks([]),plt.xticks([]) plt.show()```
————————————————————————————
For corner Detection:
We will use Harry’s corner algorithm-
```import cv2 import numpy img=cv2.imread(r’C:\images\chessboard.png) gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) gray=numpy.float32(gray) dst=cv2.cornerHarris(gray,2,3,0.04) dst=cv2.dilate(dst,NONE) #threshold for a particular value img[dst>0.01*dst.max()]=[0,0,255] cv2.imshow(‘dst’,img) cv2.waitkey(0) cv2.destroyAllWindows()```
This will help in image detection, object detection, feature detection using opencv and python!
3. Color Quantization using K means Clustering:
We will use the similar syntax and k means clustering for color quantization. This can be defined as reducing the number of colors in an image to reduce the memory. For color quantization we will use a command- cv.kmeans() – input parameters.
A. samples: It should be of np.float32 data type, and each feature should be put in a single column.
B. nclusters(K): number of clusters required at the end
C. criteria: It is the iteration termination criteria. When this criteria is satisfied, algorithm iteration stops. Actually, it should be a tuple of 3 parameters. They are (type, max_iter, epsilon): Termination criteria has 3 flags-
cv2.TERM_CRITERIA_EPS- Stop the algorithm iteration if specified accuracy, epsilon, is reached.
cv2.TERM_CRITERIA_MAX_ITER- stops the algorithm after the specified number of iterations, max_iter.
cv2.TERM_CRITERIA_EPS+TERM_CRITERIA_MAX_ITER- stop the iteration when any of the above condition is met.
max_iter- An integer specifying maximum number of iterations
epsilon- Required Accuracy

D. attempts: Flag to specify the number of times the algorithm is executed using different initial labellings.
E.flags: This flag is used to specify how initial centers are taken. Normally two flags are used for this: cv2.KMEANS_PP_CENTERS and cv2.KMEANS_RANDOM_CENTERS
cv.kmeans()-Output Parameters
A. Compactness- It is the sum of squared distance from each point to their corresponding centers.
B. labels: This is the label array, where each element marked ‘0’,’1’…
C. centers : This is the array of centers of clusters.

```import numpy import cv2 img=cv2.imread(r’C:\images\img1.jpg’) img2=img.reshape((-1,3) #converting to numpy.float32 img2=numpy.float32(img2) #define the criteria criteria=(cv2.TERM_CRITERIA_EPS+cv2.TERM_CRITERIA_MAX_ITER,10,1.0) k1=8 ret,label1,center1=cv2.kmeans(img2,k1,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS) center1=numpy.uint8(center1) res1=center1[label1.flatten] res1final=res1.reshape((img.shape))```
#define a window for image
```cv2.namedWindow(‘Kmeans Clustering’,cv2.WINDOW_NORMAl) final1=numpy.concatenate((img,res1final),axis=1) cv.imshow(‘Kmeans Clustering’,final1) cv2.waitKey(0) cv2.destroyAllWindows() ```
This was clustering with k=8 and we can do with more cluster too
```import numpy import cv2 img=cv2.imread(r’C:\images\img1.jpg’) img2=img.reshape((-1,3) #converting to numpy.float32 img2=numpy.float32(img2) #define the criteria criteria=(cv2.TERM_CRITERIA_EPS+cv2.TERM_CRITERIA_MAX_ITER,10,1.0) k1=2 ret,label1,center1=cv2.kmeans(img2,k1,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS) center1=numpy.uint8(center1) res1=center1[label1.flatten] res1final=res1.reshape((img.shape)) #define a window for image cv2.namedWindow(‘Kmeans Clustering’,cv2.WINDOW_NORMAl) final1=numpy.concatenate((img,res1final),axis=1) cv.imshow(‘Kmeans Clustering’,final1) cv2.waitKey(0) cv2.destroyAllWindows()```

FACE DETECTION AND OBJECT DETECTION USING OPEN CV
```import numpy import cv2 fc=cv2.CascadeClassifier(r‘C:\images\haarcascade_frontalface_default.xml’) ec=cv2.CascadeClassifier(r‘C:\images\haarcascade_eye.xml’) img=cv2.imread(r‘C:\images\man.jpeg’) gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) faces=fc.detectMultiScale(gray,1.3,5) for(x,y,w,h) in faces: img=cv2.rectangle(img,(x,y),(x+w),(y+h),(255,0,0),2) roi_gray=gray[y:y+h,x:x+w] roi_color=img[y:y+h,x:x+w] eyes=ec.detectMultiScale(roi_Gray) for(ex,ey,eh) in eyes: cv2.rectangle(roi_color, (ex,ey),(ex+ew,ey+eh),(0,0,255),2) cv.imshow(‘Face detection’,img) cv2.waitKey(0) cv2.destroyAllWindows()```

Recommended Reading: Chapter 5: Linear Regression