CLUSTER ANALYSIS

CLUSTER ANALYSIS

Introduction

Cluster analysis is a technique used to classify cases into groups that are relatively homogeneous within themselves and heterogeneous between each other, on the basis of certain characteristics. These groups are called clusters. It is not necessary that the clusters should comprise of people only, there could be clusters of brands, products and sectors having common characteristics. A cluster consists of variables that have high correlation with one another and comparatively low correlation with the variables in other clusters. The purpose of the cluster analysis is to find out a number of disjoint clusters based on the similarities of profiles among the entities. Cluster analysis is useful for segmenting the market for a product on the basis of the various characteristics of the customers.

Methods

Cluster analysis can be performed using Hierarchical or Non- Hierarchical (K-Means) procedure. The Hierarchical cluster analysis has the ability to compute a range of possible solutions for cluster formation and helps to select the best solution. But, K-Means Cluster analysis procedure requires specifying the number of clusters in advance. The best practice of using cluster analysis is to identify the number of clusters by using Hierarchical cluster analysis first, then to run K-Means cluster analysis with the identified number of clusters. Agglomeration Schedule, Dendrogram and Icicle plots are the parts of out-puts of Hierarchical cluster analysis. The initial cluster centers and final cluster centers are the parts of out-put of K-Means clustering method. The final cluster centers are the average values of each variable and used to interpret the clusters.

Data

The metric data (Interval scaled and ratio scaled data) are best suited to cluster analysis. Non metric data (nominal scaled and ordinal scaled data) can also be used with binary conversion (0=absence and 1=presence of an attribute).

CASE ANALYSIS-1


PROBLEM

A cosmetic product company wants to know various characteristics of the product which attract the consumers when they think of using it. The marketing team of that company prepared twelve statements with the purpose to measure the major attributes of preferring the cosmetic products.

The following statements were prepared by the marketing team of the said company.

V1 = I give value to the price of the cosmetic product.

V2 = I prefer to use the cosmetic product which are easily available in the market.

V3 = I feel safe to use the recommended cosmetic product.

V4 = The brand of the product influences me to purchase the product.

V5 = Attractive promotional offers attract people to choose the product.

V6 = I prefer to use the cosmetic products with the knowledge of the dealer

V7 = People prefer to use the product with varieties

V8 = I like to purchase the cosmetic product having refund policy.

V9 = People are more conscious about the current trend.

V10 = I enjoy testing the new cosmetic product in the market.

V11 = Advertising encourages people to choose a particular cosmetic product.

V12 = Celebrity endorsement to cosmetic product increases the popularity of the product.

INPUT DATA

The data have been collected from 18 respondents on 5-point likert scale. The respondent had to agree or disagree with each statement. (1= completely agree, 2= agree, 3 = neither agree nor disagree, 4= Disagree, 5= completely agree). The data so collected is depicted in following table and it is treated as input data matrix to run cluster analysis.

Table-1: Input Data

Performing the Analysis with SPSS

For SPSS Version 11, click on Analyze ⇒ Classify ⇒ Hierarchical Cluster

This will bring up the SPSS screen dialogue box as shown below.

After clicking Hierarchical Cluster, this will bring up the SPSS screen dialogue box as shown below.

Select the variables (v1.v2……….v12) and click to move them into the Variable box. Now click on Plot button. This will bring up the dialogue box shown below.

Dendogram and Vertical or horizontal button and then click Continue button. This will return you to the Hierarchical Cluster Analysis dialogue box.

Now click on Method and choose Between-groups linkage method then click on Continue. This will return you to the Hierarchical Cluster Analysis dialogue box.

Now click on the button   OK. The outputs so produced are illustrated on the following pages.

SPSS Output

Table-2: Case Processing Summary

                            a   Squared Euclidean Distance used

                            b Average Linkage (Between Groups)

Table-1 just gives the information about any missing data.

Table-3: Agglomeration Schedule

Agglomeration schedule helps to determine the number of clusters to be retained from the solution. The  column  labeled  Coefficients  has  the  values  of  the  distance  statistic  used  to form  the  cluster. A  good  cluster  solution  sees  a  sudden  jump  in  the  distance  coefficient from bottom to up. We see that there is a difference of (30.938 – 26.867 = 4.071) in the coefficients of stage-17 and at stage-16, the difference is (26.867– 23.857= 3.01). The next one is the difference (23.857– 22.000 = 1.867) at stage-15. But there is a sudden jump of coefficients with difference (22.00 – 18.490) = 3.51 at 14th stage of the solution.

So, the number of clusters = Number of cases – Stage with sudden jump = 18-14 =4

Table-4: Horizontal Icicle plot

An icicle plot is the visual representation of the agglomeration schedule.  Below is a horizontal Icicle plot.  The output is to be read   from  bottom  to up  (vertical  orientation)  or  from  left  to right  (horizontal  orientation).  It is a default SPSS output and of less use in interpreting the results.

 

Fig-1: Dendrogram

This  option  depicts  the  links  between  cases  and  its  structure  allows    to  visually  see  how cases  form  clusters.  Dendrograms,  or  tree  diagrams  represents  the  process  of  going  from individual  cases  to  one  large  cluster. Dendogram also helps to identify the number of clusters. While reading a Dendrogram, we want to determine at what stage the distances between clusters that are combined is large. We look for large distances between sequential vertical lines. As can be seen here, four clusters can be physically identified.

Once the number of clusters has been decided, a non-hierarchical (k-means) clustering option can be run on the data.

Performing Non- Hierarchical Cluster Analysis with SPSS

For SPSS Version 11, click on Analyze ⇒ Classify ⇒ K-Means Cluster

This will bring up the SPSS screen dialogue box as shown below.

After clicking K-Means Cluster, this will bring up the SPSS screen dialogue box as shown below.

Select the variables (v1.v2……….v12) and click to move them into the Variable box.  Now, set 4 as the number of clusters. Now click on Option button. This will bring up the dialogue box shown below.

Select Initial Clusters Centers, Anova table and Cluster Information for each case and then click Continue. This will return us to the K-Means Cluster Analysis dialogue box as follows.

Now click on the button   OK. The outputs so produced are illustrated on the following pages from table-4 to table-10.

Table-5: Initial Cluster Centers

The  Initial  Cluster  Centers  table  shows  the  first  step  in  k‐means  clustering  in  finding  the    centers.

Table-6: Iteration History

a   Convergence achieved due to no or small distance change. The maximum distance by which any center has changed is .000. The current iteration is 3. The minimum distance between initial centers is 5.099.

The  Iteration  History  table  shows  the  number  of  iterations  that  were  enough  until  cluster  centers  did  not  change  substantially.

Table-7: Cluster Membership

The  Cluster  Membership  table  gives  you  the  case  cluster  each  case  belongs  to  and  the  Euclidean  distance  of  each  case  to  the  cluster  center.

Table-8: Final Cluster Centers

The final cluster centers describe the mean values of each variable for each of four clusters. The variables with mean values 1 to 3 are equivalent to ‘agree’, mean value equal to 3 means ‘neutral attitude’ and 3 to 5 means ‘disagree’ to that fact.

Cluster Interpretation

Cluster-1 (N=3)

People belonging to this group are not giving importance to the new cosmetic products in the market. The people of this group do not bother about any refund policy. They are less price conscious, less affected by celebrity endorsement and knowledge of the dealer and neutral to easily available products and any recommendation. The people of this group are more conscious about the brand. They are influenced by the advertisement and prefer to use the cosmetic products with current trend.  They prefer to the products with attractive promotional offer and variety.

This group holds medium type beauty conscious people who get affected easily by the advertisement and promotional offer.

Cluster-2 (N=3)

This cluster of individuals is more inclined to new and branded cosmetic products. The people of this group feel variety, current trend and availability of the cosmetic products as the prime factors of choosing any cosmetic product. They do not give value to the price, any recommendation and the cosmetic products endorsed with celebrity. The people belonging to this group are neutral to advertising and knowledge of the dealer. Attractive promotional offer and refund policy cannot attract these people.

These groups of people are more beauty conscious and trendy.

Cluster-3 (N=7)

This cluster may be viewed as the group of people not influenced by attractive offer and refund policy. New and easily available cosmetic products in the market cannot influence the people to buy the product. They give least preference to recommended product and possess neutral attitude to variety. Current trend and the branded cosmetic products influence them to purchase the products.  They give value to price and want to have the knowledge of the dealer. They are also influenced by the advertisement and the celebrity.

These groups of people are typical traditional beauty conscious people giving value to branded products with the knowledge of the dealer.

 Cluster-4 (N=5)

The people of this group like to use branded cosmetic products. They get attracted by the advertisement and give value to the price of the product. They give importance to attractive promotional offer and the new product in the market. The people of this group do not have likings towards refund policy, variety and recommendation.

These groups of people are more value conscious who give importance to promotional offer.

Table-9: Distances between Final Cluster Centers

This table shows the Euclidean distances between the final cluster centers. Greater distances between clusters correspond to greater dissimilarities. Clusters 1 and 2 are most different.

Table-10: ANOVA

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.

The  ANOVA  table  indicates  which  variables  contribute  the  most  to  our  cluster  solution. Variables  with  large  mean  square  errors  provide  the  least  help  in  differentiating  between clusters. The variables ‘Recommendation’ and  ‘Refund policy’ had  the  two  highest  mean square  errors  (and  lowest  F  statistics);  therefore,  the    variables  were  not  as  helpful  as the  other  variables  in  forming  and  differentiating  clusters.

Table-11: Number of Cases in each Cluster


CASE ANALYSIS-2


PROBLEM

A cell phone company is interested in knowing its potential customers and the purpose of using the cell phone. The following statements were prepared by the company on 5-point likert scale. (1= completely agree, 2= agree, 3 = neither agree nor disagree, 4= Disagree, 5= completely agree). The data have been collected from 20 respondents as given in table- 1   and it is treated as input data matrix for cluster analysis.

V1 = Owning a cell phone gives a social status.

V2 = It keeps me connected to the world.

V3 = I cannot think of a life without cell phone.

V4 = It has made my life easier and comfortable.

V5 = It gives me a sense of independency.

V6 = It is necessity rather than luxury.

V7 = It makes me to feel very modern and techno savvy.

V8 = I prefer to change my cell regularly.

V9 = Cell phone usage is a means of time pass.

V10 = Cell phone usage is not good for teenagers.

V11= Cell phone usage increases creativity.

Table-1: Input Data

Table-2: Case Processing Summary

a   Squared Euclidean Distance used

b Average Linkage (Between Groups)

 

Table-3: Agglomeration Schedule

We see that there is a difference of (48.563 – 41.743 = 6.82) in the coefficients of stage-19, the next difference is (41.743– 39.933= 1.81) at stage-18. But there is a sudden jump of coefficients with difference (39.933– 34.5) = 5.433 at 17th stage of the solution.

So, the number of clusters = Number of cases – Stage with sudden jump = 20-17 =3

Table-4: Horizontal Icicle plot

Fig-2: Dendrogram

Table-5: Iteration History

a  Convergence achieved due to no or small distance change. The maximum distance by which any center has changed is .000. The current iteration is 3. The minimum distance between initial centers is 6.633.

Table-6: Initial Cluster Centers

Table-7: Cluster Membership

Table-8: Final Cluster Centers

Cluster-1

People of this group change their cell regularly; they feel the cell as necessary item and it keep them connected to world. They also believe that cell phone usage increases their creativity. Cell phone is the symbol of social status; good means of passing the time and give them a sense of independency. But the cell phone does not make their life easier and comfortable and they can lead their life without mobile.

This cluster may be viewed as the group of people believing the social importance of cell phone.

Cluster-2

The people of this group are in belief that the usage of cell phone is the symbol of social status; made their life easier and comfortable. They cannot think of the life without cell phone and it keeps them connected to world. The people of this group are neutral to regular change of the set. Cell phone is luxury item for the people belonging to this group; it does not enhance the creativity of the people. They do not think that the use of cell phone is a means of passing the time and the symbol of modernity.

The people of this group have considered the cell phone as Part and parcel of their life.

Cluster-3

The particular group of people feels cell phone as the symbol of social status and modernity. Cell phone usage made the life of this group of people easier and comfortable. They feel cell phone is necessary for everybody and it gives an independent way of life. They believe that cell phone usage is not good for teenagers and they can live without cell phone.

This group of people feels cell phone as today’s basic requirement giving independent life to maintain.

Table-9: Distances between Final Cluster Centers

Table-10: ANOVA

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.

Table-11: Number of Cases in each Cluster

SPSS commands

Hierarchical Cluster Analysis

  1. Click on ANALYZE at the SPSS menu bar (in older versions of SPSS, click on STATISTICS instead of ANALYZE).
  2. Click on CLASSIFY, followed by HIERARCHIAL CLUSTER.
  3. Select the variables and move them into the Variable
  4. Click PLOT and select DENDOGRAM. Then select vertical or horizontal under ORIENTATION and then click CONTINUE.
  5. Now click on METHOD of main dialogue box and choose BETWEEN-GROUPS LINKAGE METHOD then click CONTINUE.
  6. Finally click OK of Hierarchical Cluster Analysis dialogue box.

Non Hierarchical Cluster Analysis (K-Means Cluster)

  1. Click on ANALYZE at the SPSS menu bar (in older versions of SPSS, click on STATISTICS instead of ANALYZE).
  2. Click on CLASSIFY, followed by K-MEANS CLUSTER.
  3. Select the variables and move them into the Variable
  4. Set the number of clusters identified in Hierarchical Cluster Analysis
  5. Now click on OPTION and Select INITIAL CLUSTER CENTERS, ANOVA TABLE, CLUSTER INFORMATION FOR EACH CASE and then click CONTINUE.
  6. Click OK of K-Means Cluster Analysis dialogue box.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.