DISCRIMINANT ANALYSIS

DISCRIMINANT ANALYSIS

Introduction

Discriminant Analysis undertakes the same task as multiple linear regressions with metric data. But in many cases we have to use categorical variables, such as loyality and disloyality, users and non users, buyers and non buyers of a product etc. Discriminant analysis is successfully used in these cases to find out the independent variable influencing the dependent variable the most using non metric data. The dependent variable in discriminant analysis is categorical or on nominal scale, whereas the predictors are either interval or ratio scale in nature.

Discriminant Analysis Model

Discriminant analysis is a statistical technique designed to classify the data into homogeneous groups. The purpose is to determine the groups based on a set of identified variables known as predictors or independent variables.

The mathematical model for discriminant analysis is:

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + b5X5 + b6X6

Y (dependent variable) = Discriminant Score, X1 to X5 are the independent variables or predictors and b1 to b5 are the discriminant coefficients of the independent variables. The dependent variable Y is a categorized variable.

CASE ANALYSIS-1


PROBLEM

A television producing company wants to know the behavioural pattern of the purchase of the products in two categories: buyers and non-buyers of TV. Discriminant analysis has been used to find out the prominent variable of discriminating the buyers and non buyers of TV.­

Input Data

Dependent variable

Y = Buyers/Non Buyers

The particular analysis undertakes the model with ‘code 1’ for buyers and ‘code 2’ for non buyers of TV.

Independent variable

X1= Brand, X2 = Picture Quality, X3 = Higher Durability, X4 = Proximity to a dealer, X5 = Advertisement

Objective of using Discriminant analysis is to:

  • Find out the relatively better variable in discriminating the buyers of TV.
  • Determine the statistical significance of discriminant function.
  • Find the cut off score of classification.
  • Find the accuracy of classification.

Data Collection

The data have been collected from the selected respondents by using a structured questionnaire pertaining to selected attributes of buying TV. The statements of the questionnaire were measurable on a likert scale of 1-11; where 11 indicates strongly disagree and 1 indicates strongly agree. Thus, each respondent rated each of the six  attributes (Brand, Picture Quality, Higher Durability, Price, brand, Proximity to a  dealer and Advertisement) and then indicated whether he/she is the buyer of that brand or not.

Table-1: Input Data

The data so collected are analyzed by SPSS-11.

Performing the Analysis with SPSS

For SPSS Version 11, click on Analyze ⇒ Classify ⇒ Discriminant. This will bring up the SPSS screen dialogue box as shown below.

The data sheet gives the responses collected from 30 respondents on 5 attributes of buying a TV.

After clicking on Discriminant, the SPSS screen gives the following dialogue box. Select the variables x1, x2 ……..x5 and move them into the independent variable box. Just below the independent box select “Enter independents together”. Select the dependent variable and move them into the grouping variable box.

Define the range of values of the grouping variable by clicking on Define Range just below the grouping variable box. This will bring up the dialogue box shown above. Code 1 is used for buyers and code 2 is used for non buyers, so the minimum range is 1 and the maximum range is 2. Now click on the button Continue. This will bring Discriminant dialogue box. Now, click on Statistics on the lower part of the main dialogue box. This opens up the following dialogue box.

Under the title Function Coefficients, choose Unstandardized; under Descriptive choose Means and Univariate Anova and choose Within-groups Correlation under the title Matrices. Now click on the button Continue. This will bring Discriminant dialogue box. Now,click on Classify on the lower part of the main dialogue box. This opens up the following dialogue box.

Click on button All groups equal under Prior Probabilities; Summary table and Leave-one-out Classification under the title Display and Within-groups under Use Covariance Matrix. Now click on the button Continue. This will bring Discriminant dialogue box. Click on Save on the lower part of the main dialogue box. This opens up the following dialogue box.

Select Predicted group membership, Discriminant scores and then Continue button. This will return you to the following Discriminant dialogue box.

.

Now click on the button   OK. The output that will be produced is illustrated on the following pages.

SPSS Output

The SPSS output of the discriminant analysis are depicted in table-1 to table-13

Table-2: Analysis Case Processing Summary

Table-3: Group Statistics

The mean score for X1= Brand’ for the buyers group is 1.7368, whereas for non buyers group, it is 3.0909, indicating high difference. Similarly the difference in mean scores of the attributes ‘X2 = Picture Quality’ and X4 = Proximity to a dealer’ are low and ‘X1= Brand’, ‘X3 = Higher Durability’, X5 =Advertisement’ are high. Thus the attributes ‘brand’, ‘higher Durability’ and ‘advertisement’, are the prominent variables of discriminating buyers and non buyers of TV. The variables ‘X4 = Proximity to a dealer and X2 = Picture Quality’ seem not to vary a lot in terms of variability measured through standard deviations.

Table-4: Tests of Equality of Group Means

The results of univariate ANOVA’s, is carried out for each independent variable in table -3. The significant difference in the mean exists for ‘X1= Brand’, for which the p-value is 0.020 and it is less than 0.05 ( 5% level of significance). The p-values for all other variables is greater than 0.05 and so there does not exist the significant difference in the mean scores of other variables.

Table-5: Pooled Within-Groups Matrices

Table-4 represents the correlation matrix for the independent variables. The correlation coefficient for any pair of variables is not more than 0.75 and so there is no case of multicollinearity among predictors. The model is thus reliable for discriminant analysis.

Summary of Canonical Discriminant Functions

Table-6: Eigenvalues

                             a  First 1 canonical discriminant functions were used in the analysis.

An eigenvalue indicates the proportion of variance explained. A large eigenvalue is associated with a strong function. Canonical correlation for this model is 0.519 (not extremely high); it signifies medium type relationship between the discriminant score and their corresponding group (buyers and non-buyers). The square of the canonical correlation is (0.519)2 = 0.269361, it means 26.9% of the variance in this model between buyers/non-buyers is accounted for the independent variables.

Table-7: Wilks’ Lambda

The significance of discriminant is tested by using Wilks’ Lambda score. A small lambda indicates that group means appear to differ significantly. The value of Wilks’ Lambda is 0.731(high value), indicating insignificance difference. Chi-square indicates the discrimination in between two groups, it is in significant as p-value (Sig. = 0.249); is more than 0.05.

Table-8: Standardized Canonical Discriminant Function Coefficients

The discriminant coefficients reflect the relative contribution of each of the predictor on the discriminant function. A small value of the discriminant function coefficient signifies less impact of the predictor. As seen from the table, ‘brand’ has the maximum coefficients of 1.069 and it is the most influencing predictor.

Table-9: Structure Matrix

                             *Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions

                             **Variables ordered by absolute size of correlation within function.

Structural coefficients are the correlation between the discriminant score and the variables used for discriminant analysis. The correlation coefficient for the variable ‘X1=Brand’ is 0.769, and it indicates the most important discriminating variable.

Table-10: Unstandardized discriminant coefficients

                             *Unstandardized coefficients

Unstandardized discriminant coefficients are to be interpreted as those of standardized coefficients. The coefficients of unstandardized coefficients depend upon the units of measurement and so table-7 results in a constant term (-1.576).

Unstandardised discriminant coefficients are used to form the discriminant equation.

Y = Discriminant Score, X1= Brand, X2 = Picture Quality, X3 = Higher Durability, X4 = Proximity to a dealer, X5 = Advertisement

Y = -1.576+ 0.739 X1+0.074 X2+0. .248 X3 -0.114X4 +0.008 X5-0.292 X6

Table-11: Functions at Group Centroids

                             Unstandardized canonical discriminant functions evaluated at group means

‘Functions at Group Centroids’ indicate the average discriminant score for subjects in the two groups. The mean discriminant scores for the users and non-users are calculated by using group centroid coefficients. The cut off score ‘C’ for classification into users and non-users is calculated by using the rule:

C  = (N1Y1 + N2Y2)/N1+N2= (19(-.446)+ 11(0.771))/19+11= 2.333

(The particular rule is used if the sample sizes are not equal)

Where, N1 and N2 are the sizes of the sample. Y1 and Y2 are the centroid coefficients. Now, any respondent whose discriminant score is more than (2.333) is classified as the buyers of TV, where as the one with score less than (2.333) would be classified as the non- buyers of TV.

Classification Statistics

Table-12: Classification Processing Summary

Table-13: Prior Probabilities for Groups

Table-14: Classification Results

                             a Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case.

                             b 80.0% of original grouped cases correctly classified.

                             c 60.0% of cross-validated grouped cases correctly classified.

Classification Results’ is simple summary of number and percent of subjects classified correctly and incorrectly.

Hit Ratio

As already mentioned, if the discriminant score is greater than (2.333), the individual is classified as buyers of TV; otherwise into the non- buyers of TV. Using this logic, the results of classification for all the cases are represented in table-13. It can be seen from the table that out of 19 respondents who are the buyers of TV, 17 were predicted by the model as the buyers of TV. Similarly out of 11 non- buyers of TV, 7 were predicted as non-users of Dove. The overall classification ability of the model measured by the hit score is given as:

Hit ratio = Number of Prediction/Total number of cases = (17+7)/30 = 80%

The reliability of the hit score can be tested by comparing the hit score with the proportional chance criterion CPRO as given below:

CPRO = p2 + (1 – p) 2 = 0.632+ (1-0.63)2= 0.53

Where, P= Proportion of individuals belonging to group-1 (users group) = 19÷30 = 0.63

A classification accuracy of 80% seems to be good compared to the proportional chance criterion of 59%.


CASE ANALYSIS-2


PROBLEM

A retail outlet wants to know its customers and their loyality to the store. The purpose is to classify the customers into loyal and disloyal customers on the basis of their age, income, average number of visits per month, number of years attached to the same retail store. The data have been collected from the visitors of the store.

Input Data

The dependent variable (loyality of the customer) is a categorical variable with two categories:

Loyal customer (code-1)

Disloyal customer (code-2)

Independent variable

X1= Age, X2= Income in 000’s of rupees, X3=Average number of visits per month, X4= Number of years attached to the same retail store.

Data Collection

The data set comprises of the information gathered from the visitors of the retail store as shown below.

Table-1: Input Data

SPSS Output

Table-2: Analysis Case Processing Summary

Table-3: Group Statistics

The difference in mean scores of the variables X1= Age and X2= Income is significant; but the difference in standard deviations are high. The next significant difference in the mean values exist for X3 = Average number of visits per month with negligible difference in standard deviation. So, X3= Average number of visits per month is the most prominent variable of discriminating loyality and disloyality to the store.

Table-4: Tests of Equality of Group Means

The p-value for the variable X3 = Average number of visits per month (Sig. = 0.000) is less than 0.05 (5% level of significance) and for all other variables it is greater than 0.05. So the difference in the mean scores of X3 is significant.

Table-5: Pooled Within-Groups Matrices

The correlation coefficient for any pair of variables is not more than 0.75 and so the model is reliable for discriminant analysis.

Summary of Canonical Discriminant Functions

Table-6: Eigenvalues

                             a  First 1 canonical discriminant functions were used in the analysis.

A large eigen value (3.124) indicates the strength of the model as good. Canonical correlation for this model is 0.870 (high); signifies good relationship between the discriminant score and their corresponding group (loyal and disloyal).

Table-7: Wilks’ Lambda

The value of Wilks’ Lambda is 0.242 (low value), indicating significant difference in group means. The discrimination in between two groups is significant as p-value of chi-square (Sig. = 0.000); is less than 0.05.

Table-8: Standardized Canonical Discriminant Function Coefficients

As seen from the table, X3 = Average number of visits per month has the maximum coefficients of 1.076 and it is the most influencing predictor.

Table-9: Structure Matrix

                             Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions. Variables ordered by absolute size of correlation within function.

A high structural coefficient of 0.874 for the variable ‘X3= Average number of visits per month indicates the most important discriminating variable.

Table-10: Unstandardized discriminant coefficients

                             Unstandardized coefficients

Unstandardised discriminant coefficients are used to form the discriminant equation.

Y = Discriminant Score, X1= Age, X2= Income in 000’s of rupees, X3=Average number of visits per month, X4= Number of years attached to the same retail store

Y = -5.514+ 0.000 X1+0.038 X2+1.528 X3 +0.313X4

Table-11: Functions at Group Centroids

                             Unstandardized canonical discriminant functions evaluated at group means

‘The cut off score ‘C’ for classification into loyal and disloyal customer is zero in case of equal sample size. Now, any respondent whose discriminant score is more than (-1.692) is classified as loyal customer, where as one with score less than (1.692) would be classified as the disloyal customer.

Fig-1

Table-12: Classification Processing Summary

Table-13: Prior Probabilities for Groups

Table-14: Classification Results

                              a Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case.

                              b 95.8% of original grouped cases correctly classified.

                              c 91.7% of cross-validated grouped cases correctly classified.

 

The overall classification ability of the model measured by the hit score is:

 Hit Ratio = Number of correct Prediction / Total number of cases = 12+11/24 = 95.8%

The reliability of the hit score can be tested by comparing the hit score with the proportional chance criterion CPRO as given below:

CPRO = p2 + (1 – p) 2 = 0.502+ (1-0.50)2= 0.50

Where, P= Proportion of individuals belonging to group-1(loyal group) = 12÷24 = 0.50

A classification accuracy of 95.8% seems to be good compared to the proportional chance criterion of 50%.

SPSS Commands for Discriminant Analysis

  1. Click on ANALYZE at the SPSS menu bar (in older versions of SPSS, click on STATISTICS instead of ANALYZE).
  2. Click on CLASSIFY, followed by DISCRIMINANT.
  3. Select the GROUPING VARIABLE (dependent categorical variable in discriminant analysis) and transfer it from the variable list on the left to the grouping variable box on the right.
  4. Define the range of values of the grouping variable by clicking on DEFINE RANGE just below the grouping variable box. Fill in the minimum and maximum values (the codes used in the problem, say 1 and 2) and then click CONTINUE.
  5. Select all the independent variables for discriminant analysis from the variable list and transfer them to the INDEPENDENTS box on the right.
  6. Click on STATISTICS on the lower part of the main dialogue box. This opens up a smaller dialogue box. On this, choose MEANS, UNIVARIATE ANOVAS, UNSTANDARDISED and WITHINN GROUPS CORRELAION and then CONTINUE.
  7. Now, click on CLASSIFY on the lower part of the main dialogue box and select ALL GROUPS EQUAL under PRIOR PROBABILITIES; SUMMARY TABLE and LEAVE-OUT CLASSIFICATION under the title DISPLAY and WITHIN GROUPS under — USE-COVARIANCE METHOD. Now click on the button Continue.
  8. Click on SAVE on the lower part of the main dialogue box. Select PREDICTED GROUP MEMBERSHIP, DISCRIMINANT SCORE and then CONTINUE This will return us to the Discriminant dialogue box.
  9. Finally click on OK of the main dialogue box.

 


 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.