Friday, April 17, 2020

Chi-Square test for Dependency between categorical variables( Independent and target variable)


A most common problem we come across Machine learning is determining whether input features are relevant to the outcome to be predicted. This is the problem of feature selection.

In the case of classification problems where input variables are also categorical, we can use statistical tests to determine whether the output variable is dependent or independent of the input variables.

       “ Categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values.”

Pearson’s chi-squared statistical hypothesis is an example of a test for independence between categorical variables.
We take an example : Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:



High School
 Bachelors
Masters
Ph.d.
Total
Female
60
54
46
41
201
Male
40
44
53
57
194
Total
100
98
99
98
395

This  table is called a contingency tableby Karl Pearson, because the intent is to help determine whether one variable is contingent upon or depends upon the other variable

The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequenciesfor a categorical variable match the expected frequenciesfor the categorical variable. The Chi-Squared test does this for a contingency table, first calculating the expected frequencies for the groups, then determining whether the division of the groups, called the observed frequencies, matches the expected frequencies.

The resultof the test is a test statisticthat has a chi-squared distribution and can be interpreted to reject or fail to reject the assumption or null hypothesis that the observed and expected frequencies are the same.
When observed frequency is far from the expected frequency, the corresponding term in the sum is large; when the two are close, this term is small. Large values of Chi-squareindicate that observed and expected frequencies are far apart. Small values of **Chi-square** mean the opposite: observed are close to expected.

        “ The variables are considered independent if the observed and expected frequencies are similar, that the levels of the variables do not interact, are not dependent.

we can interpret the dependency of the variables  in two ways
1.      Using test statistic
2.      Using P-value

1.Using Test-statistic
We can interpret the test statistic in the context of the chi-squared distribution with the requisite number of degress of freedom as follows: **
  • If Statistic >= Critical Valuesignificant result, reject null hypothesis (H0), dependent.
  • If Statistic < Critical Valuenot significant result, fail to reject null hypothesis (H0), independent.
The degrees of freedom for the chi-squared distribution is calculated based on the size of the contingency table as:

                     degrees of freedom: (rows - 1) * (cols - 1)

2.Using P-value
In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:
  • If p-value <= alphasignificant result, reject null hypothesis (H0), dependent.
  • If p-value > alphanot significant result, fail to reject null hypothesis (H0), independent.
For the test to be effective, at least five observations are required in each cell of the contingency table.



No comments:

Post a Comment

My Logo