A most common problem we come across Machine learning is determining whether
input features are relevant to the outcome to be predicted. This is the problem
of feature selection.
In the case of classification problems where input variables
are also categorical, we can use statistical tests to determine whether the
output variable is dependent or independent of the input variables.
“ Categorical variable is a variable that can take on one of a limited, and usually fixed, number of
possible values.”
Pearson’s chi-squared statistical hypothesis is an example
of a test for independence between categorical variables.
We take
an example : Is gender independent of education level? A random sample
of 395 people were surveyed and each person was asked to report the highest
education level they obtained. The data that resulted from the survey is
summarized in the following table:
|
High School
|
Bachelors
|
Masters
|
Ph.d.
|
Total
|
Female
|
60
|
54
|
46
|
41
|
201
|
Male
|
40
|
44
|
53
|
57
|
194
|
Total
|
100
|
98
|
99
|
98
|
395
|
This table is called
a contingency tableby Karl Pearson, because the intent is to help
determine whether one variable is contingent upon or depends upon the other
variable
The Chi-Squared test is a
statistical
hypothesis test that assumes (the null hypothesis) that the
observed
frequenciesfor a categorical variable match the
expected frequenciesfor
the categorical variable. The Chi-Squared test does this for a contingency
table, first calculating the expected frequencies for the groups, then
determining whether the division of the groups, called the observed
frequencies, matches the expected frequencies.
The resultof the test is a test statisticthat
has a chi-squared distribution and can be interpreted to reject or fail to
reject the assumption or null hypothesis that the observed and expected
frequencies are the same.
When observed frequency is far from the expected frequency,
the corresponding term in the sum is large; when the two are close, this term
is small. Large values of Chi-squareindicate that observed and expected
frequencies are far apart. Small values of **Chi-square** mean the
opposite: observed are close to expected.
“ The variables
are considered independent if the observed and expected frequencies are
similar, that the levels of the variables do not interact, are not dependent.
we can interpret
the dependency of the variables in two
ways
1.
Using test statistic
2.
Using P-value
1.Using Test-statistic
We can
interpret the test statistic in the context of the chi-squared distribution
with the requisite number of degress of freedom as follows: **
- If Statistic >= Critical
Valuesignificant
result, reject null hypothesis (H0), dependent.
- If Statistic < Critical
Valuenot
significant result, fail to reject null hypothesis (H0), independent.
The
degrees of freedom for the chi-squared distribution is calculated based on the
size of the contingency table as:
degrees of freedom: (rows - 1) * (cols - 1)
2.Using
P-value
In terms
of a p-value and a chosen significance level (alpha), the test can be
interpreted as follows:
- If p-value <= alphasignificant result, reject
null hypothesis (H0), dependent.
- If p-value > alphanot significant result, fail
to reject null hypothesis (H0), independent.
For the
test to be effective, at least five observations are required in each cell of
the contingency table.