Categorical Data Analysis
I. So far:
- we’ve been looking at continuous data arranged into two or more groups, where each
“group” has more than one observation.
- e.g., a series of measurements on two or more “things”.
II. Now:
- we’re interested in data that is categorical. The data is often just counts that these
different categories have.
- for example:
Number of people with blood type A: y1
Number of people with blood type B: y2
Number of people with blood type AB: y3
Number of people with blood type O: y4
- if we had some theory about the distribution of blood types, how would we be able to
test it?
- For example, we have reason (somehow) to believe that 34% of people have blood type
A, 15% blood type B, 23% blood type AB, and 28% blood type O.
- we go out and collect a sample of 100 people, and find the following:
A: 12 B: 56 AB: 2 O: 30
- is our result compatible with our hypothesis??
III. The Chi-square goodness of fit test.
- Essentially, we have a number of categories for our data, and we have some idea as to
the “proportion” or frequency of “things” we expect in each of our categories.
A) Here’s another example (from genetics):
1) We have a simple dominant-recessive relationship. Say, yellow and purple
corn kernel color. Purple is dominant, and yellow is recessive. Without going
into the details, if we have two heterozygous parents, for our offspring we would
expect 3/4 of our kernels to be purple and 1/4 yellow (you’ll get the details in
genetics).
2) You go out and collect a sample of 267 corn kernels.
i) how many do you expect to be purple?
3/4 x 267 = 200.25
ii) yellow?
1/4 x 267 = 66.75
(notice - take the proportion you expect, and multiply this by the
total number in your sample to get what you expect in your
sample)
3) Now you can compare this to what you actually got. Suppose you took your
sample and got:
- 157 purple
- 110 yellow
4) Looking at this, you would think you should have gotten more yellows and less
purples. But is this due to random chance?
5) Set up your hypotheses:
H0: Pr{purple} = .75, Pr{yellow} = .25
(your H0 is different from what you’re used to - we’ll learn more soon)
H1: At least one probability listed in H0 is incorrect (see your text for
another way of phrasing this).
6) Decide on alpha (let’s pick .05)
7) Calculate your test statistic. This is now Chi-squared star (or sub s, as your text
calls it):
- i goes from 1 to c, where c is the number of categories (2 in our
example).
- for our example we have:
8) Compare this to the tabulated Chi-squared with c-1 degrees of freedom and the
appropriate level of alpha.
a) the Chi-squared distribution.
- the value you calculate, Chi-square-star, will follow a Chi-
squared distribution if n is moderately large.
(notice that the Chi-squared test is based on an
approximation, just like the KW test. You can get exact
values, but they’re a bit of a pain.)
- Just like many other distributions, the value of the Chi-squared
distribution depends on d.f. (or ν).
- here is what it looks like for a couple of different values of ν:
- just like before, we reject for values that wind up in the tails
(usually only in the upper tail).
- incidentally, what would a distribution look like composed of
squared normally distributed variables?
b) If Chi-squared-star ≥ tabulated Chi-square, we reject our H0.
- here’s our comparison:
so we reject our H0 and conclude that at least one of our
proportions is not as specified in H0.
- important point - notice that with just two categories, if
one of our proportions is wrong, that immediately implies
that the other one is wrong as well (why??).
B) Some comments:
a) except in the case of two categories, the alternative hypothesis is non-
directional. But for a directional alternative, the binomial test is much better (see
comments below).
b) except in this two category case, the null hypothesis is actually compound -
meaning it consists of more than one part. For example:
Pr(I) = y1, Pr(II) = y2, Pr(III) = y3.
You need to specify, with numbers, proportions, or probabilities at least
two of the things above (the other one is then automatically determined -
incidentally, that’s how degrees of freedom works).
Instead of μ1 = μ2, or even μ1 = μ2 = μ3, you have several “equations” for
equations for your H0.
- In the two category case, once you specify, one category (say Pr(I) = y1),
you don’t need to specify the other because it is “determined” by the first
(once you know the first, you know the second). So this remains a
“simple” null hypothesis.
c) in the two category case you could specify a directional alternative as follows:
H0: Pr{Male} = .6 (and therefore Pr{Female} = .4)
H1: Pr{Male} > .6 (so what are females?)
- you should be fairly comfortable with this concept by now.
- However, this test is not very powerful (see below), and since Minitab
(surprise!) doesn’t do goodness of fit tests, we won’t learn how to do a one-sided
test. (A computer package would increase the power just a bit because of the way
you have to look up critical values in tables).
If you really need to do a one sided test, use the binomial test - it’s much better for
two categories in any case (again, see below).
C. Two examples of the Chi-squared test:
1) Exercise 10.1 from p. 392 [10.1, p. 399]:
a) Geneticists propose that the color of summer squash should follow a
12:3:1 ratio. Researchers collected the following data:
white: 155 yellow: 40 green: 10
b) H0: Pr{white} = .75 (12+3+1=16, so 12/16 = .75)
Pr{yellow} = .1875
Pr{green} = .0625 (we didn’t have to specify green - why not?)
H1: at least one of these proportions is not as specified.
c) α = .10
d) Calculate our expected values:
.75 x 205 = 153.75 (our total sample
.1875 x 205 = 38.4375 size is 205)
.0625 x 205 = 12.8125
e) Calculate chi-squared-star:
f) Our tabulated chi-squared:
g) Because chi-square-star is less than our tabulated chi-square, we “fail to
reject”, and conclude that our null hypothesis is consistent with the data.
We have no evidence to show that summer squash does not follow a
12:3:1 ratio.
2) Color vision in squirrels [exercise 10.9, p. 401]. A squirrel was exposed to a
red panel and two white panels. By pressing the red panel, the squirrel was
rewarded; no reward was given for pressing the white panel. In 75 trials, the
squirrel correctly pressed the red panel 45 times. Can the squirrel see color?
H0: Pr{red} = 1/3 (so Pr{white} = 2/3)
H1: Pr{red} ≠ 1/3 Incidentally, a squirrel is going to do the best it can
to get food, so the alternative probably should be
one sided here (H1: Pr{red} > 1/3).
choose α = .02.
calculate our expected:
for red, 75 x 1/3 = 25
for white, 75 x 2/3 = 50
(also note that the proportion of red is .6, so if we had gone with our
directional alternative, it would have made sense).
calculate our chi-squared-star:
the tabulated chi-square is:
Obviously, since chi-squared-star >> the tabulated value, we get to reject
and conclude that squirrels can see the color red.
- We’re pretty much done. But if we want to get an approximate p-
value, here’s how:
- look in table 8 [9] until we get to the closest number that
is less than our critical value.
- In table 8, for one d.f., this gives us 15.14, and an
associated p < .0001.
Bottom line: we are very confident that squirrels can see the color red.
- incidentally, you’re really just interested in the minimum possible
p-value, so don’t bother with all the bracketing stuff you see in
your text.
4. The binomial test (overview only). A better test is the “binomial test”. We
haven’t learned that in here. This is covered in some introductory classes, but not
in this one (see section 6.6 [6.6] of your text to get an idea of what’s going on).
- the basic idea is that with two samples you’re really testing a single
proportion (e.g., is a baseball player hitting .300?). We know how to
calculate the probabilities of a binomial, so this really isn’t that difficult.
E.g., assume p = .3, q = .7, our baseball player hit 30 times out of 130 at
bats (p-hat = .23); if p really is .3, what is the probability of that outcome?
- you really want to look into this if you need a one sided test. It has better
power than the chi-square test with two categories, particularly with a one-
sided test.
D. Some assumptions:
- the data are collected randomly (you just can’t get away from this one!)
- The smallest expected # is at least 5. Remember, the chi-square test is an
approximation, and approximations get better the bigger the sample size.
Conversely, they get worse the smaller the sample size. If each of your categories
doesn’t have an expected value of at least 5, then your approximation will be
pretty awful. There are other techniques (e.g., exact methods) for dealing with
smaller sample sizes, but we won’t learn them here.