Csci 418/618, Simulation Models, Spring, 2002

Statistical Topic #2

 

The Input Side:

Procedure for Carrying out a Chi square goodness-of-fit test

 

How might we determine whether or not a set of observed data follows are drawn from a particularly statistical distribution?  The Chi-square goodness-of-fit test provides a quantitative statistical basis for making this judgment.  In essence, the test makes a statistical comparison between the actual and expected number of observations of values of the random variable within intervals. The null hypothesis is a statement that the data follows a particular distribution.  The results provide evidence that either supports or does not support the null hypothesis.  If the null hypothesis is not rejected, we have strong evidence that values we observe and the values that we would expect under the hypothesis are from the same random variable.

 

Steps in carrying out a chi-square test:

 

i)                    Construct a frequency table of the observed values of the random variable and calculate the (theoretical) frequencies for each interval under the assumption that the hypothesized distribution is correct.  The expected number of observations in each interval should be at least five.  Try to choose the intervals so that the number of intervals with less than three observations within it is small.

 

ii)                   Calculate  (0i – Ei)2  / Ei for each interval i  = 1,2,.., r   where 0i  is the observed and Ei is the expected frequency in interval i.

 

v)         Calculate the test statistic c2  = S  (0i  - Ei)2 / Ei                                                               

 

The test statistic is distributed according to the Chi-square distribution with  r - p-1 degrees of freedom, where r is the number of intervals, p is the number of parameters estimated for the hypothesized distribution

 

vi)        Look up the value of c2 1- a, r – p –1 in a chi-square table and reject the null  hypothesis if

 

                                    c2  ³ c2 1- a, r – p –1

 

Note:  A common mistake in step vi. is to assume that if we fail to reject the null hypothesis, then the null hypothesis must be true, i.e., we “accept” the null hypothesis.  Strictly speaking, it is better to interpret this result as one of insufficient evidence to reject the null hypothesis, which is not the same as accepting it.  It could happen, for example, that there is not enough data available to discriminate. 

 

 

Example of a Chi-square Goodness of Fit Test

 

The following data is a sampling of the observed lifetime in days of 50 microprocessors running at 1.5 times nominal voltage to provide testing under stress:

 

79.919

3.081

0.062

1.961

5.845

3.027

6.505

0.021

0.013

0.123

6.769

59.899

1.192

34.760

5.009

18.387

0.141

43.565

24.420

0.433

144.695

2.663

17.967

0.091

9.003

0.941

0.878

3.371

2.157

7.579

0.624

5.380

3.148

7.078

23.960

0.590

1.928

0.300

0.002

0.543

7.004

31.764

1.005

1.147

0.219

3.217

14.382

1.008

2.336

4.562

 

 

 

 

Bin BOUNDARY

Frequency

3

24

6

9

9

5

12

1

15

1

18

1

21

1

24

1

27

1

30

0

33

1

36

1

39

0

42

0

147

4

 

 

 

Note that there are a few examples of very large values, but most of the values are much smaller (e.g., around 30 of the 50 are 5 days or less).  A rough look at the data, including a graphical histogram, suggests that the exponential distribution might be a reasonable fit. 

Recall that the exponential distribution has a probability density function given by

 

*  

*

Note that the average value of the data is 11.905, with reciprocal rate value l = 0.084.  To use a Chi-square goodness-of-fit test, form a hypotheses as follows: 

 

Null hypothesis            H0    :  the random variable follows the exponential distribution

Alternative hypothesis H1    :  the random variable does not follow the exponential distribution

 

We proceed by performing the Chi-square test with intervals of equal probability.  If we choose k = 8 intervals, the probability of an observation falling into any one of them will be p = .125.  Since the cumulative distribution for the exponential is given by

 

F(ai) = 1 - exp (l ai )

 

Where ai is the endpoint of interval i, i = 1, 2, …., k.  Dividing the range of the dependent variable into equal parts gives F(ai) = ip, so we can write

 

ip = 1 - exp (l ai )

 

which can be solved for ai with the following result:

 

ai  =  - ( 1 / l ) ln (1 - ip)          i = 0, 1, 2, …., k

 

With the estimator of l given by 0.084 and k = 8, we get the first interval breakpoint as

 

a1 =  - ( 1 / 0.084 ) ln (1 - 0.125) = 1.590

Oi

Applying the equation again gives subsequent points as 3.435, 5.595, …..  The table below gives the observed and expected values.

 

Interval

Oi

Ei

(Oi  - Ei )2 / Oi

[0, 1.590)

19

6.25

26.01

[1.590, 3.435)

10

6.25

2.25

[3.435, 5.595)

3

6.25

0.81

[5.595, 8.252)

6

6.25

0.01

[8.252, 11.677)

1

6.25

4.41

[11.677, 16.503)

1

6.25

4.41

[16.503, 24.755)

4

6.25

0.81

[24.755, ¥)

6

6.25

0.01

 

 

Adding the last column gives a chi-square value of 39.6.  The degrees of freedom are given by k - s -1 = 8 - 1 - 1 = 6.  From Chi-square tables, with a significance level of 0.05, the tabulated value is 12.6.  Since 12.6  < 39.6, the null hypothesis is rejected.