** **

Csci 418/618, Simulation Models, Spring, 2002

**Statistical Topic #2**

** **

**The Input Side:**

How might we determine whether or not a set of observed data follows are drawn from a particularly statistical distribution? The Chi-square goodness-of-fit test provides a quantitative statistical basis for making this judgment. In essence, the test makes a statistical comparison between the actual and expected number of observations of values of the random variable within intervals. The null hypothesis is a statement that the data follows a particular distribution. The results provide evidence that either supports or does not support the null hypothesis. If the null hypothesis is not rejected, we have strong evidence that values we observe and the values that we would expect under the hypothesis are from the same random variable.

*Steps in carrying out a chi-square test:*

i) Construct a frequency table of the observed values of the random variable and calculate the (theoretical) frequencies for each interval under the assumption that the hypothesized distribution is correct. The expected number of observations in each interval should be at least five. Try to choose the intervals so that the number of intervals with less than three observations within it is small.

ii)
Calculate (0i – Ei)^{2
} / Ei for each interval i = 1,2,.., r where 0i is the observed
and Ei is the expected frequency in interval i.

v) Calculate
the test statistic c^{2
} = S (0i - Ei)^{2} /
Ei

The test statistic is distributed according to the Chi-square distribution with r - p-1 degrees of freedom, where r is the number of intervals, p is the number of parameters estimated for the hypothesized distribution

vi) Look up the value of c^{2
}_{1- }_{a, r – p –1 }in a chi-square table and reject the
null hypothesis if

c^{2
} ³ c^{2 }_{1- }_{a, r –
p –1}

_{ }

Note: A common mistake in step vi. is to assume that if we fail to reject the null hypothesis, then the null hypothesis must be true, i.e., we “accept” the null hypothesis. Strictly speaking, it is better to interpret this result as one of insufficient evidence to reject the null hypothesis, which is not the same as accepting it. It could happen, for example, that there is not enough data available to discriminate.

** **

**Example of a Chi-square Goodness of Fit Test**

** **

The following data is a sampling of the observed lifetime in days of 50 microprocessors running at 1.5 times nominal voltage to provide testing under stress:

79.919 |
3.081 |
0.062 |
1.961 |
5.845 |

3.027 |
6.505 |
0.021 |
0.013 |
0.123 |

6.769 |
59.899 |
1.192 |
34.760 |
5.009 |

18.387 |
0.141 |
43.565 |
24.420 |
0.433 |

144.695 |
2.663 |
17.967 |
0.091 |
9.003 |

0.941 |
0.878 |
3.371 |
2.157 |
7.579 |

0.624 |
5.380 |
3.148 |
7.078 |
23.960 |

0.590 |
1.928 |
0.300 |
0.002 |
0.543 |

7.004 |
31.764 |
1.005 |
1.147 |
0.219 |

3.217 |
14.382 |
1.008 |
2.336 |
4.562 |

Bin BOUNDARY |
Frequency |

3 |
24 |

6 |
9 |

9 |
5 |

12 |
1 |

15 |
1 |

18 |
1 |

21 |
1 |

24 |
1 |

27 |
1 |

30 |
0 |

33 |
1 |

36 |
1 |

39 |
0 |

42 |
0 |

147 |
4 |

Note that there are a few examples of very large values, but most of the values are much smaller (e.g., around 30 of the 50 are 5 days or less). A rough look at the data, including a graphical histogram, suggests that the exponential distribution might be a reasonable fit.

Recall that the exponential distribution has a probability density function given by

_{} _{}

_{}

Note that the average value of the data is 11.905, with reciprocal rate value l = 0.084. To use a Chi-square goodness-of-fit test, form a hypotheses as follows:

Null hypothesis H_{0 }: the random
variable follows the exponential distribution

Alternative hypothesis H_{1 }:
the random variable does not follow the exponential distribution

We proceed by performing the Chi-square test with intervals of equal probability. If we choose k = 8 intervals, the probability of an observation falling into any one of them will be p = .125. Since the cumulative distribution for the exponential is given by

F(a_{i})
= 1 - exp (l a_{i })

Where a_{i }is the endpoint of interval i, i = 1, 2,
…., k. Dividing the range of the
dependent variable into equal parts gives F(a_{i}) = ip, so we can
write

ip
= 1 - exp (l a_{i })

which can be solved for a_{i} with the following
result:

a_{i }=
- ( 1 / l ) ln (1 - ip) i = 0, 1, 2, …., k

With the estimator of l given by 0.084 and k = 8,
we get the first interval breakpoint as

a_{1
}= - ( 1 / 0.084 ) ln (1 - 0.125)
= 1.590

O_{i}

Applying the equation again gives subsequent points as 3.435, 5.595, ….. The table below gives the observed and expected values.

Interval |
O |
E |
(O |

[0,
1.590) |
19 |
6.25 |
26.01 |

[1.590,
3.435) |
10 |
6.25 |
2.25 |

[3.435,
5.595) |
3 |
6.25 |
0.81 |

[5.595,
8.252) |
6 |
6.25 |
0.01 |

[8.252,
11.677) |
1 |
6.25 |
4.41 |

[11.677,
16.503) |
1 |
6.25 |
4.41 |

[16.503,
24.755) |
4 |
6.25 |
0.81 |

[24.755,
¥) |
6 |
6.25 |
0.01 |

Adding the last column gives a chi-square value of 39.6. The degrees of freedom are given by k - s -1 = 8 - 1 - 1 = 6. From Chi-square tables, with a significance level of 0.05, the tabulated value is 12.6. Since 12.6 < 39.6, the null hypothesis is rejected.