Chi-Square Statistical Test Evaluates Independence

|||Chi-Square Statistical Test Evaluates Independence

Chi-Square Statistical Test Evaluates Independence

June 2013

Statisticians describe independence as whether the occurrence of one event or characteristic makes it neither more nor less probable that other event(s) or characteristic(s) occur(s). The chi-square test described below is one of the most widely used tests for evaluating independence of variables, particularly when the number of observations and/or variables becomes larger. This article focuses on testing whether employment discrimination is occurring, but chi-square tests can be used for numerous other applications.

How the U.S. Government Defines and Tests for Employment Discrimination

The “Uniform Guidelines on Employee Selection Procedures” published by the U.S. Equal Employment Opportunity Commission (EEOC) describes their purpose as follows:

“These guidelines incorporate a single set of principles which are designed to assist employers, labor organizations, employment agencies, and licensing and certification boards to comply with requirements of Federal law prohibiting employment practices which discriminate on grounds of race, color, religion, sex, and national origin. They are designed to provide a framework for determining the proper use of tests and other selection procedures.”

These EEOC’s guidelines define adverse impact as:

“A substantially different rate of selection in hiring, promotion, or other employment decision which works to the disadvantage of members of a race, sex, or ethnic group.”

The U.S. Department of Labor (DOL) provides a similar definition, as follows:

“Disparate impact is a theory of employment discrimination based on the disproportionate effect of a racially neutral criterion/ process. The theory refers to the discriminatory effects of uniformly applied employment criteria/processes that are neutral on their face but which more harshly affect minorities or women and cannot be justified by business necessity or job relatedness. Because the disparate impact analysis addresses the effects of a particular requirement on groups of people, it is generally a statistical proof.”

The EEOC’s guidelines describe a simplistic way of testing for adverse impact using simple ratios, as follows:

“A selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact, while a greater than four-fifths rate will generally not be regarded by Federal enforcement agencies as evidence of adverse impact. … Greater differences in selection rate may not constitute adverse impact where the differences are based on small numbers and are not statistically significant …”

The DOL similarly defines the same 80% test as follows:

“A substantially different rate of selection in hiring, promotion, transfer, training or in other employment decisions which works to the disadvantage of minorities or women. If such rate is less than 80 percent of the selection rate of the race, sex, or ethnic group with the highest rate of selection, this will generally be regarded as evidence of adverse impact. Adverse impact analyses based on the 80% rule may be buttressed by a test of statistical significance.”

For additional background information, see RIF Statistical Audits Reduce Discrimination Risks.

The following illustrates the EEOC’s and DOL’s adverse impact calculation using the four-fifths or 80% rule. Assume that there are 100 employees (consisting of equal men and women) in a company that is facing a layoff or reduction in force (RIF) of 12 people. The employer decides to lay off 5 men and 7 women. The four-fifths or 80% rule would be calculated as follows:

Before the RIF RIF % RIF
Men 50 5 10%
Women 50 7 14%
Total 100 12 12%
Adverse Impact Ratio (calculated as 10%/14%) 71%


The EEOC’s and DOL’s 80% test failed. However, the number of laid-off people (twelve) was small, so that anything other than a completely even layoff percentage (in this case, six men and six women) would cause the 80% test to fail. Somewhat intuitively, most observers would note that (i) the actual layoff percentages, as shown in the above table, could be reasonably attributed to circumstances that are not due to men being favored over women, and instead simply be a random occurrence, and (ii) the termination of a single woman vs. man in this situation should not cause the employer to be held responsible for discrimination.

Inferential Statistics Provide a More Complete Answer

Inferential statistics can assist in evaluating this situation. The classic tool for teaching probability involves placing different colored balls into a container, and then randomly removing the balls. The objective is to determine the probability that a certain color of ball will be removed. If everything were to occur “perfectly”, each removed ball would always bear the same relationship as the starting percentage of each color of ball in the container. Of course, this cannot occur because (i) only one entire ball (not a percentage of a ball) is removed at a time, and (ii) random events could cause one color of ball to be removed with a greater frequency. With small sample sizes, the outcome of a single ball removal dramatically affects the percentage of a particular colored ball that has been pulled.

One could calculate the probability of every possible permutation of balls being pulled. This is possible when small sample sizes are involved, since the possibilities increase geometrically as the number of available options increases. For this reason, this method of performing the tests is usually limited from a practical perspective to a two by two table, with small numbers in each quadrant. Such a test is called the Fischer exact test (named after its founder). The Fisher exact test calculates the probability of observing a particular table result or group of related results.

Other than their simplistic 80% test, the EEOC’s and DOL’s guidelines do not select alternative statistics that should be used. However, the following general guidance is provided:

“The degree of relationship between selection procedure scores and criterion measures should be examined and computed, using professionally acceptable statistical procedures. Generally, a selection procedure is considered related to the criterion, for the purposes of these guidelines, when the relationship between performance on the procedure and performance on the criterion measure is statistically significant at the 0.05 level of significance, which means that it is sufficiently high as to have a probability of no more than one (1) in twenty (20) to have occurred by chance.”

The 95% confidence level described above (which is also called a .05 level of significance) is widely-accepted in a broad range of other circumstances. However, there is nothing magical about a 95% confidence level. Other confidence levels can be appropriately used based on the applicable legal standards and legal counsel’s judgment of what those standards mean.

The Chi-Square Test

With larger numbers, an approximation of the “brute force” calculation performed in the Fisher exact test can be performed more easily. This approximation of the Fisher exact test is called a chi-square test. Unfortunately, the chi-square test may yield unreliable results with tables that (i) contain less than 50 observations or (ii) contain less than approximately 5 observations in any cell in the table. In such circumstances, the Fisher exact test usually remains feasible and is reliable.

The chi-square tests the hypothesis that the two groups (e.g., gender and employment status in the above example) are not related. To determine the validity of this hypothesis, the “actual” values (values that actually were observed) are compared to the “statistically expected” values (calculated values based on the sum of actual values that one would expect assuming the hypothesis is valid). Based on the difference between the actual value and the statically expected value, the chi-square test result either accepts or rejects the hypothesis with a certain confidence level.

When a chi-square test is applied, the actual values are NOT significantly different than the statistically expected values at the 95% confidence level. This means that (i) one can NOT successfully assert with 95% confidence that discrimination has occurred, and (ii) there is more than a 5 percent chance that the layoff pattern occurred by chance. Interestingly, this conclusion remains the same even if one reduces the confidence level to 80%. This is exactly the opposite conclusion that the EEOC’s and DOL’s 80% test provided using this same data.

A screenshot of a chi-square calculator with the input and test results follows:

Online Chi Square Calculator

Different results between the 80% rule and the chi-square test are quite common in sample sizes that are small. However, differences between conclusions from the two tests regularly occur even when the sample sizes are large. Here are some general observations about the discrepancies:

  1. If there are few observations, the chi-square test lacks statistical power, and will indicate no significant differences in selection ratios. In this situation, the 80% rule might show there is an adverse impact when the chi-square test indicates there is no adverse impact.
  2. With larger sample sizes, the chi-square test is the more reliable test. If there are many observations, the chi-square test will indicate statistically-significant differences in selection ratios when the 80% rule shows there is no adverse impact. When the sample size is large, the chi-square test will identify even small differences in selection ratios as having adverse impact. In contrast, the 80% test is an unsophisticated rule of thumb that accepts a large absolute difference so long as the overall ratio remains above 80%.

When these two tests provide different conclusions, a statistics expert can provide explanations and perspective. With contradictory conclusions, the statistical expert should present a sensitivity analysis as part of the presentation.

Importantly, the chi-square test is NOT limited to two possible outcomes; it can be applied to multiple outcomes within two categories. By using a more complex data input and calculations, inappropriate aggregation of dissimilar data can be avoided. For example, assume a company with two business lines uses predominantly different types of employees (e.g., different genders) in each of the two business lines. If only one of the business lines faces a RIF, then the laid-off employees will be predominantly only one type of employee, even though this has nothing to do with discrimination. When using a simplistic chi-square data input, aggregating both business lines would show discrimination. In contrast, separating the two business lines provides the opposite result.

Other Examples of Chi-Square Applications

As noted at the beginning of this article, chi-square tests can be used for numerous other applications. A calculator can accept a data matrix greater than the two-by-two example described above, and so can address more complex issues. Here are just a few examples of practical questions that can be answered using a chi-square test:

  1. Is there statistically-valid evidence that hiring practices are dependent on firm size?
  2. Is there statistically-valid evidence that profits in different industries are different?
  3. Is there statistically-valid evidence that lodging vacancy rates have been hurt at location(s) along the Gulf Coast hardest hit by the Deepwater Horizon oil spill?
  4. Is there statistically-valid evidence that investors’ income level and portfolio risk are related?
  5. Is there statistically-valid evidence of a change in the market shares obtained by certain competitors?

Fulcrum Inquiry performs statistical analyses and related expert testimony in employment and other disputes.

Monthy Archives