By:Rick Dobson | Last Updated: 2018-11-23 || Related Tips:More > T-SQL
ProblemPlease extend earlier tips on how to compute with SQL and interpret the results for a one-way ANOVA or a two-way ANOVA. In particular, I seek examples of how to implement with SQL a statistical test for telling which pairs of level means within an ANOVA are significantly different from one another.
SolutionAnalysis of Variance (ANOVA) is a rich topic for assessing the statistical significance of differences between category or level means. Two prior tips illustrate how to use SQL to implement a one-way ANOVA or a two-way ANOVA (one-way here andtwo-way here). ANOVA tests are good for determining if the group means in a factor are all the same or not. However, ANOVA tests do not generally indicate which particular group means are different from one another. There are a wide variety of ANOVA post hoc tests for assessing which category or level means are different from one another when a prior ANOVA test confirms that a set of group means are not all the same. A brief survey of these post hoc tests for a one-way ANOVA appears here .
The Tukey Honestly Significant Difference (HSD) test is one of the most widely used post hoc tests for assessing the statistical significance between pairs of factor level means after an ANOVA confirms the statistical significance of a factor or an interaction between factors in a two-way ANOVA. This tip presents multiple examples of how to compute the Tukey HSD test. The specific steps for implementing a Tukey HSD test can vary depending on whether it is a post hoc test for a one-way ANOVA or a two-way ANOVA. When performing the Tukey HSD for a two-way ANOVA, the computational steps can also change depending on whether the interaction between factors is statistically significant.
Aprior tip presents a survey of selected inferential and predictive analytical techniques for data science projects.Another prior tip introduces the SQL Statistical Package, which is a computing framework on how to implement statistics with SQL. This tip is part of a continuing series on how to expand the scope of the SQL Statistics Package and related SQL scripts. You will learn how to coordinate Tukey HSD tests with one-way and two-way ANOVA tests that can be invoked from stored procedures within the SQL Statistical Package or stand-alone SQL scripts.
What is a Tukey HSD test?The Tukey HSD test is referred to by a variety of names. Wikipedia offers this selection of names for the Tukey HSD test: Tukey range test, Tukey test, Tukey method, Tukey's honestly significant difference test, or Tukey Kramer method. Because the Tukey HSD test aims to identify the statistical significance of pairs of level means after an ANOVA test, it is a post hoc test. That is, the Tukey HSD test is only appropriate to compute when a one-way ANOVA or a two-way ANOVA discovers statistically significant outcomes, such as for a main factor effect or when a factor in a two-way ANOVA interacts significantly with the other factor.
The Tukey HSD test performs pair-wise comparisons between group means to assess statistical significance. For example, if there are three levels with names of level 1, level 2, and level 3 for a one-way ANOVA, then there are three distinct pair-wise comparisons. These comparisons can be denoted by
Level 1 versus level 2 Level 1 versus level 3 Level 2 versus level 3The number of possible pair-wise comparisons varies based on the number of levels within a factor. The test is indifferent to the order of levels within a comparison; it assumes that level 1 versus level 2 is the same as level 2 versus level 1. Also, it is not legitimate to compare a level with itself; the test assumes that two levels with the same name are, by definition, the same. As a result of these constraints on comparisons, the number of comparisons for n levels within a one-way ANOVA is derived from the number of permutations of n items two at a time (nP2). The equation for the total number of HSD pair-wise comparisons is
(nP2)/2 = ((n!/(n-2)!)/2)The number of legitimate paired comparisons can grow very rapidly as n increases. The following table illustrates the growth in legitimate paired comparisons based on various n values. While the number of comparisons start at 3 for an n value of 3, it grows to 66 comparisons for an n value of 12. In practice, it is rare to see one-way ANOVA with values of n that exceed 5 or 6.
n r n! (n - r)! nPr (nPr)/2 3 2 6 1 6 3 4 2 24 2 12 6 5 2 120 6 20 10 6 2 720 24 30 15 12 2 479001600 3628800 132 66When implementing Tukey HSD for levels within a two-way ANOVA, determining how to compare group means is not as straightforward as with a one-way ANOVA. You may find the following points of value when considering how to implement Tukey HSD tests with SQL.
Again, if none of the main nor interactions effects for two factors are statistically significant, then there is no need to compute a Tukey HSD value. There are no statistically significant pair-wise comparisons because there are no significant main or interaction effects. Even if there is a significant main effect (and no interaction effect), but there are just two levels for the factor, then you do not need to compute a Tukey HSD value. This is because with just two levels the ANOVA main effect significance level confirms whether the two factor levels are statistically different from each other. However, if there is a significant main effect (and no interaction effect), but there are more than two levels for the factor, then you can compute Tukey HSD tests to assess the statistical significance of different levels within the factor. This is because with as few as three levels there are three possible comparisons for level means. The ANOVA main effect F value does not by itself distinguish which pairs are different from one another. The Tukey HSD value is distinct for each distinct pair of level means, and the HSD value for each pair can be compared to a critical HSD value to verify if the comparison is statistically significant. You can compute the HSD values in this scenario just like for the one-way ANOVA, but you must repeat the comparisons for the levels of each factor separately. When there is a significant interaction effect, then some argue that it is not appropriate to compare level means within a factor without considering the other factor. In this case, you can compare the levels of one factor to each other holding the level of the second factor constant. This approach is demonstrated via a SQL script within this tip. Because you are performing pair-wise comparisons with the Tukey HSD test, you may wonder why you cannot just perform a separate t test between each pair of level means that you need to compare (thisprior tip explains in detail how to perform several different types of t tests with the SQL Statistics Package). The reason separate t tests are not ap