# Updating to R 2.15, warnings in R and an updated function list for Serious Stats

Whilst writing the book  the latest version of R changed several times. Although I started on an earlier version, the bulk of the book was written with 2.11 and it was finished under R 2.12. The final version of the R scripts were therefore run and checked using R 2.12 and, in the main, the most recent packages versions for R 2.12.

When it came to proof read R 2.13 was already out and therefore most of the examples were also checked with version, but I stuck with R 2.12 on my home and work machines until last week.

In general I don’t see the point of updating to a new version number if everything is working fine. One advantage of this approach is that the version I install will usually have bugs from the initial release already ironed out. That said, new versions of R have (in my experience) been very stable.

I tend to download the version only when I fall several versions behind or if it is a requirement for a new package or package version. On this occasion it turned out that the latest version of the ordinal package (for fitting ordered logistic regression and multilevel ordered logistic regression models). There are two main drawbacks with updating. The first is reinstalling all your favourite package libraries (and generally getting it set up how you like it). The second is dealing with changes in the way R behaves.

For re-installing all my packages I use a very crude system. For any given platform (Mac OS, Windows or Linux) there are cleverer solutions (that you can find via google). My solution works across cross-platform and is fairly robust, if inelegant. I simply keep an R script with a number of install.packages() commands such as:

install.packages(‘lme4’, ‘exactci’, ‘pwr’, ‘arm’)

I run these in batches after installing the new R version. I find this useful because I’m forever installing R on different machines (so far Mac OS or Windows) at work (e.g., for teaching or if working away from the office or on a borrowed machine). I can also comment the file (e.g., to note if there are issues with any of the packages under a particular version of R). This usually suffices for me as I usually run a ‘vanilla’ set-up without customization. It would be more efficient for me to customize my set-up, but for teaching purposes I find it helps not to do that. Likewise, I tend to work with a clean workspace (and use a script file to save R code that creates my workspaces). I should stress that this isn’t advice – and I would work differently myself if I didn’t use R so much for teaching.

One of the first things that happened after installing R 2.15 was that some of my own functions started producing warnings. R warnings can be pretty scary for new users but are generally benign. Some of them are there to detect behaviour associated with common R errors or common statistical errors (and thus give you a chance to check your work). Others alert you to non-standard behaviour from a function in R (e.g., changing the procedure it uses when sample sizes are small). Yet others offer tips on writing better R code. Only very rarely are they an indication that something has gone badly wrong.

Thus most R warnings are slightly annoying but potentially useful. In my case R 2.15 disliked a number of my functions of the form:

mean(data.frame)

The precise warning was:

Warning message:
mean() is deprecated.
Use colMeans() or sapply(*, mean) instead.

All the functions worked just fine, but (after my initial irritation had receded) I realize that colMeans() is a much better function. It is more efficient but, even better, it is obvious that it calculates the means of the columns of a data frame or matrix. With the more general  mean() function it is not immediately obvious what will happen when called with a data frame as an argument. It is also trivial to infer that rowMeans() calculates the row means.

I have now re-written  a number of functions to deal with this problem and to make a few other minor changes. The latest version of my functions can be loaded with the call:

source('http://www2.ntupsychology.net/seriousstats/SeriousStatsAllfunctions.txt')

I will try and keep this file up-to-date with recent versions of R and correct any bugs as they are detected.

The functions can be downloaded as a text file from:

# R functions for serious stats

UPDATE: Some problems arose with my previous host so I have now updated the links here and elsewhere on the blog.

The companion web site for Serious Stats has a zip file with R scripts for each chapter. This contains examples of R code and and all my functions from the book (and a few extras). This is a convenient form for working through the examples. However, if you just want to access the functions it is more convenient to load them all in at once.

The functions can be downloaded as a text file from:

http://www2.ntupsychology.net/seriousstats/SeriousStatsAllfunctions.txt

More conveniently, you can load them directly into R with the following call:

source('http://www2.ntupsychology.net/seriousstats/SeriousStatsAllfunctions.txt')

In addition to the Serious Stats functions, a number of other functions are contained in the text file. These include functions published on this blog for comparing correlations or confidence intervals for independent measures ANOVA and functions my paper on confidence intervals for repeated measures ANOVA.

N.B. R code formatted via Pretty R at inside-R.org

# Serious stats companion web site now live: sample chapter, data and R scripts

The companion web site for Serious stats is now live:

http://www.palgrave.com/psychology/Baguley/

It includes a sample chapter (Chapter 15: Contrasts), data sets, R scripts for all the examples and supplementary material.

# Independent measures (between-subjects) ANOVA and displaying confidence intervals for differences in means

In Chapter 2 (Confidence Intervals) of Serious stats I consider the problem of displaying confidence intervals (CIs) of a set of means (which I illustrate with the simple case of two independent means). Later, in Chapter 16 (Repeated Measures ANOVA), I consider the trickier problem of displaying of two or more means from paired or repeated measures. The example in Chapter 16 uses R functions from my recent paper reviewing different methods for displaying means for repeated measures (within-subjects) ANOVA designs (Baguley, 2012b). For further details and links see a brief summary on my psychological statistics blog. The R functions included a version for independent measures (between-subject) designs, but this was a rather limited designed for comparison purposes (and not for actual use).

The independent measures case is relatively straight-forward to implement and I hadn’t originally planned to write functions for it. Since then, however, I have decided that it is worth doing. Setting up the plots can be quite fiddly and it may be useful to go over the key points for the independent case before you move on to the repeated measures case. This post therefore adapts my code for independent measures (between-subjects) designs.

The approach I propose is inspired by Goldstein and Healy (1995) – though other authors have made similar suggestions over the years (see Baguley, 2012b). Their aim was to provide a simple method for displaying a large collection of independent means (or other independent statistics). At its simplest the method reduces to plotting each statistic with error bars equal to ±1.39 standard errors of the mean. This result is a normal approximation that can be refined in various ways (e.g., by using the t distribution or by extending it to take account of correlations between conditions). Using a Goldstein-Healy plot two means are considered different with 95% confidence if their two intervals do not overlap. In other words non-overlapping CIs are (in this form of plot) approximately equivalent to a statistically significant difference between the two means with α = .05. For convenience I will refer to CIs that have this property as difference-adjusted CIs (to distinguish them from conventional CIs).

It is important to realize that conventional 95% CIs constructed around each mean won’t have this property. For independent means they are usually around 40% too wide and thus will often overlap even if the usual t test of their difference is statistically significant at p < .05. This happens because the variance of a difference is (in independent samples) equal to the sum of the variances of the individual samples. Thus the standard error of the difference is around $\sqrt 2$ times too large (assuming equal variances). For a more comprehensive explanation see Chapter 3 of Serious stats or Baguley (2012b).

#### What to plot

If you have only two means there are at least three basic options:

1) plot the individual means with conventional 95% CIs around each mean

2) plot the difference between means and a 95% CI for the difference

3) plot some form of difference-adjusted CI

Which option is  best? It depends on what you are trying to do. A good place to start is with your reasons for constructing a graphical display in the first place. Graphs are not particularly good for formal inference and other options (e.g., significance tests, reporting point estimates CIs in text, likelihood ratios, Bayes factors and so forth) exist for reporting the outcome of formal hypothesis tests. Graphs are appropriate for informal inference. This includes exploratory data analysis, to aid the interpretation of complex patterns or to summarize a number of simple patterns in a single display. If the patterns are very clear, informal inference might be sufficient. In other cases it can be supplemented with formal inference.

What patterns do the three basic options above reveal? Option 1) shows the precision around individual means. This readily supports inference about the individual means (but not their difference). For example, a true population outside the 95% CI is considered implausible (and the observed mean would be different from that hypothesized value with p < .05 using a one sample t test).

Option 2) makes for a rather dull plot because it just involves a single point estimate for the difference in means and the 95% CI for the difference. If this is the only quantity of interest you’d be better off just reporting the mean and 95% CI in the text. This has advantage of being more compact and more accurate than trying to read the numbers off a graph. [This is one reason that graphs aren’t optimal for formal inference; it can be hard, for instance, to tell whether a line includes zero or excludes zero when the difference is just statistically significant or just statistically non-significant. With informal inference you shouldn’t care where p = .049 or p = .051, but whether there are any clear patterns in the data]

Option 3) shows you the individual means but calibrates the CIs so that you can tell if it is plausible that the sample means differ (using 95% confidence in the difference as a standard). Thus it seems like a good choice for graphical display if you are primarily interested in the differences between means. For formal inference it can be supplemented by reporting a hypothesis test in the text (or possibly a Figure caption).

It is worth noting that option 3) becomes even more attractive if you have more than two means to plot. It allows you to see patterns that emerge over the set of means (e.g., linear or non-linear trends or – if n per sample is similar – changes in variances) and to compare pairs of means to see whether it is plausible that they are different.

In contrast, option 2) is rather unattractive with more than two means. First, with J means there are J(J-1)/2 differences and thus an unnecessarily cluttered graphical display (e.g., with J = 5 means there are 10 Cis to plot). Second, plotting only the differences can obscure important patterns in the data (e.g., an increasing or decreasing trend in the means or variances would be difficult to identify).

#### Difference-adjusted CIs using the t distribution

Where only a few means are to be plotted (as is common in ANOVA) it makes sense to take a slight more accurate approach than the approximation originally proposed by Goldstein and Healy for large collections of means. This approach uses the t distribution. A similar approach is advocated by Afshartous and Preston (2010) who also provide R code for calculating multipliers for the standard errors using the t distribution (and an extension for the repeated measures). My approach is similar, but involves calculating the margin of error (half width of the error bars) directly rather than computing a multiplier to apply to the standard error.

Difference-adjusted CIs for the mean of each sample from an independent measures (between-subjects) ANOVA design is given by Equation 3.31 of Serious stats:

$\hat \mu _j \pm t_{n_j - 1,1 - {\alpha \mathord{\left/ {\vphantom {\alpha 2}} \right. \kern-\nulldelimiterspace} 2}} {{\sqrt 2 } \over 2} \times \hat \sigma _{\hat \mu _j }$

The $\hat \mu _j$ term is the mean of the jth sample (where samples are labeled j = 1 to J) and $\hat \sigma _{\hat \mu _j }$ is the standard error of that sample. The  $t_{n_j - 1,1 - {\alpha \mathord{\left/ {\vphantom {\alpha 2}} \right. \kern-\nulldelimiterspace} 2}}$ term is the quantile of the t distribution with $n_j - 1$ degrees of freedom (where $n_j$ is the size of jth sample) that includes to 100(1 – α) % of the distribution.

Thus, apart from the ${{\sqrt 2 } \mathord{\left/ {\vphantom {{\sqrt 2 } 2}} \right. \kern-\nulldelimiterspace} 2}$ term, this equation is identical to that for a 95% CI around the individual means, with the proviso that the standard error here is computed separately for each sample. This differs from the usual approach to plotting CIs for independent measures ANOVA design – where it is common to use a pooled standard error computed from a pooled standard deviation ( the root mean square error of the ANOVA) . While a pooled error term is sometimes appropriate, it is generally a bad idea for graphical display of the CIs because it will obscure any patterns in the variability of the samples. [Nevertheless, where $n_j$ is very small it make make sense to use a pooled error term on the grounds that each sample provides an exceptionally poor estimate of its population standard deviation]

However, the most important change is the ${{\sqrt 2 } \mathord{\left/ {\vphantom {{\sqrt 2 } 2}} \right. \kern-\nulldelimiterspace} 2}$ term. It creates a difference-adjusted CI by ensuring that the joint width of the margin of error around any two means is $latex \sqrt 2$ times larger than for a single mean. The division by 2 arises merely as a consequence of dealing jointly with two error bars. Their total has to be $latex \sqrt 2$ times larger and therefore each one needs only to be ${{\sqrt 2 } \mathord{\left/ {\vphantom {{\sqrt 2 } 2}} \right. \kern-\nulldelimiterspace} 2}$ times its conventional value (for an unadjusted CI). This is discussed in more detail by Baguley (2012a; 2012b).

This equation should perform well (e.g., providing fairly accurate coverage) as long as variances are not very unequal and the samples are approximately normal. Even when these conditions are not met, remember the aim is not to support formal inference. In addition, the approach is likely to be slightly more robust than ANOVA (at least to homogeneity of variance and unequal sample sizes). So this method is likely to be a good choice whenever ANOVA is appropriate.

#### R functions for independent measures (between-subjects) ANOVA designs

Two R functions for difference-adjusted CIs in independent measures ANOVA designs are provided here.  The first function bsci() calculates conventional or difference-adjusted CIs for a one-way ANOVA design.

bsci <- function(data.frame, group.var=1, dv.var=2, difference=FALSE, pooled.error=FALSE, conf.level=0.95) {
data <- subset(data.frame, select=c(group.var, dv.var))
fact <- factor(data[[1]])
dv <- data[[2]]
J <- nlevels(fact)
N <- length(dv)
ci.mat <- matrix(,J,3, dimnames=list(levels(fact), c('lower', 'mean', 'upper')))
ci.mat[,2] <- tapply(dv, fact, mean)
n.per.group <- tapply(dv, fact, length)
if(difference==TRUE) diff.factor= 2^0.5/2 else diff.factor=1
if(pooled.error==TRUE) {
for(i in 1:J) {
moe <- summary(lm(dv ~ 0 + fact))$sigma/(n.per.group[[i]])^0.5 * qt(1-(1-conf.level)/2,N-J) * diff.factor ci.mat[i,1] <- ci.mat[i,2] - moe ci.mat[i,3] <- ci.mat[i,2] + moe } } if(pooled.error==FALSE) { for(i in 1:J) { group.dat <- subset(data, data[1]==levels(fact)[i])[[2]] moe <- sd(group.dat)/sqrt(n.per.group[[i]]) * qt(1-(1-conf.level)/2,n.per.group[[i]]-1) * diff.factor ci.mat[i,1] <- ci.mat[i,2] - moe ci.mat[i,3] <- ci.mat[i,2] + moe } } ci.mat } plot.bsci <- function(data.frame, group.var=1, dv.var=2, difference=TRUE, pooled.error=FALSE, conf.level=0.95, xlab=NULL, ylab=NULL, level.labels=NULL, main=NULL, pch=21, ylim=c(min.y, max.y), line.width=c(1.5, 0), grid=TRUE) { data <- subset(data.frame, select=c(group.var, dv.var)) if(missing(level.labels)) level.labels <- levels(data[[1]]) if (is.factor(data[[1]])==FALSE) data[[1]] <- factor(data[[1]]) if (is.factor(data[[1]])==TRUE) data[[1]] <- factor(data[[1]]) dv <- data[[2]] J <- nlevels(data[[1]]) ci.mat <- bsci(data.frame=data.frame, group.var=group.var, dv.var=dv.var, difference=difference, pooled.error=pooled.error, conf.level=conf.level) moe.y <- max(ci.mat) - min(ci.mat) min.y <- min(ci.mat) - moe.y/3 max.y <- max(ci.mat) + moe.y/3 if (missing(xlab)) xlab <- "Groups" if (missing(ylab)) ylab <- "Confidence interval for mean" plot(0, 0, ylim = ylim, xaxt = "n", xlim = c(0.7, J + 0.3), xlab = xlab, ylab = ylab, main = main) grid() points(ci.mat[,2], pch = pch, bg = "black") index <- 1:J segments(index, ci.mat[, 1], index, ci.mat[, 3], lwd = line.width[1]) segments(index - 0.02, ci.mat[, 1], index + 0.02, ci.mat[, 1], lwd = line.width[2]) segments(index - 0.02, ci.mat[, 3], index + 0.02, ci.mat[, 3], lwd = line.width[2]) axis(1, index, labels=level.labels) } The default is difference=FALSE (on the basis that these are the CIs most likely to be reported in text or tables). The second function plot.bsci() uses the former function to plot the means and CIs the default here is difference=TRUE (on the basis that it the difference-adjusted CIs are likely to be more useful for graphical display). For both functions the default is a pooled error term (pooled.error=FALSE) and a 95% confidence level (conf.level=0.95). Each function also takes input as a data frame and assumes that the grouping variable is the first column and the dependent variable the second column. If the appropriate variables are in different columns, the correct columns can be specified with the arguments group.var and dv.var. The plotting function also takes some standard graphical parameters (e.g., for labels and so forth). The following examples use the diagram data set from Serious stats. The first line loads the data set (if you have a live internet connection). The second line generated the difference-adjusted CIs. The third line plots the difference adjusted CIs. Note that the grouping variable (factor) is in the second column and the DV is in the fourth column. diag.dat <- read.csv('http://www2.ntupsychology.net/seriousstats/diagram.csv') bsci(diag.dat, group.var=2, dv.var=4, difference=TRUE) plot.bsci(diag.dat, group.var=2, dv.var=4, ylab='Mean description quality', main = 'Difference-adjusted 95% CIs for the Diagram data') In this case the graph looks like this: It should be immediately clear that while the segmented diagram condition (S) tends to have higher scores than the text (T) or picture (P) conditions, but the full diagram (F) condition is somewhere in between. This matches the uncorrected pairwise comparisons where S > P = T, S = F, and F = P = T. At some point I will also add a function to plot two-tiered error bars (combining option 1 and 3). For details of the extension to repeated measures designs see Baguley (2012b). The code and date sets are available here. #### References Afshartous D., & Preston R. A. (2010). Confidence intervals for dependent data: equating nonoverlap with statistical significance. Computational Statistics and Data Analysis. 54, 2296-2305. Baguley, T. (2012a, in press). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave. Baguley, T. (2012b). Calculating and graphing within-subject confidence intervals for ANOVA. Behavior Research Methods, 44, 158-175. Goldstein, H., & Healy, M. J. R. (1995). Journal of the Royal Statistical Society. Series A (Statistics in Society), 158, 175-177. Schenker, N., & Gentleman, J. F. (2001). On judging the significance of differences by examining the overlap between confidence intervals. The American Statistician, 55, 182-186. N.B. R code formatted via Pretty R at inside-R.org Update A revised version of the function that allows you to flip the axes is available here. # Beware the Friedman test! In section 10.4.4 of Serious stats (Baguley, 2012) I discuss the rank transformation and suggest that it often makes sense to rank transform data prior to application of conventional ‘parametric’ least squares procedures such as tests or one-way ANOVA. There are several advantages to this approach over the usual approach (which involves learning and applying a new test such as Mann-Whitney U, Wilcoxon T or Kruskal-Wallis for almost every situation). One is pedagogic. It is much easier to teach or learn the rank transformation approach (especially if you also cover other transformations in your course). Another reason is that there are situations where widely used rank-randomization tests perform very badly, yet the rank transformation approach does rather well. In contrast, Conover and Iman (1981) show that rank transformation versions of parametric tests mimic the properties of the best known rank randomization tests (e.g., Spearman’s rho, Mann-Whitney U or Wilcoxon T) rather closely with moderate to large sample sizes. The better rank randomization tests tend to have the edge on rank transformation approaches only when sample sizes are small (and that advantage may not hold if there are many ties). The potential pitfalls of rank randomization tests is nicely illustrated with the case of the Friedman test (and related tests such as Page’s L). I’ll try and explain the problem here. #### Why the Friedman test is an impostor … I’ve always thought there was something odd about the way the Friedman test worked. Like most psychology students I first learned the Wilcoxon signed ranks (T) test. This is a rank randomization analog of the paired test. It involves computing the absolute difference between paired observations, ranking them and then adding the original sign back in. Imagine that the raw data consist of the following paired measurements (A and B) from four people (P1 to P4):  A B P1 13 4 P2 6 9 P3 11 9 P4 12 6 This results in the following ranks being assigned:  A – B Rank P1 +9 +4 P2 -3 -2 P3 +2 +1 P4 +6 +3 The signed ranks are then used as input to a randomization (i.e., permutation) test that, if there are no ties, gives the exact probability of the observed sum of the ranks (or a sum more extreme) being obtained if the paired observations had fallen into the categories A or B at random (in which case the expected sum is zero). The basic principle here is similar to the paired t test (which is a one sample t test on the raw differences). The Friedman test is (incorrectly) generally considered to be a rank randomization equivalent of one-way repeated measures (within-subjects) ANOVA in the same way that the Wilcoxon test is a a rank randomization equivalent of paired t. It isn’t. To see why, consider three repeated measures (A, B and C) for two participants. Here are the raw scores:  A B C P1 6 7 12 P2 8 5 11 Here are the corresponding ranks:  A B C P1 1 2 3 P2 2 1 3 The ranks for the Friedman test depend only on the order of scores within each participant – they completely ignore the differences between participants. This differs dramatically from the Wilcoxon test where information about the relative size of differences between participants is preserved. Zimmerman and Zumbo (1993) discuss this difference in procedures and explain that the Friedman test (devised by the noted economist and champion of the ‘free market’ Milton Friedman) is not really a form of ANOVA but an extension of the sign test. It is an impostor. This is bad news because the sign test tends to have low power relative to the paired t test or Wilcoxon sign rank test. Indeed, the asymptotic relative efficiency relative to ANOVA of the Friedman test is .955 J/(J+1) where J is the number of repeated measures (see Zimmerman & Zumbo, 1993). Thus it is about .72 for J = 3 and .76 for J = 4, implying quite a big hit in power relative to ANOVA when the assumptions are met. This is a large sample limit, but small samples should also have considerably less power because the sign test and the Friedman test, in effect, throw information away. The additional robustness of the sign test may sometimes justify its application (as it may outperform Wilcoxon for heavy-tailed distributions), but this does not appear to be the case for the Friedman test. Thus, where one-way repeated measures ANOVA is not appropriate, rank transformation followed by ANOVA will provide a more robust test with greater statistical power than the Friedman test. #### Running one-way repeated measures ANOVA with a rank transformation in R The rank transformation version of the ANOVA is relatively easy to set up. The main obstacle is that the ranks need to be derived by treating all nJ scores as a single sample (where n is the number of observations per J repeated measures conditions – usually the number of participants). If your software arranges repeated measures data in broad format (e.g., as in SPSS) this can involve some messing about cutting and pasting columns and then putting them back (for which I would use Excel). For this sort of analysis I would in case prefer R – in which case the data would tend to be in a single column of a data frame or in a single vector anyway. The following R code using demo data from the excellent UCLA R resources runs first a friedman test, then a one-way repeated measures ANOVA and then the rank transformation version ANOVA. For these data pulse is the DV, time is the repeated measures factor and id is the subjects identifier. demo3 <- read.csv("http://www.ats.ucla.edu/stat/data/demo3.csv") friedman.test(pulse ~ time|id, demo3) library(nlme) lme.raw <- lme(fixed = pulse ~ time, random =~1|id, data=demo3) anova(lme.raw) rpulse <- rank(demo3$pulse)
lme.rank <- lme(fixed = rpulse ~ time, random =~1|id, data=demo3)
anova(lme.rank)

It may be helpful to point out  a couple of features of the R code. The Friedman test is built into R and can take formula or matrix input. Here I used formula input and specified a data frame that contains the demo data. The vertical bar notation indicates that the time factor varies within participants. The repeated measures ANOVA can be run in many different ways (see Chapter 16 of Serious stats ). Here I chose ran it as a multilevel model using the nlme package (which should still work even if the design is unbalanced). As you can see, the only difference between the code for the conventional ANOVA and the rank transformation version is that the DV is rank transformed prior to analysis.

Although this example uses R, you could almost as easily use any other software for repeated measures ANOVA (though as noted it is simplest with software that take data structured in long form – with the DV in a single column or vector).

#### Other advantages of the approach

The rank transformation is, as a rule, more versatile than using rank randomization tests. For instance, ANOVA software often has options for testing contrasts or correcting for multiple comparisons. Although designed for analyses of raw data some procedures are very general and can be straightforwardly applied to the rank transformation approach – notably powerful modified Bonferroni procedures such as the Hochberg or Westfall procedures. A linear contrast can also be used to run the equivalent of a rank randomization trend test such as the Jonckheere test (independent measures) or Page’s L (repeated measures). A rank transformation version of the Welch-Satterthwaite t test is also superior to the more commonly applied Mann-Whitney U test (being robust to homogeneity of variance when sample sizes are unequal which the Mann-Whitney U test is not).

#### References

Baguley, T. (2012, in press). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.

Conover, W. J., & Iman, R. L. (1981). Rank transformations as a bridge between parametric and nonparametric statistics. American Statistician, 35, 124-129.

Zimmerman, D. W., & Zumbo, Bruno, D. (1993). Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks. Journal of Experimental Education, 62, 75-86.

N.B.  R code formatted via Pretty R at inside-R.org

# Comparing correlations: independent and dependent (overlapping or non-overlapping)

In Chapter 6 (correlation and covariance) I consider how to construct a confidence interval (CI) for the difference between two independent correlations.  The standard approach uses the Fisher z transformation to deal with boundary effects (the squashing of the distribution and increasing asymmetry as r approaches -1 or 1). As zr is approximately normally distributed (which r is decidedly not) you can create a standard error for the difference by summing the sampling variances according to the variance sum law (see chapter 3).

This works well for the CI around a single correlation (assuming the main assumptions – bivariate normality and homogeneity of variance – broadly hold) or for differences between means, but can perform badly when looking at the difference between two correlations. Zou (2007) proposed modification to the standard approach that uses the upper and lower bounds of the CIs for individual correlations to calculate a CI for their difference. He considered three cases: independent correlations and two types of dependent correlations (overlapping and non-overlapping). He also considered differences in R2 (not relevant here).

Independent correlations

In section 6.6.2 (p. 224) I illustrate Zou’s approach for independent correlations and provide R code in sections 6.7.5 and 6.7.6 to automate the calculations. Section 6.7.5 shows how to write a simple R function and illustrates it with a function to calculate a CI for Pearson’s r using the Fisher transformation. Whilst writing the book I encountered several functions do do exactly this. The cor.test() function in the base package does this for raw data (along with computing the correlation and usual NHST). A number of functions compute it using the usual text book formula. My function relies on R primitive hyperbolic functions (as the Fisher z transformation is related to the geometry of hyperbolas), which may be useful if you need to use it intensively (e.g., for simulations):

rz.ci <- function(r, N, conf.level = 0.95) {
zr.se <- 1/(N - 3)^0.5
moe <- qnorm(1 - (1 - conf.level)/2) * zr.se
zu <- atanh(r) + moe
zl <- atanh(r) - moe
tanh(c(zl, zu))
}

The function is 6.7.6 uses the rz.ci() function to construct a CI for the difference between two independent correlations. See section 6.6.2 of Serious stats or Zou (2007) for further details and a worked example. My function from section 6.7.6 is reproduced here:

r.ind.ci <- function(r1, r2, n1, n2=n1, conf.level = 0.95) {
L1 <- rz.ci(r1, n1, conf.level = conf.level)[1]
U1 <- rz.ci(r1, n1, conf.level = conf.level)[2]
L2 <- rz.ci(r2, n2, conf.level = conf.level)[1]
U2 <- rz.ci(r2, n2, conf.level = conf.level)[2]
lower <- r1 - r2 - ((r1 - L1)^2 + (U2 - r2)^2)^0.5
upper <- r1 - r2 + ((U1 - r1)^2 + (r2 - L2)^2)^0.5
c(lower, upper)
}

The call the function use the two correlation coefficients an sample as input (the default is to assume equal n and a 95% CI).

A caveat

As I point out in chapter 6, just because you can compare two correlation coefficients doesn’t mean it is a good idea. Correlations are standardized simple linear regression coefficients and even if the two regression coefficients measure the same effect, it doesn’t follow that their standardized counterparts do. This is not merely the problem that it may be meaningless to compare, say, a correlation between height and weight with a correlation between anxiety and neuroticism. Two correlations between the same variables in different samples might not be meaningfully comparable (e.g., because of differences in reliability, range restriction and so forth).

Dependent overlapping correlations

In many cases the correlations you want to compare aren’t independent. One reason for this is that the correlations share a common variable. For example if you correlate X with Y and X with Z you might be interested in whether the correlation rXY is larger than rXZ. As X is common to both variables the correlations are not independent. Zou (2007) describes how to adjust the interval to account for this correlation. In essence the sampling variances of the correlations are tweaked using a version of the variance sum law (again see chapter 3).

The following functions (not in the book) compute the correlation between the correlations and use it to adjust the CI for the difference in correlations to account for overlap (a shared predictor). Note that both functions and  rz.ci() must be loaded into R. Also included is a calls to the main function  that reproduces the output from example 2 of Zou (2007).

rho.rxy.rxz <- function(rxy, rxz, ryz) {
num <- (ryz-1/2*rxy*rxz)*(1-rxy^2-rxz^2-ryz^2)+ryz^3
den <- (1 - rxy^2) * (1 - rxz^2)
num/den
}

r.dol.ci <- function(r12, r13, r23, n, conf.level = 0.95) {
L1 <- rz.ci(r12, n, conf.level = conf.level)[1]
U1 <- rz.ci(r12, n, conf.level = conf.level)[2]
L2 <- rz.ci(r13, n, conf.level = conf.level)[1]
U2 <- rz.ci(r13, n, conf.level = conf.level)[2]
rho.r12.r13 <- rho.rxy.rxz(r12, r13, r23)
lower <- r12-r13-((r12-L1)^2+(U2-r13)^2-2*rho.r12.r13*(r12-L1)*(U2- r13))^0.5
upper <- r12-r13+((U1-r12)^2+(r13-L2)^2-2*rho.r12.r13*(U1-r12)*(r13-L2))^0.5
c(lower, upper)
}

# input from example 2 of Zou (2007, p.409)
r.dol.ci(.396, .179, .088, 66)

The r.dol.ci() function takes three correlations as input – the correlations of interest (e.g., rXY and rXZ) and the correlation between the non-overlapping variables (e.g., rYZ). Also required is the sample size (often identical for both correlations).

Dependent non-overlapping correlations

Overlapping correlations are not the only cause of dependency between correlations. The samples themselves could be correlated. Zou (2007) gives the example of a correlation between two variables for a sample of mothers. The same correlation could be computed for their children. As the children and mothers have correlated scores on each variable, the correlation between the same two variables will be correlated (but not overlapping in the sense used earlier). The following functions compute the CI for the difference in correlations  between dependent non-overlapping correlations. Also included is a call to the main function that reproduces Zou (2007) example 3.

rho.rab.rcd <- function(rab, rac, rad, rbc, rbd, rcd) {
num <- 1/2*rab*rcd * (rac^2 + rad^2 + rbc^2 + rbd^2) + rac*rbd + rad*rbc - (rab*rac*rad + rab*rbc*rbd + rac*rbc*rcd + rad*rbd*rcd)
den <- (1 - rab^2) * (1 - rcd^2)
num/den
}

r.dnol.ci <- function(r12, r13, r14, r23, r24, r34, n12, n34=n12, conf.level=0.95) {
L1 <- rz.ci(r12, n12, conf.level = conf.level)[1]
U1 <- rz.ci(r12, n12, conf.level = conf.level)[2]
L2 <- rz.ci(r34, n34, conf.level = conf.level)[1]
U2 <- rz.ci(r34, n34, conf.level = conf.level)[2]
rho.r12.r34 <- rho.rab.rcd(r12, r13, r14, r23, r24, r34)
lower <- r12 - r34 - ((r12 - L1)^2 + (U2 - r34)^2 - 2 * rho.r12.r34 * (r12 - L1) * (U2 - r34))^0.5
upper <- r12 - r34 + ((U1 - r12)^2 + (r34 - L2)^2 - 2 * rho.r12.r34 * (U1 - r12) * (r34 - L2))^0.5
c(lower, upper)
}

# from example 3 of Zou (2007, p.409-10)

r.dnol.ci(.396, .208, .143, .023, .423, .189, 66)

Although this call reproduces the final output for example 3 it produces slightly different intermediate results (0.0891 vs. 0.0917) for the correlation between correlations. Zou (personal communication) confirms that this  is either a typo or rounding error (e.g., arising from hand calculation) in example 3 and that the function here produces accurate output. The input here requires the correlations from every possible correlation between the four variables being compared (and the relevant sample size for the correlations being compared). The easiest way to get the correlations is from a correlation matrix of the four variables.

Robust alternatives

Wilcox (2009) describes a robust alternative to these methods for independent correlations and modifications to Zou’s  method that make the dependent correlation methods robust to violations of bivariate normality and (in particular) homogeneity of variance assumptions. Wilcox provides R functions for these approaches on his web pages. His functions take raw data as input and are computationally intensive. For instance the dependent correlation methods use Zou’s approach but take boostrap CIs for the individual correlations as input (rather than the simpler Fisher z transformed versions).

The relevant functions are twopcor() for the independent case, TWOpov() for the dependent overlapping case and TWOpNOV() for the non-overlapping case.

UPDATE

Zou’s modified asymptotic method is easy enough that you can run it in Excel. I’ve added an Excel spreadsheet to the blog resources that should implement the methods (and matches the output to R fairly closely). As it uses Excel it may not cope gracefully with some calculations (e.g., with extremely small or large values or r or other extreme cases) – and I have more confidence in the R code.

References

Baguley, T. (2012, in press). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.

Zou, G. Y. (2007). Toward using confidence intervals to compare correlations. Psychological Methods, 12, 399-413.

Wilcox, R. R. (2009). Comparing Pearson correlations: Dealing with heteroscedascity and non-normality. Communications in Statistics – Simulation & Computation, 38, 2220-2234.

N.B. R code formatted via Pretty R at inside-R.org

# Serious stats – a quick chapter summary

Here is a list of the contents by chapter with quick notes on chapter content …

0. Preface (About the book; notes on software, mathematics and types of boxed sections)

1. Data, Samples and Statistics (A gentle review of measures of central tendency and dispersion with a little more depth in places – flagging up the distinction between descriptive and inferential formulas and perhaps introducing a few unfamiliar statistics such the geometric mean)

2. Probability Distributions (A background chapter giving a whirlwind tour of the main probability distributions – discrete and continuous – that crop up in later chapters. It also introduces important concepts such probability mass functions, probability density functions and cumulative density functions and characteristics of distributions such as skew, kurtosis and whether they are bounded. From a statistical point of view it is a quick overview missing out a lot of the difficult stuff. )

3. Confidence Intervals (This chapter introduces interval estimation using confidence intervals (CIs) and gives examples for discrete and continuous distributions – particularly those for means and differences between independent or paired means using the t distribution. This chapter also introduces Monte Carlo methods – with emphasis on the bootstrap.)

4. Significance Tests (This chapter introduces significance tests. These are deliberately covered after CIs – which are less popular in the behavioral sciences but generally more useful. A number of common tests are covered – notably t tests and chi-square tests. The chapter ends with some comments on the appropriate use of significance tests – a point picked up again in chapter 11.)

5. Regression (This chapter introduces regression – with an emphasis on simple linear regression. Later chapters draw heavily on this basic material including concepts such as prediction, leverage and influence. The versatility of regression approaches is shown by illustrating how an independent t test is a simple regression model and how a linear model can fit some curvilinear relationships.)

6. Correlation and Covariance (Introduces covariance and correlation with emphasis on the link between Pearson’s r and simple linear regression. The chapter also introduces standardization and problems of working with standardized quantities such as boundary effects, range restriction and small sample bias. Methods for inference with correlation coefficients and comparing correlations (e.g., using the Fisher z distribution) are considered. Some alternatives to Pearson’s r are also introduced.)

7. Effect Size (This chapter focuses on effect size, starting with an overview of the different uses of effect size metrics. The chapter gives a tour of different types of effect size metrics, distinguishing between: continuous and discrete metrics; simple (unstandardized) and standardized metrics; focused (1 df) and unfocused (multiple df) metrics; base rate sensitive and base rate insensitive metrics. I argue that standardized metrics whether based on differences or correlations (d family or r family) are not good measures of the practical, clinical or theoretical importance of an effect because they confound the magnitude of an effect with its variability – though they may be useful in some situations.)

8. Statistical Power (This chapter introduces statistical power – starting by explaining the link between the effect size and statistical power, illustrating why standardized effect size (by combining the magnitude of an effect with its variability) is often a convenient way to summarize an effect in order to estimate statistical power or the sample size required to detect an effect. Problems and pitfalls in statistical power and sample size estimation are discussed. Later sections introduce the accuracy in parameter estimation approach to power in relation to the width of a confidence interval.)

9. Exploring Messy Data (This chapter looks at exploratory analysis of data with emphasis on graphical methods for checking statistical assumptions.)

10. Dealing with Messy Data (This chapter surveys approaches to dealing with violations of statistical assumptions with particular emphasis on robust methods and transformations.)

11. Alternatives to Classical Statistical Inference (This chapter looks at criticism of classical, frequentist methods of inference and considers frequentist responses and three alternative approaches: likelihood, Bayesian and information-theoretic methods. I illustrate each of the alternatives both here and in later chapters.)

12. Multiple Regression and the General Linear Model (This chapter extends regression to models with multiple predictors. The problem of fitting these models when predictors are not orthogonal (i.e., when they are correlated) is introduced and a solution is illustrated using matrix algebra. The rest of the chapter introduces partial and semi-partial correlation and focuses on interpreting a multiple regression model and related issues such as collinearity and suppression.)

13. ANOVA and ANCOVA with Independent Measures (This chapter introduces ANOVA and ANCOVA as special cases of multiple regression with categorical predictors (e.g., using dummy or effect coding). The chapter ends by introducing the multiple comparison problem in relation to differences between means for a factor in ANOVA or differences between adjusted means in ANCOVA. For the latter, the main focus is on modified Bonferroni procedures, though alternatives such as control of false discovery rate and information-theoretic approaches are briefly considered.)

14. Interactions (This chapter looks at modeling non-additive effects of predictors in multiple regression models through the inclusion of interaction terms. It starts by looking at the most general form of an interaction model in multiple regression (often termed a moderated multiple regression) before looking at polynomial terms in regression and interactions in the context of ANOVA and ANCOVA. The main emphasis is on interpreting and exploring interaction effects (e.g., through graphical methods). The chapter also looks at simple main effects and simple interaction effects.)

15. Contrasts (This chapter looks at the often neglected topic of contrasts – mainly in the context of ANOVA and ANCOVA models (where they are weighted combinations of differences in means or adjusted means). Methods for setting up contrasts to test hypotheses about patterns of means are explained for simple cases and extended for unbalanced designs, adjusted means and interaction effects.)

16. Repeated Measures ANOVA (This chapter introduces repeated measures and related (e.g., matched) designs. These increase statistical power by removing individual differences from the ANOVA error term, but at the cost of increased complexity (e.g., making stronger assumptions about the errors of the model). Again, the chapter focuses on checking and dealing with violations of assumptions and on the interpretation of the model. It also briefly considers MANOVA and repeated measures ANCOVA models and the use of gain scores).

17. Modelling Discrete Outcomes (This chapter explains how the regression approach of the general linear model can be extended to models with discrete outcomes using the generalized linear model and related approaches. The main focus is on logistic regression (including multinomial and ordered logistic regression) and Poisson regression, but negative binomial regression and models for excess zeroes (zero-inflated and hurdle models) are briefly reviewed. The chapter ends by considering the difficulty of modeling correlated observations in logistic regression.)

18. Multilevel Models (This chapter introduces multilevel models with particular emphasis on their application to the analysis repeated measures data. The chapter considers conventional nested designs (e.g., repeated measures within participants or children within schools) and moves on to fully crossed models and a brief overview of multilevel generalized linear models).

All chapters come with several examples within the chapter and R code (at the end). Most also have notes on SPSS syntax. I don’t include full SPSS instructions because these are often already available in popular texts. If they aren’t available it is generally because SPSS couldn’t readily implement these analyses. Also note that recent versions of SPSS can be set up to call R via syntax (though I find it easier to use R directly).

Online supplements

The book is around 800 pages long and some material cut from the final draft will be available in five online supplements. This material is either parenthetical (being too detailed than required) or self-contained sections that could stand alone and were perhaps not relevant for all readers.

OS1. Meta-analysis (This section was included in chapter 7 Effect size and introduces meta-analysis. Most meta-analytic approaches for continuous data use standardized effect size metrics. As the chapter argues that simple effect size metrics are often superior for summarizing and comparing effects this chapter uses meta-analysis of simple (raw) mean differences to illustrate fixed effect and random effects models. There is a nice link between random effects meta-analysis and multilevel models – so it was a shame to drop it.)

OS2. Dealing with missing data (An overview of methods for dealing with missing data that was part of chapter 10. The main focus is on multiple imputation – an extremely useful and underused approach in the behavioral sciences and a worked example is demonstrated for both R and SPSS. There are nice links between multiple imputation and meta-analysis – so it made sense to move this chapter out once I had decided to leave out meta-analysis. If you work with missing data and aren’t already familiar with multiple imputation you should take a careful look at this chapter – as most standard methods for dealing with missing data are biased and have low statistical power.)

OS3. Replication probabilities and prep (When I started writing the book there was quite an interest in replication probabilities and prep. in particular as an alternative to p values. This interest has largely faded and my (largely critical) take on prep is now mainly a historical curiosity. The main text now covers this topic briefly in chapter 11. )

OS4. Pseudo-R2 and related measures (A reader of the final draft of chapter 17 commented that given the problems with these measures and my own critical stance on standardized effect size metrics that my coverage of this topic was too detailed. I greatly reduced the emphasis on pseudo-R2 in the text by moving most of the material here. Of these measures my favourite is Zheng and Agresti’s predictive power measure – which I find most intuitive.)

OS5. Loglinear models (Loglinear models are models of contingency table data (closely related to Poisson regression, and under certain conditions equivalent). As Poisson models are generally more flexible, loglinear models were cut from the final draft. However, as they are quite popular in the behavioral sciences – this supplement is provided. Loglinear models are also a convenient way to parameterize a count model to make it more “chi-square-like”. Note: loglinear model can also be used in a more general sense to include models with log link functions or log transformations.)

# Serious stats: A guide to advanced statistics for the behavioral sciences

This is a blog to accompany my forthcoming book “Serious stats” published by Palgrave.

Baguley, T. (2012, in press). Serious stats: A guide to advanced statistics for the behavioral sciences. Basingstoke: Palgrave.

The book is available for pre-order (e.g., via amazon) and instructors should be able to pre-order inspection copies via Macmillan in the US (or Palgrave in the UK).
The proofs have been checked and returned and I am hoping for a publication date of May 2012.