3 R practicals – Statistical Modelling

3.1 Getting started

For running the code below, you will need the R packages MASS and SMPracticals. The MASS package (Venables & Ripley, 2002) is one of the recommended R package and is included with the binary distributions of R, so you should have it. The SMPracticals package (Davison, 2024) is the R package providing the datasets and a few functions for use with the practicals outlined in Davison (2003, Appendix A), and can be installed by running

install.packages("SMPracticals")

The packages can be loaded and attached by doing

library("MASS")
library("SMPracticals")

3.2 `trees` data

trees contains data on the volume of timber, height and girth (diameter) of 31 felled black cherry trees; girth is measured four feet six inches above ground (Atkinson, 1985, p. 63). The problem is to find a simple linear model for predicting volume from height and girth. See ?trees for more details.

data("trees", package = "datasets")
pairs(trees, panel = panel.smooth)
pairs(log(trees), panel = panel.smooth)

coplot() generates conditioning plots, in which the relationship between two variables is displayed conditional on subsets of values of other variables. This is useful to see if the relationship is stable over the range of other variables. The plots should be read from left to right, starting from the bottom row, and each plot corresponds to the ranges of values (from left to right) shown on the top plot for the conditioning variable. For the relationship of log volume and log girth, conditional on height we get:

coplot(log(Volume) ~ log(Girth) | Height, data = trees, panel = panel.smooth)

Produce and interpret the conditioning plots on the orginal scale.

For an initial fit, we take a linear model and assess model fit using diagnostic plots:

m_trees <- glm(Volume ~ Girth + Height, data = trees)
summary(m_trees)
plot.glm.diag(m_trees)

What do you make of the m_trees fit?

To assess the possibility of transformation:

boxcox(m_trees)

Both \(\lambda = 1\) and \(\lambda = 0\) lie outside the confidence interval, though the latter is better supported. One possibility is to take \(\lambda = 1/3\), corresponding to response \(\texttt{Volume}^{1/3}\).

What transformations for Girth and Height are then needed for dimensional compatibility? Fit this model, give interpretations of the parameter estimates, and discuss its suitability.

An alternative is to suppose that a tree is conical in shape, in which case
\[ \texttt{Volume} \propto \texttt{Height} \times \texttt{Girth}^2 \, . \] Equivalently, we fit

m_trees_log <- glm(log(Volume) ~ log(Girth) + log(Height), data = trees)
summary(m_trees_log)
plot.glm.diag(m_trees_log)

Are the parameter estimates consistent with this model? Does it fit adequately? What advantage has it over the others for prediction of future volumes?

3.3 `salinity` data

salinity contains \(n = 28\) observations on the salinity of water in Pamlico Sound, North Carolina (Atkinson, 1985, p. 48; Ruppert & Carroll, 1980). The response sal is the bi-weekly average of salinity. The other three columns contain values of the covariates, respectively a lagged value of salinity lag, a trend indicator trend, and the river discharge dis.

Using the techniques from the analysis of the tree data set as a guide, find a model suitable for prediction of salinity from the covariates. The data contain at least one outlier.

3.4 `shuttle` data

shuttle contains the data in Davison (2003, Table 1.3) on O-ring failures for the space shuttle (Dalal et al., 1989).

data("shuttle", package = "SMPracticals")
row.names(shuttle) <- NULL

To fit a binomial logistic regression model with covariate temperature, we do:

m_shuttle <- glm(cbind(r, m - r) ~ temperature, family = binomial(), data = shuttle)
anova(m_shuttle)
summary(m_shuttle)

Try fitting with and without both covariates. To assess model fit, try

plot.glm.diag(m_shuttle)

Do you find these diagnostics useful?

3.5 `bliss` data

bliss provides data on mortality of flour-beetles as a function of dose of a poison (Bliss, 1935). To plot the death rates and fit a logistic regression model, we do:

data("bliss", package = "SMPracticals")
m_bliss_logit <- glm(cbind(r, m - r) ~ log(dose), family = binomial(), data = bliss)
summary(m_bliss_logit)

with(bliss, {
    plot(log(dose), r/m, ylim = c(0, 1), ylab = "Proportion dead")
    points(log(dose), fitted(m_bliss_logit), pch = 3, col = 2)
})

Does the fit seem reasonble to you?

Try again with the probit and cloglog link functions.

For example, for the cloglog link function we have:

m_bliss_cloglog <- glm(cbind(r, m-r) ~ log(dose), family = binomial("cloglog"), data = bliss)
with(bliss, {
    plot(log(dose), r/m, ylim = c(0, 1), ylab = "Proportion dead")
    points(log(dose), fitted(m_bliss_logit), pch = 3, col = 2)
    points(log(dose), fitted(m_bliss_cloglog), pch = 3, col = 3)
})

Which link function fits best? Give a careful interpretation of the resulting model.

3.1 Getting started

3.2 trees data

3.3 salinity data

3.4 shuttle data

3.5 bliss data

3.2 `trees` data

3.3 `salinity` data

3.4 `shuttle` data

3.5 `bliss` data