Chapter 18 When we only want interpretation on some predictors

Caution: in a highly developmental stage! See Section 1.1.

18.1 Non-identifiability in GAMS

Here is some help on Lab 4 Exercise 1(b). Exercise 1(b) is intended to get you to think about what the \(h\) functions in a Generalized Additive Model (GAM) are.

An interpretation of the \(h\) functions can only make sense in light of the non-identifiability issue of GAM’s, so that’s discussed first. Then, hints are given for the first two questions in Exercise 1(b).

18.1.1 Non-identifiability

What is “non-identifiability”, exactly? It can happen for any model that’s not carefully specified (not just GAM’s). Let’s look at an example first.

In simple linear regression, why not write the model \[ Y = \beta_0 + \alpha_0 + \beta_1 X + \varepsilon, \] where \(\mathbb{E}(\varepsilon)=0\)? It’s because three parameters are too many to describe a line. In other words, . For example, the model \(Y=1+X+\varepsilon\) can be written with \[ \beta_0 = 0, \alpha_0 = 1, \beta_1 = 1, \] or \[ \beta_0 = -1, \alpha_0 = 2, \beta_1 = 1, \] etc. In fact, as long as \(\alpha_0 = 1 - \beta_0\), and \(\beta_1=1\), we get the same regression line.

In general, and roughly speaking, when more than one parameter selection gives you the same model, there’s a non-identifiability issue. It leads to problems in estimation and estimator properties. It also leads to an problem: the parameters don’t have a meaning, since they can represent more than one thing in the model.

This is even true in non-parametric cases, such as the GAM. Let’s look at a two-predictor GAM: \[ Y = \beta_0 + h_1\left(X_1\right) + h_2\left(X_2\right) + \varepsilon, \] where \(\beta_0\) is any real number, \(h_1\) and \(h_2\) are any smooth functions, and \(\mathbb{E}(\varepsilon)=0\). As it is, this model is non-identifiable: if you pick a \(\beta_0\), \(h_1\), and \(h_2\), I can find another set of \(\beta_0\), \(h_1\), and \(h_2\) that gives the same regression surface. How? I can just add a constant \(c\) to your \(\beta_0\), and subtract that constant from your, say, \(h_1\) (i.e., “vertically shift” your \(h_1\) function downwards by \(c\)).

So, the “parameters” (which includes the \(h\) functions) in a GAM are non-identifiable – the \(h\) functions can be vertically shifted, and \(\beta_0\) can just compensate for these shifts to give the same regression surface.

To make the model identifiable, we force the \(h\) functions to be vertically centered at zero. Here’s how: we ensure that after transforming the \(j\)’th predictor to \(h_j\left(X_j\right)\), the resulting data are centered at 0. Mathematically, we ensure that \[ \frac{1}{n}\sum_{i=1}^{n}h_j\left(x_{ij}\right) = 0 \] for each predictor \(j\), where \(x_{ij}\) for \(i=1,\ldots,n\) are the observations.

18.1.2 Question 1b

Notation: Let’s call \(\hat{\beta}_0\) the estimate of \(\beta_0\), and the the functions \(\hat{h}_1\) and \(\hat{h}_2\) the estimates of \(h_1\) and \(h_2\), respectively.

The prediction on observation \(i\), denoted \(\hat{Y}_i\), is \[ \hat{Y}_i = \hat{\beta}_0 + \hat{h}_1\left(x_{i1}\right) + \hat{h}_1\left(x_{i2}\right). \] This will help with the first question:

Suppose the gam fit is called fit. Why is mean(predict(fit)) the same as the estimate of the intercept?

Here’s a hint: predict(fit) gives you the vector \(\hat{Y}_1, \ldots, \hat{Y}_n\). Then, mean averages them. The question is asking you to indicate why we have \[ \frac{1}{n}\sum_{i=1}^{n}\hat{Y}_i = \hat{\beta}_0. \] The answer uses Equation

The next question asks you to think about how you’d recover an \(h\) function. It asks:

For each \(h\) function, write an R function that evaluates the \(h\) function over a grid of values, without calling the plot function on the fit. Show that the function works by evaluating it over a small grid of values.

Suppose you want to evaluate function \(\hat{h}_1\) at some generic point \(x_0\). You can do this using the predict function, and somehow specifying \(x_0\) in the newdata argument (in place of “predictor 1”). But predict will give you all three components of the model, added together: the \(\hat{\beta}_0\) part, plus the \(\hat{h}_1\) part (evaluated at whatever is in the “predictor 1” column), plus the \(\hat{h}_2\) part (evaluated at whatever is in the “predictor 2” column). Your job is to “isolate” the \(\hat{h}_1\) part, evaluated at \(x_0\). We can subtract out \(\hat{\beta}_0\), which is specified in the model output. But you can’t just subtract out the \(\hat{h}_2\) part, because we don’t know it. Your job is to use a property of \(\hat{h}_2\) (hint: Equation ) to remove it.

You can also think of it this way: if mean(predict(fit)) “zeroes-out” both \(h\) functions, how can you modify the prediction data so that one of the \(h\) functions doesn’t zero-out, but instead evaluates at some desired point?