Chapter 18 When we only want interpretation on some predictors

Caution: in a highly developmental stage! See Section 1.1.

18.1 Non-identifiability in GAMS

Here is some help on Lab 4 Exercise 1(b). Exercise 1(b) is intended to get you to think about what the $$h$$ functions in a Generalized Additive Model (GAM) are.

An interpretation of the $$h$$ functions can only make sense in light of the non-identifiability issue of GAM’s, so that’s discussed first. Then, hints are given for the first two questions in Exercise 1(b).

18.1.1 Non-identifiability

What is “non-identifiability”, exactly? It can happen for any model that’s not carefully specified (not just GAM’s). Let’s look at an example first.

In simple linear regression, why not write the model $Y = \beta_0 + \alpha_0 + \beta_1 X + \varepsilon,$ where $$\mathbb{E}(\varepsilon)=0$$? It’s because three parameters are too many to describe a line. In other words, . For example, the model $$Y=1+X+\varepsilon$$ can be written with $\beta_0 = 0, \alpha_0 = 1, \beta_1 = 1,$ or $\beta_0 = -1, \alpha_0 = 2, \beta_1 = 1,$ etc. In fact, as long as $$\alpha_0 = 1 - \beta_0$$, and $$\beta_1=1$$, we get the same regression line.

In general, and roughly speaking, when more than one parameter selection gives you the same model, there’s a non-identifiability issue. It leads to problems in estimation and estimator properties. It also leads to an problem: the parameters don’t have a meaning, since they can represent more than one thing in the model.

This is even true in non-parametric cases, such as the GAM. Let’s look at a two-predictor GAM: $Y = \beta_0 + h_1\left(X_1\right) + h_2\left(X_2\right) + \varepsilon,$ where $$\beta_0$$ is any real number, $$h_1$$ and $$h_2$$ are any smooth functions, and $$\mathbb{E}(\varepsilon)=0$$. As it is, this model is non-identifiable: if you pick a $$\beta_0$$, $$h_1$$, and $$h_2$$, I can find another set of $$\beta_0$$, $$h_1$$, and $$h_2$$ that gives the same regression surface. How? I can just add a constant $$c$$ to your $$\beta_0$$, and subtract that constant from your, say, $$h_1$$ (i.e., “vertically shift” your $$h_1$$ function downwards by $$c$$).

So, the “parameters” (which includes the $$h$$ functions) in a GAM are non-identifiable – the $$h$$ functions can be vertically shifted, and $$\beta_0$$ can just compensate for these shifts to give the same regression surface.

To make the model identifiable, we force the $$h$$ functions to be vertically centered at zero. Here’s how: we ensure that after transforming the $$j$$’th predictor to $$h_j\left(X_j\right)$$, the resulting data are centered at 0. Mathematically, we ensure that $\frac{1}{n}\sum_{i=1}^{n}h_j\left(x_{ij}\right) = 0$ for each predictor $$j$$, where $$x_{ij}$$ for $$i=1,\ldots,n$$ are the observations.

18.1.2 Question 1b

Notation: Let’s call $$\hat{\beta}_0$$ the estimate of $$\beta_0$$, and the the functions $$\hat{h}_1$$ and $$\hat{h}_2$$ the estimates of $$h_1$$ and $$h_2$$, respectively.

The prediction on observation $$i$$, denoted $$\hat{Y}_i$$, is $\hat{Y}_i = \hat{\beta}_0 + \hat{h}_1\left(x_{i1}\right) + \hat{h}_1\left(x_{i2}\right).$ This will help with the first question:

Suppose the gam fit is called fit. Why is mean(predict(fit)) the same as the estimate of the intercept?

Here’s a hint: predict(fit) gives you the vector $$\hat{Y}_1, \ldots, \hat{Y}_n$$. Then, mean averages them. The question is asking you to indicate why we have $\frac{1}{n}\sum_{i=1}^{n}\hat{Y}_i = \hat{\beta}_0.$ The answer uses Equation

The next question asks you to think about how you’d recover an $$h$$ function. It asks:

For each $$h$$ function, write an R function that evaluates the $$h$$ function over a grid of values, without calling the plot function on the fit. Show that the function works by evaluating it over a small grid of values.

Suppose you want to evaluate function $$\hat{h}_1$$ at some generic point $$x_0$$. You can do this using the predict function, and somehow specifying $$x_0$$ in the newdata argument (in place of “predictor 1”). But predict will give you all three components of the model, added together: the $$\hat{\beta}_0$$ part, plus the $$\hat{h}_1$$ part (evaluated at whatever is in the “predictor 1” column), plus the $$\hat{h}_2$$ part (evaluated at whatever is in the “predictor 2” column). Your job is to “isolate” the $$\hat{h}_1$$ part, evaluated at $$x_0$$. We can subtract out $$\hat{\beta}_0$$, which is specified in the model output. But you can’t just subtract out the $$\hat{h}_2$$ part, because we don’t know it. Your job is to use a property of $$\hat{h}_2$$ (hint: Equation ) to remove it.

You can also think of it this way: if mean(predict(fit)) “zeroes-out” both $$h$$ functions, how can you modify the prediction data so that one of the $$h$$ functions doesn’t zero-out, but instead evaluates at some desired point?