# Chapter 1 Preamble

## 1.1 Caution

This book is in its very preliminary stages. Content will be moving around and updated.

Currently, much of the book is a stitching together of previous pieces of my writing that I think might be relevant to this book. These chapters will be updated to consider the different context, audience, and content organization that’s best for this book.

Stage of development:

1. Gather existing and potentially relevant pieces of writing
2. Create a new chapter structure
3. Parse existing writing into new chapters
4. Writing:
1. Write concepts/objectives of each chapter
2. Fill in content
3. Write preambles for each chapter and part

## 1.2 Purpose of the book

There’s a vast and powerful statistical framework out there. This book takes a modularized approach to making this framework accessible, so that as a problem solver, the reader can make a sequences of decisions to build models that are best suited to address the problem.

This book focuses on the why of statistical framework. Why set up a model in a certain way? Why is this definition useful? What are the implications of this definition or this model?

Regression analysis can help solve two main types of problems:

1. Interpreting the relationship between variables.
2. Predicting a new outcome.

This book primarily focusses on models for interpretation, and often references these two competing needs. The prediction problem is still discussed to some extent: once a model suited for interpretation is developed, it is still important to be able to use it to make predictions. For those interested primarily in prediction, check out resources on supervised learning, which aims to optimize predictions.

There is another layer to interpretation that this book adopts, in terms of describing and understanding the motivation and inner workings behind each method. This is in contrast to a purely mathematical presentation of statistical methods. This book presents both an interpretation of a the high-level idea behind amethod, as well as a mathematical presentation to make these concepts precise. For example, the Kaplan-Meier estimate of the survival function is explained both intuitively and mathematically.

Most statistical methods are built on a foundation of assumptions imposed on the data. But an assumption is almost never exactly true; so we instead explore consequences based on “how close” an assumption is to being true. In some cases, we even find that an anticipated assumption is not required at all, depending on how we would like to interpret the model – an example being the “requirement” of linear data in linear regression. Consequenty, one aim of this book is to discourage the thinking that a method either “can” or “cannot” be applied, instead thinking about pros and cons of various methods.

Methods are demonstrated using the R statistical software. This is because R has an extensive selection of packages available for statistical analysis, and the tidyverse and tidymodels meta-packages make data analysis readable and organized. Since this book does not focus on optimally flexing a model function to conform to data (non-parametric supervised learning), languages better suited for this task, such as python, are not considered.

The audience of this book is quite wide, with the aspects of modularization, interpretation, and non-binary view of assumptions perhaps appealing to various readers:

1. Practitioners may find the modularization useful when making decisions when fitting models, the non-binary view of assumptions useful when evaluating the trustworthiness of their models, and the interpretation of their model useful when communicating their results.
2. Experts in the field of Statistics might benefit from the unique modularized framework of the field, as they find well-used notions such as the expected value being challenged and expanded upon.
3. Learners may find the modularization useful for compartmentalizing concepts, and the interpretation of methods useful for understanding concepts.

## 1.3 Tasks that motivate Regression

Real world problems for which regression is an appropriate tool generally fall into two categories:

1. Prediction: Predicting the response of a new unit/individual, sometimes also describing uncertainty in the prediction.
2. Interpretation: Interpreting how predictors influence the response.

For example, consider a person undergoing artificial insemination.

• Prediction: Given the person’s age, what is the chance of pregnancy?
• Interpretation: How does age influence the chance of pregnancy? How does time of insemination after a spike in Luteinizing hormone affect the chance of pregnancy, and how is this different for people over 40?

This book does not focus on optimizing predictions, but focusses on the other tasks. This means:

1. describing the uncertainty in predictions, or estimates in general, and
2. interpreting how predictors influence the response.

Why not focus on optimizing predictions? This is the objective of supervised learning, an entire discipline in itself. The scope of this book would just be too big to include this, too.

## 1.4 Examples

library(mice)
## Loading required package: lattice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
##     cbind, rbind
library(broom)
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::lag()      masks stats::lag()
Wage <- ISLR::Wage
NCI60 <- ISLR::NCI60
baseball <- Lahman::Teams %>% tbl_df %>%
select(runs=R, hits=H)
esoph <- as_tibble(esoph) %>%
mutate(agegp = as.character(agegp))
titanic <- na.omit(titanic::titanic_train)