# Chapter 1 Preamble

## 1.1 Caution

This book is in its very preliminary stages. Content will be moving around and updated.

Currently, much of the book is a stitching together of previous pieces of my writing that I think might be relevant to this book. These chapters will be updated to consider the different context, audience, and content organization that’s best for this book.

Stage of development:

1. Gather existing and potentially relevant pieces of writing
2. Create a new chapter structure
3. Parse existing writing into new chapters
4. Writing:
1. Write concepts/objectives of each chapter
2. Fill in content
3. Write preambles for each chapter and part

## 1.2 Purpose of the book

This book is aimed at practitioners trying to gain insight from numeric data. This task of “gaining insight from data” falls squarely in the field of Statistics, but this book is not designed to be a Statistics text book.

While there’s a vast and powerful statistical framework out there capable of providing insight to a vast array of data problem types, this framework remains difficult to navigate and piece together. This hinders the practitioner’s ability to make effective modelling decisions, who faces putting their trust and confidence in the statust quo way of doing things.

The statistical framework can be difficult to navigate for the following reasons:

1. The statistical framework tends to be described from a method-first perspective.

Instead, this book takes a problem-first approach to exploring Statistics. For example, you can traditionally learn about the type of conclusions that can be drawn from linear regression, or from an ANOVA, but in this book …

1. Descriptions of statistical methods tend to emphasize the requirement of assumptions.

The reality is, assumptions are never true in practice. Instead, this book evaluates assumptions as approximations instead, commenting on whether and how important a good approximation is.

1. Descriptions of statistical methods tend to emphasize the “what” and “how”.

Typical descriptions aim to describe what a model is and how to apply it, spending time on explaining the mechanical guts of the method. But a deeper discussion into why the method is important and set up in a specific way, and how it fits into the statistical framework as a whole is lacking. This book places emphasis on the “why”, and tends to assume you are smart and can figure out how the guts of a method works on your own by doing some googling.

What’s the result? A book that takes a modularized approach to statistical decision making, so that as a problem solver, the reader can make a sequences of decisions to build models that are best suited to address the problem at hand.

As mentioned, the statistical framework that has been defined to date is vast. This book cannot realistically aim to cover everything. Instead, we’ll focus on foundational ways of thinking through regression analysis, which itself is quite broad and useful for solving a wide array of problems. Some examples include linear regression, survival analysis, mixed effects models, and quantile regression. Using statistical jargon, the focus is on describing aspects of conditional distributions across a covariate space. More advanced statistical methods, like state space models or continuous time markov chains, tend to stem from concepts in regression analysis.

Even if you have not taken a Statistics course before, you will find this book accessible because fundamental concepts like probability and distributions are defined from scratch. Even if you are somewhat familiar with Statistics, you may find these fundamental concepts worth revisiting, since you’ll be challenged to re-think why well-known concepts such as the mean and variance are prominent. A basic knowledge of intro calculus is assumed throughout this book, but as a general rule, we will not need to delve into mathematical details.

Methods are demonstrated using the R statistical software. This is because R has an extensive selection of packages available for statistical analysis, and the tidyverse and tidymodels meta-packages make data analysis readable and organized.

## 1.3 A focus on Interpretation

Regression analysis can help solve two main types of problems:

1. Interpreting the relationship between variables.

For example, we might want to know how someone’s age influences the click-through rate on an online advertisement, taking into account other factors like their location and the type of ad presented. Or, perhaps we would like to find the factors that are most highly related to fatal car crashes. Or, how does age influence the chance of pregnancy? How does time of insemination after a spike in Luteinizing hormone affect the chance of pregnancy, and how is this different for people over 40?

1. Predicting a new outcome.

For example, we might want to predict the click-through rate of an online advertisement, given someone’s age and location, and the type of ad presented. Or, we might want to predict the flow of a river given a certain amount of rainfall and snowmelt.

Although both problem types are considered, this book primarily focusses on building models for interpretation. A focus on prediction lies in the realm of machine learning, and a large part of the decision making here involves optimizing a model’s fit to new data – a problem for which there are many resources available.

library(mice)
## Loading required package: lattice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
##     cbind, rbind
library(broom)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.4
## ✔ tidyr   1.0.2     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::lag()      masks stats::lag()
Wage <- ISLR::Wage
titanic <- na.omit(titanic::titanic_train)