Preface
You’re probably wondering about the book title. Here’s how I like to stylize it: Monty the \(\mathcal{H}\mathrm{ipp}_0\). Why? I don’t want to spoil it before we even get started. Check out Chapter 1. In Chapter 2 and beyond I use the word hippopotamus exactly once, in lieu of a different word. See if you can find it. Treat it like a game of “Where’s Waldo?”
My hope with this book is to provide an unconventional approach to a first course in statistics that resonates with an audience who grew up in the age of computing. It aims to be friendly with new concepts, while not talking down to the uninitiated. You can be the judge as to whether or not I’ve succeeded in that.
Many professionals, when speaking to me in passing, report that “Stats 101” was their least favorite/hardest class they took toward their degree. I think that’s because stats is taught in a boring, old-fashioned way that focuses on memorization, and carrying out complex and inscrutable formulas that lack intuition. Folks remember working on calculators or spreadsheets, and combing through tables of quantiles and percentiles squirreled away in textbook appendices. Maybe those things made sense at one time. Now they’re out of touch with how people experience data analytics in a modern context, and a disservice to an important and interesting area of scientific inquiry.
My target audience is second-year college students who’ve already taken a class on probability, and who are comfortable with calculus. It’s alright if you’re not a probability expert, but it will help a lot to be comfortable with core concepts like mass, density, distribution, expectation, conditioning, independence, etc. Derivative calculus is essential, and integral calculus is helpful. You must have some experience coding. The book is written to help you learn R. Supplementary Python, is also provided. You need not be an expert in either, but some familiarity will help. If you’ve never coded in a language like one of those (e.g., MATLAB ®, C, C++, Julia, Rust), then you’ll have a hard time with the material here. I hope that this book is also a good fit for master’s-level students who came from another, non-statistical undergraduate program.
This isn’t the kind of book you can skim for formulas. It must be read to understand what you’re meant to learn/do, and the context behind why. Like acquiring any real skill, you’ll need practice and time for reflection to get good at it. Again, this book is intended to be read. That seems like a silly thing to say about a book, but I think it’s important here. Many students treat textbooks as reference manuals. But this one is not intended as a reference manual. My focus is on concepts not recipes. You can get a lot from the code examples, but if you don’t read the words around the code, and around the math, you’ll miss out. You’ll miss out not just on real understanding, but you also won’t get the homework completely right and you’ll probably lose points. This book is a few hundred pages. If you can’t be bothered to read twenty pages a week in a fifteen-week college semester, then you probably aren’t ready for college.
At the risk of beating a dead horse, I’ve got two more paragraphs on this. Although this isn’t a reference text, about 75% of the homework questions can be done by simply cutting-and-pasting code from earlier in the chapter with minor modifications – a few characters here and there. The trick is to know what to change/adapt. Students will need help recognizing this, especially if they’re not reading the book, which means not reading this Preface either. A common mistake is to use Chapter 4 methods (approximation) on Chapter 3 questions (exact). This happens because students prefer AI-assisted web search over reading the chapter that the questions are in.1 It sometimes happens as early as Chapter 1 because students think they know how to do it already, like from an earlier stats class, but they’re not noticing that what I’m teaching – and thus encouraging them to practice – is actually a new skill. When I use this material at Virginia Tech (VT), students lose lots of points on early homework sets despite warnings. It takes until Chapter 5 before I start getting taken seriously, and I’ve yet to figure out how to short-circuit that journey.
Presuming that students have prior experience with statistical analysis and probability lets me move a little faster, but it presents a hazard for understanding. I want to show you/your students a new way of doing things, and if you don’t pay attention you’ll likely miss out, and end up circling back later. Try to un-know what you think you know, especially if it falls outside of the scope of the presentation. Keep your mind open. There’s no harm in doing outside research. However, if you’re using something found outside of the text, with the exception of pointers I’ve made explicitly via links, then you’re definitely not doing it how I intended, and you may be doing it wrong. Each homework section has a preamble where I outline how to approach the problems. Usually it’s a long-winded way of saying “use methods from the current chapter, not from somewhere else; only use libraries and other sources to check your work”.
I make many references to other materials that can be found on the internet, either via footnotes (printed versions) or hyperlinks (HTML version). Many of those are Wikipedia links, which is a wonderful resource. You are encouraged to click on them, but mostly they’re there so you know you can find details, and more precise definitions, somewhere. The idea to is add color and context without derailing the message. Only visit those links if you’re curious. Clicking on them is not required to follow the development, which is intended to be self-contained. This is a book about ideas, not trivia. There are many good sources, including other textbooks, and my linking to Wikipedia is expedient and exemplary. I always encourage students to track things down if they’re curious. I hope those links are a good place to start.
My narrative favors on process over precision. I’m a little defensive about this, and apologize for it multiple times in the text. There will be times where I need to make seemingly arbitrary adjustments so that a calculation matches output from a software library. I’ll be the first to admit when I don’t know why it works that way. There may be a good reason for it, or it may simply be convention. I arrived at some of the adjustments that I show you by peeking under the hood at the R implementation. Sometimes those codes aren’t particularly well documented and your guess is as good as mine. Often a good guess can be made forensically. Sometimes it’s best to move on.
You might think writing a textbook in such a cavalier way is a cop-out. But I genuinely believe it’s not worth investing neurons on minutia. I realize that where I draw the line on that is arbitrary and may not be the same place you would. I can be overly compulsive about stuff that I can’t pin down, believe me. I tell myself that edge cases are inevitable, and usually you can’t address all of them and keep your sanity. Precision can be the enemy of understanding. I want my readers to get a feeling for how statistics works. Sometimes you can draw comfort from seeing how numbers line up with some benchmark. Yet in many cases that benchmark is an approximation, and often the sense in which that approximation is good (or eventually converges asymptotically) is either unrealistic, or is based on assumptions that are unverifiable. I hope to impress upon you that my preferred approach – more on that in Chapter 1 – suffers fewer limitations because it doesn’t cut corners. Getting an accurate answer might require a lot (more) computer work. But it’s not your work; the transistors are doing it for you once the code is written.
I’m not going to list the subjects that are covered in this book. That’s what the Contents is for. I do want to say something about broad themes. There are five modules, as it were, though I don’t make a show of that with any special headings. The first two chapters are a warm-up. The idea is to give you a feel for how I like to think about statistical inference. The next two add mathematical rigor to, and expand upon, ideas from the first two chapters. The third module branches out into more complex methodology, filling out a “classical” parametric approach to basic applied statistics, all the way through to simple linear regression (SLR). Then I transition to nonparametric (non-P) methods. This is unconventional for an introductory statistics text, but I think that my unique approach (still not giving that away here) makes non-P accessible. Non-P methods are fascinating and beautiful, both conceptually and in practice. Many of the topics from the first half of the book are revisited here, through a new lens. This helps reinforce understanding, while broadening the toolkit. The book concludes with a final chapter on “fancy” regression methods, including transformations and multiple linear regression. I hope this is a “treat”, to whet your appetite for a second course in statistics.
There are many subjects that are not covered here, but very well could be. Perhaps the most important omission is power. I allude to the power of statistical tests from time to time, and point to Appendix A.2 for a glimpse at details. I wont deny the importance of power analysis, but I don’t teach it in any of my introductory classes on statistics. My feeling is that it’s too in the weeds for a first pass. Before you have a solid grasp of the fundamentals involved – what comprises a statistical test, operationally speaking – it doesn’t make much sense to discuss which tests are “better” than others in terms of power.
Every subject herein is paired with illustration in worked code. There’s not
a single table or figure (which is not a drawing) in the book that’s not
supported by code on the page. Everything is fully reproducible
via bookdown (Xie 2018a) on CRAN, combining knitr
(Xie 2015, 2018b) and rmarkdown (J. J. Allaire et al. 2018) packages. What you see
is what you get (WYSIWYG). Many examples and some data sets are based on
pseudo-random numbers. This means that
when you run the code on your machine, you won’t get the same thing that I
print in the book. Any time random numbers are involved, I encourage you to
run the example multiple times to see how answers may vary from one random
“instance” to another. It’s impossible to fully remove randomness from the
experience of engaging with this book’s material. That invariably results in
hedged verbiage, at times, on my end.
Each chapter ends with homework exercises that have been vetted over three semesters of teaching at VT. Some of those questions are mathematical in nature, but most are practical implementation exercises and data analysis. Before jumping into any data analysis, I encourage students to code up their own “library version” of the procedures required, which usually means tests and confidence intervals. Like the examples that accompany every illustration in the main body of the chapters, data for homeworks come from a variety of sources. Some of them are totally fabricated; others come from repositories like the Data and Story Library (DASL). Some have been passed down from colleagues of mine over the years; others are borrowed from the academic literature. Some I have found on my own, or pulled from other resources on the web. Whenever possible, links to origins and citations to research papers are provided.
Although my presentation exclusively favors R, supplementary
Python code is provided on the book
webpage. Unfortunately, many of the library
procedures that are showcased for comparison in R are not available for
Python. All of the bespoke code – that means the code that I wrote from
scratch for this book – is realtively easy to translate. Many thanks to
Anya Raval, a talented student at
VT, for the Python code files linked from
the book webpage. I hope it would be similarly straightforward, if tedious, to
port things over to other languages like or MATLAB or
Julia. Most of the code in this book involves simple for
loops, random number generation and vectorized aggregation (say via mean or
var). All of the homework questions can be solved in any language, but it
could be handy to have R to check your work. R is an easy language to learn if
you already have familiarity with Python, MATLAB or Julia. Let me encourage
you to use this book as a tutorial and, at the end, add another language to
your résumé.
Code readability, reliability and reproducibility of output are at least as
important as efficiency. R code in this book is peppered with ample
commentary, both as coded ## comments and as prose around the code. Partly
to ensure that the code is human-digestible; I don’t make use of
Tidyverse. Everything is ordinary R. Tidyverse
is a very important part of the R ecosystem, but its target audience is data
wranglers. This is not a data-wrangling book; it is an algorithms and
procedures text. I want anyone with coding experience, not only R/tidy
experts, to be able to follow the examples. I also aimed for the bare minimum
of dependencies, in terms of other R packages. I wanted most of the code to
be easy to translate to other languages, and to be executable without an
internet connection (to install packages, etc.). In part for this reason, I
use basic R plotting as opposed to ggplot (Wickham 2016). Colleagues sometimes
tell me that ggplot makes prettier plots than base R does. Aesthetics are a
matter of opinion. For what I need in this book, base R plotting is
sufficient and provides nice-looking graphics with simple code.
I’m ready to get started, are you? I have a few folks to thank before jumping in. I am grateful to the VT Department of Statistics and the Computational Modeling and Data Analytics (CMDA) program at VT for allowing me to teach CMDA 2006 on multiple occasions. The material from this book came primarily from that course, but also from a nonparametric statistics class for VT/Stats, and from classes that I taught while at the University of Chicago. The first version of CMDA 2006 was during the Covid-19 pandemic, which turned out to be a blessing for this material/book. Online delivery meant that I had to be careful about the presentation. I needed a record of all of code and bookwork, which made it easier to re-purpose for this project. I’m not saying I’m grateful to a virus, but it is what it is. I’m saying this book is lemonade.
I am grateful for my family. They’ve been patient with me, right from the moment their eyes bulged out of their skulls when I said I was thinking I’d write another book. Fortunately, this one was easier than the first one. Practice and experience makes everything easier and more efficient. Most important of all, though, is a comfortable base of operations, and that’s what family is for. Mine’s the mostest bestest. Mama and kiddos: I hope that I help you be your best the way you do for me.
Robert B. Gramacy
Blacksburg, VA
References
rmarkdown: Dynamic Documents for R. https://CRAN.R-project.org/package=rmarkdown.
ggplot2: Elegant Graphics for Data Analysis. New York, NY: Springer-Verlag. https://ggplot2.tidyverse.org.
knitr. 2nd ed. Boca Raton, Florida: Chapman & Hall/CRC. http://yihui.name/knitr/.
bookdown: Authoring Books and Technical Documents with Rmarkdown. https://CRAN.R-project.org/package=bookdown.
knitr: A General-Purpose Package for Dynamic Report Generation in r. https://CRAN.R-project.org/package=knitr.
Although this may change over time, not much of what I’m teaching in this book is AI-searchable at the time of writing.↩︎