Note for Instructors: This module is written to be accessible to both instructors and students. Feel free to assign it as pre-reading or use it as a resource for your own understanding. The core concepts apply to anyone learning to think statistically.
Introduction
As scientists, we’ve all been there: You get excited about a research finding, see that p = 0.03, and think “Great! That means there’s a 97% chance my hypothesis is right!” But then you’re scrolling run across a video where someone on the internet says that’s not actually what a p-value means at all. The inevitable follow-up question is: “Okay… so what are the chances my hypothesis is right?”
And here’s the problem: with the frequentist statistical tools most of us learn (and teach), we genuinely can’t answer that question.
This course is about learning a statistical framework that can answer the questions scientists (and students) actually want to ask. By the end of this module, you’ll understand why Bayesian statistics offers a more intuitive way of thinking about data, hypotheses, and uncertainty—one that aligns naturally with how scientists actually reason about the world.
Learning Objectives
Upon completion of this module, you will be able to:
Identify the types of questions that frequentist statistics cannot directly answer
Explain what a p-value actually represents (and what it doesn’t)
Describe the limitations of the “reject/fail to reject” framework in scientific practice
Recognize Bayesian reasoning as a formalization of how we naturally update beliefs
Articulate the conceptual advantage of Bayesian methods for addressing research questions
Connect statistical thinking to the replication crisis and methodological reform in science
Key elements of this first unit:
Focused on logic, not on calculations.
You don’t need any computer programming experience.
The goal is to explore concepts and interpretations.
1. The Questions We Really Want to Ask
One of the best parts of being a scientist is being allowed to have curiosity about the world every day, and peek at the hidden mechanisms behind both the ostentatious and the (seemingly) mundane. For me, uncovering the secrets of everyday behaviors that most others overlook is especially cool. For instance, I have noticed that people often press the elevator button more than once. Maybe you have noticed this too, but if not, now that I have pointed it out you won’t be able to stop noticing it. I suspect people do this because they aren’t sure the press registered, so I would predict that adding feedback (like a backlight) that confirms the press will reduce the number of presses. In a simple experiment we can either activate or deactivate the lights behind the buttons for a week and turn them on for a week, and measure the number of presses.
After a lengthy IRB process where the reviewers requested detailed brightness specs of the buttons when lit and an independent review of finger anatomy, plus several failed grant applications, a few months later we finally have made some observations (i.e., data) and are ready to analyze it. Given all the precision up to this point, we need to be precise here: What question are we actually asking?
If you are like me, it is going to be something along the lines of one of these:
“How likely is it that feedback changes the amount of button pressing?”
“What are the chances this result would replicate?”
“Should I be confident in this conclusion?”
“How strong is the evidence for my hypothesis?”
These are perfectly reasonable scientific questions. In fact, they’re exactly the kinds of questions we should be asking. They’re about the probability of hypotheses given the data we have observed, which is what science is fundamentally about: evaluating competing explanations in light of evidence.
But here’s the problem: Standard frequentist statistics cannot directly answer any of these questions.
What Frequentist Methods Actually Tell Us
The tools most of us were trained on and thus we teach our students—t-tests, ANOVAs, chi-square tests—are designed to answer a very different question. They ask: “If my null hypothesis were true, how surprising would my data be?” Or in terms of this example, “If adding a light doesn’t do anything, how rarely would I get a difference in button presses this big or larger?”
That’s what a p-value represents: the probability of getting data as extreme as (or more extreme than) what you observed, assuming the null hypothesis is true.
In notation: P(Data | H₀)
Notice the direction here. We’re conditioning on the hypothesis and asking about the data. But what we really want is the reverse: to condition on the data and ask about the hypothesis.
We want: P(H | Data)
This is not a subtle difference. These are fundamentally different quantities, and confusing them leads to some of the most common and consequential misinterpretations in science. If this is news to you, you are not alone. Many instructors teaching research methods and statistical tests would really struggle to accurately define a p-value (See Haller & Krauss, 2002 for an empirical misconception rate of up to 72% among instructors). I’ve definitely found myself feeling like I am getting a workout with the mental gymnastics required to fully understand how the p-value connects to what I actually want to know.
If a bunch of people with advanced specialized degrees whose work critically depends on applying these concepts in the real world struggle to understand what they are actually doing, what does this imply about our ability to pass the baton of knowledge to the next generation? Which is worse, that we don’t realize we’re misconstruing what these numbers mean (while hanging ALL of our inferences on it and the credibility of our scientific disciplines) or that we keep teaching it this way even though we know how limited it is?
The avatar of calculus takes aim at you with his fiery integral bow, but little does he know you wield the power of computing.
I think we can do better. There are tools out there that make much more sense to use to directly answer the kinds of questions we are interested in. I got “lucky” and took a postdoc in a lab that used Bayesian approaches which put me in a sink-or-swim forced understanding situation. But, I found Bayesian ideas to be profoundly more rational and useful for the kinds of work most of us in the life sciences do, and have applications far beyond statistical inference (we’ll explore some of these in module 5). And, while there have been prophets who tour from conference to conference, guest lecturing on the value of Bayes and how we need a Bayesian revolution, every time you try to find a straightforward introduction to the topic they want you to be a calculus expert, enroll in a lengthy or expensive course, or have a strong background in programming your own scripts. But the reality is Thomas Bayes (and Pierre Simon Laplace, who may actually be the real hero, more on that in Module 2) didn’t have computers to program. Bayes rule is an elegant and simple idea with deep implications and applicability. Even if we didn’t get much training in Bayes, we can all learn the basics and give our students a strong foundation on which to build as they learn more. After all, we have a duty to do a better job preparing our students than our intellectual forebears.
2. What a P-Value Is (and Isn’t)
Let’s be crystal clear about what a p-value means. Let’s put it in the context of a t-test for example’s sake but this applies to p-values, generally. If we were to measure any two samples even from the same population, we would almost never get the exact same average measurement from both groups, across any measurable trait. So, we have to ask: “If we could account for the variablility due to the error, would the groups still be different?” Laplace figured out that there is a pattern to the error, and thus we can account for it. In his “Law of Errors” he showed that the error follows a normal distribution; that is, you are less likely to get big differences due to error and more likely to get small differences due to error with two samples from the same population.
Note: Not all p-values: While I said this definition applies generally, there are some statistical tests that ask very different questions than we might think. For example, for a Wilcoxon nonparametric test doesn’t test whether the means are different; rather, the null hypothesis asserts, if you randomly pick one value from Group A and one value from Group B, the probability that the value from A is larger than the value from B is exactly 50% (or 0.5). In statistical terms: H₀:P(A > B) = 0.5. Since we know real-life data has some random error, the p-value asks “If the null hypothesis is true (i.e., if P(A > B) is really 50%), what is the probability that we would get a ranking pattern as extreme (or more extreme) as the one we observed in our sample, just due to random chance?” But just typing all of that makes me want to sell all my belongings, move to Kalamazoo and open a woodcarving shop. I can’t foist this nuance upon my sophomore undergrad students, bright-eyed and bushy-tailed and ready to make a real difference in the world. It’s too late for me, but we can save them from this madness.
Imagine that you made a distribution of all possible sets of two samples drawn from the population. This is known as a sampling distribution. Since the error follows a normal distribution, you can use that knowledge to determine a probability of getting any particular difference in means. You could then say something like, “we’d only get a difference bigger than +/- 1.96 standard deviations in this error distribution five percent of the time due to random chance.” That seems rare, so if we get a number bigger than 1.96, which corresponds to a probability of observing those data from two samples drawn from the same population by random chance, we can say that the null hypothesis isn’t a good explanation for this data most of the time. To calculate this test statistic, you can divide your mean difference by your estimated standard error, and your p-value tells you how often you’d get a difference that big due to random chance.
In simpler terms, if you run a study and get p = 0.03, here’s what that tells you:
“If the null hypothesis were true, and I repeated this exact experiment infinite times, I would expect to see data this extreme (or more extreme) about 3% of the time, purely by chance.”
That’s it. That’s all a p-value can tell you.
Here’s what it does NOT tell you:
❌ The probability that your hypothesis is true
❌ The probability that the null hypothesis is false
❌ The probability that your result is a “real” effect
❌ The probability that your result will replicate
❌ The strength of evidence for your hypothesis
And yet, if you look at how p-values are discussed in published papers, student theses, and even in some textbooks, you’ll find these exact interpretations everywhere.
Why This Matters
This isn’t just a semantic quibble. Misinterpreting p-values has real consequences:
Example: The “Significant” Misinterpretation
A student finds p = 0.04 and concludes there’s a 96% chance their treatment works. They might design their entire dissertation around this “confirmed” effect. But the p-value doesn’t tell them that. All it says is that if there were no effect, data like theirs would be somewhat unusual.
But “somewhat unusual under the null” is very different from “probably true.” A result can be statistically significant and still have a low probability of being true; especially if:
The effect size is small
The measurement is noisy
Similar hypotheses in this field rarely pan out (low prior probability)
The researcher tested many outcomes and only reported this one (p-hacking)
The p-value doesn’t account for any of these factors. Bayesian methods do.
3. The Crisis We’re All Navigating
If you’ve been in science for more than a few years, you’ve heard about the “replication crisis.” Study after study in psychology, medicine, and even ecology has failed to replicate. Some estimates suggest that more than half of published findings may be false positives.
How did we get here?
Part of the problem is the way we use (and misuse) frequentist statistics:
The Publish-or-Perish Pressure Cooker
The academic incentive structure rewards “significant” results. Papers with p < 0.05 get published; papers with p = 0.08 get rejected or buried. This creates enormous pressure to find significance, leading to:
P-hacking: Running multiple analyses and only reporting the one that “works”
HARKing: Hypothesizing After Results are Known
Optional stopping: Collecting data until you hit significance, then stopping
The file drawer problem: Null results never see the light of day, no matter how well designed the study was
The Arbitrary Threshold Problem
Why is p = 0.049 publishable but p = 0.051 is not? The 0.05 threshold is arbitrary—a convention from the 1920s that has calcified into dogma. But reality doesn’t have a sharp cliff at p = 0.05. Evidence exists on a continuum.
The frequentist framework, with its “reject” or “fail to reject” dichotomy, forces us to collapse that continuum into a binary decision. It’s like asking “Is it raining?” when what we really need to know is “Should I bring an umbrella, a poncho, or build an ark?”
A Statistical Approach for a Different Era
It’s worth noting that frequentist methods were developed in an era when computation was expensive and data was (relatively) cheap. P-values gave researchers a simple rule: collect data, calculate a number, make a decision. It worked well for agricultural field trials and quality control in factories.
But modern science is different. We often have:
Complex, multivariate data
Expensive or limited samples
Questions about strength of evidence, not just “yes/no” decisions
A need to integrate across multiple studies
Access to massive computational power
The tools should match the task. And increasingly, researchers are recognizing that Bayesian methods are better suited to the questions we’re actually trying to answer.
4. Introducing “Belief”: How We Actually Think
How do your beliefs continuously update as you gain new information over time?"
Here’s a simple scenario. Imagine you’re meeting a friend for coffee at 2:00 PM. It’s now:
2:05 PM – Your friend isn’t there yet. Your thought: “Probably caught at a stoplight. They’ll be here any second.”
2:15 PM – Still not there. Your thought: “Maybe they hit traffic? Or couldn’t find parking?”
2:45 PM – Still not there. Your calls go to voicemail. Your thought: “Okay, something is actually wrong. Did they forget? Are they okay?”
3:30 PM – You’re about to leave. Your thought: “They definitely forgot. Or there was an emergency. Either way, I’m leaving.”
Notice what you did. You started with a belief (they’re on their way), and as new evidence came in (more time passed with no arrival), you updated that belief. By 3:30 PM, you’re not thinking “they’re running late”—you’ve moved to a qualitatively different explanation.
This is Bayesian reasoning. You didn’t run a t-test. You didn’t calculate a p-value. You took your prior expectation, observed new data, and revised your beliefs accordingly. This is how humans naturally reason under uncertainty.
Formalizing Intuition
What Bayesian statistics does is take this intuitive process and give it a rigorous mathematical structure. It provides a framework for:
Starting with what you know (or what you assume)—the prior
Observing new data—the evidence
Updating your beliefs in a logically consistent way—the posterior
The beautiful thing? The math isn’t telling you what to believe. It’s telling you how to update your beliefs when new evidence comes in. The logic is simple:
If the data are very consistent with Hypothesis A and very inconsistent with Hypothesis B, you should believe A more strongly than B (all else being equal).
That’s it. That’s Bayesian inference in plain English.
5. What Bayes Does: A Sneak Preview
We’ll dive deep into the mechanics in the next module, but here’s the basic idea:
Bayesian statistics gives us a formal engine for combining:
Prior Knowledge – What you believe (or assume) before seeing the data
New Evidence – The data you just collected
Updated Belief – Your revised belief after seeing the data
The output is a probability distribution over hypotheses. Not a binary “reject” or “fail to reject,” but a full picture of what’s plausible given everything you know.
What This Means for Your Students
Imagine being able to tell your student:
“Based on your data, there’s about a 78% probability that Fertilizer A produces more growth than Fertilizer B. The most likely true difference is around 4 cm, though it could reasonably be anywhere from 1 to 7 cm.”
That’s the kind of statement Bayesian methods let you make. It directly answers the questions scientists actually care about.
Compare that to:
“We reject the null hypothesis that there is no difference between fertilizers (p = 0.03).”
Which answer would you rather have?
Common Questions & Misconceptions
Before we move on, let’s address some questions that come up frequently:
Q: “If p-values don’t tell us what we think they do, why do we still use them?”
A: Historical inertia, mostly. Frequentist methods became the standard in the mid-20th century and were cemented into textbooks, journal guidelines, and statistical software. They’re also simpler to calculate by hand (important before computers). But just because something is traditional doesn’t mean it’s optimal for every question. Plus, a chicken-and-egg problem: We know what we were taught, and we teach what we know. It’s hard work to take time to learn something that is a new framework (and you’re doing it right now, you beautiful thing! You’re doing amazing.)
Q: “Does this mean all published research using p-values is wrong?”
A: Not at all! Frequentist methods can be perfectly valid tools when used correctly and interpreted carefully. The problem is that they’re often misinterpreted and misapplied. Bayesian methods offer an alternative that’s more intuitive for many research questions, especially when you want to make direct probability statements about hypotheses.
Q: “Isn’t Bayesian statistics ‘subjective’ because you choose a prior?”
A: All statistical methods involve subjective choices. Which test to use, when to stop collecting data, which variables to include, how to handle outliers. Frequentist methods hide many of these choices; Bayesian methods make them explicit. That transparency is actually a scientific virtue. Plus, in Module 2, you’ll see how different priors converge as data accumulates, making the choice less critical than you might think.
Q: “Do I need to choose between frequentist and Bayesian approaches?”
A: No! They’re tools in your toolkit. Some questions are well-suited to frequentist methods, others to Bayesian methods. Many researchers use both. The goal isn’t to replace one with the other, but to understand when each approach is most useful.
Q: “Is this going to require a lot of math or programming?”
A: The underlying math uses calculus and probability theory, but you don’t need to do it by hand. Software handles the computations. In Module 4, you’ll learn to use JASP, a free point-and-click program that makes Bayesian analysis as easy as running a t-test in SPSS. No programming required (though we’ll provide resources if you want to learn R or Python).
The Trade-Off: Transparency About Assumptions
There’s an important caveat: Bayesian methods require you to specify your prior beliefs. This makes some people uncomfortable. It feels “subjective.” Don’t worry, we’ll walk you through it.
But here’s the thing: all statistical analyses involve assumptions. Frequentist methods hide many of their assumptions in the choice of test, the stopping rule, the model structure. Bayesian methods make those assumptions explicit and quantifiable.
Is that a bug or a feature? I’d argue it’s a feature. Science benefits from transparency. If someone disagrees with your prior, they can:
See exactly what you assumed
Re-run the analysis with their own prior
Check whether the conclusions are robust to different priors
This transparency is actually a strength, not a weakness.
6. Where This Course Will Take You
By the end of this course, you’ll be able to:
✅ Teach your students the conceptual logic of Bayesian reasoning
✅ Conduct a Bayesian equivalent of a t-test using free, user-friendly software (JASP)
✅ Interpret Bayesian outputs (Bayes factors, credible intervals, posterior distributions)
✅ Explain how Bayesian thinking connects to neuroscience, animal behavior, and learning
✅ Make the case for why Bayesian methods are particularly valuable in the life sciences
You won’t need to write code (though we’ll provide resources if you want to). You won’t need advanced math (though we won’t shy away from formulas). The focus is on understanding the logic and being able to teach it confidently.
Module Summary
Let’s recap what we’ve covered:
Frequentist statistics answer P(Data
Hypothesis), but scientists want P(Hypothesis
Data)
P-values are widely misinterpreted, leading to poor scientific reasoning
The replication crisis is partly driven by the “reject/fail to reject” dichotomy and perverse incentives
Bayesian reasoning formalizes how we naturally update beliefs in light of evidence
Bayesian methods provide probabilities of hypotheses, directly answering the questions we care about
All statistics involve assumptions; Bayesian methods make those assumptions explicit
The bottom line: You’re not learning Bayesian statistics because frequentist methods are “wrong”. They’re just designed for a different question. Bayesian methods are designed for the questions you and your students are already asking. This course will teach you the logic and tools to start answering those questions with rigor and confidence.
What’s Next
In Module 2, we’ll dive into the mathematics of Bayes’ theorem. You’ll learn the core components—priors, likelihoods, posteriors—and work through concrete examples that bring these ideas to life. By the end of Module 2, you’ll have the conceptual toolkit to think in Bayesian terms.
Let’s get started. Are you ready? Then on to Module 2
Key References & Further Reading
On P-values and Their Misinterpretation:
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129-133. [The American Statistical Association’s official guidance on p-values]
Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology, 45(3), 135-140.
Haller, H., & Kraus, S. (2002). Misinterpretations of significance: A problem students share with their teachers? Methods of Psychological Research, 7(1), 1-20.
On the Replication Crisis:
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. [The famous study showing that only 36% of psychology studies successfully replicated]
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. [A provocative but influential analysis of publication bias and false positives]
Introduction to Bayesian Thinking:
Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25(1), 178-206. [Excellent overview of Bayesian methods for experimental scientists]
Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6(3), 274-290. [Accessible comparison of the two approaches]
For Life Scientists Specifically:
Ellison, A. M. (2004). Bayesian inference in ecology. Ecology Letters, 7(6), 509-520. [Tailored to ecological applications]
Kéry, M., & Schaub, M. (2011). Bayesian Population Analysis using WinBUGS: A Hierarchical Perspective. Academic Press. [If you want to go deeper into ecological applications]
What questions do you wish you could answer about your own research or your students’ projects?
Have you ever had a “significant” result that you weren’t quite sure how to interpret?
How might your teaching change if you could give students probabilities of hypotheses instead of just p-values?
These questions don’t have right or wrong answers; they’re just prompts to help you connect this material to your own experience. Feel free to jot down some thoughts. We’ll revisit these ideas as the course progresses.