If you’ve ever read a wild headline like, “Study Shows Chewing Rocks Prevents Cancer,” you’ve probably wondered how that could be possible.

If you look closer at this type of article you may find that the sample size for the study was a mere handful of people.

If one person in a group of five chewed rocks and didn’t get cancer, does that mean chewing rocks prevented cancer?

Definitely not. The study for such a conclusion doesn’t have statistical significance—though the study was performed, its conclusions don’t really mean anything because the sample size was small.

So what is statistical significance, and how do you calculate it?

In this article, we’ll cover what it is, when it’s used, and go step-by-step through the process of determining if an experiment is statistically significant on your own.

##
What Is Statistical Significance?

As I mentioned above, the fake study about chewing rocks isn’t statistically significant. What that means is that the conclusion reached in it isn’t valid, because there’s not enough evidence that what happened was not random chance.

A statistically significant result would be one where, after rigorous testing, you reach a certain degree of confidence in the results.

We call that degree of confidence our

confidence level

, which demonstrates how sure we are that our data was not skewed by random chance. More specifically, the confidence level is the likelihood that an interval will contain values for the parameter we’re testing.

There are three major ways of determining statistical significance:

- If you run an experiment and your p-value is less than your alpha (significance) level, your test is statistically significant

- If your confidence interval doesn’t contain your null hypothesis value, your test is statistically significant

- If your p-value is less than your alpha, your confidence interval will not contain your null hypothesis value, and will therefore be statistically significant

This info probably doesn’t make a whole lot of sense if you’re not already acquainted with the terms involved in calculating statistical significance, so let’s take a look at what it means in practice.

Say, for example, that we want to determine the average typing speed of 12-year-olds in America. We’ll confirm our results using the second method, our confidence interval, as it’s the simplest to explain quickly.

First, we’ll need to set our

p-value

, which tells us the probability of our results being at least as extreme as they were in our sample data if our null hypothesis (a statement that there is no difference between tested information), such as that all 12-year-old students type at the same speed) is true.

A typical p-value is 5 percent, or 0.05, which is appropriate for many situations but can be adjusted for more sensitive experiments, such as in building airplanes

. For our experiment, 5 percent is fine.

If our p-value is 5 percent, our confidence level is 95 percent—it’s always the inverse of your p-value. Our confidence level expresses how sure we are that, if we were to repeat our experiment with another sample, we would get the same averages—

it is

not

a representation of the likelihood that the entire population will fall within this range.

Testing the typing speed of every 12-year-old in America is unfeasible, so we’ll take a sample—100 12-year-olds from a variety of places and backgrounds within the US. Once we average all that data, we determine the average typing speed of our sample is 45 words per minute, with a standard deviation of five words per minute.

From there, we can extrapolate that the average typing speed of 12-year-olds in America is somewhere between $45 – 5z$ words per minute and $45 + 5z$ words per minute. That’s our

confidence interval

—a range of numbers we can be confident contain our true value, in this case the real average of the typing speed of 12-year-old Americans.

Our z-score, ‘z,’ is determined by our confidence value

.

In our case, given our confidence value, that would look like $45 – 5(1.96)$ and $45 + 5(1.96)$,

making our confidence interval 35.2 to 54.8.

A wider confidence interval, say with a standard deviation of 15 words per minute, would give us more confidence that the true average of the entire population would fall in that range ($45± \bo{15}(1.96)$), but would be less accurate

.

More importantly for our purposes,

if your confidence interval doesn’t include the null hypothesis, your result is statistically significant.

Since our results demonstrate that not all 12-year-olds type the same speed, our results are significant.

One reason you might set your confidence rating lower is if you are concerned about sampling errors. A

sampling error

, which is a common cause for skewed data, is what happens when your study is based on flawed data.

For example, if you polled a group of people at McDonald’s about their favorite foods, you’d probably get a good amount of people saying hamburgers. If you polled the people at a vegan restaurant, you’d be unlikely to get the same results, so if your conclusion from the first study is that most peoples’ favorite food is hamburgers, you’re relying on a sampling error.

It’s important to remember that statistical significance is not necessarily a guarantee that something is objectively true.

Statistical significance can be strong or weak, and researchers can factor in bias or variances to figure out how valid the conclusion is. Any rigorous study will have numerous phases of testing—one person chewing rocks and not getting cancer is not a rigorous study.

Essentially,

statistical significance tells you that your hypothesis has basis and is worth studying further.

For example, say you have a suspicion that a quarter might be weighted unevenly. If you flip it 100 times and get 75 heads and 25 tails, that might suggest that the coin is rigged. That result, which deviates from expectations by over 5 percent, is statistically significant.

Because each coin flip has a 50/50 chance of being heads or tails, these results would tell you to look deeper into it, not that your coin is definitely rigged to flip heads over tails. The results are statistically significant in that there is a clear tendency to flip heads over tails, but that itself is not an indication that the coin is flawed.

##
What Is Statistical Significance Used For?

Statistical significance is important in a variety of fields—

any time you need to test whether something is effective, statistical significance plays a role.

This can be very simple, like determining whether the dice produced for a tabletop role-playing game are well-balanced, or it can be very complex, like determining whether a new medicine that sometimes causes an unpleasant side effect is still worth releasing.

Statistical significance is also frequently used in business to determine whether one thing is more effective than another. This is called A/B testing—two variants, one A and one B, are tested to see which is more successful.

In school, you’re most likely to learn about statistical significance in a science or statistics context, but it can be applied in a great number of fields. Any time you need to determine whether something is demonstrably true or just up to chance, you can use statistical significance!

##
How to Calculate Statistical Significance

Calculating statistical significance is complex—most people use calculators rather than try to solve equations by hand.

Z-test calculators

and

t-test calculators

are two ways you can drastically slim down the amount of work you have to do.

However, learning how to calculate statistical significance by hand is a great way to ensure you really understand how each piece works. Let’s go through the process step by step!

###
Step 1: Set a Null Hypothesis

To set up calculating statistical significance,

first designate your null hypothesis, or H

_{
0
}

. Your null hypothesis should state that there is no difference between your data sets.

For example, let’s say we’re testing the effectiveness of a fertilizer by taking half of a group of 20 plants and treating half of them with fertilizer. Our null hypothesis will be something like, “This fertilizer will have no effect on the plant’s growth.”

###
Step 2: Set an Alternative Hypothesis

Next, you need an alternative hypothesis, H

_{
a
}

.

Your alternative hypothesis is generally the opposite of your null hypothesis

, so in this case it would be something like, “This fertilizer will cause the plants who get treated with it to grow faster.”

###
Step 3: Determine Your Alpha

Third, you’ll want to set the significance level, also known as alpha, or α.

The alpha is the probability of rejecting a null hypothesis when that hypothesis is true.

In the case of our fertilizer example, the alpha is the probability of concluding that the fertilizer does make plants treated with it grow more when the fertilizer does not actually have an effect.

An alpha of 0.05, or 5 percent, is standard, but if you’re running a particularly sensitive experiment, such as testing a medicine or building an airplane, 0.01 may be more appropriate. For our fertilizer experiment, a 0.05 alpha is fine.

Your confidence level is $1 – α(100%)$, so if your alpha is 0.05, that makes your confidence level 95%.

Again, your alpha can be changed depending on the sensitivity of the experiment, but most will use 0.05.

###
Step 4: One- or Two-Tailed Test

Fourth, you’ll need to decide whether a one- or two-tailed test is more appropriate.

One-tailed tests examine the relationship between two things in one direction, such as if the fertilizer makes the plant grow. A two-tailed test measures in two directions, such as if the fertilizer makes the plant grow or shrink.

Since in our example we don’t want to know if the plant shrinks, we’d choose a one-tailed test. But if we were testing something more complex, like whether a particular ad placement made customers more likely to click on it or less likely to click on it, a two-tailed test would be more appropriate.

A two-tailed test is also appropriate if you’re not sure which direction the results will go, just that you think there will be an effect. For example, if you wanted to test whether or not adding salt to boiling water while making pasta made a difference to taste, but weren’t sure if it would have a positive or negative effect, you’d probably want to go with a two-tailed test.

###
Step 5: Sample Size

Next, determine your sample size. To do so, you’ll conduct a power analysis, which gives you the probability of seeing your hypothesis demonstrated given a particular sample size.

Statistical power tells us the probability of us accepting an alternative, true hypothesis over the null hypothesis.

A higher statistical power gives lowers our probability of getting a false negative response for our experiment.

In the case of our fertilizer experiment, a higher statistical power means that we will be less likely to accept that there is no effect from fertilizer when there is, in fact, an effect.

A power analysis consists of four major pieces:

- The effect size, which tells us the magnitude of a result within the population
- The sample size, which tells us how many observations we have within the sample
- The significance level, which is our alpha
- The statistical power, which is the probability that we accept an alternative hypothesis if it is true

Many experiments are run with a typical power, or β, of 80 percent

. Because these calculations are complex, it’s not recommended to try to calculate them by hand—instead,

most people will use a calculator like this one

to figure out their sample size.

Conducting a power analysis lets you know how big of a sample size you’ll need to determine statistical significance.

If you only test on a handful of samples, you may end up with a result that’s inaccurate—it may give you a false positive or a false negative. Doing an accurate power analysis helps ensure that your results are legitimate.

###
Step 6: Find Standard Deviation

Sixth, you’ll be calculating the standard deviation, $s$ (also sometimes written as $σ$). This is where the formula gets particularly complex, as this tells you how spread out your data is.

The formula for standard deviation of a sample is:

$$s = √{{∑(x_i – µ)^2}/(N – 1)}$$

In this equation,

- $s$ is the standard deviation
- $∑$ tells you to sum all the data you collected
- $x_i$ is each individual data
- $µ$ is the mean of your data for each group
- $N$ is your total sample

So, to work this out, let’s go with our preliminary fertilizer test on ten plants, which might give us data something like this:

Plant | Growth (inches) |

1 | 2 |

2 | 1 |

3 | 4 |

4 | 5 |

5 | 3 |

6 | 1 |

7 | 5 |

8 | 4 |

9 | 4 |

10 | 4 |

We need to average that data, so we add it all together and divide by the total sample number.

$(2 + 1 + 4 + 5 + 3 + 1 + 5 + 4 + 4 + 4) / 10 = 3.3$

Next, we subtract each sample from the average $(x_i – µ)$, which will look like this:

Plant | Growth (inches) | $x_i – µ$ |

1 | 2 | 1.3 |

2 | 1 | 2.3 |

3 | 4 | -0.7 |

4 | 5 | -1.7 |

5 | 3 | 0.3 |

6 | 1 | 2.3 |

7 | 5 | -1.7 |

8 | 4 | -0.7 |

9 | 4 | -0.7 |

10 | 4 | -0.7 |

Now we square all of those numbers and add them together.

$1.32 + 2.32 + -0.72 + -1.72 + 0.32 + 2.32 + -1.72 + -0.72 + -0.72 + -0.72 = 20.1$

Next, we’ll divide that number by the total sample number, N, minus 1.

$20.1/9 = 2.23$

And finally, to find the standard deviation, we’ll take the square root of that number.

$√2.23=1.4933184523$

But that’s not the end. We also need to calculate the variance

between sample groups

, if we have more than one sample group. In our case, let’s say that we did a second experiment where we

didn’t

add fertilizer so we could see what the growth looked like on its own, and these were our results:

Plant | Growth (inches) |

1 | 1 |

2 | 1 |

3 | 2 |

4 | 1 |

5 | 3 |

6 | 1 |

7 | 1 |

8 | 2 |

9 | 1 |

10 | 1 |

So let’s run through the standard deviation calculation again.

####
#1: Average Data

$1 + 1 + 2+ 1 + 3 + 1 + 1 + 2 + 1 + 1 = 14$

$14/10 = 1.4$

####
#2: Subtract each sample from the average $(x_i – µ)$.

$0.4 + 0.4 + (-0.4) + 0.4 + (-1.6) + 0.4 + 0.4 + (-0.4) + 0.4 + 0.4 = 0.4$

####
#3: Divide the last number by the total sample number, N, minus 1.

$0.4/9=0.0444$

####
#4: Take the square root of the previous number.

$√0.0444 = 0.2107130751$

###
Step 7: Run Standard Error Formula

Okay, now we have our two standard deviations (one for the group with fertilizer, one for the group without). Next, we need to run through the standard error formula, which is:

$$s_d = √((s_1/N_1) + (s_2/N_2))$$

In this equation:

- $s_d$ is the standard error
- $s_1$ is the standard deviation of group one
- $N_1$ is the sample size of group one
- $s_2$ is the standard deviation of group two
- $N_2$ is the sample size of group two

So let’s work through this.

First, let’s figure out $s_1/N_1$.

With our numbers, that becomes $1.4933184523/10$, or 0.14933184523.

Next, let’s do $s_2/N_2$.

With our numbers, that becomes $0.2107130751/10$, or 0.02107130751.

Next, we need to add those two numbers together.

$0.14933184523 + 0.02107130751 = 0.17040315274$

And finally, we’ll take the square root:

$√0.17040315274 = 0.41279916756$

So our standard error $s_d$, is 0.41279916756.

###
Step 8: Find t-Score

But we’re still not done! Now you’re probably seeing why most people use a calculator for this.

Next up: t-score. Your t-score is what allows you to compare your data to other data, which tells you the probability of the two groups being significantly different. The formula for t-score is

$$t = (µ_1 – µ_2)/s_d$$

where:

- $t$ is the t-score
- $µ_1$ is the average of group one
- $µ_2$ is the average of group two
- $s_d$ is the standard error
- So for our numbers, this equation would look like:

$t = (3.3 – 1.4)/0.41279916756$

$t = 4.60272246001$

###
Step 9: Find Degrees of Freedom

We’re almost there! Next, we’ll find our degrees of freedom ($df$), which tells you how many values in a calculation can vary acceptably. To calculate this, we add the number of samples in each group and subtract two. In our case, that looks like this:

$$(10 + 10) – 2 = 18$$

###
Step 10: Use a T-Table to Find Statistical Significance

And now

we’ll use a t-table

to figure out whether our conclusions are significant

. To use the t-table, we first look on the left-hand side for our $df$, which in this case is 18.

Next, scan along that row of variances until you find ours, which we’ll round to 4.603. Whoa! We’re off the chart! Scan upward until you see the p-values at the top of the chart and you’ll find that our p-value is something smaller than 0.0005, which is well below our significance level.

So is our study on whether our fertilizer makes plants grow taller valid?

The final stage of determining statistical significance is comparing your p-value to your alpha.

In this case, our alpha is 0.05, and our p-value is well below 0.05. Since one of the methods of determining statistical significance is to demonstrate that your p-value is less than your alpha level, we’ve succeeded!

The data seems to suggest that our fertilizer does make plants grow, and with a p-value of 0.0005 at a significance level of 0.05, it’s definitely significant!

Now, if we’re doing a rigorous study, we should test again on a larger scale to verify that the results can be replicated and that there weren’t any other variables at work to make the plants taller.

##
Tools to Use For Statistical Significance

Calculators make calculating statistical significance

a lot

easier.

Most people will do their calculations this way instead of by hand, as doing them without tools is more likely to introduce errors in an already sensitive process.

To get you started, here are some calculators you can use to make your work simpler:

How to Calculate T-Score on a TI-83

Find Sample Size and Confidence Interval

##
What’s Next?

Need to brush up on AP Stats?

These

free AP Statistics practice tests

are exactly what you need!

If you’re struggling with statistics on the SAT Math section

, check out this guide to

strategies for mean, median, and mode

!

This

formula sheet for AP Statistics

covers all the formulas you’ll need to know for a great score on your AP test!