Did you read a hot new study on a diet which has won your affection? Let’s not rush things. Here’s what every honorable scientist and sceptic needs to know about how to judge if a study is of high or low quality, made simple with intuitive examples of correlation and causation!
Themain point of this article is to teach you that not all studies are created equal! This is important to understand, especially if you seriously want to understand fitness and nutrition, since many of the issues talked about below are common in fitness research.
Experimental Studies: Randomized Controlled Trials are Best
The best way to prove that a diet causes better health or weight loss etc. is to conduct an experimental study. This is where one group gets to try a certain diet, and the other acts as a “do nothing” group (a so-called control group). There are several types of experimental studies:
- The worst type of experimental study: Studies on cells: Here, conclusions are drawn based on how cells in a lab react to various substances. For obvious reasons, things that happen in a test tube (“in vitro”) may not happen in the body of living beings (“in vivo”), which is why this type of research really only acts as the first “step” before doing studies on the living.
- The okay type type of experimental study: Studies on animals: Studies on animals are better than on cells, but once again don’t always reflect effects seen in humans since our bodies differ. An example is that rats, which are usually studied, handle sugar much differently from humans, making research on the effects of sugar on rats less applicable for humans. Animal studies are, however, the only means by which things that are unethical to study on humans can be tested.
- The best type of experimental study: The Randomized Controlled Trial (RCT): This is the only type of study which REALLY can prove if one thing causes another. The “full” name of a proper RCT is a “double-blinded placebo controlled randomized controlled trial” Say that 5 times. In medicine, all drugs need to go through RCTs before being sold. In order to test prove that one thing causes another thing RCT will do the following:
- A group of individuals are randomly divided (“randomized”) into a treatment group (“intervention group”) and a “do nothing” group (“control group”). By randomly dividing people into two groups, any differences between individuals will be evenly divided between the groups. This means that, for example, people with good genetics, bad genetics, stressful life habits, smokers, sick people, healthy people etc will be found in equal amounts in both groups. So randomization gives you two groups which are equal on a population level.
- The intervention group gets whatever we are testing: be it a supplement, a diet or a new exercise.
- The control group get a “fake product” (a placebo): To know if our tested product works, we need to compare It to a group that is “getting nothing”. Strictly speaking, the control group shouldn’t just “get nothing”, but increase get a “fake” product, called a placebo. This is because “doing nothing” might affect the control group negatively, making the tested product look better. A placebo might influence the control group to behave as though they are getting the tested product, so that we know that its not just behavioral changes or other factors giving us the effect, but rather the actual product.
- Neither the participants or the researcher know who is getting which product, the study is “blinded”: If participants know that they are in the intervention group, they may behave different than if they know that they’re getting placebo. These behaviors might affect study results. Similarly, researchers might treat intervention group participants differently from placebo participants, also affecting the results. Therefore, “blinding” is the best way to avoid these biases. This is known as double-blinding, as all participants are “blind” to the given treatment. There are also simpler variants where only study participants or only researchers are blinded, but these are generally less reliable.
This way, an RCT really manages to isolate the product tested, while keeping everything else constant, following groups over time to know if the PRODUCT actually CAUSES an effect.
Experimental Studies on DIETS are Limited by the Following Factors
- It is impossible to blind groups, so the study participants and researchers know which participants are getting which diet. This can influence behavior and results. Simply knowing you are in the “vegetarian group” or the “control group” may influence other daily behaviors.
- People are generally bad at following a new diet. It is not uncommon that the groups only follow the diet as instructed in 50% of the time, meaning that the “intervention group” won’t fairly represent the tested diet. So simply telling people in your intervention group to follow a diet is not the same as them actually following the diet. Some studies provide meals to their participants, which makes them more likely to follow the given diet, but at the end of the day we don’t know if participants have been “cheating” or not. The most reliable way of knowing that participants actually stick to what they are supposed to eat is to lock them in a room where you control everything that they eat (a.k.a. a metabolic chamber). These types of studies are today very rare for ethical reasons, but have been done historically, like in the Vipeholm Mental Hospital Experiments (conducted close to where we went to medical school actually!).
- Experimental studies are expensive and not long lasting. Some things, like heart disease and cancer, take YEARS to develop, but instructing a group to switch to a vegetarian diet for 20 years and actually follow this diet is insanely hard. Think about it, would YOU be able to follow a diet for 20 years of a study? That’s a pretty major life change for a damn study. Therefore, experimental studies usually only last for a few weeks or a couple months tops. To asses risk of disease, they usually decide to measure so called surrogate markers, like blood pressure and blood fats which are assumed to reflect the risk of future disease (and this may not necessarily be true).
- They have fewer participants. In addition, the the dilemma above, participants of experimental studies usually require some form of compensation to stay engaged in the study. It is both difficult to gather enough participants and also to make sure people stay in the study, leading to fewer participants. If there are too few participants in a study, it can be difficult to detect group differences over time. The study is said to be underpowered. This means that one group might for example actually build more muscle or lose more fat over time, but the study may be too small to be able to find this difference between groups.
So even if experimental studies are the best way to prove cause some of the limitations above can be remedied by looking at existing statistics on different groups and drawing conclusions. This type of study is known as an observational study (or epidemiological study) because we are not actively changing anything like in experimental studies, but merely looking at existing patterns. The problem with these studies is that they cannot prove that one thing causes another, but can only correlate two things together (also called an association or a “link” in media).
To avoid confusion, correlation, association and link all mean the same thing: That two things happen at the same time or vary together.
Correlation vs Causation
To understand the difference between causality from experimental studies and correlation in observational studies the following example is helpful:
Looking at the statistics, we can see that taller people tend to wear larger sized t-shirts. This is the equivalent of an observational study finding a correlation between large t-shirts and tallness.
However, this doesn’t mean that large t-shirts cause tallness. if we want to find out, a suitable experimental study would be to take a group of people with similar height and divide them into two equal groups, giving one group larger t-shirts and the other smaller t-shirts. Following them for 2 years, we would find no difference in height between the groups, meaning that larger t-shirts don’t cause tallness. This would be an experimental study.
Another example is the classic association between diet soda and diabetes. People who dink diet soda also tend to have more diabetes, as seen in observational studies, but when conducting experimental studies where we give healthy people diet soda or water and follow them over time, we don’t see more diabetes developed in the group receiving diet soda. Thus the experimental study proves that diet soda doesn’t cause diabetes. What actually seems to be the case is that people who have diabetes generally tend to choose diet soda over regular soda as it is better for their blood sugar, explaining the correlations seen in observational studies.
Observational Studies: Prospective Cohort Studies are Best
There are three main types of observational studies:
- The worst type of observational study: Cross sectional studies: These look at data at ONE point in time and find links. (a “cross section” of time). For example, asking people how many have the diagnosis diabetes and also asking them how many regularly consume diet soda and presenting the results would be a cross sectional study.
- The okay type of observational study: Case control studies: These go to cases with a certain outcome today and ask them about their past to try to find links. The equivalent example to the one above would be to go to a group of diabetics and ask them if they regularly consumed diet soda last year. This way you can try to find a “link” to the past. If you then ask healthy people if they regularly consumed diet soda last year, you could compare the results and suspect that there is a link between past soda consumption and diabetes, but once again this is only a correlation and not causality since there could be other things in the past of diabetics, like smoking and lack of exercise, that may contribute to their disease.
- The best type of observational studies: Cohort studies: These studies take a group (a “cohort”) with a suspected risk factor to link it to an outcome. This can be done by either following the group over time (a prospective cohort study), or looking at a group exposed to a past risk factor to see how many have the outcome now (a retrospective cohort study). To stick with the examples above, a retrospective cohort study would be looking at a group of people who regularly consumed diet soda last year, to see how many have diabetes now. A prospective cohort study would be to take a group who says that they consume diet soda regularly and follow them over time and in the future ask them if they have diabetes. Of the two, prospective studies are considered the “better” option when it comes to evidence quality.
All three observational study types above only show CORRELATIONS and can never fully prove what CAUSES what. That’s what experimental studies, especially randomize controlled trials do.
Here are a few bizarre correlations for you:
Correlation is not Causality!
Ice Cream and Murder:
Organic Food and Autism:
Cholesterol Levels, Justin Bieber and Facebook
So you see that just because two things happen at the same time, does NOT mean that they cause each other.
By now you might think that observational studies SUCK, since they mislead people, and YES media uses observational studies to create scary headlines. BUT observational studies DO have some advantages over experimental studies:
- Cheaper: no paying participants, no paying teams to follow people up. To make an observational study, all you need is access to data.
- Follow individuals over a longer time: as long as there is data on the same individuals over time, potential long-term links can be found using observational studies.
- Larger numbers of participants: This is because data usually comes from registries with thousands of participants, making it possible to detect rarer outcomes like rare diseases over time.
- Help create hypotheses for experimental studies: Observational data may find many correlations, which can then be further investigated using a randomized controlled trial to see if the found links are causal. At the same time, finding NO correlation between two variables, is a good indication that there ISN’T a causal link.
How to Interpret Study Results? Statistical Significance & P-Values
Now that you understand the different study types, you may have realized that studies present results by demonstrating a DIFFERENCE between two groups. You cannot simply give a group a supplement and measure an increase in their muscle mass 12 weeks later and give the supplement the credit. How do you know that it was the supplement? Maybe participants happened to start a workout program that week. Maybe being part of the study made them realize how lazy they were? The point is YOU need a group to compare to. Same principle goes for observational studies.
Now, EVEN if you are comparing two groups, there is a CHANCE that you are just LUCKY to find a difference between the two groups. To TEST for this, there is a mathematical value called the P-value. In simple terms, the P-value tells you how big the chance is that the difference between your groups is pure LUCK. So a P-value of 0.6 means that there is a 60% change that the group difference seen in a study occurred by chance. Scientists have arbitrarily agreed that if the P value is less than 0.05 (i.e. only a 5% chance that findings were by chance), results of a study are deemed to be STATISTICALLY SIGNIFICANT, meaning that they probably are a real finding and not just dumb luck.
That’s the most important thing for you to know, then it can be worth knowing that the chance of finding a statistically significant P-value increases if the population of the study is large, or if you simply do many statistical analyses. Thus the P-value can also be “false” positive.
So How Do I Know What is True?
You may now be able to appreciate that it takes time to know if something really is true or not, and a single observational study isn’t enough to with certainty say that something is true or not. To make things even more complicated, some studies reach conclusions just by LUCK, which means that a single randomized controlled trial also isn’t enough to say that something is true.
Dr. Glen Begley (a big name from the pharmaceutical industry) found that only SIX of 56 studies published in big medical journals could be redone to reach the same conclusion (!). This is why we need MANY studies, ideally both observational and experimental, pointing towards the same conclusion before with certainty saying that a hypothesis is TRUE. Most of the time, however, that amount studies is lacking, which is why the conclusion of many studies often is “more studies are needed”. Thus, there are LEVELS to this:
Make Things Easy: Find a Systematic Review & Meta-Analysis
The GOOD news is that you don’t have to look up all studies yourself, as there are researchers doing it for you: by conducting a so-called systematic review. These are articles where researches try to sum up all the evidence on a topic by searching all available databases for studies on a topic. Most importantly they explain how they conducted the search for studies, meaning that the risk of them “cheating” by cherry-picking studies is much smaller. After all, anyone can recheck their work by redoing the search. This is the main difference between a systematic reviewand a regular review article, where the author doesn’t present how the literature search was conducted.
After all relevant studies have been found for a research review, the data from all studies can be combined by doing a so-called meta-analysis. We can then conclude what ALL available research says about a topic. Meta-analyses will usually represent their results in a diagram called a forest plot. By looking at a forest plot you can get a quick overview of the results. The image below teaches you how to interpret a forest plot:
As you see above, the five studies reach different conclusions. With one disagreeing with the other four. Then again, the “potential result variation” (which in scientific terms I called the 95% confidence interval) shows that if the study were repeated, results might agree with the other studies. All in all, we can see from the diamond that the combined data from all studies suggests that results favor group 1. That’s why meta analyses are so powerful, they give you the big picture. A blogger might write about Study 3 to disprove someone else writing about study 5, but the meta-analysis can then settle the dispute.
To make things even clearer, here’s an example of a forest plot from an imagined meta-analysis.
Meta-analyses can combine results from RCTs as well as observational studies. It is important to note that the quality of the meta-analysis largely depends on what studies you put into it. A meta-analysis of bad studies usually results in a bad meta-analysis. Also, meta analyses can sometimes give misleading conclusions because of the mixed studies included.
Meta-Analysis Problems: Heterogeneity
If a meta-analysis aims to look at studies measuring the effect of protein on muscle mass, one study may be on thousand 75 year olds who eat more beans for 3 weeks daily without muscle gains while another is on six 19 year old athletes who take steroids and drink whey protein for 2 years daily with huge muscle gains.
A meta-analysis of these two studies might reach a conclusion based mostly on the results of the first study, giving a false impression that protein barely matters. This issue is known as heterogeneity and simply means that the different studies in the meta analysis may not be comparable due to different methods, design and populations.
Some aspects of heterogeneity can be measured (called statistical heterogeneity) while others cannot (called clinical heterogeneity). Statistical heterogeneity means that the studies used different statistical methods to measure results and is measured as a P value and an I² value. A particular value over 0.05 indicates statistical heterogeneity and the I² value indicates HOW MUCH heterogeneity there is. Generally 25% is considered low and 75% is considered high. This says nothing about the CLINICAL heterogeneity though. The only way to judge this is to look at the individual studies and use your expertise to say if you think the studies are done on comparable situations. Meta analyses usually present tables of all included studies to make this easier, but it can be tricky!
RCT vs Meta-Analysis!
Now you might also ask what is better? A large randomized controlled trial of 5000 patients, or a meta-analysis of 100 small studies, each with 50 participants? Well, we would argue that the large RCT is more likely to keep a higher standard, and should therefore be trusted more, but it all depends on how the studies look, so these things aren’t easy. A study in the New England Journal of Medicine (one of the world’s most prestigious journals) found that 35% of large RCTs found results disagreeing with previous meta-analyses (1). Thus we could say (with the risk of over-simplifying) that you probably can trust the results of a meta-analysis with 65% certainty.
Take Home Message
That’s the end of our long science guide. If you are lazy and just want a quick answer to your question, search for systematic reviews and meta-analyses. The newer the better since they then probably will include the latest studies on a given topic! Alternatively, you can try searching for a large randomized controlled trial on a population relevant to you. Science is complex and can never give any definite answers. The goal is to pain explain reality as accurately as possible, but we must always stay humble to the possibility of being wrong. Good luck!
Despite the long read, there is still much more to be said, and some stuff on here is simplified to make it easier for beginners to understand. Please let me know if anything is unclear, or if you feel like it’s been over-simplified. I hope that I’ve brought you one step closer to being able to independently seek out reliable knowledge!
If you want to learn more about science, check out the science category on our blog for all of our science related articles!
Artin Entezarjou, M.D, PhD Student
Co-Founder of EBT
Source:
Jacques LeLorier, M.D., Ph.D. Discrepancies between Meta-Analyses and Subsequent Large Randomized, Controlled Trials