For research and educational purposes only. Not medical advice.

Reading published evidence without the medical degree: tiers, p-values, effect sizes, and how to spot a thin study

Most peptide and GLP-1 discussions cite published research without engaging the actual mechanics of how that research is built. The mechanics are not arcane.…

Category: Research Gaps. 14 min read. Published 2026-04-28.

The evidence pyramid, fairly stated

Clinical-research methodology textbooks describe an evidence pyramid that ranks study designs by their susceptibility to bias and confounding. The bottom of the pyramid is single anecdote: one person reports an outcome. The top is high-quality systematic review and meta-analysis of large randomized controlled trials. Between those endpoints sit case series, observational cohort studies, case-control studies, individual randomized trials, and unsystematic reviews.

The pyramid is a useful first cut, not a precise ranking. A small RCT with poor blinding and high dropout can be less informative than a large, well-designed prospective cohort. A systematic review of weak studies inherits the weakness of its inputs. The Cochrane Collaboration has spent decades developing tools (GRADE, Risk of Bias 2) that take the pyramid intuition and turn it into something more rigorous . The pyramid is the heuristic; GRADE is the methodology.

  • Systematic review + meta-analysis (high quality): synthesis of multiple primary studies under a pre-registered protocol with defined inclusion criteria and quantitative pooling. Strongest when the underlying studies are themselves well-designed.
  • Randomized controlled trial (RCT): the most informative single-study design because randomization neutralizes baseline confounding in expectation. Quality varies by sample size, blinding, dropout, and analysis plan adherence.
  • Prospective cohort: follows a defined population forward over time. Cannot fix confounding by random assignment, but can adjust for measured confounders. Good for long-term outcomes that are unethical to randomize.
  • Case-control: compares people with the outcome to people without, looking backward. Efficient for rare outcomes. Vulnerable to recall bias.
  • Cross-sectional: snapshot at a single time point. Useful for prevalence, not causation.
  • Case series: reports outcomes for a defined patient group, no comparator. Hypothesis-generating, not confirmatory.
  • Single case report: one person, one outcome. Useful only as a flag for further investigation.

What p-values actually say, and what they do not

A p-value is the probability, assuming the null hypothesis is true, of observing a result at least as extreme as the one obtained. It is a probability about the data given the null, not a probability that the null is true given the data, and not a probability that the alternative hypothesis is true. The American Statistical Association issued a formal statement clarifying these distinctions in 2016 because the misinterpretations are widespread in primary literature .

Concretely, p < 0.05 does not mean the result is true with 95 percent probability. It does not mean the effect is large. It does not mean the effect will replicate. It means that, if there were genuinely no effect, the chance of observing a result this extreme by random sampling alone would be less than 5 percent. That is useful information; it is not the only information.

Effect size beats statistical significance

A trial with 10,000 patients can detect a 0.1 kg weight difference with high statistical significance. A 0.1 kg difference is meaningless to a person trying to lose 30 kg. Effect size measures the magnitude of difference between groups, independent of sample size. Common measures include Cohen's d for continuous outcomes (small ~0.2, medium ~0.5, large ~0.8), risk ratios and risk differences for binary outcomes, and absolute risk reduction (ARR) and number needed to treat (NNT) for clinical decisions.

Reading a trial well means looking at the effect size first and the p-value second. The trial primary endpoint usually reports both. STEP-1 reported a 14.9 percent placebo-subtracted weight reduction at 68 weeks; that is the effect size, and it is large . SURMOUNT-1 reported 17.8 percent placebo-subtracted at the highest tirzepatide dose; that is the effect size, and it is larger . The p-values for both are vanishingly small, but the p-value is not what makes the result interesting.

Confidence intervals tell you what the trial actually narrowed down

A 95 percent confidence interval is the range of effect sizes the data are consistent with. A trial that reports a mean weight reduction of 14.9 percent with a 95 percent CI of 14.0 to 15.8 percent has tightly constrained the effect; a different trial reporting 14.9 percent with a CI of 6 to 24 percent has not.

Wide confidence intervals usually come from small samples, large within-group variance, or both. They do not mean the result is wrong; they mean the trial has weakly constrained the answer. A pilot study showing 'a large but uncertain effect' (large point estimate, wide CI) is hypothesis-generating. A confirmatory trial converting that to a tighter CI is the load-bearing follow-up.

Heterogeneity, and the meta-analysis traps that come with it

Meta-analysis pools effect estimates across studies. The pooling assumption (that the studies are estimating the same underlying effect) breaks down when the studies actually measured different things, used different doses, enrolled different populations, or had different durations. Heterogeneity statistics (I-squared, Q-statistic) attempt to quantify how much the across-study variability exceeds within-study variability. High I-squared (>50 percent) is a flag that the pooled estimate may be hiding meaningful between-study differences .

  • When a meta-analysis reports high heterogeneity, look at the forest plot. If individual study estimates are spread across both sides of the line of no effect, the pooled point estimate is averaging studies that disagree.
  • Subgroup analysis can sometimes resolve heterogeneity (different doses, different populations). When it does not, the pooled effect should be reported with its uncertainty rather than treated as a single answer.
  • Publication bias inflates pooled effect estimates because positive trials are more likely to be published than null trials. Funnel plots and Egger's tests detect this; trim-and-fill methods adjust for it.
  • Pre-registered systematic reviews (PROSPERO, Cochrane) are less vulnerable to selective inclusion than non-registered narrative reviews.

Diagnostic questions that separate a load-bearing study from a thin one

  1. What was the primary endpoint, and was it pre-specified before data collection? Pre-specified endpoints are more credible than secondary endpoints repurposed as headlines.
  2. How were participants assigned to groups? Randomization neutralizes baseline confounding; non-random assignment leaves it.
  3. Was the trial blinded, and to whom? Double-blind reduces both expectation effects in patients and assessment bias in clinicians.
  4. What was the dropout rate, and was the analysis intention-to-treat? High dropout combined with per-protocol analysis often inflates effect size.
  5. Was the sample size calculation reported? Trials underpowered for the primary endpoint should be read as hypothesis-generating regardless of significance.
  6. How are missing data handled? Last-observation-carried-forward, multiple imputation, and complete-case analysis can give different answers from the same dataset.
  7. Were there protocol deviations? Mid-trial changes to the primary endpoint, dose, or analysis plan substantially weaken the result.
  8. Who funded the trial, and what is the conflict-of-interest disclosure? Industry funding does not automatically invalidate a result, but it shifts the burden of proof on subgroup analyses and post-hoc framing.
  9. Is the journal peer-reviewed? Preprints are useful but unreviewed; pay attention to reviewer comments when the published version becomes available.
  10. Has the result been replicated? A single trial is hypothesis-generating; replication across labs and populations is what makes a finding load-bearing.

How to actually search PubMed when you have a question

PubMed indexes most biomedical primary literature. The search syntax matters. Boolean operators (AND, OR, NOT in capitals), MeSH terms (the indexed subject vocabulary), and field tags ([Title], [Author], [Journal]) all narrow a search faster than free-text. The PubMed Help pages walk through the syntax and the filter set .

  • Start with the compound name and the indication or outcome. 'tirzepatide AND obesity' is a reasonable starting query.
  • Use Filters: Article Type (Randomized Controlled Trial, Meta-Analysis), Publication Date (last 5 years), Species (Humans).
  • Sort by 'Best Match' for relevance and 'Most Recent' to find current literature.
  • When a paper looks promising, read the abstract first; check effect size and confidence interval; check whether the abstract conclusion matches the data presented in the abstract methods and results.
  • For paywalled articles, the abstract is usually free; the full text often is not. PubMed Central (PMC) hosts open-access versions where available.

What this changes when reading pepSmart and similar references

pepSmart's library catalog and articles cite primary sources by URL. Each link in the references list points at a specific paper, regulatory page, or trial registration. The decision of whether to trust a claim should not stop at 'pepSmart said it'; it should continue to 'what does the cited source say, and is the source itself credible'. The diagnostic questions in the prior section apply equally to anything pepSmart cites and to anything any other source cites.

References

  1. [1] Cochrane Library: GRADE methodology overview (Cochrane)
  2. [2] PubMed: ASA statement on p-values (Wasserstein and Lazar, 2016) (PubMed)
  3. [3] STEP-1: Once-weekly semaglutide in adults with overweight or obesity (Wilding et al., NEJM 2021) (PubMed)
  4. [4] SURMOUNT-1: Tirzepatide once weekly for the treatment of obesity (Jastreboff et al., NEJM 2022) (PubMed)
  5. [5] Cochrane Handbook chapter on heterogeneity in meta-analysis (Cochrane)
  6. [6] NCBI: PubMed search help and syntax (NCBI)