Research & Statistics for FRCEM Final: The Complete Guide
SLO 10 and 11 account for about 10 questions on FRCEM Final — and they're some of the most commonly failed. Most candidates skip stats revision because it feels dry and intimidating. But these are genuinely free marks if you learn the basics. This guide can't cover everything you'll need, but it covers the fundamentals and the most common exam themes.
Evidence hierarchy (strongest to weakest):
- Level 1a — Systematic review of RCTs
- Level 1b — Individual RCT (with narrow CI)
- Level 2a — Systematic review of cohort studies
- Level 2b — Individual cohort study / low-quality RCT
- Level 3a — Systematic review of case-control studies
- Level 3b — Individual case-control study
- Level 4 — Case series / case report
- Level 5 — Expert opinion
RCT: Gold standard for testing treatments. Randomises participants to intervention vs control. Minimises bias through randomisation and blinding.
Cohort study: Follows groups over time. Can be prospective or retrospective. Good for prognosis and identifying risk factors. Cannot prove causation but can show strong associations.
Case-control study: Starts with the outcome and looks back for exposure. Good for rare diseases because you don't need to follow thousands of people. Quicker and cheaper than cohort studies.
Cross-sectional study: A snapshot at one point in time. Good for prevalence. Cannot show causation or temporal relationships.
The 2x2 table:
| Disease + | Disease − | |
| Test + | True Positive (TP) | False Positive (FP) |
| Test − | False Negative (FN) | True Negative (TN) |
Sensitivity = TP / (TP + FN) — how good the test is at detecting disease. A highly sensitive test rules OUT disease when negative (SnNOut).
Specificity = TN / (TN + FP) — how good the test is at confirming health. A highly specific test rules IN disease when positive (SpPIn).
PPV = TP / (TP + FP) — affected by prevalence. Higher prevalence leads to higher PPV.
NPV = TN / (TN + FN) — affected by prevalence. Lower prevalence leads to higher NPV.
ARR (Absolute Risk Reduction) = control event rate − treatment event rate.
NNT (Number Needed to Treat) = 1 / ARR. Always uses absolute risk reduction, never relative.
ARI (Absolute Risk Increase) = treatment adverse event rate − control adverse event rate.
NNH (Number Needed to Harm) = 1 / ARI. If NNH < NNT, the drug harms more than it helps.
ROC (Receiver Operating Characteristic) curve: Plots sensitivity (y-axis) vs 1−specificity (x-axis) at different diagnostic thresholds. The gold curve bowing toward the top-left corner indicates a good test. The diagonal dashed line represents chance (useless test).
AUC (Area Under the Curve) measures overall test accuracy:
- 0.5 = useless (follows the diagonal — no better than chance)
- 0.7–0.8 = acceptable
- 0.8–0.9 = excellent
- >0.9 = outstanding
- 1.0 = perfect test
The top-left corner represents a perfect test. The diagonal line represents a test no better than chance. If the exam shows two ROC curves, the one with the larger AUC is the better test.
Power = 1 − β = the probability of detecting a true difference if one exists. Conventionally set at 80% (0.8).
α (alpha) = significance level (usually 0.05) = probability of Type I error (false positive — finding a difference that doesn't exist).
β (beta) = probability of Type II error (false negative — missing a real effect).
Power calculation components:
- Effect size (must be clinically meaningful)
- Alpha level
- Desired power
- Variance in the data
How each component affects sample size:
- Bigger effect size = smaller sample needed
- Smaller alpha = larger sample needed
- Higher power = larger sample needed
- More variance = larger sample needed
Power calculations are performed before the study to determine sample size.
Lowering a diagnostic threshold (e.g. troponin cut-off) increases sensitivity but increases false positives (Type I error).
Concealed allocation: Sealed envelope given BEFORE the trial starts. The patient doesn't know their group assignment. Prevents selection bias at enrolment.
Randomisation: Assigning participants to groups AFTER enrolment. Ensures groups are comparable at baseline.
Blinding:
- Single-blind: patient doesn't know their group
- Double-blind: patient + researcher don't know
- Triple-blind: patient + researcher + assessor don't know
Blinding reduces observer bias and placebo effects.
Surrogate endpoint: A proxy marker used instead of the actual clinical outcome. E.g. DVT on Doppler as a surrogate for PE in pregnancy (where CTPA radiation is a concern).
Kappa coefficient: Measures inter-rater reliability. 0 = no agreement beyond chance. 0.5 = moderate agreement. 1.0 = perfect agreement.
Demographic table: Compares baseline characteristics between groups using p-values. You WANT p > 0.05 — this means the groups are comparable.
- Selection bias: Systematic differences in who is recruited. Groups are not representative of the target population.
- Recall bias: Participants with disease remember exposures differently from those without. Common in case-control studies.
- Observer/detection bias: Assessors measure outcomes differently based on knowledge of group allocation. Prevented by blinding.
- Attrition bias: Systematic differences in dropouts between groups. If sicker patients drop out of the treatment arm, results look falsely good.
- Publication bias: Positive studies are more likely to be published than negative ones. Detected with funnel plots.
- Lead-time bias: Earlier detection appears to increase survival time without actually changing outcomes. Common in screening studies.
- Hawthorne effect: Participants change behaviour because they know they're being observed, not because of the intervention itself.
Confounding: A third variable affects both the exposure and the outcome, creating a spurious association. Controlled by randomisation, matching, stratification, or multivariate analysis.
The Jadad score (Oxford quality scoring system) assesses RCT quality on a scale of 0–5:
- Was the study described as randomised? (+1)
- Was the randomisation method appropriate? (+1) — e.g. computer-generated, NOT alternate allocation
- Was the study described as double-blind? (+1)
- Was the blinding method appropriate? (+1) — e.g. identical placebo
- Were withdrawals and dropouts described? (+1) — numbers and reasons in each group
Deductions: −1 if randomisation method inappropriate. −1 if blinding method inappropriate.
Score ≤2 = low quality. Score ≥3 = high quality.
Other quality assessment tools:
- CONSORT: Reporting standard for RCTs
- PRISMA: Reporting standard for systematic reviews
- GRADE: Overall evidence quality rating system
- Newcastle-Ottawa: Quality assessment for cohort and case-control studies
P-value <0.05 = statistically significant by convention. This does NOT mean clinically important — a huge trial can find a tiny, meaningless difference that reaches statistical significance.
Confidence interval: The range where the true value likely lies. If the 95% CI crosses 1.0 (for ratios like RR/OR) or 0 (for differences) — the result is not significant, regardless of the p-value.
- Narrow CI = precise estimate
- Wide CI = imprecise estimate (often from small sample)
Heterogeneity (I²) in meta-analysis: >50% is significant — you should question whether the studies should be combined at all.
Intention-to-treat (ITT) vs per-protocol: ITT analyses everyone enrolled (even dropouts). It's more conservative and avoids attrition bias. Per-protocol only analyses those who completed the study — risks overestimating treatment effect.
Forest plot (blobogram): Displays meta-analysis results. Each horizontal line represents one study — the square is the point estimate (bigger square = larger study = more weight), the line is the confidence interval. The diamond at the bottom is the overall pooled effect. A vertical line at 1.0 represents no effect. If the diamond crosses this line, the overall result is not significant.
Funnel plot: Used BEFORE meta-analysis to detect publication bias. Studies are plotted by effect size (x-axis) vs precision/sample size (y-axis). In an unbiased sample, dots should be symmetrically distributed around the central line like an inverted funnel. Asymmetry suggests publication bias — smaller studies with negative results may not have been published.
Kaplan-Meier curve: Survival analysis over time. The y-axis is survival probability, x-axis is time. Steps downward represent individual events (e.g. deaths). Two curves are plotted (treatment vs control) — separation between the curves indicates a treatment effect. The greater the separation, the larger the effect.
Scatter plot: Shows the relationship between two continuous variables. Dots trending upward = positive correlation, downward = negative, no pattern = no correlation. A line of best fit may be drawn. The r-value (correlation coefficient) measures strength: closer to 1 or -1 = stronger. Important: correlation does NOT equal causation.
Box & whisker plot: Summarises the distribution of data. The line inside the box = median. The box edges = interquartile range (IQR, 25th to 75th percentile). The whiskers extend to the range. Dots beyond the whiskers = outliers. Useful for comparing distributions between groups — if the boxes don't overlap, there's likely a significant difference.
ROC curve: See Section 3 (Diagnostic Formulae) above for full detail including AUC values and illustration.
Audit: Measures current practice against an existing STANDARD. Uses the PDSA (Plan-Do-Study-Act) cycle. Must re-audit to close the loop. Does NOT need ethics approval.
Research: Generates NEW knowledge. Tests a hypothesis. NEEDS ethics approval.
QI methodologies:
- EBCD (Experience-Based Co-Design): Uses patient and staff experiences to drive improvement
- Driver diagrams: Primary aim → primary drivers → secondary drivers → change ideas
- 5 Whys / 3 Whys: Root cause analysis — keep asking "why?" until you reach the underlying cause
- Ishikawa (fishbone) diagram: Categorises causes into groups (people, process, equipment, environment, etc.)
- Pareto analysis (80/20 rule): 80% of problems come from 20% of causes. Fix the top causes first for maximum impact.
- Process mapping: Visualising the patient journey to identify bottlenecks and waste
- Run charts / SPC charts: Tracking improvement over time. Distinguishes special cause variation (something changed) from common cause variation (normal fluctuation).
- Gantt chart: Project timeline showing tasks, duration, and dependencies
You do NOT need 100% of data for PDSA — just enough, collected sequentially, to test your change idea.
These are free marks
Most candidates panic when they see a stats question and guess. If you learn the formulae, recognise the graph types, and understand the basic concepts, you'll pick up 5–10 marks that others throw away. That's the difference between passing and failing.
Recommended Reading
"Critical Appraisal for FCEM" — Bootland, Coughlan, Galloway & Goubet (CRC Press). Covers everything you need for SLO 10/11 in plain language.
Ready to practise?
More advice and 1,000+ questions mapped to the FRCEM blueprint. Not just a question bank — support, guidance and help where you need it.