Pre-registered, statistically rigorous A/B experiments on cold email infrastructure, outbound tooling, and go-to-market strategy. No vendor relationships. No conflicts. Just data.
Lemlist analyzed 182 million emails and published reply rate benchmarks. Instantly published a deliverability report from billions of platform sends. ZoomInfo claims 95% data accuracy. Every AI SDR vendor promises 4-7x conversion lifts. None of it is independently verified. None of it uses a control group. None of it is pre-registered.
Observational platform data, no matter how large, cannot establish causation. A vendor reporting high reply rates from their users is reporting selection bias — their users are good at cold email, or picked the platform because it works for them. You cannot attribute the result to the product.
GTM Reports runs randomized controlled trials. Every experiment is pre-registered before a single email sends. Hypotheses are committed to in public. Analysis follows the pre-registered plan. Null results are published in full. The methodology section is longer than the findings section. Raw data is available to every subscriber.
This is what accountability looks like in a space that has never had it.
Directional hypothesis, primary metric, sample size calculation, and analysis plan published publicly before any data is collected. Timestamped and immutable.
Every non-test variable — domain age, warmup protocol, DNS config, sending cadence, ICP — held constant or stratified across arms before launch.
Leads assigned to arms using stratified random allocation by receiving inbox provider, industry, and seniority. Chi-square balance check required to proceed.
Daily checks on spam complaint rate, bounce rate, and blacklist status. Sample Ratio Mismatch tested weekly. Primary metrics never observed until runtime ends.
Primary analysis runs exactly as documented. Bonferroni correction applied for multiple comparisons. Exploratory segmentation clearly labeled as such. Null results published in full.
| ID | Experiment | n | Primary Metric | Status | Finding |
|---|---|---|---|---|---|
| GTM-001 | Inbox Provider Performance: Mailreef vs. PremiumInboxes vs. ZapMail vs. Custom SMTP4-arm RCT · Controlled for domain age, warmup, copy, ICP, sending volume | 15,360 | Positive reply rate | ● Active | Pre-registered. Results publish Day 60. |
| GTM-002 | AI-Generated vs. Human-Written Sequences: Claude Agents SDK vs. Manual Copy2-arm RCT · Identical ICP, infrastructure, and sending cadence | 6,400 | Positive reply rate | Upcoming | Pre-registration opens for subscriber review April 14. |
| GTM-003 | Data Enrichment Accuracy: Apollo vs. ZoomInfo vs. Clay Waterfall vs. ManualGround-truth verification of 10,000 records against LinkedIn and direct confirmation | 10,000 | Email validity rate at 90 days | Upcoming | Subscriber vote confirmed. Design phase Q3. |
| GTM-004 | Follow-Up Sequence Length: 2-Step vs. 4-Step vs. 6-Step3-arm RCT · Identical step-one copy, randomized interval timing | 9,600 | Positive reply rate | Upcoming | Open for subscriber vote. |
| GTM-005 | Sending Volume Per Mailbox: 30 vs. 50 vs. 75 Emails/DayDeliverability degradation study over 42-day window with seed account measurement | TBD | Inbox placement rate at Day 42 | Upcoming | Open for subscriber vote. |
Every hypothesis, primary metric, sample size calculation, and analysis plan is published publicly before the experiment begins. This prevents p-hacking and post-hoc rationalization — the most common sources of false positive research.
Established best practices for trustworthy online controlled experimentsMulti-arm experiments apply Bonferroni correction for all pairwise comparisons. A 4-arm experiment tests 6 pairs — the significance threshold adjusts to 0.0083, not 0.05. Uncorrected multi-arm tests produce false positives at 26%+ rates.
Standard for family-wise error rate control in RCT analysisWeekly chi-square goodness-of-fit test verifies actual experimental splits match intended allocation. SRM indicates implementation errors, differential filtering, or provider throttling that would invalidate results regardless of sample size.
Standard trustworthiness check for online controlled experimentsMost cold email "A/B tests" use 100-500 emails per variant — sufficient only to detect 100%+ relative lifts. GTM Reports targets 3,200+ per arm at 80% power and 95% confidence for a minimum detectable effect of 0.8 percentage points.
Two-proportion Z-test with Bonferroni-adjusted alpha thresholdIf an experiment finds no statistically significant difference, that result is published in full. "No detectable difference between these providers" is a finding with direct operational value. Unpublished null results are a scientific integrity failure.
Publication bias prevention — standard in academic researchReply classification (Positive / Neutral / Negative) is performed without knowledge of which experimental arm produced the reply. Arm labels are added to the dataset only after classification is complete, preventing unconscious scoring bias.
Double-blind reply coding with inter-rater reliability checkAnnual subscribers set the research agenda. Current vote is live.
Test vendor-claimed conversion rate lifts against a human SDR baseline using identical ICP, copy, and infrastructure. The first independent validation of the AI SDR category.
A controlled experiment isolating send timing as the sole variable across 8 arms — Monday through Friday at peak and off-peak windows — with identical copy and ICP.
2x2 factorial design testing subject line length (short vs. long) and personalization level (generic vs. company-specific) with positive reply rate as primary outcome.
Tests whether waterfall enrichment through Clay produces meaningfully better deliverability and reply rates than single-source data at equivalent per-lead cost.
The GTM research ecosystem has one structural problem: every entity publishing benchmarks has a financial interest in the outcome. Platform data comes from platforms. Comparison studies come from companies adjacent to the tools being compared. There is no independent layer.
GTM Reports is structurally different. Our business model — annual subscriber fees — means we profit when findings are trustworthy and lose subscribers when they are not. This is not a stated value. It is an incentive structure.
GTM Reports generates revenue exclusively from subscriber access fees. We hold no equity, affiliate arrangements, or paid partnerships with any vendor whose performance is the subject of a report. Tools used as experimental constants across all arms are disclosed as such in every methodology section. Where free accounts are provided by vendors for inclusion in an experiment, this is disclosed and the vendor has no influence over test design, analysis, or published findings. Our incentive is accuracy. Our business model depends on it.
Full access to a single published report — methodology, findings, raw data, and statistical code.
Full archive access, all future reports as published, subscriber voting rights, and raw data for every experiment.
Full annual access for your entire team, plus quarterly benchmark briefings and priority pre-registration review.
Billed annually. No monthly option — the research cadence is 6-8 week experiments, not monthly content.