GTM-001 Active: Inbox Provider RCT — Day 14 of 4215,360 emails deployed across 4 armsSRM check passed — chi-square p = 0.71Guardrail metrics nominal — all arms below thresholdGTM-002 Pre-Registration drops April 14Subscriber vote open: AI SDR vs. Human leads 847 vs. 312GTM-001 Active: Inbox Provider RCT — Day 14 of 4215,360 emails deployed across 4 armsSRM check passed — chi-square p = 0.71Guardrail metrics nominal — all arms below thresholdGTM-002 Pre-Registration drops April 14Subscriber vote open: AI SDR vs. Human leads 847 vs. 312
Independent GTM Research

GTM vendors make
claims. We run
controlled experiments.

Pre-registered, statistically rigorous A/B experiments on cold email infrastructure, outbound tooling, and go-to-market strategy. No vendor relationships. No conflicts. Just data.

5Reports Planned
95%Confidence Threshold
0Vendor Relationships
Active & Upcoming Experiments
EXPERIMENT GTM-001 · RCT · 4-ARM
● Active — Day 14/42

Inbox Provider Performance: Mailreef vs. PremiumInboxes vs. ZapMail vs. Custom SMTP

n = 15,360 emailsMetric: Positive reply ratePower: 80%
GTM-002 · UPCOMING
AI SDR vs. Human-Written Sequences: A Controlled Comparison
GTM-003 · UPCOMING
Data Enrichment Accuracy: Apollo vs. ZoomInfo vs. Clay Waterfall
GTM-004 · SUBSCRIBER VOTE
Follow-Up Sequence Length: 2-Step vs. 4-Step vs. 6-Step
Pre-Registration Protocol·Bonferroni Correction Applied·Sample Ratio Mismatch Detection·Stratified Randomization·Blind Reply Classification·RCT Design Framework·Guardrail Metric Monitoring·Raw Data Published·Null Results Reported·Zero Vendor RelationshipsPre-Registration Protocol·Bonferroni Correction Applied·Sample Ratio Mismatch Detection·Stratified Randomization·Blind Reply Classification·RCT Design Framework·Guardrail Metric Monitoring·Raw Data Published·Null Results Reported·Zero Vendor Relationships
The Problem With GTM Research

Every benchmark you've ever read was published by someone selling you something.

Lemlist analyzed 182 million emails and published reply rate benchmarks. Instantly published a deliverability report from billions of platform sends. ZoomInfo claims 95% data accuracy. Every AI SDR vendor promises 4-7x conversion lifts. None of it is independently verified. None of it uses a control group. None of it is pre-registered.

Observational platform data, no matter how large, cannot establish causation. A vendor reporting high reply rates from their users is reporting selection bias — their users are good at cold email, or picked the platform because it works for them. You cannot attribute the result to the product.

GTM Reports runs randomized controlled trials. Every experiment is pre-registered before a single email sends. Hypotheses are committed to in public. Analysis follows the pre-registered plan. Null results are published in full. The methodology section is longer than the findings section. Raw data is available to every subscriber.

This is what accountability looks like in a space that has never had it.

GTM Reports generates revenue exclusively from subscriber access fees. We hold no equity, affiliate arrangements, or paid partnerships with any vendor whose performance is the subject of a report. Tools used as experimental constants across all arms are disclosed as such. Our incentive is accuracy. Our business model depends on it.
§ 01

How Every Experiment
Gets Built

01

Hypothesis & Pre-Registration

Directional hypothesis, primary metric, sample size calculation, and analysis plan published publicly before any data is collected. Timestamped and immutable.

02

Control Variable Documentation

Every non-test variable — domain age, warmup protocol, DNS config, sending cadence, ICP — held constant or stratified across arms before launch.

03

Stratified Randomization

Leads assigned to arms using stratified random allocation by receiving inbox provider, industry, and seniority. Chi-square balance check required to proceed.

04

Guardrail Monitoring

Daily checks on spam complaint rate, bounce rate, and blacklist status. Sample Ratio Mismatch tested weekly. Primary metrics never observed until runtime ends.

05

Pre-Registered Analysis

Primary analysis runs exactly as documented. Bonferroni correction applied for multiple comparisons. Exploratory segmentation clearly labeled as such. Null results published in full.

§ 02

The Research Program

IDExperimentnPrimary MetricStatusFinding
GTM-001Inbox Provider Performance: Mailreef vs. PremiumInboxes vs. ZapMail vs. Custom SMTP4-arm RCT · Controlled for domain age, warmup, copy, ICP, sending volume15,360Positive reply rate● ActivePre-registered. Results publish Day 60.
GTM-002AI-Generated vs. Human-Written Sequences: Claude Agents SDK vs. Manual Copy2-arm RCT · Identical ICP, infrastructure, and sending cadence6,400Positive reply rateUpcomingPre-registration opens for subscriber review April 14.
GTM-003Data Enrichment Accuracy: Apollo vs. ZoomInfo vs. Clay Waterfall vs. ManualGround-truth verification of 10,000 records against LinkedIn and direct confirmation10,000Email validity rate at 90 daysUpcomingSubscriber vote confirmed. Design phase Q3.
GTM-004Follow-Up Sequence Length: 2-Step vs. 4-Step vs. 6-Step3-arm RCT · Identical step-one copy, randomized interval timing9,600Positive reply rateUpcomingOpen for subscriber vote.
GTM-005Sending Volume Per Mailbox: 30 vs. 50 vs. 75 Emails/DayDeliverability degradation study over 42-day window with seed account measurementTBDInbox placement rate at Day 42UpcomingOpen for subscriber vote.
§ 03

Methodological
Standards

Σ

Pre-Registration

Every hypothesis, primary metric, sample size calculation, and analysis plan is published publicly before the experiment begins. This prevents p-hacking and post-hoc rationalization — the most common sources of false positive research.

Established best practices for trustworthy online controlled experiments
α

Bonferroni Correction

Multi-arm experiments apply Bonferroni correction for all pairwise comparisons. A 4-arm experiment tests 6 pairs — the significance threshold adjusts to 0.0083, not 0.05. Uncorrected multi-arm tests produce false positives at 26%+ rates.

Standard for family-wise error rate control in RCT analysis
χ²

Sample Ratio Mismatch

Weekly chi-square goodness-of-fit test verifies actual experimental splits match intended allocation. SRM indicates implementation errors, differential filtering, or provider throttling that would invalidate results regardless of sample size.

Standard trustworthiness check for online controlled experiments
n

Adequate Sample Sizes

Most cold email "A/B tests" use 100-500 emails per variant — sufficient only to detect 100%+ relative lifts. GTM Reports targets 3,200+ per arm at 80% power and 95% confidence for a minimum detectable effect of 0.8 percentage points.

Two-proportion Z-test with Bonferroni-adjusted alpha threshold

Null Result Publication

If an experiment finds no statistically significant difference, that result is published in full. "No detectable difference between these providers" is a finding with direct operational value. Unpublished null results are a scientific integrity failure.

Publication bias prevention — standard in academic research

Blinded Classification

Reply classification (Positive / Neutral / Negative) is performed without knowledge of which experimental arm produced the reply. Arm labels are added to the dataset only after classification is complete, preventing unconscious scoring bias.

Double-blind reply coding with inter-rater reliability check
§ 04

Subscribers Vote on
What Gets Tested Next

Annual subscribers set the research agenda. Current vote is live.

GTM-006 · CANDIDATE847 votes

AI SDR Platform Claims: Independent Performance Audit of 6 Vendors

Test vendor-claimed conversion rate lifts against a human SDR baseline using identical ICP, copy, and infrastructure. The first independent validation of the AI SDR category.

GTM-007 · CANDIDATE312 votes

Sending Time and Day-of-Week Effects on Cold Email Reply Rate

A controlled experiment isolating send timing as the sole variable across 8 arms — Monday through Friday at peak and off-peak windows — with identical copy and ICP.

GTM-008 · CANDIDATE621 votes

Subject Line Length and Personalization: A Factorial Experiment

2x2 factorial design testing subject line length (short vs. long) and personalization level (generic vs. company-specific) with positive reply rate as primary outcome.

GTM-009 · CANDIDATE489 votes

Clay Workflow Conversion: Waterfall Enrichment vs. Single Source Accuracy

Tests whether waterfall enrichment through Clay produces meaningfully better deliverability and reply rates than single-source data at equivalent per-lead cost.

Get Access to Vote — $499/year
§ 05

Why You Can
Trust This

The GTM research ecosystem has one structural problem: every entity publishing benchmarks has a financial interest in the outcome. Platform data comes from platforms. Comparison studies come from companies adjacent to the tools being compared. There is no independent layer.

GTM Reports is structurally different. Our business model — annual subscriber fees — means we profit when findings are trustworthy and lose subscribers when they are not. This is not a stated value. It is an incentive structure.

GTM Reports generates revenue exclusively from subscriber access fees. We hold no equity, affiliate arrangements, or paid partnerships with any vendor whose performance is the subject of a report. Tools used as experimental constants across all arms are disclosed as such in every methodology section. Where free accounts are provided by vendors for inclusion in an experiment, this is disclosed and the vendor has no influence over test design, analysis, or published findings. Our incentive is accuracy. Our business model depends on it.

§ 06

Access the
Research

Individual Report
Single Report
$199
one-time purchase

Full access to a single published report — methodology, findings, raw data, and statistical code.

  • Complete methodology documentation
  • Full statistical output and confidence intervals
  • Raw data export (CSV)
  • R and Python analysis code
  • Permanent access to that report
Purchase Report
Teams
Agency Access
$2,499
per year · 10 seats

Full annual access for your entire team, plus quarterly benchmark briefings and priority pre-registration review.

  • 10 seats — full annual access
  • Quarterly briefings with the researcher
  • Priority review of pre-registration documents
  • Custom segmentation requests for published data
  • Whitelabeled summaries for client reports
Contact for Access

Billed annually. No monthly option — the research cadence is 6-8 week experiments, not monthly content.