GTM-001 Active: Inbox Provider RCT — Day 14 of 4215,360 emails deployed across 4 armsSRM check passed — chi-square p = 0.71Guardrail metrics nominal — all arms below thresholdGTM-002 Pre-Registration drops April 14Subscriber vote open: AI SDR vs. Human leads 847 vs. 312GTM-001 Active: Inbox Provider RCT — Day 14 of 4215,360 emails deployed across 4 armsSRM check passed — chi-square p = 0.71Guardrail metrics nominal — all arms below thresholdGTM-002 Pre-Registration drops April 14Subscriber vote open: AI SDR vs. Human leads 847 vs. 312

Independent GTM Research

GTM vendors make
claims. We run
controlled experiments.

Pre-registered, statistically rigorous A/B experiments on cold email infrastructure, outbound tooling, and go-to-market strategy. No vendor relationships. No conflicts. Just data.

5Reports Planned

95%Confidence Threshold

0Vendor Relationships

Get Annual Access — $499 View Research

Active & Upcoming Experiments

EXPERIMENT GTM-001 · RCT · 4-ARM

● Active — Day 14/42

Inbox Provider Performance: Mailreef vs. PremiumInboxes vs. ZapMail vs. Custom SMTP

n = 15,360 emailsMetric: Positive reply ratePower: 80%

GTM-002 · UPCOMING

AI SDR vs. Human-Written Sequences: A Controlled Comparison

→

GTM-003 · UPCOMING

Data Enrichment Accuracy: Apollo vs. ZoomInfo vs. Clay Waterfall

→

GTM-004 · SUBSCRIBER VOTE

Follow-Up Sequence Length: 2-Step vs. 4-Step vs. 6-Step

→

Pre-Registration Protocol·Bonferroni Correction Applied·Sample Ratio Mismatch Detection·Stratified Randomization·Blind Reply Classification·RCT Design Framework·Guardrail Metric Monitoring·Raw Data Published·Null Results Reported·Zero Vendor RelationshipsPre-Registration Protocol·Bonferroni Correction Applied·Sample Ratio Mismatch Detection·Stratified Randomization·Blind Reply Classification·RCT Design Framework·Guardrail Metric Monitoring·Raw Data Published·Null Results Reported·Zero Vendor Relationships

The Problem With GTM Research

Every benchmark you've ever read was published by someone selling you something.

Lemlist analyzed 182 million emails and published reply rate benchmarks. Instantly published a deliverability report from billions of platform sends. ZoomInfo claims 95% data accuracy. Every AI SDR vendor promises 4-7x conversion lifts. None of it is independently verified. None of it uses a control group. None of it is pre-registered.

Observational platform data, no matter how large, cannot establish causation. A vendor reporting high reply rates from their users is reporting selection bias — their users are good at cold email, or picked the platform because it works for them. You cannot attribute the result to the product.

GTM Reports runs randomized controlled trials. Every experiment is pre-registered before a single email sends. Hypotheses are committed to in public. Analysis follows the pre-registered plan. Null results are published in full. The methodology section is longer than the findings section. Raw data is available to every subscriber.

This is what accountability looks like in a space that has never had it.

GTM Reports generates revenue exclusively from subscriber access fees. We hold no equity, affiliate arrangements, or paid partnerships with any vendor whose performance is the subject of a report. Tools used as experimental constants across all arms are disclosed as such. Our incentive is accuracy. Our business model depends on it.

Hypothesis & Pre-Registration

Directional hypothesis, primary metric, sample size calculation, and analysis plan published publicly before any data is collected. Timestamped and immutable.

Control Variable Documentation

Every non-test variable — domain age, warmup protocol, DNS config, sending cadence, ICP — held constant or stratified across arms before launch.

Stratified Randomization

Leads assigned to arms using stratified random allocation by receiving inbox provider, industry, and seniority. Chi-square balance check required to proceed.

Guardrail Monitoring

Daily checks on spam complaint rate, bounce rate, and blacklist status. Sample Ratio Mismatch tested weekly. Primary metrics never observed until runtime ends.

Pre-Registered Analysis

Primary analysis runs exactly as documented. Bonferroni correction applied for multiple comparisons. Exploratory segmentation clearly labeled as such. Null results published in full.

ID	Experiment	n	Primary Metric	Status	Finding
GTM-001	Inbox Provider Performance: Mailreef vs. PremiumInboxes vs. ZapMail vs. Custom SMTP4-arm RCT · Controlled for domain age, warmup, copy, ICP, sending volume	15,360	Positive reply rate	● Active	Pre-registered. Results publish Day 60.
GTM-002	AI-Generated vs. Human-Written Sequences: Claude Agents SDK vs. Manual Copy2-arm RCT · Identical ICP, infrastructure, and sending cadence	6,400	Positive reply rate	Upcoming	Pre-registration opens for subscriber review April 14.
GTM-003	Data Enrichment Accuracy: Apollo vs. ZoomInfo vs. Clay Waterfall vs. ManualGround-truth verification of 10,000 records against LinkedIn and direct confirmation	10,000	Email validity rate at 90 days	Upcoming	Subscriber vote confirmed. Design phase Q3.
GTM-004	Follow-Up Sequence Length: 2-Step vs. 4-Step vs. 6-Step3-arm RCT · Identical step-one copy, randomized interval timing	9,600	Positive reply rate	Upcoming	Open for subscriber vote.
GTM-005	Sending Volume Per Mailbox: 30 vs. 50 vs. 75 Emails/DayDeliverability degradation study over 42-day window with seed account measurement	TBD	Inbox placement rate at Day 42	Upcoming	Open for subscriber vote.

Pre-Registration

Every hypothesis, primary metric, sample size calculation, and analysis plan is published publicly before the experiment begins. This prevents p-hacking and post-hoc rationalization — the most common sources of false positive research.

Established best practices for trustworthy online controlled experiments

Bonferroni Correction

Multi-arm experiments apply Bonferroni correction for all pairwise comparisons. A 4-arm experiment tests 6 pairs — the significance threshold adjusts to 0.0083, not 0.05. Uncorrected multi-arm tests produce false positives at 26%+ rates.

Standard for family-wise error rate control in RCT analysis

χ²

Sample Ratio Mismatch

Weekly chi-square goodness-of-fit test verifies actual experimental splits match intended allocation. SRM indicates implementation errors, differential filtering, or provider throttling that would invalidate results regardless of sample size.

Standard trustworthiness check for online controlled experiments

Adequate Sample Sizes

Most cold email "A/B tests" use 100-500 emails per variant — sufficient only to detect 100%+ relative lifts. GTM Reports targets 3,200+ per arm at 80% power and 95% confidence for a minimum detectable effect of 0.8 percentage points.

Two-proportion Z-test with Bonferroni-adjusted alpha threshold

∅

Null Result Publication

If an experiment finds no statistically significant difference, that result is published in full. "No detectable difference between these providers" is a finding with direct operational value. Unpublished null results are a scientific integrity failure.

Publication bias prevention — standard in academic research

⊥

Blinded Classification

Reply classification (Positive / Neutral / Negative) is performed without knowledge of which experimental arm produced the reply. Arm labels are added to the dataset only after classification is complete, preventing unconscious scoring bias.

Double-blind reply coding with inter-rater reliability check

Annual subscribers set the research agenda. Current vote is live.

GTM-006 · CANDIDATE847 votes

AI SDR Platform Claims: Independent Performance Audit of 6 Vendors

Test vendor-claimed conversion rate lifts against a human SDR baseline using identical ICP, copy, and infrastructure. The first independent validation of the AI SDR category.

GTM-007 · CANDIDATE312 votes

Sending Time and Day-of-Week Effects on Cold Email Reply Rate

A controlled experiment isolating send timing as the sole variable across 8 arms — Monday through Friday at peak and off-peak windows — with identical copy and ICP.

GTM-008 · CANDIDATE621 votes

Subject Line Length and Personalization: A Factorial Experiment

2x2 factorial design testing subject line length (short vs. long) and personalization level (generic vs. company-specific) with positive reply rate as primary outcome.

GTM-009 · CANDIDATE489 votes

Clay Workflow Conversion: Waterfall Enrichment vs. Single Source Accuracy

Tests whether waterfall enrichment through Clay produces meaningfully better deliverability and reply rates than single-source data at equivalent per-lead cost.

Get Access to Vote — $499/year

The GTM research ecosystem has one structural problem: every entity publishing benchmarks has a financial interest in the outcome. Platform data comes from platforms. Comparison studies come from companies adjacent to the tools being compared. There is no independent layer.

GTM Reports is structurally different. Our business model — annual subscriber fees — means we profit when findings are trustworthy and lose subscribers when they are not. This is not a stated value. It is an incentive structure.

GTM Reports generates revenue exclusively from subscriber access fees. We hold no equity, affiliate arrangements, or paid partnerships with any vendor whose performance is the subject of a report. Tools used as experimental constants across all arms are disclosed as such in every methodology section. Where free accounts are provided by vendors for inclusion in an experiment, this is disclosed and the vendor has no influence over test design, analysis, or published findings. Our incentive is accuracy. Our business model depends on it.

Individual Report

Single Report

$199

one-time purchase

Full access to a single published report — methodology, findings, raw data, and statistical code.

Complete methodology documentation
Full statistical output and confidence intervals
Raw data export (CSV)
R and Python analysis code
Permanent access to that report

Purchase Report

GTM vendors make
claims. We run
controlled experiments.

Inbox Provider Performance: Mailreef vs. PremiumInboxes vs. ZapMail vs. Custom SMTP

Every benchmark you've ever read was published by someone selling you something.

How Every Experiment
Gets Built

Hypothesis & Pre-Registration

Control Variable Documentation

Stratified Randomization

Guardrail Monitoring

Pre-Registered Analysis

The Research Program

Methodological
Standards

Pre-Registration

Bonferroni Correction

Sample Ratio Mismatch

Adequate Sample Sizes

Null Result Publication

Blinded Classification

Subscribers Vote on
What Gets Tested Next

AI SDR Platform Claims: Independent Performance Audit of 6 Vendors

Sending Time and Day-of-Week Effects on Cold Email Reply Rate

Subject Line Length and Personalization: A Factorial Experiment

Clay Workflow Conversion: Waterfall Enrichment vs. Single Source Accuracy

Why You Can
Trust This

Access the
Research

GTM vendors makeclaims. We runcontrolled experiments.

Inbox Provider Performance: Mailreef vs. PremiumInboxes vs. ZapMail vs. Custom SMTP

Every benchmark you've ever read was published by someone selling you something.

How Every ExperimentGets Built

Hypothesis & Pre-Registration

Control Variable Documentation

Stratified Randomization

Guardrail Monitoring

Pre-Registered Analysis

The Research Program

MethodologicalStandards

Pre-Registration

Bonferroni Correction

Sample Ratio Mismatch

Adequate Sample Sizes

Null Result Publication

Blinded Classification

Subscribers Vote onWhat Gets Tested Next

AI SDR Platform Claims: Independent Performance Audit of 6 Vendors

Sending Time and Day-of-Week Effects on Cold Email Reply Rate

Subject Line Length and Personalization: A Factorial Experiment

Clay Workflow Conversion: Waterfall Enrichment vs. Single Source Accuracy

Why You CanTrust This

Access theResearch

GTM vendors make
claims. We run
controlled experiments.

How Every Experiment
Gets Built

Methodological
Standards

Subscribers Vote on
What Gets Tested Next

Why You Can
Trust This

Access the
Research