Subject Line Testing in a post-iOS 15 World for Email Marketers & Data Analysts

Like most other email marketers, my heart skipped a beat when I read about the iOS 15 Mail Privacy Protection feature that obscures open information on emails on iOS devices. As a statistician and devout believer in an ongoing test-and-learn strategy, open rate A/B testing was one of the core tools that allowed us to grow email conversions at a mid-size specialty retailer by 300% YoY for 3 years. When iOS 15 was announced, my team was sending approximately 350 million emails per year across 5 product lines with send sizes ranging from 30k – 4M. Most deployments were optimized for subject lines, often testing 2-5 subject lines per send on a portion of the population. Our ESP automatically deploys the remainder of the population with the winning subject line, so we are able to realize the benefits of a test immediately.

The thought of dulling that important optimization tool was definitely scary–should we give up on subject line testing entirely? Would we still see differentiation, just less significant? Would we still be able to achieve statistically significant results? Or should we eschew open rate completely in favor of click rate?

Our team decided to take a conservative approach; we continued to subject line test on almost every email, but increased the test volume and scaled back subject line testing to 2 variations each. We decided that after 6 months, we would evaluate whether we were still frequently able to achieve statistically significant results and extract meaningful insights. See “Real World Experience & Final Thoughts” section on what the results were for this project.

Impact of the iOS 15 change on subject line testing

With iOS 15, automatically registered opens essentially serve to obfuscate or “mute” the true impact of a test because a portion of the population is no longer reacting to the actual test. The table below shows a typical B2C email sender with an average 15% open rate pre-iOS 15. If they were to run a subject line test that effectively creates 1 point in lift, (eg. winning subject line opens at 16% vs 15%), in a post-iOS 15 world the impact of the same test is closer to 0.6pts due to a portion of the audience having iOS, which opens the email regardless of subject line.

Depending on audience size, (for example with 100,000 total audience), this may be the difference between a statistically significant test with a 1pt lift and a not statistically significant test with a 0.6pt lift, as shown in the example below.

This disproportionately affects organizations with smaller audiences, who now face bigger hurdles to statistically significant test results, continue reading for more details.

Four Best Practices in Subject Line Testing in a Post iOS 15 World

Know your audience size

How large is your audience for each campaign? Are there some specialty messages that fall below the threshold where you are likely to have statistically significant results? Don’t waste time subject line testing.

Set Expectations

Based on your audience size, know how much of an impact the test has to have to reach statistical significance. Ensure your stakeholders understand that the bar is higher than it was pre-iOS 15, and you may see statistically significant results less frequently or need to test fewer, more differentiated variants to increase the likelihood of a reliable answer. With the implementation of iOS 15MPP policies, testing based on open rates moving forward will be measured by those opening on non-iOS devices, which may be a slightly biased sample.

Maximize differentiation in all tests

No longer can we expect statistically significant results from a nuanced test where perhaps we only change one word of a subject line. Once, my team was accused of using exclamation points too frequently in subject lines so we decided to test the same subject line with one version ending in a period and one ending in an exclamation point. It ended in an exact tie, (not just a statistical tie, but an exact mathematical tie to 2 decimal points) and caused a deployment error in our ESP that used strict inequality (> vs ≥) to declare a winner. Now is the time for testing very different subject lines, for example pricing/savings vs. highlighting product features.

Run valid tests

A/B testing is only statistically valid when based on a random sample methodology. Never, ever evaluate a subject line from last year/month/week to one run today as changes in time of day, seasonality, economic environment, audience composition, deliverability, and other factors could also be responsible for open rate (or click rate) impact. This is even more important when trying to read results through the veil of iOS 15 auto opens. Longitudinal trend analysis is a valuable tool, but shouldn’t be used to evaluate content directly.

Can I still test subject lines? A Statistical Approach.

Despite the changes with iOS 15, subject line/open rate testing is still a powerful optimization tool in many organizations and should not be ignored.

Great news! Math has all the answers we need to set a plan for continuing to optimize your email program in a post-iOS 15 world. Using estimated subject line impact, (“true” impact without the impact of iOS auto-opens) and audience size, the table below can be used to estimate whether a subject line test will still be statistically significant at a 95% confidence level for a typical B2C audience before & after the impact of iOS 15 MPP.

For example, an organization with a 2M audience size can run a 2 variant test (1M each) and get statistically significant results as long as the test subject line is sufficiently differentiated to expect a “true” 0.4 pt lift or more in open rate from the “base” subject line.

From this chart, the greatest losses in deterministic testing are for small audiences (<100k test cells), so organizations with sufficiently large audiences to send 500-1M per variant should experience very little degradation in statistical significance despite misleading open activity from a significant portion of the population.

My experience pre-iOS 15 in B2C subject line testing is that “true” differentiation of 0.5-1pt is the reasonable expected range for a good subject line test (this is reduced to a 0.3 – 0.6 pt measured lift with the impact of iOS 15 auto opens). It is difficult to craft a subject line and get more than 1 point of lift, unless it is an amazing subject line!

A similar table with assumptions for a B2B audience, (which includes assumptions of lower penetration of iOS 15 and higher overall open rates) is below:

Want to run your own scenarios? Update the BLUE cells in the calculator below to get personalized recommendations based on your audience size, # of variants to test, iOS penetration and estimated subject line impact.

Pre-iOS 15

Post-iOS 15

Send Size (total audience)

Base Open Rate

# variants to test

Expected open rate test impact (pts)

Est % of audience on iOS

Note: input percentages as whole numbers (eg. 10 for 10% open rate, 1.5 for 1.5 percentage points expected improvement)

Add your email and details below if you want to talk in detail about your results and how you can effectively test in a post IOS world

What about click rate? Should I judge a subject line test on click rate instead?

Measuring a test based on click rate does remove the issue of “fake” iOS 15 opens and thus the impact of a test on click rate is still a pure measure, but moving the needle on click rate the same relative impact, (eg. 10% increase in click rate) is a smaller percentage point increase and therefore harder to achieve statistical significance. Near the extrema of the Binomial distribution, (eg. events close to 0% or 100% probability like clickthrough rate) and smaller sample sizes, (eg. <500k), lift of 5-10% on click rate may not yield statistical significance where 5-10% lift on open rate at the same sample size is statistically significant.

Real World Experience & Final Thoughts

While the automatic opens that started to occur in iOS 15 definitely created higher hurdles to understanding test results, open rate continues to be a powerful metric for understanding how subject lines, deliverability, and other outside influences affect consumer behavior. For organizations with large audience sizes, even small differences in open rate are still very valuable for understanding creative performance and optimizing subject lines.

Our retail organization had 3 consistent audiences that we used for subject line testing on a regular basis; one large audience of approx 3-3.5M and two smaller audiences of 300k-400k. Prior to iOS 15, we saw statistically significant results for 80-90% of subject line tests on the large audience and 80% of our tests on our smaller audiences. After evaluating 6+ months of testing post-iOS 15 rollout, we saw no change in statistical significance frequencies for our 3M member audience, and only slight degradation (~ 72% of our tests were significant) for our smaller audiences.

Need help understanding if subject line testing is still relevant in your program? Seeking to optimize your customer touchpoints by leveraging testing, insights, and data-driven personalization? Contact ContinuumGlobal to learn more.

Statistical Details & Assumptions

Statistical significance methodology is based on Binomial Proportion Confidence Intervals Normal approximation with α=0.05 → z=1.96. No Bonferroni correction has been applied, so actual accuracy may be <95%. For probabilities very close to 0 or 1, other methods of confidence interval estimation will be more accurate.

iOS open rate audience volumes and impact on open rate (estimated at 80% for iOS audience members) are based on personal experience before and after iOS deployment and can be adjusted for variations observed in other organizations.

About the Author

Hillary Bliss is the Sr Director of Data & Analytics at ContinuumGlobal, a leading Marketing Services agency with a dedicated and dynamic team who aren’t afraid to challenge the status quo. ContinuumGlobal creatively scales marketing operations at pace and with ease, making marketing operations more agile, cost-effective and scalable.

Hillary is a data scientist with a Masters of Statistics and an MBA from Georgia Tech and has worked on direct marketing program strategy, customer behavior and marketing spend analysis at major retailers in the United States. She currently leads the Data & Analytics practice at ContinuumGlobal focusing on helping quantify, visualize, and optimize impact of consumer touchpoints for major technology clients including multiple Google product lines and Fortune 200 companies like Coinbase.