Mobile testing strategies

Tue, 14 Oct 2025, reading time: 8 minutes

👉 Read this article on Substack.

Picture this: It’s 2 AM and your phone buzzes with Slack notifications. Your latest app release is crashing on startup for users with older devices. Your boss is asking for an ETA on the fix.

The answer? “We’ll have a patch ready tomorrow, but users won’t see it for 2–3 days minimum due to App Store review and slow roll-outs.”

It’s the reality every mobile team fears. Unlike web applications where you can push a fix in minutes, or even roll back, mobile releases are unforgiving. Once your app goes live, there’s no quick rollback, no hotfix pipeline that deploys instantly.

This harsh reality shapes everything about mobile quality assurance.

Mobile apps live in a world of app store reviews, staged rollouts, and users who might not update for days. A single crash affecting 1% of your user base could mean thousands or even millions of frustrated customers before you can respond.

Yet, some mobile teams hand off their work to one or two QA members, hoping that this is enough to catch serious problems.

That helps, but it’s more of a last resort, and oftentimes not enough to keep quality high.

Feature flags can help a lot, but if your app crashes on startup, feature flags can’t even load, which is a great way to increase your blood pressure.

That’s why, in this article, I want to help explore why mobile testing requires a different testing approach.

I’ll cover:

Blind spots for manual and UI testing strategies
Understanding where manual tests fit, and where UI Tests can supplement them
Trade-offs and comparisons between different testing approaches in a mobile context

By understanding what “quality” really means in mobile development, you’ll think differently about how you write code and test your apps.

The role of manual testing

Manually testing our applications is an important step; we can’t only run automated tests and assume our app works great. Human eyes can catch things that automated tests don’t: weird UI glitches, UX issues, labels being awkwardly trimmed off screens, odd translations, or buttons being slightly out of bounds.

Especially for new features and changes, it’s great to have human verification that our app works as intended. Because it’s new UI that hasn’t been verified before.

The great benefit is that anyone in the company can help find issues, and it’s a perfect way to check an app against internal or production servers.

However, manual testing doesn’t scale.

For example, maybe your app works well for 99% of customers, but 1% might have obscure crashes. It could be that a local database migration fails, but only on slower devices with low disk space. Or slower devices start running out of memory. Whatever it is, it’s always something different. If there’s some weird code-path or context (such as specific OS + device version) that isn’t tested by your team, then users will find those crashes for you. In which case manual testing and feature flags may not be enough, and it will be painful to get a fix out.

Sometimes things still break beyond our control. We may have the most perfect, stable build, with a bazillion feature flags and 100% test coverage. Yet, a new OS version could introduce crashes to a once-stable build - like a rug being pulled out from underneath us.

A new OS version means that even if you were to beta-test extensively, your own team may not find everything. But those thousands, if not millions, of users will most likely find something that a small testing team won’t.

The Combinatory Explosion: Why Manual Testing Can’t Scale

When leaving a QA person to test your app, you will only test a tiny fraction of all device combinations.

Let’s say you support a modest range of devices and conditions:

5 device models (iPhone 13, 14, 15, 16, 17)
3 OS versions (iOS 16, 17, 18)
2 orientations (portrait, landscape)
2 appearance modes (light, dark)
2 storage states (plenty of space, almost full)

That’s already 5 × 3 × 2 × 2 × 2 = 120 combinations.

Now add some more real-world variables:

App backgrounding scenarios
Phone call interruptions
Low memory warnings
Large-text accessibility setting
Small-text accessibility setting
Right-To-Left text
Split-screen multitasking
Different accessibility settings
Various network conditions
Almost-full disk

You’re suddenly looking at thousands of possible states your app could encounter.

To skip tests and give full confidence to a QA person, saying “Looks good, ship it!” is humorous, considering it becomes humanly impossible to verify all states. Yet, many teams do work this way!

In my opinion, QA and human verification works best for new features and critical features. But for regressions, I am in favor of automation so you can “forget” about a lot of features and assume they are stable.

Flagship bias

Another problem I see in teams is what I call “flagship bias”, the gap between how we test and how users actually use our apps.

Developers gravitate toward the latest devices. That iPhone Air handles memory leaks like a champ, and your office WiFi makes every screen load buttery-smooth.

But most users are still on older devices with limited storage and spotty connections.

The testing confidence becomes satirical: “I tested the payment flow on my iPhone 17 Pro Max with 500GB free space in perfect WiFi conditions. Ship it!”

Meanwhile, your app is about to process a purchase on a 4-year-old phone with almost no storage remaining, while the user walks through a subway tunnel.

So be sure to be aware of your bias, keep testing on older devices, and enable network limiting often. Maybe that fancy UI animation isn’t as important, maybe you need a skeleton loader, or maybe you need to talk to a backender to cut up that API call so you can show data a bit earlier.

The lesson here is: Make your worst-case scenario your primary testing environment.

The hidden costs of canceling a build

Thursday afternoon: Your team has tested the release all week. Marketing has the announcement ready. Customer support prepared their FAQ. Then QA finds a showstopper: the payment flow crashes. Oh crap.

The build gets canceled. “No worries.” you say. “We’ll fix it and ship next release.”

But now a subtle problem comes in: The next build contains three layers of changes: its own new features, everything from the canceled build, and the fix for the canceled issue.

Each canceled build makes the next release bigger and harder to test.

This creates a snowball effect. More surface area means more potential issues, which leads to more canceled builds, which creates even larger releases.

Beyond the technical complexity, canceled builds damage team confidence and morale. Features sit unreleased for weeks, and people become uncertain whether anything will actually ship.

Prevention becomes critical. Therefore I recommend to focus more on prevention than putting out fires.

One way I recommend to test more preventatively, is to look into Shift-left testing approaches.

Cross-team dependencies amplify the risk

Mobile apps don’t exist in isolation, they’re interconnected systems where one team’s changes can break another team’s features in unexpected ways.

Consider this scenario: Your onboarding team ships a beautiful new user registration flow. It works perfectly in testing. But two weeks later, the settings team pushes an update that changes how user preferences are stored. Suddenly, new users can’t complete registration because the onboarding flow can’t save their initial preferences.

Now two seemingly unrelated teams and features are blocking each other.

The larger your organization, the more this compounds. That’s why it’s even more important to catch issues earlier than later, before QA even receives a build.

In mobile, it’s difficult to create full autonomy, modules help but it’s not enough. To see various approaches to increase autonomy, check out the modular chapters of the Mobile System Design book.

Build-to-build testing

Another easily and often overlooked check is to test updating the build from an older version to the latest version on the same device.

For instance, developers may give a testing team the latest build to test and it may all work great. But to the users, an obscure crash may occur when the latest build has to perform a local database migration from the previous build, which could affect most users. Verifying a build upgrade is easy to miss.

Maybe it works fine generally, but perhaps a slower device may have trouble finishing a database migration on time during startup and now the initial app load time is negatively affected.

These issues tend to slip through the cracks easily. Because when testing internally, we are constantly overriding and updating builds on our devices, thus making it harder to detect a real issue in regular day-to-day work. Even when distributing builds, a single crash may only appear once and never show up again, thus making serious issues appear as a fluke. Because if a crash happens to a team-member, they might just reopen the app and it looks all fine. But, in practice this crash could happen to thousands of users. So take weird flukes seriously.

With all of this in mind, be sure to incorporate build updates in your testing scenarios. This should preferably be carried out on the slowest device you can get your hands on.

This process can be (semi-)automated. You can check out the latest production version, build and run it on a device (simulator), then check out the upcoming version, and build that on the same device (simulator). Then verify whether any crashes or errors occurred.

Where UI Tests fit

In some companies, UI Tests are never written for various reasons. A team may decide that unit tests + manual checks are enough to ensure quality.

Compared to unit tests, UI tests can be slower to write, harder to maintain, and UI Tests are known to be brittle and flaky. Another big downside is that running UI tests on the server is slow (such as when they’re part of a CI pipeline).

So it makes sense that companies decide to omit UI Tests and compensate for this by relying on manual tests.

But we have to be aware of what it means not to write UI Tests; Manual tests don’t scale, they can only test a subset of how our build will run. With UI Tests you can open up a fleet of simulators and run tons of flows in parallel.

And think about when you would learn about issues. Manual tests often are performed late in the release process too by QA, usually during the release phase, which makes releases more stressful because issues are then found right before a deadline. But you can run UI Tests whenever and as often as you want, even when you’re sleeping!

And consider the overhead. With a failing UI Test you get to see exactly at which step that happens. With manual testing you get a report where you probably have to ask for more information through videos or meetings. All in all it will cost more time to communicate and clarify bugs.

One big benefit of manual tests, however, is that they are commonly run against real servers, for instance when team-members test a new app version against a production environment. As opposed to unit-tests and UI Tests which tend to have their network calls mocked out. Be aware of this and leverage manual tests for this purpose.

It’s not an either/or situation either. Consider mixing testing styles to increase the testing surface.

If UI Tests are a pain to write, then check out Stop Fighting Your UI Tests to get them up and running more quickly.

New features versus catching regressions

Consider manual testing for new and critical features that you’re about to release. Human eyes can catch a lot of foreseen and unforeseen issues and odd bugs.

Consider regularly spreading internal builds within the company so that the entire department can help find issues early and often. That’s the benefit of manual testing. Anyone can try to help find issues within the app.

Conversely, implement UI Tests to make sure all, if not most features remain stable. They will be great for catching any regressions that might pop up.

With a strong unit-test suite, and by applying UI Tests and internal builds, you will be able to catch a lot of issues early and often.

Time and money investments

When considering implementing UI Tests, the upfront investment is a common factor. For a small app some manual checks can be enough, but there is a tipping point once your app grows.

For a serious app it would cost considerable time and resources for each release to have an entire team go over the entire app manually.

Granted, having a few people perform manual tests can be cheaper than having a development team working full-time to keep CI jobs stable.

However, thinking that UI Tests are expensive can quickly become a fallacy for any serious app. When considering time investments, think about what is more expensive: Running a giant UI test suite on a fleet of devices multiple times a day, or hiring a team of people that manually test on a handful of devices right before submitting a build to the App Store / Play Store.

Dealing with slow UI Tests

One main gripe with UI Tests is that they block people’s pull requests since they are slow to run. They also can clog up the CI capacity in day-to-day work.

To combat this, consider making UI Tests optional in your CI pipeline so that people can run them if they want, so they won’t automatically run for every job. And make it so that developers can still merge even when UI tests fail in their CI job, since nobody likes having a flaky UI test block a tiny unit-test hotfix. If there’s a real issue with a UI Test, then people should pick that up separately.

Alternatively, consider running UI Tests periodically, such as nightly outside of pull requests and when the CI jobs aren’t clogged up. Then, the next morning the department can receive a test-report of what is possibly broken.

Periodic UI Test runs can serve as a canary in a coal mine, notifying the team of possible issues way before the release phase starts.

Reducing flakiness

One argument against UI Tests is that they tend to be flaky and brittle. They can fail even when the app is healthy. In other words, they require more babysitting.

For instance, maybe one test scrolls down UI to expose an element, but on a smaller phone this element is still not on the display, thus causing failure.

Or maybe someone accidentally matched on a text (instead of an accessibility element), thus making a UI Test sensitive to a certain language, causing failure in different languages.

When I was in a platform team, we would grab the failing tests, and rerun only the failing tests. If you do this two times in a row, then you’ll get a better idea whether your tests are “really” flaky. This really helped to discern which tests are broken versus flaky.

Snapshot tests

Snapshot tests capture the visual appearance of your UI components and screens, storing them as reference images. When you run the tests again, they compare the current state against these stored snapshots and flag any differences.

This approach is particularly valuable for catching visual regressions that other testing methods might miss. While UI tests can verify that a button exists and is tappable, they can’t tell you if the button’s padding is suddenly wrong or if a font weight has changed unexpectedly.

The main benefit is automated visual regression detection. You can catch layout issues, color changes, font problems, and spacing inconsistencies without manual inspection.

However, snapshot tests come with maintenance overhead. Every intentional UI change requires updating the reference images. Different devices, screen sizes, and OS versions can generate different snapshots, multiplying the number of reference images you need to maintain.

Use snapshot tests strategically: focus on critical UI components and key user flows rather than trying to snapshot test everything. They work best for stable UI elements that shouldn’t change frequently.

Mixing testing styles

When talking about UI Tests, the line becomes blurry about what to test. To some it’s not clear where UI Tests begin and where they become integration tests (talking to real servers), or where they overlap with manual tests.

The fact that the line between UI Tests and integration tests becomes blurry is a good indicator, because it signals we are testing a build close to the real world. This has a lot of value!

Having said that, let’s consider the trade-offs of various testing styles and how we can make use of them.

With manual tests we can verify internal and/or production builds that use real network calls. We can find UI issues that UI Tests won’t catch (such as wrong copy or weird margins). But manual testing doesn’t scale, unfortunately, and it’s usually performed at a late stage in the release process.
With Unit Tests, we verify our code but not UI, and we tend to not test the network.
With UI Tests we can’t (easily) verify code directly, but we can verify UI and behavior very close to what users get, minus networking. UI Tests can cover UI only to some extent because they don’t find visual bugs such as wrong layout margins or typos.
With Snapshot tests, we can automatically catch visual regressions by comparing current UI renders against stored reference images. They excel at detecting layout changes, color shifts, and font issues that other tests miss. However, they require maintenance overhead when UI changes intentionally.
With Integration tests, we can test a build close to production since it uses actual network calls. But you need to keep servers up and running to deal with the continuous creation and resetting of test accounts and these servers need to be in sync with upcoming features and various versions.

The lesson here is: No single approach catches everything. Consider combining multiple testing styles for ultimate coverage.

A rule of thumb to ensure quality is to run unit-tests and UI Tests with a mocked network environment to at least confirm the app works well for 90%. Then the actual network calls and other UI issues can be manually tested.

Snapshot tests are interesting, but I consider them a bonus since it requires maintaining a library of reference images to test again.

Then on top of all that, consider whether adding integration tests is worth the extra investment, since it requires a team to keep servers up and running and in sync with new features and upcoming changes. Integration tests are a giant topic and warrant their own article, but rest assured it will be the most work of all the testing styles mentioned above.

Final thoughts

Mobile testing is about building confidence in a deployment pipeline with no easy way back.

The key insight is timing. In mobile development, the earlier you catch issues, the exponentially cheaper they are to fix. A bug found during development costs minutes or hours to resolve. Found during release testing? Many hours or even days. In production? Days and tons of frustrated users.

This is why successful mobile teams focus on damage prevention rather than damage control. The mobile platform doesn’t forgive testing shortcuts, but with the right strategy, you can turn this constraint into a competitive advantage.

If you enjoyed reading this, check out various testing strategies and improving team autonomy by reading the Mobile System Design book.

Mobile System Design