How to Identify Flaky Tests (with Real Data)
Most teams treat test results like a checkbox: green is good, red is bad, ship or block. The interesting signal lives in everything that happens between those two states — runtime variance, retry counts, the same five tests showing up in every postmortem. That signal is where engineering decisions actually get made.
Flaky tests are the bane of continuous integration environments, often leading to mistrust in test results and delayed releases. This article will explore how to identify flaky tests using real data, allowing you to focus on meaningful engineering insights rather than noise.
By the end of this article, you will be able to detect flaky tests with precision using tools like Grafana, Allure, and ClickHouse, and make informed decisions for your CI pipeline.
This is increasingly important in today's microservices architectures, where the complexity of test suites grows with each service you add.
What This Actually Is
Flaky tests are tests that exhibit nondeterministic outcomes — they pass or fail for reasons unrelated to the code changes they are meant to validate. In a modern test architecture, this often surfaces as intermittent failures during CI runs, which could be due to network latency, timing issues, or test dependencies.
Identifying these tests is crucial because they can erode the confidence in your test suite, lead to longer build times due to retries, and cause unnecessary bottlenecks in the development pipeline.
In the context of CI/CD pipelines, flaky tests often hide in plain sight, masked by retry mechanisms or ignored due to their sporadic nature. Advanced tooling and analytics are required to pinpoint them effectively.
How To Implement It
To identify flaky tests, you need a combination of data collection and analysis tools. Start by instrumenting your test suites to collect detailed run data, including execution time, retries, and environment variables. Tools like Allure and ReportPortal can help aggregate this data.
Next, use a database like ClickHouse to store test execution data. Here's a simple SQL query to identify tests with high variance in execution time:
SELECT test_name, COUNT(*) as run_count, AVG(duration) as avg_duration, STDDEV(duration) as stddev_duration FROM test_runs GROUP BY test_name HAVING stddev_duration > 0.2 * avg_duration;This query filters tests where the standard deviation of the duration is more than 20% of the average duration, a common sign of flakiness.
Visualize this data in Grafana for easier analysis. Here's a sample JSON configuration for a Grafana panel:
{"title": "Flaky Test Detection", "type": "table", "datasource": "ClickHouse", "targets": [{"query": "SELECT test_name, run_count, avg_duration, stddev_duration FROM flaky_tests ORDER BY stddev_duration DESC"}]}Integrating this setup with your CI pipeline, such as GitHub Actions or Jenkins, allows you to automatically detect and report flaky tests. This reduces triage time significantly — for example, one team reduced triage time from 22 minutes per failure to under 4 minutes by wiring their dashboards to Loki for real-time logs.
Common Pitfalls
One common mistake is to rely solely on retry mechanisms to 'solve' flaky tests. This approach masks the problem rather than addressing the root cause, leading to a false sense of stability.
Another pitfall is failing to differentiate between true failures and environmental issues that cause flakiness. This can happen when teams do not invest in proper logging and observability, making it difficult to trace the origin of test failures.
Finally, some teams focus too much on test coverage as a metric of quality without considering test reliability. This can lead to a bloated test suite with high coverage but low confidence in its results. Avoid these pitfalls by prioritizing the identification and resolution of flaky tests in your CI strategies.
What Most Teams Get Wrong
Many teams believe that pass/fail metrics are the ultimate signal for test quality. In reality, the variance and patterns in test results are where valuable insights lie.
Another misconception is that high code coverage equates to high-quality tests. In fact, without addressing flakiness, high coverage might simply mean high maintenance overhead without true reliability.
Lastly, some teams think that once a dashboard is in place, the problem of flaky tests is solved. Dashboards are tools for visibility, not solutions in themselves. They must be part of a broader strategy that includes root-cause analysis and continuous improvement.
If you implement this, the next thing worth measuring is mean-time-to-first-signal on production incidents. This will help you further refine your CI/CD pipeline for optimal performance and reliability without relying on flaky tests as a crutch.
Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.