Auto-Detecting Flaky Tests in CI

Flaky Tests & Reliability 5 min read May 05, 2026

Most teams treat test results like a checkbox: green is good, red is bad, ship or block. The interesting signal lives in everything that happens between those two states — runtime variance, retry counts, the same five tests showing up in every postmortem. That signal is where engineering decisions actually get made. Flaky tests are a significant source of frustration in continuous integration (CI) environments. They create noise and delay, mislead engineers, and obscure real failures. Analyzing test results with precision can transform your CI pipeline from a bottleneck into a facilitator of quality.

By the end of this article, you will know how to implement automated flaky test detection using tools like Jenkins, GitHub Actions, and Grafana, and how to interpret these results to drive meaningful engineering decisions. We'll cover the technical intricacies of setting up these tools to automatically detect flaky tests and address them effectively. This knowledge is crucial as CI pipelines become more integral to software delivery, making the detection of flaky tests essential for maintaining velocity without sacrificing reliability.

Understanding the nuances of flaky test detection is no longer optional; it's imperative for teams looking to scale their CI pipelines and deliver high-quality software consistently. Recent advancements in observability tools and data processing capabilities enable us to tackle this problem with unprecedented precision and speed.

API Testing using Python, Behave, VS Code & GitHub Copilot

Smarter API Test Automation — Python, Behave, VS Code, AI with GitHub Copilot & CI/CD Pipelines. Complete in a Weekend!

Learn more

What flaky tests are and how CI systems detect them

Flaky tests are those that sometimes pass and sometimes fail even when there are no changes to the system under test. This inconsistency can be caused by various factors, including asynchronous code execution, non-deterministic test data, or unstable dependencies such as external services and databases. Flaky tests undermine the credibility of a test suite, leading to wasted time as engineers attempt to diagnose and fix phantom issues.

In a modern test architecture, flaky tests are typically identified by analyzing patterns over multiple test runs. This involves collecting and examining test execution metrics such as execution time variance, success-to-failure ratios, and environmental dependencies. Advanced CI systems and testing frameworks provide hooks and plugins that gather this data seamlessly, allowing for the automatic detection of test flakiness.

Implementing flaky test detection in CI pipelines allows engineering teams to maintain the reliability of their test suites and prioritize debugging efforts effectively. This ensures that test results remain a trustworthy indicator of code quality, enabling teams to catch real defects early in the development cycle and maintain a rapid delivery cadence.

Collecting and querying test execution data in CI pipelines

To effectively auto-detect flaky tests, you need to establish a robust data collection mechanism across your CI pipelines. Begin by configuring your CI tools, such as Jenkins or GitHub Actions, to export test results in a format that can be consumed by analytics tools. JUnit XML is a common choice due to its compatibility with many analysis tools.

Once your test results are being captured, use a data processing tool like ClickHouse or BigQuery to store and analyze this information over time. To identify flaky tests, you'll want to look for patterns such as high variance in execution time or inconsistent pass/fail results across identical environments. Here's an example SQL query that could be used to flag potentially flaky tests based on execution time variance:

SELECT test_name, COUNT(*) as run_count, stddev(execution_time) as exec_time_variance FROM test_results GROUP BY test_name HAVING run_count > 10 AND exec_time_variance > threshold ORDER BY exec_time_variance DESC;

This query identifies tests with significant execution time variance, which is often a hallmark of flakiness. It's important to set a threshold that makes sense for your specific context, taking into account the natural variance of your test environments.

Visualizing these results can provide immediate insights into flakiness patterns. Integrate Grafana with a time-series database like Prometheus to create dashboards that track test stability over time. A sample Grafana panel configuration might look like the following:

{ "title": "Flaky Test Detection", "type": "graph", "targets": [ { "expr": "rate(test_failures[5m])", "legendFormat": "{{test_name}}", "intervalFactor": 2, "refId": "A" } ], "xaxis": { "mode": "time" }, "yaxes": [ { "format": "short" }, { "format": "short" } ] }

These dashboards can be configured to generate alerts when new flaky tests are detected, providing an immediate signal to developers to investigate further. In practice, this approach has allowed teams to reduce their average triage time from over 20 minutes to less than 5 minutes per incident.

Moreover, consider implementing a retry mechanism in your CI pipeline to automatically rerun failed tests. This can help filter out transient failures from persistent issues, giving you a clearer picture of which tests are truly flaky.

Avoiding pass/fail-only analysis and dismissing flaky tests

One common mistake is to rely solely on pass/fail statistics to judge the reliability of tests. This approach misses the subtleties of transient failures and their underlying causes, leading to incomplete insights into test flakiness. Engineers should instead focus on collecting and analyzing detailed execution metrics.

Another pitfall is the tendency to dismiss flaky tests as unimportant or too difficult to fix. This mindset often stems from organizational pressures to prioritize feature delivery over test maintenance. However, ignoring flaky tests can lead to technical debt that undermines long-term productivity and quality.

To avoid these pitfalls, it's crucial to foster a culture that values test stability and quality. Use automated tools for continuous monitoring and invest in fixing flaky tests as part of regular engineering cycles. Encouraging collaboration between developers and quality engineers can also help address the root causes of flakiness more effectively.

Debunking myths about coverage, flakiness, and dashboards

A pervasive myth is that high test coverage equates to high test quality. While coverage is an important metric, it doesn't account for the reliability of the tests themselves. High coverage with flaky tests can lead to a false sense of security and may obscure real issues in the codebase.

Another misunderstanding is the belief that flakiness is inevitable and unfixable. In truth, many flaky tests result from poor test design, environmental instability, or dependencies that can be addressed with targeted efforts. By systematically identifying and addressing these issues, teams can significantly improve test reliability.

Lastly, some teams assume that dashboards alone can solve the problem of test flakiness. While dashboards provide essential visibility, they must be complemented by actionable alerts and continuous improvement processes to effectively address flaky tests. Teams need to integrate these insights into their development workflows to drive real change.

Flaky test detection is a critical component of a robust CI pipeline. By implementing these strategies, you can improve your test reliability and focus on delivering quality software. The next step is to measure mean-time-to-first-signal on production incidents, ensuring that your detection systems provide timely and actionable insights. This will help you maintain a high level of software quality and operational efficiency.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

What flaky tests are and how CI systems detect them

Collecting and querying test execution data in CI pipelines

Avoiding pass/fail-only analysis and dismissing flaky tests

Debunking myths about coverage, flakiness, and dashboards

Related Articles

SLO-Driven Testing: Aligning Tests with Reliability Goals

How to Identify Flaky Tests (with Real Data)

The Cost of Flaky Tests (Real Numbers)

Flaky Test Root Cause Analysis: A Decision Tree