How to Stop Flakes from Coming Back

Flaky Tests & Reliability 4 min read May 05, 2026

Most teams treat test results like a checkbox: green is good, red is bad, ship or block. The interesting signal lives in everything that happens between those two states — runtime variance, retry counts, the same five tests showing up in every postmortem. That signal is where engineering decisions actually get made.

Flaky tests are a persistent thorn in the side of continuous integration pipelines, often turning what should be a reliable signal into a misleading noise. The challenge is not just to detect flakes but to ensure they don't come back. This article dives into strategies that experienced engineers can use to eliminate flakiness for good.

By the end of this article, you'll be equipped with the know-how to identify the root causes of flaky tests, efficiently triage them, and implement robust solutions that prevent recurrence. We'll explore tool-specific configurations, from GitHub Actions to Grafana, that enable you to monitor and act on patterns effectively.

This matters now more than ever as teams scale and architectures become more complex, putting pressure on CI/CD systems to be as reliable as the applications they support. A modern approach to flaky test management is not a luxury but a necessity.

API Testing using Python, Behave, VS Code & GitHub Copilot

Smarter API Test Automation — Python, Behave, VS Code, AI with GitHub Copilot & CI/CD Pipelines. Complete in a Weekend!

Learn more

Why flaky tests break CI/CD pipelines and erode trust

Flaky tests are tests that exhibit non-deterministic outcomes, meaning they can pass or fail under the same conditions. This behavior is typically due to timing issues, concurrency problems, or external dependencies that are not stable. Flakiness undermines the credibility of automated testing, as it erodes trust in test results.

In a modern testing architecture, flaky tests disrupt the CI/CD pipeline, leading to wasted resources, delayed releases, and frustrated engineers. They require additional cycles of investigation and rerunning, which diverts attention from delivering quality software.

Understanding flaky tests requires a multi-faceted approach, utilizing observability tools, test result analytics, and pattern recognition to isolate and address the underlying causes. This process involves both technical solutions and organizational discipline to enforce best practices in test writing and maintenance.

Using Grafana, Loki, and SQL to detect flaky test patterns

To stop flaky tests from coming back, start by integrating observability directly into your CI pipeline. Tools like Grafana and Loki are excellent choices for aggregating and visualizing test run data. Here’s a basic setup using Grafana to monitor test flakiness:

{
  "panels": [
    {
      "type": "graph",
      "title": "Flaky Test Frequency",
      "targets": [
        {
          "expr": "sum by (test_name) (increase(test_failures_total[1d]))",
          "format": "time_series"
        }
      ]
    }
  ]
}

This panel lets you visualize which tests fail most frequently over a 24-hour period. By identifying these patterns, you can prioritize which tests to triage.

Next, utilize SQL queries to analyze test results stored in databases like ClickHouse or BigQuery. For example, to identify flaky tests based on retry patterns, use:

SELECT test_name, COUNT(*) as failure_count
FROM test_results
WHERE status = 'failed'
GROUP BY test_name
HAVING failure_count > 3
ORDER BY failure_count DESC;

This query helps pinpoint tests that fail often enough to be considered flaky, allowing targeted investigation.

For CI/CD integration, use GitHub Actions to automate flaky test identification and reporting. A simple workflow could look like this:

name: Flaky Test Detector

on:
  push:
    branches:
      - main

jobs:
  detect-flakes:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout code
      uses: actions/checkout@v2
    - name: Run tests
      run: pytest --junitxml=results.xml
    - name: Analyze results
      run: python scripts/analyze_flakes.py

This setup executes tests and runs a script to analyze results for flakiness, automating the detection process.

Retries, environment gaps, and poor test isolation to avoid

A common pitfall is relying solely on test retries to deal with flaky tests. While retries can temporarily mask the problem, they do not solve the underlying issues and often lead to bloated test run times. Instead, focus on identifying root causes through observability.

Another mistake is overlooking the impact of test environment variability. Flakiness often arises from differences in environments where tests are run. Ensure that your test environment is as close to production as possible to mitigate this risk.

Lastly, engineers sometimes ignore the importance of proper test isolation. Tests that share state or dependencies can produce flaky results. Use dependency injection and mock services to ensure tests are isolated and repeatable.

Debunking myths about pass/fail signals and flakiness

One myth is that pass/fail is the ultimate signal of test quality. In reality, a test passing consistently with variability in execution time or across environments indicates potential issues. Pay attention to these nuances in your observability data.

Another misconception is that code coverage equates to quality. While coverage is a useful metric, it doesn't account for the reliability of the tests themselves. A high coverage suite filled with flaky tests is less valuable than a lower coverage suite that is reliable.

Finally, many teams believe that flakiness is an unavoidable aspect of testing. With modern tools and practices, flaky tests can be systematically reduced and often eliminated, improving CI/CD pipeline reliability significantly.

Addressing flaky tests is a continuous process that demands vigilance and the right set of tools. By implementing observability into your test suite and adopting best practices for test isolation and environment consistency, you can minimize flakiness and improve pipeline reliability. If you implement this, the next thing worth measuring is mean-time-to-first-signal on production incidents, ensuring a quick response to real-world issues.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

Why flaky tests break CI/CD pipelines and erode trust

Using Grafana, Loki, and SQL to detect flaky test patterns

Retries, environment gaps, and poor test isolation to avoid

Debunking myths about pass/fail signals and flakiness

Related Articles

Fixing Unstable Test Suites: A Systematic Approach

Quarantine vs Fix: When to Use Each

How to Identify Flaky Tests (with Real Data)

Flake Budget: Treating Stability as a Resource