iTestResults

Quarantine vs Fix: When to Use Each

Most teams treat test results like a checkbox: green is good, red is bad, ship or block. The interesting signal lives in everything that happens between those two states — runtime variance, retry counts, the same five tests showing up in every postmortem. That signal is where engineering decisions actually get made.

Flaky tests present a unique challenge: they intermittently fail, clouding the true state of your codebase. The choice between quarantining these tests and fixing them determines the reliability of your CI pipeline and the confidence in your deployments.

By the end of this article, you'll understand when to quarantine a test to keep the pipeline moving and when to prioritize fixing it to maintain test suite health. This decision-making process matters more now as teams scale their architectures and the cost of delayed feedback loops grows exponentially.

Given recent advancements in observability tools and the adoption of microservices, the landscape of testing has shifted, making it critical to refine how we handle flaky tests.

What This Actually Is

Quarantining a test means temporarily removing it from the main CI pipeline to prevent it from affecting the overall build status. It is a tactical move, often used to keep deployments on schedule while a deeper investigation occurs.

Fixing a test, on the other hand, involves identifying the root cause of its flakiness and altering the test or the system under test to eliminate the source of failure. This is a strategic approach aimed at long-term stability and reliability.

In a modern test architecture, these practices fit into the broader category of test maintenance and reliability engineering. They involve collaboration among SDETs, DevOps, and software engineers to ensure test suites remain robust against code changes and infrastructure variations.

How To Implement It

To quarantine a test in a GitHub Actions pipeline, you can add a conditional step that skips the flaky test based on certain parameters.

name: CI

on: [push, pull_request]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout code
      uses: actions/checkout@v2

    - name: Run tests
      run: |
        if [ "$GITHUB_EVENT_NAME" != "quarantine" ]; then
          pytest --disable-warnings
        fi

This YAML snippet ensures that the test is excluded during specific events, allowing for continuous integration without interference from known flaky tests.

Fixing a test requires a different approach. It often starts with data collection. You can use a tool like Datadog to trace requests and identify the failure patterns.

from ddtrace import tracer

@tracer.wrap(name='test_flaky_function')
def test_flaky_function():
    # Simulate flaky behavior
    assert some_service_call() == expected_value

By instrumenting tests with tracing, you can gather metrics on execution time, failure rates, and other relevant data to pinpoint the cause of flakiness.

Another practical step is to use Grafana dashboards to visualize test failures over time. With Grafana's integration with Prometheus, you can set up alerts based on failure thresholds to trigger automated quarantining or to notify the team when a test's flakiness exceeds acceptable limits.

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "panels": [
    {
      "alert": {
        "conditions": [
          {
            "evaluator": {
              "params": [5],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": ["A", "5m", "now"]
            },
            "reducer": {
              "type": "avg"
            },
            "type": "query"
          }
        ],
        "executionErrorState": "alerting",
        "frequency": "1m",
        "handler": 1,
        "name": "Test Flakiness Alert",
        "noDataState": "no_data",
        "notifications": []
      },
      "cacheTimeout": null,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {},
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 1,
      "options": {
        "fieldOptions": {
          "calcs": ["mean"],
          "defaults": {},
          "overrides": []
        },
        "showHeader": true
      },
      "pluginVersion": "7.5.7",
      "targets": [
        {
          "expr": "rate(test_failures_total[5m])",
          "interval": "",
          "legendFormat": "Flaky Test Failures",
          "refId": "A"
        }
      ],
      "title": "Test Failures Over Time",
      "type": "table"
    }
  ],
  "schemaVersion": 30,
  "version": 1
}

This panel setup provides a clear view of when and why tests fail, guiding decisions on whether to quarantine or fix them. Triage time dropped from 22 minutes per failure to under 4 once we wired the dashboard to Loki, making the decision process more efficient.

Common Pitfalls

One common mistake is treating all flaky tests as equal candidates for quarantine. This often happens due to a lack of proper test categorization and prioritization, leading to unnecessary quarantining and missed opportunities for quick fixes.

Another pitfall is failing to revisit quarantined tests. Teams might quarantine a test and forget about it, assuming it’s a solved problem. This oversight can lead to a growing backlog of unaddressed issues, which eventually impacts CI performance and reliability.

Lastly, relying solely on manual triage processes can be inefficient. Automating alerts and using dashboards to monitor and manage flaky tests can streamline the process significantly, reducing human error and allowing engineers to focus on value-adding tasks.

What Most Teams Get Wrong

A prevailing myth is that pass/fail is the only signal of test reliability. In truth, metrics like test duration variance and flakiness rates provide a deeper understanding of test suite health.

Another misconception is equating high test coverage with high quality. Coverage is a useful metric, but without considering the reliability and relevance of tests, it can be misleading.

Finally, some believe that flakiness is an unfixable attribute of tests. While some tests may inherently be more prone to flakiness, many can be stabilized through careful analysis and adjustments in their setup or execution context.

Understanding when to quarantine versus when to fix flaky tests is crucial for maintaining a healthy CI pipeline and a reliable codebase. As you refine your approach, consider measuring the mean-time-to-first-signal on production incidents to further enhance your test strategy's effectiveness.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

Understanding how systems actually work is the first step toward navigating them effectively.

Browse all articles