Parsing JUnit XML Reports to Extract Flaky-Test Signals

Test Analytics & Metrics 6 min read July 24, 2026

Most teams treat JUnit XML as a transport format: CI reads it, marks the build green or red, and the file rots in an artifact store. The real value isn't in the final status — it's in the distribution of outcomes across runs. A test that fails one in five times on the same commit, or whose duration drifts from 400 ms to 4 s over two weeks, is a reliability debt accruing interest every sprint.

The problem is structural. JUnit XML was designed to report a single run, not to reason across runs. Nothing in the schema encodes history, retry counts, or environment context. Extracting flaky-test signals means building that layer yourself — ingestion, normalization, and a query model that treats test identity as a time series, not a snapshot.

By the end of this article you'll have a working pipeline: parse raw JUnit XML in CI, normalize it into a schema suitable for ClickHouse or PostgreSQL, write the queries that surface flakiness rates and duration regressions, and wire the output into a Grafana dashboard your team will actually use.

API Testing using Python, Behave, VS Code & GitHub Copilot

Smarter API Test Automation — Python, Behave, VS Code, AI with GitHub Copilot & CI/CD Pipelines. Complete in a Weekend!

Learn more

JUnit XML as a Flakiness Data Source: Schema, Limits, and Signal Density

A JUnit XML report is a <testsuites> document containing one or more <testsuite> elements, each holding <testcase> nodes that carry classname, name, time, and optional <failure>, <error>, or <skipped> children. That's the entire schema. There's no concept of a retry, no run ID, no branch, no executor. Pytest's --junit-xml adds properties blocks; Playwright's JUnit reporter adds retries as a custom attribute. Most parsers silently drop those extensions.

In a modern test architecture, JUnit XML sits at the collection layer — it's what your runner emits before anything interesting happens. The signal density increases only when you join it with run metadata: Git SHA, branch, CI job ID, retry attempt number, and wall-clock timestamp. Without those joins, you can count failures. With them, you can distinguish a test that is consistently broken (needs a fix) from one that is non-deterministically broken (needs isolation and root-cause analysis). That distinction is the entire point.

Building the Ingestion and Query Pipeline: From XML to Flakiness Rate

Start with a Python parser that enriches each test case with run-level context before it ever touches a database. The key is to treat every <testcase> as a row with a stable test identity key (classname + name, normalized) and mutable per-run fields.

import xml.etree.ElementTree as ET
import hashlib, os, json, sys
from datetime import datetime, timezone

def parse_junit(xml_path: str, run_meta: dict) -> list[dict]:
    tree = ET.parse(xml_path)
    root = tree.getroot()
    suites = root.findall("testsuite") or [root]
    rows = []
    for suite in suites:
        for tc in suite.findall("testcase"):
            name = tc.get("name", "")
            classname = tc.get("classname", "")
            identity = hashlib.sha1(f"{classname}::{name}".encode()).hexdigest()[:12]
            status = "pass"
            if tc.find("failure") is not None: status = "failure"
            elif tc.find("error") is not None:  status = "error"
            elif tc.find("skipped") is not None: status = "skipped"
            rows.append({
                "test_id":   identity,
                "classname": classname,
                "name":      name,
                "duration":  float(tc.get("time") or 0),
                "status":    status,
                "run_id":    run_meta["run_id"],
                "branch":    run_meta["branch"],
                "sha":       run_meta["sha"],
                "ts":        run_meta["ts"],
            })
    return rows

if __name__ == "__main__":
    meta = {
        "run_id": os.environ["GITHUB_RUN_ID"],
        "branch": os.environ["GITHUB_REF_NAME"],
        "sha":    os.environ["GITHUB_SHA"],
        "ts":     datetime.now(timezone.utc).isoformat(),
    }
    rows = parse_junit(sys.argv[1], meta)
    print(json.dumps(rows))

Wire this into your GitHub Actions workflow so it runs immediately after the test step, before artifacts are uploaded. Pipe the JSON to a ClickHouse HTTP insert or a COPY into PostgreSQL.

# .github/workflows/test.yml (relevant excerpt)
- name: Run tests
  run: pytest tests/ --junit-xml=results/junit.xml

- name: Ingest test results
  env:
    GITHUB_RUN_ID: ${{ github.run_id }}
    GITHUB_REF_NAME: ${{ github.ref_name }}
    GITHUB_SHA: ${{ github.sha }}
  run: |
    python scripts/parse_junit.py results/junit.xml \
      | curl -s --data-binary @- \
        "https://clickhouse.internal/insert?query=INSERT+INTO+test_runs+FORMAT+JSONEachRow"

Once you have 30+ days of data, the flakiness query is straightforward. A test is flaky when it produces mixed outcomes across runs on the same branch — not just globally. This ClickHouse query gives you a ranked flakiness report:

SELECT
    classname,
    name,
    countIf(status = 'failure') AS failures,
    count()                      AS total_runs,
    round(failures / total_runs, 3) AS flakiness_rate,
    avg(duration)                AS avg_duration_s,
    quantile(0.95)(duration)     AS p95_duration_s
FROM test_runs
WHERE branch = 'main'
  AND ts >= now() - INTERVAL 30 DAY
GROUP BY classname, name
HAVING total_runs >= 10
   AND flakiness_rate BETWEEN 0.05 AND 0.95   -- exclude consistently broken
ORDER BY flakiness_rate DESC
LIMIT 50;

The BETWEEN 0.05 AND 0.95 predicate is deliberate: a test failing 100% of the time is broken, not flaky — it belongs in a different triage queue. Feed this query into a Grafana panel using the ClickHouse data source plugin (v3+), set the refresh to 1 hour, and add a variable filter for branch. One team running this setup cut their flaky-test triage time from 22 minutes per failure to under 4 minutes once engineers could link directly from the Slack alert to the pre-filtered dashboard row.

Three Ingestion Mistakes That Corrupt Your Flakiness Signal

Ingesting only the final merged XML. Many CI systems (Jenkins with the JUnit plugin, GitHub Actions with dorny/test-reporter) aggregate XML files across parallel shards before storing results. If a test fails in shard 3 and passes in shards 1, 2, and 4, the aggregated file may record only the last-seen status. You lose the failure entirely. Fix this by ingesting per-shard XML files individually, keyed by shard index, before any aggregation step.

Not normalizing test identity across refactors. When a class is renamed or a test is moved, its classname::name key changes and your history breaks. The flakiness rate resets to zero for what is functionally the same test. A lightweight fix is to maintain a test_aliases table mapping old identities to canonical ones, updated via a post-merge hook. The deeper fix is to use a stable annotation — Pytest's @pytest.mark.test_id("uuid") pattern — and parse it from the XML <properties> block. Most teams skip this until they've been burned by a false "no flakiness" report after a large refactor.

What Teams Misread in Their Own JUnit Data

Treating retry-pass as a clean pass. Pytest-retry and JUnit's rerunFailures (JUnit 4 / Surefire) write the final passing result to XML and discard intermediate failures unless you explicitly configure report-only-first-failure=false. A test that passes on the third attempt looks identical to one that passed on the first. Your flakiness rate is silently understated. Check your runner's retry configuration and confirm that retry attempts are either written as separate <testcase> nodes or surfaced in a <properties> block you're actually reading.

Equating duration variance with flakiness. A test whose runtime swings from 200 ms to 2 s but always passes is not flaky — it's a performance signal, and it belongs in a separate duration-regression alert, not your flakiness queue. Mixing the two inflates your "flaky test count" metric and trains engineers to ignore it. Keep status-based flakiness and duration-based regression as distinct queries with distinct alert thresholds. The P95 duration column in your ClickHouse query is a starting point; a 2× week-over-week increase in P95 for a previously stable test is worth a separate Grafana alert rule.

The JUnit XML format will not evolve to solve these problems for you — it's a 20-year-old reporting convention, not an observability platform. The teams getting value from it are the ones who treat it as raw telemetry and build the enrichment layer themselves. If you want a reference schema and a set of pre-built ClickHouse migrations to start from, the Test Analytics section of this site has both. Start there, then adapt the flakiness query to your branching model.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

JUnit XML as a Flakiness Data Source: Schema, Limits, and Signal Density

Building the Ingestion and Query Pipeline: From XML to Flakiness Rate

Three Ingestion Mistakes That Corrupt Your Flakiness Signal

What Teams Misread in Their Own JUnit Data

Related Articles

Flaky Test Root Cause Analysis: A Decision Tree

Defect Density vs Defect Trend: Which Tells You More

Mean Time to Detect (MTTD) for Test Suites

Coverage as a Vanity Metric: What to Measure Instead