iTestResults

Parsing JUnit XML Reports to Extract Flaky-Test Signals

Most teams treat JUnit XML as a pass/fail receipt — something Jenkins or GitHub Actions consumes, renders green or red, and discards. The real signal is in the attributes nobody reads: time, timestamp, retry counts smuggled into rerun-failure elements, and the same classname appearing in five consecutive failed runs. That's where flakiness lives, and it's all sitting in a file format that hasn't changed meaningfully since Ant 1.6.

The problem isn't that JUnit XML lacks signal — it's that most pipeline tooling stops at aggregation. You get a count of failures; you don't get a distribution of outcomes for LoginTest::test_session_expiry across the last 200 runs. Without that history, every flaky failure looks like a new bug.

By the end of this article you'll have a working Python parser that extracts per-test outcome history from JUnit XML, a ClickHouse schema to store it, and a Grafana query that surfaces your top-10 flaky tests by pass-rate variance — the kind of signal that actually drives a triage queue.

AI-driven BDD for senior test engineers

Test automation, frameworks, and AI-powered BDD.

Read iTestBDD

What JUnit XML Flakiness Parsing Actually Is

JUnit XML is a de-facto standard — not an official spec — originally defined by the Ant JUnit task and now emitted by Pytest (--junitxml), Playwright, Selenium, k6, and dozens of other frameworks. Each <testcase> element carries classname, name, time, and an optional child element: <failure>, <error>, <skipped>, or nothing (pass). Surefire and Pytest also emit <rerunFailure> and <flakyFailure> children when retry plugins are active — these are the richest flakiness indicators in the format.

In a modern test architecture, JUnit XML sits at the boundary between test execution and observability. Allure and ReportPortal both ingest it, but they optimize for human browsing of individual runs, not for time-series analysis across hundreds of runs. To detect flakiness as a pattern — not just an incident — you need to extract per-test outcome tuples (run_id, test_id, status, duration_ms, timestamp) and push them into a store that supports windowed aggregation: ClickHouse, BigQuery, or PostgreSQL with a partitioned table.

How to Build the Extraction Pipeline

Start with a Python parser that normalizes the XML into structured records. The standard library's xml.etree.ElementTree is sufficient; lxml adds XPath but isn't worth the dependency for this use case.

import xml.etree.ElementTree as ET
import hashlib, sys
from datetime import datetime, timezone

def parse_junit(path: str, run_id: str) -> list[dict]:
    tree = ET.parse(path)
    root = tree.getroot()
    suites = root.iter("testsuite")
    records = []
    for suite in suites:
        ts = suite.get("timestamp")
        for tc in suite.iter("testcase"):
            test_id = hashlib.sha1(
                f"{tc.get('classname')}.{tc.get('name')}".encode()
            ).hexdigest()[:16]
            failure  = tc.find("failure")
            error    = tc.find("error")
            skipped  = tc.find("skipped")
            rerun_f  = tc.findall("rerunFailure")
            flaky_f  = tc.findall("flakyFailure")
            if failure is not None or error is not None:
                status = "failed"
            elif skipped is not None:
                status = "skipped"
            else:
                status = "passed"
            records.append({
                "run_id":       run_id,
                "test_id":      test_id,
                "classname":    tc.get("classname"),
                "name":         tc.get("name"),
                "status":       status,
                "duration_ms":  int(float(tc.get("time", 0)) * 1000),
                "retries":      len(rerun_f) + len(flaky_f),
                "has_flaky_el": len(flaky_f) > 0,
                "suite_ts":     ts,
                "ingested_at":  datetime.now(timezone.utc).isoformat(),
            })
    return records

Wire this into your CI as a post-test step. In GitHub Actions, upload results to ClickHouse via the HTTP interface — no extra SDK needed:

# .github/workflows/test.yml  (post-test job step)
- name: Ingest JUnit results
  if: always()
  env:
    CH_URL: ${{ secrets.CLICKHOUSE_URL }}
    RUN_ID: ${{ github.run_id }}-${{ github.run_attempt }}
  run: |
    python scripts/parse_junit.py test-results/**/*.xml "$RUN_ID" \
      | curl -s "$CH_URL/?query=INSERT+INTO+test_runs+FORMAT+JSONEachRow" \
             --data-binary @-

The ClickHouse schema uses a ReplacingMergeTree so re-ingested runs on retries don't double-count:

CREATE TABLE test_runs (
    run_id       String,
    test_id      FixedString(16),
    classname    LowCardinality(String),
    name         String,
    status       LowCardinality(String),
    duration_ms  UInt32,
    retries      UInt8,
    has_flaky_el UInt8,
    suite_ts     Nullable(DateTime),
    ingested_at  DateTime DEFAULT now()
) ENGINE = ReplacingMergeTree(ingested_at)
  PARTITION BY toYYYYMM(ingested_at)
  ORDER BY (test_id, run_id);

With 30 days of data loaded, the Grafana query for your flakiness leaderboard becomes straightforward. Flakiness rate here is defined as: runs where the test both passed and failed within the same rolling window — a stricter signal than raw failure rate:

-- Grafana / ClickHouse data source
SELECT
    classname,
    name,
    countIf(status = 'passed')                              AS passes,
    countIf(status = 'failed')                              AS failures,
    sum(retries)                                            AS total_retries,
    round(countIf(status='failed') / count() * 100, 1)     AS fail_pct,
    round(quantile(0.95)(duration_ms) / 1000.0, 2)         AS p95_sec
FROM test_runs
WHERE ingested_at >= now() - INTERVAL 30 DAY
GROUP BY classname, name
HAVING fail_pct BETWEEN 5 AND 95   -- pure failures aren't flaky
ORDER BY total_retries DESC
LIMIT 20;

The HAVING fail_pct BETWEEN 5 AND 95 predicate is deliberate — tests that fail 100% of the time are broken, not flaky; tests that fail 0.1% of the time are noise. The 5–95 band is where the actionable flakiness lives. One team running ~4,000 tests per day saw triage time drop from 22 minutes per failure to under 4 minutes once engineers could click directly from a Slack alert (routed via Grafana OnCall) to a pre-filtered panel showing that test's 30-day outcome history.

Common Pitfalls

Conflating retry-pass with clean-pass. When Pytest-rerunfailures or Surefire retries a test and it eventually passes, the final JUnit XML often records only the passing <testcase> — no evidence of the prior failure — unless you explicitly configure report_individual_tests=True or use the <flakyFailure> extension. Engineers then wonder why the dashboard shows 0 retries on a suite that visibly retried three tests. Always validate your framework's retry-reporting behavior against raw XML before trusting aggregated counts. Playwright's built-in reporter and Pytest's pytest-rerunfailures>=12.0 both emit <rerunFailure> correctly; older versions silently swallow them.

Using wall-clock suite timestamps instead of per-test timestamps. JUnit XML's <testsuite timestamp> is a single value for the whole suite. In parallel test execution (pytest-xdist, Gradle parallel, Argo Workflows fan-out), tests from different workers land in the same XML with the same timestamp, making duration-based anomaly detection useless. The fix is to enrich each record with the CI job's start time plus the cumulative time offset, or — better — emit OpenTelemetry spans from your test runner and use Honeycomb or Grafana Tempo for duration analysis, reserving JUnit XML strictly for pass/fail/retry status.

What Most Teams Get Wrong

Treating failure rate as the flakiness signal. A test that fails 80% of the time isn't flaky — it's broken. Flakiness is non-determinism: the same code, same commit, different outcomes. The metric that captures this is outcome entropy across runs at the same SHA, not aggregate failure rate across all SHAs. Teams that alert on "failure rate > 10%" end up paging on genuinely broken tests while the quietly flaky ones (30% fail, 70% pass, always retried, never fixed) accumulate and erode suite credibility over months. The HAVING fail_pct BETWEEN 5 AND 95 filter above is a start; pairing it with a GROUP BY test_id, commit_sha gets you closer to true non-determinism detection.

Assuming Allure or ReportPortal replaces a time-series store. Use Allure when you need rich per-run artifact browsing and stakeholder-facing HTML reports — it's excellent for that. Use ReportPortal when you need a hosted, multi-project test management layer with built-in defect triage workflows and LDAP auth. Neither tool is designed for ad-hoc SQL over 90 days of outcome history, and neither exposes the raw per-test time-series in a way that feeds a Grafana SLO panel or a PagerDuty flakiness-budget alert. They're complementary to a data store, not a replacement for one.

JUnit XML is older than most of the CI systems that consume it, but it's still the most portable test-result format in the ecosystem. The extraction pattern here — parse to tuples, store in ClickHouse or BigQuery, query for outcome entropy — scales from a 200-test suite to 200,000 without architectural changes. A practical next step: run the parser against your last 30 days of archived XML artifacts and sort by total_retries DESC. The top five rows will tell you more about your suite's health than any coverage report.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

Understanding how systems actually work is the first step toward navigating them effectively.

Browse all articles