iTestResults

How to Track Quality Over Time (Without Vanity Metrics)

Most teams treat test results like a checkbox: green is good, red is bad, ship or block. The interesting signal lives in everything that happens between those two states — runtime variance, retry counts, the same five tests showing up in every postmortem. That signal is where engineering decisions actually get made. In this article, we tackle the challenge of tracking software quality over time without falling into the trap of vanity metrics. By the end, you'll know how to extract actionable insights from your test results, moving beyond basic pass/fail paradigms. This is crucial now as systems grow more complex and teams scale, demanding a refined approach to quality tracking.

What This Actually Is

Tracking quality over time is about observing trends in test results, identifying patterns, and using those insights to inform engineering decisions. It's not just about counting how many tests pass or fail; it's about understanding the underlying causes of failures and fluctuations in test performance. This approach fits into a modern test architecture by enabling continuous improvement, where insights from test results drive changes in code, tests, and processes.

In a typical CI/CD pipeline, tools like Jenkins, GitHub Actions, or CircleCI execute test suites and record results. These results can be stored in databases like PostgreSQL or ClickHouse, or visualized using dashboards in Grafana. The goal is to move from reactive firefighting to proactive quality management, using data to guide decisions.

By tracking quality over time, teams can anticipate issues before they become critical, allocate resources more effectively, and refine their development processes. This approach leads to more stable releases and a deeper understanding of the software's behavior in real-world conditions.

How To Implement It

To start tracking quality over time without relying on vanity metrics, you need a robust setup to collect, store, and analyze test results. Begin by setting up a pipeline that captures detailed test execution data. Use a tool like Allure or ReportPortal to aggregate results from your test runs. For example, in Jenkins, you can configure a post-build action to publish test results to Allure:

pipeline {
	stages {
		stage('Test') {
			steps {
				sh 'pytest --junitxml=results.xml'
				allure includeResults: true, results: [[path: 'results.xml']]
			}
		}
	}
}

Next, store your test results in a database like PostgreSQL for historical analysis. You can use a simple SQL query to extract insights, such as identifying tests with the highest failure rates:

SELECT test_name, COUNT(*) as failure_count
FROM test_results
WHERE status = 'failed'
GROUP BY test_name
ORDER BY failure_count DESC
LIMIT 10;

Visualize trends using Grafana by connecting it to your database. Create panels to monitor metrics like test pass rate, average runtime, and failure trends over time. This dashboard provides a real-time overview of your test suite's health and historical performance.

Implementing this system can significantly reduce triage time. For instance, after integrating our test results with Grafana and Loki, the time spent diagnosing failures dropped from 22 minutes to under 4 minutes per failure.

Finally, consider adding observability into your tests using OpenTelemetry. Instrument your test code to collect traces and spans, providing deeper insights into execution flow and identifying bottlenecks or flaky tests.

Common Pitfalls

One common mistake is over-relying on pass/fail metrics without considering test stability or reliability. This approach misses the nuance of flaky tests, which can skew perceptions of quality. Avoid this by tracking test flakiness and prioritizing stabilization efforts.

Another pitfall is failing to account for environmental factors that impact test results, such as server load or network latency. Engineers often overlook these, leading to misdiagnosed issues. Mitigate this by correlating test data with infrastructure metrics using tools like Datadog or Prometheus.

Finally, teams sometimes collect too much data without a clear analysis strategy, leading to information overload. Focus on key metrics that directly impact quality and use automation to highlight significant changes. This ensures data is actionable and not just noise.

What Most Teams Get Wrong

Many teams mistakenly believe that a high test coverage percentage equates to high quality. In reality, coverage metrics often don't account for the depth or relevance of test cases. Instead, focus on the effectiveness of tests in catching defects and their alignment with critical user paths.

Another misconception is that flakiness is an inherent, unfixable part of testing. Flakiness often results from issues like timing dependencies or unreliable third-party services, which can be addressed through test isolation and better mocking strategies.

Finally, the belief that dashboards alone provide all necessary insights is misleading. While dashboards are valuable for visualization, they must be paired with deep-dive analyses to understand root causes and inform strategic decisions. Dashboards are tools, not solutions.

Tracking quality over time without resorting to vanity metrics requires a nuanced approach that goes beyond simple pass/fail counts. By focusing on trends, patterns, and actionable insights, teams can drive meaningful improvements in software quality. As a next step, consider measuring the mean-time-to-first-signal on production incidents to further enhance your approach to quality tracking.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

Understanding how systems actually work is the first step toward navigating them effectively.

Browse all articles