Beyond the Average: Quantile Regression and Stepwise Policies

01/12/2025
Camilo Arias (Universidad de la Sabana) y John Bravo (Universidad de la Sabana)
Machine Learning

Looking at the Distribution, Not Just the Average

Suppose a government implements a new health policy aimed at reducing avoidable hospitalizations. A traditional evaluation might tell us that, on average, hospitalizations fall by 10%. But that single figure hides key questions: Who benefits the most—high-risk patients or medium-risk patients? Does the policy reduce extreme events (colas of the distribution) or only improve moderate situations? Is the impact similar across different regions or adoption cohorts?

Quantile Regression (QR) addresses precisely these kinds of questions. Instead of focusing on the conditional mean of an outcome Y given X, it allows us to study how a given quantile (for example, the 25th, 50th, or 75th percentile) changes when a policy is introduced. Schematically, quantile regression estimates, for a given percentile, a linear predictor that approximates the conditional quantile. Instead of penalizing errors quadratically (as OLS does), QR uses a loss function that penalizes positive and negative residuals differently, so that the percentile is properly centered.

Figure 1: Quantile regression: looking at how each quantile of the distribution shifts, not just the average.

In the evaluation of policies implemented in a staggered manner over time (where implementation is not simultaneous for all beneficiaries), it is natural to combine QR with difference-in-differences (DiD) designs. The question is: can we simply use a quantile regression with two-way fixed effects (TWFE-QR) and feel confident about the results?

Our work shows that the answer is no, and that the problem of forbidden comparisons, already known in the context of means, carries over—subtly—to quantiles.

The Scenario: Policies with Staggered Adoption

Imagine a panel with units i (municipalities, hospitals, firms) and periods t (years, quarters, months). Each unit adopts a policy at a different point in time: some early, others late, and some never. The TWFE-QR estimator seeks the value that best fits the quantiles of the outcome variable’s distribution once fixed effects have been absorbed. At first glance, this is a natural extension of classical TWFE to the world of quantiles. However, when adoption is staggered and policy effects differ across cohorts, the TWFE-QR estimator ends up mixing treated-versus-treated comparisons across different periods, something that goes against the logic of the DiD design.

What TWFE-QR Is Actually Doing

Behind the scenes, quantile regression with fixed effects solves a convex optimization problem. The Karush–Kuhn–Tucker (KKT) conditions impose that, at the optimum, certain residual balances must equal zero when weighted by the covariates (including the treatment variable). Without going into the technical notation, the key message is that for each cohort g (defined by the timing of adoption) and period l, we can think of a score associated with the cell (g,l): it measures how many observations lie above or below the target quantile. The TWFE-QR estimator chooses a single value for the treatment effect estimator such that the sum of all these scores, aggregated across all treated cells, balances out to zero. If every cohort had the same policy effect, this balance would be coherent: all cells would “push” in the same direction. But when effects differ across cohorts (heterogeneity), some cells push upward, others downward, and the final result becomes a kind of weighted average of cohort effects.

In our work, we show that under fairly general conditions, TWFE-QR can be interpreted as a weighted average of the “clean” effects of each cohort g (if it were compared only against appropriate controls), with nonnegative weights that depend on cohort size, the number of treated periods, and how informative the distribution is around the quantile of interest. The problem is that these weights do not explicitly follow a policy logic, but instead emerge from the KKT conditions. The weights depend on the structure of the panel and on the density of the distribution around the quantile of interest, not on an explicit design criterion. From a public policy perspective, this complicates interpretation: what effect are we actually reporting? Does it correspond to a treated-versus-never-treated contrast, or does it mix treated-versus-treated comparisons across different stages of adoption?

Our Proposal

To address this problem, we propose an alternative estimator based on a simple idea: we construct clean cohort-specific contrasts and then aggregate them using weights with a clear interpretation. The procedure is as follows:

Define cohorts and clean controls. For each cohort k (e.g., municipalities adopting in 2022), we consider as controls only those units that will be treated in the future or will never be treated. In this way, we avoid using already-treated units as controls.
Construct a cohort-specific 2×2 design. Within each cohort, we compare treated versus control quantiles before and after adoption, using quantile regression with fixed effects adapted to the relevant subset. This produces a cohort-specific effect.
Aggregation with Transparent Weights. Finally, we define an aggregated estimator in which the weights are proportional to the number of units in the cohort and the number of periods during which the cohort contributes as treated. These are nonnegative weights that sum to one, and their meaning is clear: they measure how much “data mass” (unit-time observations) each cohort contributes.

The key is that each 2×2 comparison effect is constructed while avoiding treated-versus-treated comparisons at inappropriate times, and that the final aggregation respects the design logic: more weight is assigned to cohorts with a larger number of treated observations, but without allowing treated units to serve as their own controls at another stage.

What We Found in Simulations

To illustrate the behavior of both estimators, we consider a simulated data experiment in which: the untreated outcome combines a unit fixed effect, a time fixed effect, and a random shock; and the policy effect is heterogeneous across cohorts—some cohorts have a larger impact in the upper tail, others around the median, and so on. In this setting, we compare: the aggregated TWFE-QR estimator, which estimates a single effect for the entire panel, and the cohort-based estimator with transparent aggregation. The results show that: TWFE-QR exhibits substantial bias at several quantiles (P25, P50, P75) when effects are heterogeneous. In some scenarios, the sign of the estimated effect does not match the true effect. Our estimator systematically reduces bias across the distribution: the percentage bias falls substantially between P25 and P75, and the estimator preserves the correct direction of the effect at all quantiles.

Figure 2: TWFE: target and bias across percentiles. In simulations with four cohorts (early, middle, late, and never treated), the TWFE-QR estimator exhibits substantial bias and, at some quantiles, can even reverse sign relative to the true effect (orange). Our proposal (green) consistently reduces bias at P25, P50, and P75.

Beyond the numerical metrics, the qualitative interpretation also improves: the analyst can decompose the effect into cohort-specific components, identify which groups benefit the most at each quantile, and communicate results in a way that aligns with the actual structure of the policy (who was treated and when).

Implications

At this stage of our research, the main findings have been the following: if the policy is adopted at different times and there is reason to suspect heterogeneous effects, using TWFE-QR mechanically leads to misleading interpretations: the estimator mixes early and late treatments as if they were controls for one another. At a minimum, a design based on clean cohort-specific comparisons is necessary to preserve the original DiD logic: comparing treated units only with “never treated” or “not yet treated” units, and then constructing an average that reflects that structure. The transparency of the weights is crucial: knowing exactly how much each cohort contributes to the aggregated effect facilitates communication with decision-makers and the design of targeted policies.

Cosmology to the Extreme: Artificial Intelligence for Mapping the Universe on a Large Scale

What if the laws of physics as we know them were wrong? Not in some minor detail, but in something fundamental. That is one of the two possible conclusions that emerge from the most recent data on the large-scale universe …

Read article

Safety

Adversarial Robustness: How difficult is it to break a language model?

Large language models have become everyday tools: they assist in writing texts, support medical diagnoses, generate code, and answer complex questions in seconds …

Read article

Technology

When Mistakes Don’t Matter: Rethinking How We Train Decision-Making Models

The standard way to evaluate predictive models is dominated by a simple idea: if prediction error decreases, the model is better. Metrics such as MSE or accuracy have become the standard in most industrial pipelines …

Read article

Technology