PISA’s Plausible Values and How to Work with Them
- Yulia Kuzmina
- Feb 6
- 4 min read
Updated: Feb 9

Today, I want to share some thoughts on the specifics of analyzing data from PISA and other large-scale educational studies like TIMSS and PIAAC. Just as a quick reminder, PISA (Programme for International Student Assessment) has been conducted every three years since 2000. It evaluates students’ literacy in mathematics, reading, science, and, in some cycles, additional domains like financial literacy (introduced in 2022).
Unlike traditional assessments that produce a single test score, PISA does not report point estimates of student performance. Instead, it measures performance using plausible values (PVs), which are generated through Item Response Theory (IRT) models. These models estimate a student’s ability based on their responses to a subset of test questions and background questionnaire data. Since students answer different sets of questions, PVs allow researchers to approximate overall proficiency more accurately. Each student is assigned multiple plausible values—five in earlier cycles and ten since PISA 2015. Importantly, all statistical analyses should account for the full set of PVs, rather than averaging them at the individual level.
PISA’s technical reports emphasize that the study is designed to analyze student performance at the population level, identifying broad trends and relationships rather than producing precise individual scores.
For my grant project, I’ve been analyzing PISA 2018 and 2022 data. I have to admit—working with ten plausible values instead of a single score is far from convenient! Many researchers simplify their analyses by averaging PVs and treating the result as a regular variable—such as including it as a dependent variable in regression models. However, this approach is incorrect. PISA’s documentation strongly warns that averaging PVs leads to biased estimates of population parameters.
Instead, the correct approach involves a more complex method: computing estimates separately for each plausible value, averaging these estimates to obtain the final result, and then properly accounting for both sampling variance and measurement error variance when calculating standard errors. This method ensures that the estimation process reflects the uncertainty inherent in the PVs.
To make things easier, various programs and packages automate these steps. One example is IDB Analyzer, developed by the International Association for the Evaluation of Educational Achievement (IEA). This tool correctly handles plausible values and complex survey weights and is freely available for researchers working with international large-scale assessment data. However, IDB Analyzer does not support multilevel regression. If you need to run multilevel models, you can use HLM (Hierarchical Linear Models) software, but it is not free and has limited visualization options. I used it early in my career, but I now prefer other programs.
In Stata, the repest package allows for the correct estimation of multilevel regression models while accounting for plausible values. However, this method is extremely slow. For example, I recently ran a simple two-level multilevel model (without random slopes and interactions) on PISA Serbia data (2018 and 2022). Using repest, it took about an hour to obtain results, whereas the same model with the averaged PVs approach was estimated in less than a minute.
Is the "Correct" Approach Always Worth It?
So, is it worth using the "correct" approach? Below, I compare regression coefficients from one model obtained by different approaches:
The second column shows estimates obtained from a multilevel regression with averaged PVs (using final student weights) (package xtmixed in Stata).
The third column shows results using repest and xtmixed.
Table 1. Results of multilevel regression analysis (PISA 2018 and 2022 data, Serbia)
As you can see, the differences are tiny. The coefficient estimates are nearly identical, which makes sense—since the "correct" method involves averaging estimates for each PV, and averaging PVs before analysis yields similar results. However, there is a difference in standard errors. In some cases, standard errors increase, but in most cases, they decrease.
As a result of this decrease in standard errors, some previously insignificant coefficients became significant. For example:
The difference between 2018 and 2022 became significant.
The performance gap between students from large cities (>1M population) and medium-sized cities (100K–1M population) also became significant.
This raises the question: Does this small difference justify the additional computational time? Especially considering that we are working with educational data, where Type II errors (failing to detect a true effect) are generally less critical than in medical or high-risk research.
Whether this matters depends on the purpose of the analysis. If the goal is to examine broad population trends, the differences may be negligible. However, if you work within the frequentist paradigm and use NHST (Null Hypothesis Significance Testing) and statistical significance is central to the study, ignoring proper PV handling could lead to misleading conclusions.
In a Bayesian framework, plausible values can be handled differently—by modeling student ability as a latent variable rather than treating PVs as multiple imputations. This avoids the need to manually average results over multiple PVs but still requires correctly incorporating measurement uncertainty. While Bayesian methods are also computationally demanding, they align better with the theoretical foundations of PISA’s measurement model and provide richer inferences.
Ultimately, researchers have some flexibility in choosing how to handle PVs, but it is crucial to consider the research goals and computational trade-offs. For simple descriptive analyses, averaging PVs might be acceptable. For hypothesis testing (frequentist NHST), proper PV handling reduces bias. For Bayesian models, PV handling is naturally built-in. While the correct approach is more rigorous, it is important to weigh its necessity against the practical trade-offs, especially when working with large datasets and complex models. Personally, I find Bayesian multilevel models more conceptually satisfying, although I can’t say I’ve fully integrated them into my workflow just yet.
What do you think? Do you strictly follow the recommended approach, or do you find alternative methods more practical? Is it important in educational studies to obtain such accurate results?
Notes: The picture was generated by AI upon request: educational studies, plausible values, Bayesian approach :)
コメント