Human biases in psychological data and sample reconstruction techniques
Séminaire Données et Aléatoire Théorie & Applications
7/10/2021 - 14:00 Jean-Charles Quinton (LJK - UGA) Salle 106
Distributions in psychological science are often summarized by means and standard deviations, even when underlying measures do not satisfy the associated psychometric properties. For instance, means may be used for single Likert-style items (made of a small set of ordered responses) or for scales made of many such items, but whose score distributions are asymmetric and with strong constraints on support bounds. Examples of how such distributions can easily be generated due to human biases (e.g., desirability or utility response biases) will be illustrated on the Big Five inventory, the most widespread personality scale used in psychopathological and recruitment procedures (revisiting simulations from Paunonen & Lebel, 2012). Given that linear models are commonly used in the statistical testing of effects on such measures, test assumptions may easily get violated, possibly leading to inflated type I error rates and more broadly to incorrect inferences drawn from data. Consequently, as part of the "replicability crisis" in empirical sciences, there has been a recent surge of interest to confirm that summary statistics are correct (to prevent reporting errors or fraud) as well as representative of adequate distributions. To this purpose, several sample reconstruction techniques have been developed to generate possible distributions given a reported mean, standard deviation, and other measure constraints. Nevertheless, current methods are either heuristic (e.g., iterative and based on conservation of sums in SPRITE; Heathers et al., 2018) or impose strong constraints (e.g., integer measures for the resolution of Diophantine systems in CORVIDS; Wilner, Wood, & Simons, 2018) yet are used to detect errors in publications. They implicitly assume distributions of real-world study samples closely resemble random distributions generated through model-based simulations. The probability of generating specific samples through deterministic or stochastic simulations may be very low (yet not necessarily proving their inexistence) while being an inaccurate estimate of real-world samples probability due to experimental manipulation and study constraints. A first contribution to this emerging field deals with the calculation of the exact number of samples matching a given rounded mean, rounded dispersion, and measure constraints. Combinatorics allows assessing with certitude the (im)possibility of given statistics being observed on a real sample satisfying the constraints, and thus the exhaustivity of solutions provided by sample reconstruction techniques. The second contribution deals with the reconstruction of the exhaustive set of valid samples under relaxed constraints (e.g., on any finite set of arbitrary values), relying on graph-based methods and algorithmic optimization to explore the sample space. The resulting oriented graph (where each path represents a unique valid sample) can be generated under a few seconds for problems of realistic dimensions (in terms of sample size and value set). Beyond circumventing methodological issues of existing approaches, these contributions (on response biases and sample reconstruction) more broadly aim at showing that all mathematical and computational developments currently used to assess the validity of empirical results rely on a set of assumptions that must be satisfied for the inference to be correct.