🤖 AI Summary
Researchers warn that using large language models to create "silicon samples"—synthetic datasets meant to stand in for human respondents—carries a serious risk of analytic flexibility that can undermine research validity. The paper maps the many choices researchers must make (e.g., prompt design, model selection, sampling and decoding parameters, and how generated text is mapped back to survey scales) and evaluates 252 different analytic configurations. Results show that small changes can dramatically alter how well synthetic samples reproduce human data on three core dimensions: rank ordering of participants, response distributions, and between-scale correlations. Critically, no single configuration performed well across all dimensions; approaches that matched human data on one metric often failed on others.
For the AI/ML and social-science communities this matters because silicon samples are positioned as a scalable, lower-cost alternative to human subjects, but their reliability depends heavily on undocumented analytic choices. The authors call for heightened scrutiny: systematic benchmarking, transparent reporting of generation pipelines, pre-registration or specification of analytic choices, and sensitivity analyses to quantify how robust conclusions are to configuration changes. Without such standards, synthetic-human comparisons risk replicating biases and creating fragile findings that are artifacts of modeling choices rather than true social patterns.
Loading comments...
login to comment
loading comments...
no comments yet