Using LLMs to create datasets: reconstructing the historical memory of Colombia (arxiv.org)

🤖 AI Summary
Researchers used GPT to turn more than 200,000 Spanish-language newspaper articles about violence into a structured analytical dataset to help reconstruct Colombia’s fragmented historical memory. Rather than manually coding stories, the team prompted the LLM to read each article and answer predefined questions, producing labels and extracted facts that support descriptive analyses and an applied study examining the relationship between violent incidents and coca crop eradication. The arXiv submission links code, data and demos, showing the end-to-end workflow and enabling reproducibility. Technically, the project demonstrates how LLM-based question-answering can scale annotation of massive, unstructured text corpora that governments never systematically recorded, unlocking new policy research and historical reconstruction opportunities. Key implications include faster creation of event datasets, richer contextual features for causal or spatial analyses, and the ability to surface narratives that official records omit. The authors also point to important caveats: label noise, model hallucination, and bias mean outputs need validation and transparency in prompt design and evaluation. If those safeguards are applied, LLM-enabled dataset construction could become a powerful tool for social-science and public-policy studies in under-documented settings.
Loading comments...
loading comments...