A LLM trained only on data from certain time periods to reduce modern bias (github.com)

0 points 206 days ago ago | visit original

🤖 AI Summary

A new language model has been developed using a novel approach called Selective Temporal Training (STT), which trains the model exclusively on historical texts from a specific time period—1800 to 1875—aimed at reducing modern bias and authentically reflecting the language and worldview of that era. The first versions, v0 and v0.5, utilized nanoGPT’s architecture, while v1 was built on Microsoft's Phi 1.5. The model generates responses that mimic 19th-century language style, successfully recalling historical events and connecting them with figures from the dataset, albeit with some incoherence in its outputs due to the limitations of its training corpus. This initiative is significant for the AI/ML community as it addresses the challenge of modern bias in language models, facilitating a deeper understanding of historical contexts without contemporary influence. Although the model has shown improvements, such as a more coherent Victorian writing style in v1, it still suffers from high rates of factual inaccuracies and residual OCR noise from the training data. The project emphasizes the importance of curating and cleaning historical data meticulously to achieve a more accurate representation of linguistic styles from specific epochs, potentially paving the way for further research into bias-free AI language processing.

Loading comments...

loading comments...