The artificial intelligence does not excel in history, according to a new study.
A new study has revealed that the most advanced language models did not perform well on a higher-level history test.
A recent study reveals that, although artificial intelligences can excel in tasks such as programming or creating podcasts, they struggle to correctly answer advanced-level history questions. A team of researchers has developed a new assessment standard called Hist-LLM, which tests three prominent language models: OpenAI's GPT-4, Meta's Llama, and Google's Gemini, through historical questions. This benchmark evaluates the accuracy of the responses based on the Seshat Global History Databank, an extensive database of historical knowledge named after the Egyptian goddess of wisdom.
The results presented at a significant artificial intelligence conference, NeurIPS, were disappointing. The top-performing model was GPT-4 Turbo, which achieved only 46% accuracy, a figure barely above that of a random answer. According to Maria del Rio-Chanona, co-author of the study and associate professor of computer science at University College London, the most significant finding is that despite the impressive capabilities of language models, they still lack the depth of understanding needed to tackle advanced historical questions. While they are efficient for basic facts, they are not prepared for more nuanced historical inquiries at the doctoral level.
The researchers shared examples of historical questions where the language models provided incorrect answers. In one case, GPT-4 Turbo was asked whether scale armor was used during a specific period in Egyptian history. The model claimed it was, even though this technology emerged in Egypt 1,500 years later. When questioned about why the language models are inconsistent with technical history questions while excelling in complex inquiries from other fields, del Rio-Chanona suggested that this is because they tend to extrapolate from very prominent historical data, struggling to retrieve knowledge from less prominent historical contexts.
It was also noted that OpenAI and Llama’s models performed poorly in regions like sub-Saharan Africa, suggesting possible biases in the data used for their training. This highlights that language models cannot completely replace humans in certain areas. However, the researchers remain hopeful that in the future, they could assist historians. They are working on improving their benchmark by including more data from underrepresented regions and by posing more complex questions. In summary, while the results highlight areas where language models need to improve, they also underscore their potential to contribute to historical research.