
Keeping up with the latest research is vital for scientists, but given that are published every year, that can prove difficult. Artificial intelligence systems show promise for quickly synthesizing seas of information, but they still tend to make things up, or 鈥渉allucinate.鈥澛
For instance, when a team led by researchers at the 91探花 and , or Ai2, studied a recent OpenAI model, , they found it fabricated 78-90% of its research citations. And general-purpose AI models like ChatGPT often can鈥檛 access papers that were published after their training data was collected.聽
So the 91探花and Ai2 research team built OpenScholar, an open-source AI model designed specifically to synthesize current scientific research. The team also created the first large, multi-domain for evaluating how well models can synthesize and cite scientific research. In tests, OpenScholar cited sources as accurately as human experts, and 16 scientists preferred its response to those written by subject experts 51% of the time.聽
The team Feb. 4 in Nature. The project鈥檚 are publicly available and free to use.
鈥淎fter we started this work, we put the demo online and quickly, we got a lot of queries, far more than we鈥檇 expected,鈥 said senior author , a 91探花associate professor in the Paul G. Allen School of Computer Science & Engineering and senior director at Ai2. 鈥淲hen we started looking through the responses we realized our colleagues and other scientists were actively using OpenScholar. It really speaks to the need for this sort of open-source, transparent system that can synthesize research.鈥
Researchers trained the model and then created a set of 45 million scientific papers for OpenScholar to pull from to ground its answers in established research. They coupled this with a technique called “,鈥 which lets the model search for new sources, incorporate them and cite them after it鈥檚 been trained.聽
鈥淓arly on we experimented with using an AI model with Google鈥檚 search data, but we found it wasn鈥檛 very good on its own,鈥 said lead author , a research scientist at Ai2 who completed this research as a 91探花doctoral student in the Allen School. 鈥淚t might cite some research papers that weren鈥檛 the most relevant, or cite just one paper, or pull from a blog post randomly. We realized we needed to ground this in scientific papers. We then made the system flexible so that it could incorporate emerging research through results.鈥澛
To test their system, the team created ScholarQABench, a benchmark against which to test systems on scientific search. They gathered 3,000 queries and 250 longform answers written by experts in computer science, physics, biomedicine and neuroscience.聽
鈥淎I is getting better and better at real world tasks,鈥 Hajishirzi said. 鈥淏ut the big question ultimately is whether we can trust that its answers are correct.鈥
The team compared OpenScholar against other state-of-the-art AI models, such as OpenAI鈥檚 GPT-4o and two models from Meta. ScholarQABench automatically evaluated AI models鈥 answers on metrics such as their accuracy, writing quality and relevance.聽
OpenScholar outperformed all the systems it was tested against. The team had 16 scientists review answers from the models and compare them with human-written responses. The scientists preferred OpenScholar answers to human answers 51% of the time, but when they combined OpenScholar citation methods and pipelines with GPT-4o (a much bigger model), the scientists preferred the AI written answers to human answers 70% of the time. They picked answers from GPT-4o on its own only 32% of the time.
鈥淪cientists see so many papers coming out every day that it鈥檚 impossible to keep up,鈥 Asai said. 鈥淏ut the existing AI systems weren鈥檛 designed for scientists鈥 specific needs. We鈥檝e already seen a lot of scientists using OpenScholar and because it鈥檚 open-source, others are building on this research and already improving on our results. We鈥檙e working on a followup model, , which builds on OpenScholar鈥檚 findings and performs multi-step search and information gathering to produce more comprehensive responses.鈥澛
Other co-authors include , , , all 91探花doctoral students in the Allen School; , a 91探花professor emeritus in the Allen School and general manager and chief scientist at Ai2; , a 91探花postdoc in the Allen School and postdoc at Ai2; , a 91探花professor in the Allen School; , a 91探花assistant professor in
the Allen School; Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D鈥橝rcy, David Wadden, Matt Latzke, Jenna Sparks and Jena D. Hwang of Ai2; Wen-tau Yih of Meta; Minyang Tian, Shengyan Liu, Hao Tong and Bohao Wu of University of Illinois Urbana-Champaign; Pan Ji of University of North Carolina; Yanyu Xiong of Stanford University; and Graham Neubig of Carnegie Mellon University.
For more information, contact Asai at akaria@allenai.org and Hajishirzi at hannaneh@cs.washington.edu.