AI2 is developing a great language model optimized for science

Palm 2. GPT-4. The list of AI that generates text practically grows by the day.

Most of these models are API-protected, making it impossible for researchers to see exactly what makes them tick. But increasingly, community efforts are producing open source AI that is just as sophisticated, if not more so, than its commercial counterparts.

The latest of these efforts is the Open Language Model, a large language model that will be released by the nonprofit Allen Institute for AI Research (AI2) sometime in 2024. The Open Language Model, or OLMo for short, it is being developed in collaboration with AMD. and the Large Modern Unified Infrastructure consortium, which provides supercomputing power for training and education, as well as Surge AI and MosaicML (which provide training data and code).

“The research and technology communities need access to open language models to advance this science,” Hanna Hajishirzi, senior director of NLP research at AI2, told TechDigiPro in an email interview. “With OLMo, we are working to close the gap between research capabilities and public and private knowledge by building a competitive language model.”

One might wonder, including this reporter, why AI2 felt the need to develop an open language model when there are already several to choose from (see BloomGoals Calls, etc.). The way Hajishirzi sees it, while the open source releases to date have been valuable and even pushed the envelope, they have missed the mark in a number of ways.

AI2 sees OLMo as a platform, not just a model, one that will allow the research community to take every component AI2 creates and use it themselves or seek to improve it. Everything AI2 does for OLMo will be openly available, Hajishirzi says, including a public demo, training dataset, and API, and documented with “very limited” exceptions under “proper” licenses.

“We are building OLMo to create greater access for the AI ​​research community to work directly on language models,” Hajishirzi said. “We believe that the wide availability of all aspects of OLMo will allow the research community to take what we are creating and work to improve it. Our ultimate goal is to collaboratively build the best open language model in the world.”

The other differentiator of OLMo, according to Noah Smith, Senior Director of NLP Research at AI2, is a focus on allowing the model to better leverage and understand textbooks and scholarly papers rather than, say, code. There have been other attempts at this, such as the infamous Meta by Meta galactic model. But Hajishirzi believes that AI2’s work in academia and the tools he has developed for research, such as Semantic Scholar, will help make OLMo “uniquely suitable” for scientific and academic applications.

“We think OLMo has the potential to be something really special in the field, especially in a landscape where many are rushing to cash in on interest in generative AI models,” Smith said. “AI2’s unique ability to act as external experts gives us the opportunity to not only draw on our own world-class expertise, but also to collaborate with the strongest minds in the industry. As a result, we believe our rigorous and documented approach will lay the foundation for building the next generation of safe and effective AI technologies.”

That’s a nice feeling, no doubt. But what about the thorny ethical and legal issues surrounding the training and release of generative AI? The debate is raging around the rights of content owners (among other affected stakeholders), and countless lingering issues have yet to be resolved in court.

To allay concerns, the OLMo team plans to work with AI2’s legal department and TBD outside experts, stopping at “checkpoints” in the model-building process to reassess privacy and intellectual property rights issues.

“We hope that through an open and transparent dialogue about the model and its intended use, we can better understand how to mitigate bias, toxicity, and shed light on outstanding research questions within the community, ultimately resulting in one one of the most robust models available. Smith said.

What about the potential for misuse? The models, which are often toxic and biased to begin with, are ripe for bad actors to attempt to spread disinformation and generate malicious code.

Hajishirzi said that AI2 will use a combination of licensing, model design, and selective access to the underlying components to “maximize scientific benefits and reduce the risk of harmful use.” To guide the policy, OLMo has an ethics review committee with internal and external advisers (AI2 would not say who exactly) who will provide feedback throughout the model creation process.

We’ll see to what extent that makes a difference. For now, a lot is up in the air, including most of the model’s technical specifications. (AI2 revealed that it will have around 70 billion parameters, with parameters being the parts of the model learned from historical training data.) The training will start on the LUMI supercomputer in Finland, the fastest supercomputer in Europe, starting in January, next few months.

AI2 is inviting contributors to help contribute and critique the model development process. Those interested can contact the organizers of the OLMo project here.


Scroll to Top