Hugging Face, ServiceNow, and Nvidia Released ‘StarCoder2’, a Free Code-Generating Model

IBL News | New York

The BigCode project, an open-scientific collaboration focused on the development of LLMs for Code (Code LLMs), released this week StarCoder2, an AI-powered open-source code generator with a less restrictive license than GitHub Copilot, Amazon CodeWhisperer, and Meta’s Code Llama.

Like most other code generators, StarCoder2 can suggest ways to complete unfinished lines of code as well as summarize and retrieve snippets of code when asked in natural language.

All StarCoder2 variants are trained on The Stack v2, a new large and high-quality code dataset. StarCoder 2 is a family of open LLMs for code and comes in three different sizes with 3B trained by ServiceNow, 7B trained by Hugging Face, and 15B parameter trained by NVIDIA with NVIDIA NeMo and trained on NVIDIA accelerated infrastructure.

The last one, StarCoder2-15B, was trained on over 4+ trillion tokens and 600+ programming languages from The Stack v2.

BigCode released all models, datasets, and the processing, as well as the training code, as explained in a paper.

In the project, at least these U.S. universities participated: Northeastern University, University of Illinois Urbana-Champaign, Johns Hopkins University, Leipzig University, Monash University, University of British Columbia, MIT, Technical University of Munich, Technion – Israel Institute of Technology, University of Notre Dame, Princeton University, Wellesley College, University College London, UC San Diego, Cornell University, and UC Berkeley.

Beyond Academia, the project gathered Kaggle, Roblox 12Sea AI Lab 13, CSIRO’s Data61, Mazzuma, Contextual AI, Cohere, and Salesforce.

StarCoder 2 can be fine-tuned in a few hours using a GPU like the Nvidia A100 on first- or third-party data to create apps such as chatbots and personal coding assistants. And, because it was trained on a larger and more diverse data set than the original StarCoder (~619 programming languages), StarCoder 2 can make more accurate, context-aware predictions — at least hypothetically.

Harm de Vries, head of ServiceNow’s StarCoder 2 development team, told TechCrunch in an interview, that “with StarCoder2, developers can use its capabilities to make coding more efficient without sacrificing speed or quality.”

A recent Stanford study found that engineers who use code-generating systems are more likely to introduce security vulnerabilities in the apps they develop. Moreover, a poll from Sonatype, the cybersecurity firm, shows that the majority of developers are concerned about the lack of insight into how code from code generators is produced and “code sprawl” from generators producing too much code to manage.

StarCoder 2’s license might also prove to be a roadblock for some, according to TechCrunch.

“StarCoder 2 is licensed under the BigCode Open RAIL-M 1.0, which aims to promote responsible use by imposing “light touch” restrictions on both model licensees and downstream users. While less constraining than many other licenses, RAIL-M isn’t truly “open” in the sense that it doesn’t permit developers to use StarCoder 2 for every conceivable application (medical advice-giving apps are strictly off limits, for example). Some commentators say RAIL-M’s requirements may be too vague to comply with in any case — and that RAIL-M could conflict with AI-related regulations like the EU AI Act.”

ServiceNow has already used StarCoder to create Now LLM, a product for code generation fine-tuned for ServiceNow workflow patterns, use cases, and processes. Hugging Face, which offers model implementation consulting plans, is providing hosted versions of the StarCoder 2 models on its platform. Nvidia, which is making StarCoder 2 available through an API and web front-end.