StarCoder, a New Free Code-Generating Model Alternative to GitHub’s Copilot

IBL News | New York

Hugging Face and ServiceNow released StarCoder, a free AI code-generating system alternative to GitHub’s Copilot (powered by OpenAI’s Codex), DeepMind’s AlphaCode, and Amazon’s CodeWhisperer.

StarCoder — which is licensed to allow for royalty-free use by anyone, including corporations — was trained in over 80 programming languages as well as text from GitHub repositories, including documentation and Jupyter programming notebooks.

It also integrates with Microsoft’s Visual Studio Code code editor and, like OpenAI’s ChatGPT, can follow basic instructions (e.g., “create an app UI”) and answer questions about code.

ServiceNow supplied an in-house compute cluster of 512 Nvidia V100 GPUs to train the StarCoder model.

Hugging Face and a co-lead on StarCoder, Leandro von Werra claimed that StarCoder matches or outperforms the AI model from OpenAI that was used to power initial versions of Copilot.

Unlike Copilot, the 15-billion-parameter StarCoder was trained over the course of several days on an open-source dataset called The Stack, which has over 19 million curated, permissively licensed repositories and more than six terabytes of code in over 350 programming languages.

Because it’s permissively licensed, code from The Stack can be copied, modified, and redistributed.

StarCoder isn’t open source in the strictest sense. Rather, it’s being released under a licensing scheme, OpenRAIL-M, that includes “legally enforceable” use case restrictions

The StarCoder code repositories, model training framework, dataset-filtering methods, code evaluation suite, and research analysis notebooks are available on GitHub as of this week.

“At launch, StarCoder will not ship as many features as GitHub Copilot, but with its open-source nature, the community can help improve it along the way as well as integrate custom models,”  Leandro von Werra said in TechCrunch.

The nonprofit Software Freedom Conservancy among others criticized GitHub and OpenAI for using public source code, not all of which is under a permissive license, to train and monetize Codex.

AI-powered coding tools can cut development costs substantially while allowing coders to focus on more creative tasks. A study from the University of Cambridge found that at least half of developers’ efforts are spent debugging and not actively programming, which costs the software industry an estimated $312 billion per year.