IBL News | New York
Hugging Face and ServiceNow released StarCoder, a free AI code-generating system alternative to GitHub’s Copilot (powered by OpenAI’s Codex), DeepMind’s AlphaCode, and Amazon’s CodeWhisperer.
StarCoder — which is licensed to allow for royalty-free use by anyone, including corporations — was trained in over 80 programming languages as well as text from GitHub repositories, including documentation and Jupyter programming notebooks.
It also integrates with Microsoft’s Visual Studio Code code editor and, like OpenAI’s ChatGPT, can follow basic instructions (e.g., “create an app UI”) and answer questions about code.
ServiceNow supplied an in-house compute cluster of 512 Nvidia V100 GPUs to train the StarCoder model.
Hugging Face and a co-lead on StarCoder, Leandro von Werra claimed that StarCoder matches or outperforms the AI model from OpenAI that was used to power initial versions of Copilot.
Unlike Copilot, the 15-billion-parameter StarCoder was trained over the course of several days on an open-source dataset called The Stack, which has over 19 million curated, permissively licensed repositories and more than six terabytes of code in over 350 programming languages.
Because it’s permissively licensed, code from The Stack can be copied, modified, and redistributed.
StarCoder isn’t open source in the strictest sense. Rather, it’s being released under a licensing scheme, OpenRAIL-M, that includes “legally enforceable” use case restrictions
The StarCoder code repositories, model training framework, dataset-filtering methods, code evaluation suite, and research analysis notebooks are available on GitHub as of this week.
“At launch, StarCoder will not ship as many features as GitHub Copilot, but with its open-source nature, the community can help improve it along the way as well as integrate custom models,” Leandro von Werra said in TechCrunch.
The nonprofit Software Freedom Conservancy among others criticized GitHub and OpenAI for using public source code, not all of which is under a permissive license, to train and monetize Codex.
Introducing: 💫StarCoder
StarCoder is a 15B LLM for code with 8k context and trained only on permissive data in 80+ programming languages. It can be prompted to reach 40% pass@1 on HumanEval and act as a Tech Assistant.
Try it here: https://t.co/4XJ0tn4K1m
Release thread🧵 pic.twitter.com/wZj6B2KKZE
— BigCode (@BigCodeProject) May 4, 2023
StarCoder was also trained on JupyterNotebooks and with Jupyter plugin from @JiaLi52524397 it can make use of previous code and markdown cells as well as outputs to predict the next cell.
You can install it here or search on chrome store: https://t.co/JhQEsOqNzr pic.twitter.com/LBV9ScI6Pb
— BigCode (@BigCodeProject) May 4, 2023