Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: StarCoder

Tyler Josephson ⚛️ (May 05 2023 at 12:12):

Open-source LLM trained on code. 80 programming languages, including Lean (I think Lean 3)
Paper: https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view
Code: https://huggingface.co/bigcode

Eric Taucher (May 05 2023 at 12:43):

Please don't take this wrong, just jumping to the facts.

From - Table 1: Overview of the training data for StarCoder.

Language Percentage
lean 0.012

Yes 1% would be 1.0

Just weeks ago checked StarCoder with regards to Prolog so knew to look for this detail.

Language Percentage
prolog 0.001

:frown:

Tyler Josephson ⚛️ (May 05 2023 at 13:27):

Yeah, 0.012% is not a lot. That’s 0.09 GB, so just 90 MB. Not sure how well it’ll perform, but it’s open source, so weights can be fine-tuned.

Scott Morrison (May 07 2023 at 05:46):

Has anyone tried StarCoder? I tried to get started (which is actually pretty easy), but the downloads time out every time.

Notification Bot (May 07 2023 at 15:23):

Frederick V has marked this topic as resolved.

Notification Bot (May 07 2023 at 15:24):

Frederick V has marked this topic as unresolved.

Zhangir Azerbayev (May 09 2023 at 20:47):

Even though starcoder's training set is only 0.012% Lean, that's the same ballpark as models we've already seen exhibit nontrivial Lean ability, like gpt-j, code-davinci-002, and gpt-3.5/4. Note that language models are really good at generalizing to low-resource languages.

Of course finetuning will improve performance a lot. In two weeks, Tim Dettmers is releasing code that will allow you to finetune starcoder on a single gpu.

@karpathy Super excited to push this even further: - Next week: bitsandbytes 4-bit closed beta that allows you to finetune 30B/65B LLaMA models on a single 24/48 GB GPU (no degradation vs full fine-tuning in 16-bit) - Two weeks: Full release of code, paper, and a collection of 65B models

- Tim Dettmers (@Tim_Dettmers)

Junyan Xu (Jun 17 2023 at 21:44):

Let me highlight more recent developments in models for coding and software development:

WizardCoder, surpassing open-source SOTA by ~20% on HumanEval pass@1. Generated/synthetic data is the future. Not sure how much Lean data is there, but in their repo it's said

At present, our core contributors are preparing the 65B version and we expect to empower WizardLM with the ability to perform instruction evolution itself, aiming to evolve your specific data at a low cost.

Large sequence models for software development activities from Google (see also Resolving code review comments with ML)

The novelty of DIDACT is that it uses the process of software development as the source of training data for the model, rather than just the polished end state of that process, the finished code. By exposing the model to the contexts that developers see as they work, paired with the actions they take in response, the model learns about the dynamics of software development and is more aligned with how developers spend their time.

On the other hand, it seems leanprover-community doesn't have access to GitHub Copilt for PRs yet.

Sid (Jun 18 2023 at 23:00):

Junyan Xu said:

Let me highlight more recent developments in models for coding and software development:

WizardCoder, surpassing open-source SOTA by ~20% on HumanEval pass@1. Generated/synthetic data is the future. Not sure how much Lean data is there, but in their repo it's said

A couple of possible issues with this evaluation are (1) the initial prompts used for generating more prompts could be leaking data from HumanEval (2) ChatGPT is used to generate more prompts and could have knowledge of HumanEval which is another source of data leakage. Another issue with this methodology is that it seems upper bounded by how good the LLM used to generate the responses to the prompts is.

Eric Wieser (Jun 19 2023 at 14:27):

(I assume the last part of your message is not intended to be quoted)

Last updated: May 02 2025 at 03:31 UTC