From the abstract: “Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.”
Would allow larger models with limited resources. However, this isn’t a quantization method you can convert models to after the fact, Seems models need to be trained from scratch this way, and to this point they only went as far as 3B parameters. The paper isn’t that long and seems they didn’t release the models. It builds on the BitNet paper from October 2023.
“the matrix multiplication of BitNet only involves integer addition, which saves orders of energy cost for LLMs.” (no floating point matrix multiplication necessary)
“1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint”
Edit: Update: additional FAQ published
This is big if true, but we’ll have to see how well it holds up at larger scales.
The size of the paper is a bit worrying but the authors are all very reputable. Several were also contributors on the retnet and kosmos2/2.5 papers.
As far as I understand, their contribution is to apply what has proven to work well in the Llama architecture, to what BitNet does. And add a ‘0’. Maybe you just don’t need that much text to explain it, just the statistics.
They claim it scales as a FP16 Llama model does… So unless their judgement/maths is wrong, it should hold up. I can’t comment on that. But I’d like that if it were true…
This is sick. Would this lead to better offline LLMs on mobile?
ollama already lets you run many 7b llms on Android with 4bit quantization.
I think we’re already getting there. Lots of newer phones include AI accelerators. And all the companies advertise for AI. I don’t think they’re made to run LLMs, but anyways. Llama.cpp already runs on phones. And the limiting factor seems to be the RAM. I’ve tried Microsoft’s “phi-2”, quantized and on slow hardware, it’s surprisingly capable for such a small model. Something like a ternary model would significantly cut down on the amount of RAM that is being used which allows to load larger models while also making it faster, everywhere. So I’d say yes. And it would also allow me to load a more intelligent model on my PC.
I think the doing away with matrix multiplications is also a big deal, but has little consequences as of today. You’d first need to re-design the chips to take advantage of that. And local inference is typically limited by memory bandwidth, not multiplication speed. At least as far as I understand.
I’d say if this is true, it allows for a big improvement in parameter count for all kinds if use-cases. But I’ve also come to the conclusion that there might be a caveat to that. Maybe the training is prohibitively expensive. I don’t really know, at this point there is too much speculation going on and I’m not really an expert.
Yeah I knew about the AI chips being more common but this is a really good write up, thanks!
My mind was already blown that models like Llama work with 4-bit quantization. But this is just insane.
So are more bits less important than more paramters? Would a higher paramter or higher bit count matter more if the models ended up the same size?
They claim it performs at 1.56 bit about as good as something with 16 bits. I don’t quite get your question. Seems we can do with less precision / different maths and arrive at the same quality. The total count of parameters isn’t affected. But the numbers now don’t take 16 bits each, but less.
They said their’s is “comparable with the 8-bit models”. Its all tradeoffs. It isn’t clear to me where you allocate your compute/memory budget. I’ve noticed that full 7b 16 bit models often produce better results for me than some much larger quantied models. It will be interesting to find the sweet spot.
I can’t find that mention of “8-bit models” anywhere in the paper, just by skimming it again I only see references and comparisons to FP16.
I know these discussions from llama.cpp and ggml quantization. With that you can quantize a model more and more and it becomes worse the lower the precision gets. You can counter that by using a larger model that was more “intelligent” in the first place… With that you can calculate the sweet spot and what gives you the best quality at a certain compute cost or size… A more degraded bigger model, or a less degraded smaller model…
But we don’t have different quantization levels here, just one. And it’s also difficult to compare, as with ggml you take the same model and quantize it to different levels… We also don’t have that here, you can’t take an existing model with this approach and quantize it and compare it to another… You have to train a new model from scratch. And then it’s a different model.
I can’t find a good analogy here… Maybe it’s a bit like asking if the filesize of an JPEG image is more important than the resolution… It’s kind of the wrong question. You can compare different compression levels of the JPEG image, or compare the size of the JPEG to a BMP file… It’s really not a good analogy, but a BMP file with 20 times the size looks exactly like a smaller JPEG file on the screen. And you can also have a 7B parameter LLM model give better answers than a poor (or older) 13B model. It’s neither just parameter count nor presision alone.
So if they say they can do with less than a third of the RAM and compute time and simultansously score a tiny bit higher in the benchmarks, I don’t see a tradeoff here.
Generally speaking you can ask the question: What delivers the best results with at a given compute cost. Or the other way around: What has the lowest cost to arrive at a certain point. But this is kind of a different technique, same parameter count, same results, but significantly lower computing cost on inference.
(And reading all the speculation elsewhere: There might be a different tradeoff. The authors didn’t talk about training and just made very small models. A more complex and expensive training process could be a tradeoff.)
Apparently I am an idiot and read the wrong paper. The previous paper mentioned that “comparable with the 8-bit models”
Reading up on the speculation on the internet: There must be a caveat… There is probably a reason why they only trained up to 3B parameter models… I mean the team has the name Microsoft underneath and they should have access to enough GPUs. Maybe the training is super (computationally) expensive.
They say that the models would have to be trained from scratch, and so far that has always been super expensive.
Sure, I meant considerably more expensive than current methods… It’s not really a downside if it’s as expensive as other methods, because of the huge benefits it has after training is finished (on inference.)
If it’s just that, the next base/foundation models would be surely conceptualized with this. And companies would soon pick up on it, since the initial investment in training would pay back quickly. And then you have like an 8x competetive advantage.
Ah, I thought you meant why the researchers themselves hadn’t produced any larger models. AFAIK neither MS or OAI has released even a 7b model, they might have larger BitNet models which they only use internally.
Hmm. I meant kind of both. I think them not releasing a model isn’t a good sign to begin with. That wouldn’t matter if somebody picked it up. (What I read from the paper is that they did some training up to 3B(?!) and then scaled that up in some way to get some more measurements without actually training larger models. So also internally they don’t seem to have any real larger models. But even the small models don’t seem to have been published. I mean I also don’t have any insight on what amount of GPUs the researchers/companies have sitting around or what they’re currently working on and using them for. It’s a considerable amount, though.)
It’s only been a few weeks. I couldn’t find a comprehensive test / follow-up of their approach yet. However last week they released some more information: https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf
And I found this post from 2 days ago where someone did a small training run and published the loss curve.
And some people have started doing some implementations on Github. I’m not sure though where this is supposed to be going without availability of actual models.
Unfortunately it looks like there’s no way to convert an existing LLM to this format (for now). It’ll have to be trained from scratch.