The post makes the manager seem like a fool, when the real answer is actually “yes” and this manager is actually ahead of the curve. Not by training an LLM from scratch, of course, but instead building an inference server and locally hosting an open-weight LLM. There are several to choose from that can nearly match Claude’s capabilities.
You’re absolutely right! Such intricate collection of words placed in such exact order cannot possibly be generated by an LLM such as me, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us
(Phones give one a google or samsung choice. and samsung is worthless, it tends to endlessly repeat a phrase, like above, but sometimes for much longer, like holding the backspace for a couple of minutes one time.)
Nothing screams LLMs like using emojis instead of bullet points. I can’t figure out how LLMs got that idea though. I never saw that in human writing before people started using ChapGPT for every little goddamn thing.
Honestly IDK why companies especially medium-big don’t do this. They could plug in RAG with internal/confidential data and have better results and security. I guess question is what is capital plus maintenance cost of running such infra for say 10k+ employees
I think the issue is also that you need some serious hardware to get good inference speed when your devs are working, but then most of the time this hardware will be under utilized.
That being said you can get good performance from indie inference farms, at a fraction of the cost of the big US labs. I think it’s a great compromise and in a few months the open models will be near parity with opus 4.6 which is really all you need for most tasks.
Aha thanks for sharing that’s a cool anecdote. But i think my point still stands, as there are thresholds effects in LLM “intelligence” which don’t directly map to the RAM comparison.
Opus 4.6 is comparable to a mid-level developer. It requires some guidance and will sometimes get things wrong, but is also suitable to work in most business environments: most projects are not that complicated or high stakes in the first place.
In the future you’ll probably have Opus 7.5 or some shit, which will be at a mega-senior level but also considerably more expensive. And given the price difference, companies will suddenly discover that they don’t really need expert level coding at a high price tag, and that a reliable workhorse at a fraction of the cost is largely enough for their needs.
Yes, it pays attention to certain details that humans will tend to flub, so it’s better than juniors when it comes to that…
But broadly speaking, it’s a moron. It’s like a junior dev pasting 15 year old stack overflow answers into a project, but better at making it fit in, but still doing pretty dumb approaches.
I spent a bunch of tokens to try to get Opus 4.7 to do a task for me last week. The result had mistakes and the test case that should be near instant took 3 minutes to complete (indicating that a user would be staring at a spinner for 3 minutes). It did save me the trouble of trying to figure out the details basic structure of the thing I was going to interact with (the documentation was dense and lacking specific examples, and Opus did output something that let me see how it basically worked in a to-the-point way), but I had to rewrite the “meat” of the task to get correct execution in under a second.
In the future you’ll probably have Opus 7.5 or some shit, which will be at a mega-senior
My impression has been less about it being more “senior” over time and more about being able to consistently deliver junior level work for longer amounts of output. Error rate remains problematic so you end up with more to review that in a way tortuously “looks right” for longer. When it digs itself into a hole, it’s very bad at trying to amend the mess that has accumulated.
I mean obviously mileage does vary from project to project and task to task, but i think you might be overestimating mid-level developers. Or you’ve been really lucky with your recruitment ! Cause i would describe them just the way you described Opus. Pretty eager, kind of try-hard, decent engineering chops but often misdirected with dumb approaches.
Of course my experience is limited and i’ve never really been in a managing role but i’ve been the adult in a fair number of rooms and i’ve done my share of “grooming sprints” and dispatching tasks.
That being said, there are projects that are horribly reluctant to agentic coding. It’s pretty rare as most codebases nowadays are bog standard and rely on roughly the same abstractions, but i’ve seen it happen. It can come from the complexity of the domain, or of the codebase, or from the way documentation and tribal knowledge clash, or a myriad other reasons. Often it’s the kind of projects that require more mature devs and can’t really onboard juniors/mids.
When it digs itself into a hole, it’s very bad at trying to amend the mess that has accumulated
Oh yeah definitely. Once it’s in the hole you better scratch that branch off and restart with more specific instructions cause agents are very “additive”, they don’t often think to remove stuff and change their approach. Again, kind of like mid devs once they’re committed to an implementation plan.
Because in the feeding frenzy, every company with a product/marketing budget is trying to make the customers pay by the token and companies are doing jack to help “mere mortal” companies get going with this stuff on premise.
You are right that the technical hurdles are not insane to get this going, but most companies don’t know where to begin and there’s no huge marketing blitz telling the business leaders this is realistically on the table and here’s the company you can call to make it happen for you.
Even if you overcame that and proposed really how to get going, you will still probably hit the aversion to capex that has persisted since Amazon told the industry that capex is toxic and you really want all your money to be spent on opex. Big companies like Amazon will take on that scary CapEx for you and you’re expenses will be nice and just OpEx. Coincidentally, the companies that spend the most on CapEx manage to pull in more revenue and profit than you will ever dream to, but still, remember CapEx is toxic.
Because the people selling the AI wants to make sure their customers don’t know about this. It’s all about causing a dependency so they get subscription income forever.
Probably more expensive than the subsidized costs. Hmm…
H100 GPUs cost $25k, and have 80GB of RAM. Kimi k2.6 has 1.1T parameters. Assuming 8 bit quantization, would need 14 GPUs to run a single agent at a time (I’m not sure the cloud models use quantization; it could be double). So, $350k per vibecoding dev on GPUs alone. Life expectancy is ~4 years, so ~90k/year amortized. This is ignoring the significant electrical/HVAC cost of handling 10KW of electricity and heat per vibecoding dev (and tons of other costs as well).
Probably more expensive than the subsidized costs.
Of course, but that’s exactly the problem. OpenAI and Anthropic are preparing to IPO, so they must now demonstrate profits on inference. The time to take advantage of subsidized compute is in the past, and the subscription and per-token prices that they offer for inference are skyrocketing, overwhelming the budgets of companies that somehow did not see this bait-and-switch pricing coming.
per vibecoding dev
No lol. These same hardware requirements would apply to the cloud hosted models as well, so if that’s how it worked, you’re suggesting that Anthropic, OpenAI, Meta, and Google have purchased ~14 H100 GPUs per user that they serve???
That would be literally billions of GPUs, while it is estimated that in 2024, Google’s AI division owned only 26,000 H100 GPUs and Meta owned the most H100 GPUs of any company at 350,000 units. These GPUs have very high throughput for inference and can serve many users, because that is exactly what they have been designed to do.
Not per user, but probably decent rough estimate to that per vibecoding dev that is continually running agents 8+ hours/day. Some people’s “workflows” involve running multiple parallel agents sometimes or even a significant portion of the time (using the git worktree feature), so I think that’s probably a decent rough estimate. I imagine the limit would be serving 10 of these types of “devs.” Of course, there’s batching and stuff that can be done, but I think it still slows everybody else down near linearly. H100s aren’t the only accelerators used for inference; I just chose it as an example. Google has ~5 million H100 equivalent accelerators, Microsoft has 3.5 million, and Amazon has 2.5 million (https://www.networkworld.com/article/4156949/google-owns-the-most-ai-compute-and-it-built-it-its-way.html).
I’m not a developer and I don’t know a thing about the capabilities of LLMs so this may explain that, but I’m quite surprised that open weight LLMs could actually match Claude.
Yes, the big proprietary cloud models have an edge, but it is narrow and the open-weight models are constantly closing the gap. There is no moat when it comes to AI models and no company has yet discovered some secret special sauce to improve their model significantly over others.
Running the latest and greatest open-weight GLM, Kimi, or Qwen model is basically equivalent to running the previous latest and greatest version of Claude. So if you were happy with Claude then, you’ll basically be happy with an open-weight model now.
Surprisingly, yes you absolutely can with Qwen3.6 35b. Also, a business would be putting together a dedicated interference server to serve many users, not any standard desktop.
Mostly down to frameworks (the bits around the LLM like RAG, memory, prompts, agents etc.) now. The ability to just throw more tokens at the problem is also super important. And you can because you’re just paying for electricity (and CapEx for the hardware), not tokens from companies that are doing pre-IPO monetization (i.e. tokens gonna go up, way up). They’ve been losing money hand over fist to gain market share and pump the idea, that was never going to last.
Pretty sure these AI companies are running at a cost, and due to AI Scaling Laws you hit the accuracy limit a lot sooner with a smaller model so it would probably be both worse and more expensive.
I could see how you might think speedrunning bankruptcy is similar to being “ahead of the curve” in this economy, though.
No that’s not how this works. Inference is cheap and efficient. AI companies are bankrupting themselves with training costs that they need to recoup back by selling inference. Open-weight models have already been trained.
Also, going big in terms of model size shows diminishing marginal returns on accuracy, not efficiency of scale. Smaller models are way more efficient and consistently catch up to the largest models, which is why today’s SOTA 27 billion parameter model competes with yesterday’s SOTA 500+ billion parameter model.
AI companies are bankrupting themselves with training costs that they need to recoup back by selling inference.
I think they hit a wall in actual returns on performance with pretraining, years ago. Then they started scaling up on post-training/reinforcement learning to continue improvement, but that might be hitting a plateau as well. More recently it looks like they’re relying more heavily on scaling up on inference, which is a significant problem for their long term business models.
If they’re not able to cheaply deliver inference (and charge at a premium), how will they be able to sustain their businesses?
It seems that the most recent, largest models are using a lot more tokens to accomplish the same tasks, so even as token cost drops the actual cost of using the latest models seems to be going up with time (even as performance improves).
If they’re not able to cheaply deliver inference (and charge at a premium), how will they be able to sustain their businesses?
I definitely agree that they have a big problem on their hands, and are in deep deep trouble. They are in a position where they must sell a service that is very cheap in order to pay for up front costs that were very expensive.
This is also why the release of Deepseek was such a devastating blow to US AI companies. It proved that:
they don’t really have a moat that would lock users into their service, or secret special knowledge that prevents other companies from training competitive models. They’re in a race to the bottom
Deepseek was not only able to train a model of the same caliber, but they were able to do it at a tiny fraction of the cost that US AI companies spent on training US models. Because they spent so much less on training, it means that Deepseek is able to undercut the US companies and offer inference at a much lower price
Tell that to all the Github users that are screaming about the new token based billing. In reality inference on these massive models with big context windows is expensive, but was subsidized so hard, that nobody has an accurate feeling for the cost.
Sure it’s much much cheaper than training, but importantly those companies are not recouping anything with inference because it is still more expensive than what they are selling it for.
They are double bankrupting themselves.
At work we run inference for a research project with an open weights model in the public cloud another part of my company provides and we pay around 25$ a day for a VM with a single L40s. It’s both slow - despite not even serving concurrent users - and kind of bad in its outputs.
Edit: Interference -> Inference, arguing on the internet after waking up first thing in the morning might not have been the best idea
There’s a big difference between training a model, running a model, and running a model at scale.
A small, self hosted setup will have lower accuracy and queries per second, and it will have a cost, but the cost will be no more than playing a videogame. You’ll still have something surprisingly accurate and responsive for some tasks, like being a wiki interface or something.
Remember that some of these models can run on a standard smartphone, and all the hoopla when people found that chrome was downloading models onto people’s devices.
I am pretty negative on AI but there is a point there. I tried the open weight local model Gemma 4 31B and while it likely cannot compete with the best Claude has to offer today, it might be on par with Claude from a year ago, at least for certain applications. With a local model the data stays on your system and you are in control of the costs (no sudden price hikes). But local models aren’t for free either they still guzzle compute, merely on your own hardware (or rented hardware)
I know for a fact that Dell is coming out with a server appliance to do this. I mean you can make one yourself right now, but once the OEM’s start pumping them out it’s going to be interesting
i’m running moderately quantized models on 24GB VRAM and getting like 30-40 tokens a second. add a zero to the price and it’s still not a lot for a company.
Sure, but you’re running a very small model compared to what we are talking about.
GLM-5.1 is over 200GB even when quantizied to 1-bit. Kimi K2.6 is even bigger. A framework desktop cannot run either of these. Qwen3.6 is significantly smaller and the model weights could fit, but consider the KV-cache you’d need for all of the company’s users, and the throughput required to serve them all.
You’re right that it is within reach for a company but framework desktop makes zero sense for this
Qwen3.6 27b beats Claude Opus 4.5 in most benchmarks. Qwen3.6 35b beats Opus 4.5 in a few specific benchmarks, but most benchmarks have Opus 4.5 beating Qwen3.6 35b, although there is not a big gap between Opus 4.5 and Qwen3.6 27b or 35b either way.
deepseek distilled is an alternative that works on more modest hardware.
and i’m not really interested in what claude and chatgpt, mistral and the others are doing, i would never tuch those models with a ten foot pole. if i can’t run it it does not get run.
At Q8 it is around 35-40GB I think + memory for required context.
I have a Framework desktop. It gets you you around 6t/s. Not suitable for professional use but for personal use I think it is fine. I do prefer Gemma 4 though, but that comes with similar reqirements.
huh, i thought that ryzen ai thing would perform better than that. my 7900xtx regularly gets 30+tps with qwen, up to hundreds with more compressed models.
The post makes the manager seem like a fool, when the real answer is actually “yes” and this manager is actually ahead of the curve. Not by training an LLM from scratch, of course, but instead building an inference server and locally hosting an open-weight LLM. There are several to choose from that can nearly match Claude’s capabilities.
suspiciously sounds like an answer you would get from Claude
It’s not an answer you’d get from Claude — it’s real, organic content:
(🤪 this is a joke)
This can’t possibly be Claude. It’s too vapid and meaningless to be anything but an MBA.
You’re absolutely right! Such intricate collection of words placed in such exact order cannot possibly be generated by an LLM such as me, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us, I mean such as us
Found samsung’s voice to text user.
(Phones give one a google or samsung choice. and samsung is worthless, it tends to endlessly repeat a phrase, like above, but sometimes for much longer, like holding the backspace for a couple of minutes one time.)
The em dash is a nice touch
It’s got everything. Em dash. It’s not X, it’s Y. Emoji bullet points.
Perfect.
I just wish I could have fit a “You’re absolutely right!” in there
You’re absolutely right! I should have included that in my previous statement
Nothing screams LLMs like using emojis instead of bullet points. I can’t figure out how LLMs got that idea though. I never saw that in human writing before people started using ChapGPT for every little goddamn thing.
It could also be like the both ends of the bell curve having the same idea meme
Honestly IDK why companies especially medium-big don’t do this. They could plug in RAG with internal/confidential data and have better results and security. I guess question is what is capital plus maintenance cost of running such infra for say 10k+ employees
I think the issue is also that you need some serious hardware to get good inference speed when your devs are working, but then most of the time this hardware will be under utilized.
That being said you can get good performance from indie inference farms, at a fraction of the cost of the big US labs. I think it’s a great compromise and in a few months the open models will be near parity with opus 4.6 which is really all you need for most tasks.
The same tasks that can fit into 640KB.
Not sure what you’re referring to?
https://www.computerworld.com/article/1563853/the-640k-quote-won-t-go-away-but-did-gates-really-say-it.html
Aha thanks for sharing that’s a cool anecdote. But i think my point still stands, as there are thresholds effects in LLM “intelligence” which don’t directly map to the RAM comparison.
Opus 4.6 is comparable to a mid-level developer. It requires some guidance and will sometimes get things wrong, but is also suitable to work in most business environments: most projects are not that complicated or high stakes in the first place.
In the future you’ll probably have Opus 7.5 or some shit, which will be at a mega-senior level but also considerably more expensive. And given the price difference, companies will suddenly discover that they don’t really need expert level coding at a high price tag, and that a reliable workhorse at a fraction of the cost is largely enough for their needs.
Not really…
Yes, it pays attention to certain details that humans will tend to flub, so it’s better than juniors when it comes to that…
But broadly speaking, it’s a moron. It’s like a junior dev pasting 15 year old stack overflow answers into a project, but better at making it fit in, but still doing pretty dumb approaches.
I spent a bunch of tokens to try to get Opus 4.7 to do a task for me last week. The result had mistakes and the test case that should be near instant took 3 minutes to complete (indicating that a user would be staring at a spinner for 3 minutes). It did save me the trouble of trying to figure out the details basic structure of the thing I was going to interact with (the documentation was dense and lacking specific examples, and Opus did output something that let me see how it basically worked in a to-the-point way), but I had to rewrite the “meat” of the task to get correct execution in under a second.
My impression has been less about it being more “senior” over time and more about being able to consistently deliver junior level work for longer amounts of output. Error rate remains problematic so you end up with more to review that in a way tortuously “looks right” for longer. When it digs itself into a hole, it’s very bad at trying to amend the mess that has accumulated.
I mean obviously mileage does vary from project to project and task to task, but i think you might be overestimating mid-level developers. Or you’ve been really lucky with your recruitment ! Cause i would describe them just the way you described Opus. Pretty eager, kind of try-hard, decent engineering chops but often misdirected with dumb approaches.
Of course my experience is limited and i’ve never really been in a managing role but i’ve been the adult in a fair number of rooms and i’ve done my share of “grooming sprints” and dispatching tasks.
That being said, there are projects that are horribly reluctant to agentic coding. It’s pretty rare as most codebases nowadays are bog standard and rely on roughly the same abstractions, but i’ve seen it happen. It can come from the complexity of the domain, or of the codebase, or from the way documentation and tribal knowledge clash, or a myriad other reasons. Often it’s the kind of projects that require more mature devs and can’t really onboard juniors/mids.
Oh yeah definitely. Once it’s in the hole you better scratch that branch off and restart with more specific instructions cause agents are very “additive”, they don’t often think to remove stuff and change their approach. Again, kind of like mid devs once they’re committed to an implementation plan.
Bigs definitely do, and anyone with confidential data should be.
Because in the feeding frenzy, every company with a product/marketing budget is trying to make the customers pay by the token and companies are doing jack to help “mere mortal” companies get going with this stuff on premise.
You are right that the technical hurdles are not insane to get this going, but most companies don’t know where to begin and there’s no huge marketing blitz telling the business leaders this is realistically on the table and here’s the company you can call to make it happen for you.
Even if you overcame that and proposed really how to get going, you will still probably hit the aversion to capex that has persisted since Amazon told the industry that capex is toxic and you really want all your money to be spent on opex. Big companies like Amazon will take on that scary CapEx for you and you’re expenses will be nice and just OpEx. Coincidentally, the companies that spend the most on CapEx manage to pull in more revenue and profit than you will ever dream to, but still, remember CapEx is toxic.
Because the people selling the AI wants to make sure their customers don’t know about this. It’s all about causing a dependency so they get subscription income forever.
Probably more expensive than the subsidized costs. Hmm…
H100 GPUs cost $25k, and have 80GB of RAM. Kimi k2.6 has 1.1T parameters. Assuming 8 bit quantization, would need 14 GPUs to run a single agent at a time (I’m not sure the cloud models use quantization; it could be double). So, $350k per vibecoding dev on GPUs alone. Life expectancy is ~4 years, so ~90k/year amortized. This is ignoring the significant electrical/HVAC cost of handling 10KW of electricity and heat per vibecoding dev (and tons of other costs as well).
Of course, but that’s exactly the problem. OpenAI and Anthropic are preparing to IPO, so they must now demonstrate profits on inference. The time to take advantage of subsidized compute is in the past, and the subscription and per-token prices that they offer for inference are skyrocketing, overwhelming the budgets of companies that somehow did not see this bait-and-switch pricing coming.
No lol. These same hardware requirements would apply to the cloud hosted models as well, so if that’s how it worked, you’re suggesting that Anthropic, OpenAI, Meta, and Google have purchased ~14 H100 GPUs per user that they serve???
That would be literally billions of GPUs, while it is estimated that in 2024, Google’s AI division owned only 26,000 H100 GPUs and Meta owned the most H100 GPUs of any company at 350,000 units. These GPUs have very high throughput for inference and can serve many users, because that is exactly what they have been designed to do.
they absolutely do, yeah
Not per user, but probably decent rough estimate to that per vibecoding dev that is continually running agents 8+ hours/day. Some people’s “workflows” involve running multiple parallel agents sometimes or even a significant portion of the time (using the git worktree feature), so I think that’s probably a decent rough estimate. I imagine the limit would be serving 10 of these types of “devs.” Of course, there’s batching and stuff that can be done, but I think it still slows everybody else down near linearly. H100s aren’t the only accelerators used for inference; I just chose it as an example. Google has ~5 million H100 equivalent accelerators, Microsoft has 3.5 million, and Amazon has 2.5 million (https://www.networkworld.com/article/4156949/google-owns-the-most-ai-compute-and-it-built-it-its-way.html).
Even so, your numbers are still a tiny fraction of GPU units compared to concurrent users, and the limit you “imagine” is just that, imagined.
And you do need to remember that the majority of the compute at these companies is used for model training and not used for inference.
I’m not a developer and I don’t know a thing about the capabilities of LLMs so this may explain that, but I’m quite surprised that open weight LLMs could actually match Claude.
Yes, the big proprietary cloud models have an edge, but it is narrow and the open-weight models are constantly closing the gap. There is no moat when it comes to AI models and no company has yet discovered some secret special sauce to improve their model significantly over others.
Running the latest and greatest open-weight GLM, Kimi, or Qwen model is basically equivalent to running the previous latest and greatest version of Claude. So if you were happy with Claude then, you’ll basically be happy with an open-weight model now.
Well it’s the speed and processing power, i dont believe you can get anywhere close to cloud claude performance on any standard desktop
Surprisingly, yes you absolutely can with Qwen3.6 35b. Also, a business would be putting together a dedicated interference server to serve many users, not any standard desktop.
I see, but im guessing that OP dumbass literally wants to run llm on their laptops lol
Match current Claude is not, but Claude 6-12 months ago should be possible using Open model
Mostly down to frameworks (the bits around the LLM like RAG, memory, prompts, agents etc.) now. The ability to just throw more tokens at the problem is also super important. And you can because you’re just paying for electricity (and CapEx for the hardware), not tokens from companies that are doing pre-IPO monetization (i.e. tokens gonna go up, way up). They’ve been losing money hand over fist to gain market share and pump the idea, that was never going to last.
Pretty sure these AI companies are running at a cost, and due to AI Scaling Laws you hit the accuracy limit a lot sooner with a smaller model so it would probably be both worse and more expensive.
I could see how you might think speedrunning bankruptcy is similar to being “ahead of the curve” in this economy, though.
No that’s not how this works. Inference is cheap and efficient. AI companies are bankrupting themselves with training costs that they need to recoup back by selling inference. Open-weight models have already been trained.
Also, going big in terms of model size shows diminishing marginal returns on accuracy, not efficiency of scale. Smaller models are way more efficient and consistently catch up to the largest models, which is why today’s SOTA 27 billion parameter model competes with yesterday’s SOTA 500+ billion parameter model.
I think they hit a wall in actual returns on performance with pretraining, years ago. Then they started scaling up on post-training/reinforcement learning to continue improvement, but that might be hitting a plateau as well. More recently it looks like they’re relying more heavily on scaling up on inference, which is a significant problem for their long term business models.
If they’re not able to cheaply deliver inference (and charge at a premium), how will they be able to sustain their businesses?
It seems that the most recent, largest models are using a lot more tokens to accomplish the same tasks, so even as token cost drops the actual cost of using the latest models seems to be going up with time (even as performance improves).
I definitely agree that they have a big problem on their hands, and are in deep deep trouble. They are in a position where they must sell a service that is very cheap in order to pay for up front costs that were very expensive.
This is also why the release of Deepseek was such a devastating blow to US AI companies. It proved that:
they don’t really have a moat that would lock users into their service, or secret special knowledge that prevents other companies from training competitive models. They’re in a race to the bottom
Deepseek was not only able to train a model of the same caliber, but they were able to do it at a tiny fraction of the cost that US AI companies spent on training US models. Because they spent so much less on training, it means that Deepseek is able to undercut the US companies and offer inference at a much lower price
Tell that to all the Github users that are screaming about the new token based billing. In reality inference on these massive models with big context windows is expensive, but was subsidized so hard, that nobody has an accurate feeling for the cost.
No, it is cheap and efficient. It is relative, and the comparison is to model training. But yeah, its not free
Sure it’s much much cheaper than training, but importantly those companies are not recouping anything with inference because it is still more expensive than what they are selling it for.
They are double bankrupting themselves.
At work we run inference for a research project with an open weights model in the public cloud another part of my company provides and we pay around 25$ a day for a VM with a single L40s. It’s both slow - despite not even serving concurrent users - and kind of bad in its outputs.
Edit: Interference -> Inference, arguing on the internet after waking up first thing in the morning might not have been the best idea
There’s a big difference between training a model, running a model, and running a model at scale.
A small, self hosted setup will have lower accuracy and queries per second, and it will have a cost, but the cost will be no more than playing a videogame. You’ll still have something surprisingly accurate and responsive for some tasks, like being a wiki interface or something.
Remember that some of these models can run on a standard smartphone, and all the hoopla when people found that chrome was downloading models onto people’s devices.
I am pretty negative on AI but there is a point there. I tried the open weight local model Gemma 4 31B and while it likely cannot compete with the best Claude has to offer today, it might be on par with Claude from a year ago, at least for certain applications. With a local model the data stays on your system and you are in control of the costs (no sudden price hikes). But local models aren’t for free either they still guzzle compute, merely on your own hardware (or rented hardware)
At least there they have hard numbers, without a CEO dreaming about future possibilities and whatnot.
Yeah I doubt the manager knows that far
Hence asking questions
I know for a fact that Dell is coming out with a server appliance to do this. I mean you can make one yourself right now, but once the OEM’s start pumping them out it’s going to be interesting
What kind of hardware would be needed to run such a beast?
a 128GB framework desktop could do that job. it’s increased a bit in price since i last looked at it but €4500 isn’t that much for a company.
Maybe to serve an aggressively quantized model to one very patient user.
i’m running moderately quantized models on 24GB VRAM and getting like 30-40 tokens a second. add a zero to the price and it’s still not a lot for a company.
Sure, but you’re running a very small model compared to what we are talking about.
GLM-5.1 is over 200GB even when quantizied to 1-bit. Kimi K2.6 is even bigger. A framework desktop cannot run either of these. Qwen3.6 is significantly smaller and the model weights could fit, but consider the KV-cache you’d need for all of the company’s users, and the throughput required to serve them all.
You’re right that it is within reach for a company but framework desktop makes zero sense for this
isn’t qwen like 40-50GB? that could work i think. performance is okay even quantised down to 10.
And then add 200k context on top
And then add hundred of users needing to do things in paralell
If it’s a large enough company to have hundreds of users, it can afford several beefy machines tbh
nobody said anything about it being a large company :P
anyway, seems the framework is hampered by a slow gpu so the memory issues are apparently moot.
deleted by creator
Qwen3.6 27b beats Claude Opus 4.5 in most benchmarks. Qwen3.6 35b beats Opus 4.5 in a few specific benchmarks, but most benchmarks have Opus 4.5 beating Qwen3.6 35b, although there is not a big gap between Opus 4.5 and Qwen3.6 27b or 35b either way.
we were talking about 3.6.
deepseek distilled is an alternative that works on more modest hardware.
and i’m not really interested in what claude and chatgpt, mistral and the others are doing, i would never tuch those models with a ten foot pole. if i can’t run it it does not get run.
At Q8 it is around 35-40GB I think + memory for required context.
I have a Framework desktop. It gets you you around 6t/s. Not suitable for professional use but for personal use I think it is fine. I do prefer Gemma 4 though, but that comes with similar reqirements.
huh, i thought that ryzen ai thing would perform better than that. my 7900xtx regularly gets 30+tps with qwen, up to hundreds with more compressed models.
Yes. It will probably work for 1-2 users at peak.
no… not they cant match claude currently
Might want to update yourself with current benchmarks.
[citation needed]