

Claude’s system prompt had leaked at one point, it was a whopping 15K words and there was a directive that if it were asked a math question that you can’t do in your brain or some very similar language it should forward it to the calculator module.
Just tried it, Sonnet 4 got even less digits right
425,808 × 547,958 = 233,325,693,264
(correct is 233.324.900.064)
I’d love to see benchmarks on exactly how bad at numbers LLMs are, since I’m assuming there’s very little useful syntactic information you can encode in a word embedding that corresponds to a number. I know RAG was notoriously bad at matching facts with their proper year for instance, and using an LLM as a shopping assistant (ChatGTP what’s the best 2k monitor for less than $500 made after 2020) is an incredibly obvious use case that the CEOs that love to claim so and so profession will be done as a human endeavor by next Tuesday after lunch won’t even allude to.
Anecdotally, it took like one and a half week from the c-suite okaying using copilot to people beginning to consider googling beneath them and to start elevating to me the literal dumbest shit just because copilot was having a hard time with it.