What is wrong with LLM benchmarks, and why are we still using them?

micheal65536@lemmy.micheal65536.duckdns.org · 11 months ago

In that case ChatGPT is correct, it cannot work with links. You will need to download the video transcript (subtitles) yourself and ask it to summarise that. This definitely works, people have been doing it for months.

micheal65536@lemmy.micheal65536.duckdns.org · 11 months ago

Probably another case of “I don’t want people training AI on my posts/images so I’m nuking my entire online existence”.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Without knowing anything about this model or what it was trained on or how it was trained, it’s impossible to say exactly why it displays this behavior. But there is no “hidden layer” in llama.cpp that allows for “hardcoded”/“built-in” content.

It is absolutely possible for the model to “override pretty much anything in the system context”. Consider any regular “censored” model, and how any attempt at adding system instructions to change/disable this behavior is mostly ignored. This model is probably doing much the same thing except with a “built-in story” rather than a message that says “As an AI assistant, I am not able to …”.

As I say, without knowing anything more about what model this is or what the training data looked like, it’s impossible to say exactly why/how it has learned this behavior or even if it’s intentional (this could just be a side-effect of the model being trained on a small selection of specific stories, or perhaps those stories were over-represented in the training data).

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

IMO, local LLMs lack the capabilities or depth of understanding to be useful for most practical tasks (e.g. writing code, automation, language analysis). This will heavily skew any local LLM “usage statistics” further towards RP/storytelling (a significant proportion of which will always be NSFW in nature).

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Stable Diffusion 2 base model is trained using what we would today refer to as a “censored” dataset. Stable Diffusion 1 dataset included NSFW images, the base model doesn’t seem particularly biased towards or away from them and can be further trained in either direction as it has the foundational understanding of what those things are.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

There doesn’t appear to be a model anywhere, unless that has been published completely separately and not mentioned anywhere in the code documentation.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

So… If this doesn’t actually increase the context window or otherwise increase the amount of text that the LLM is actually able to see/process, then how is it fundamentally different to just “manually” truncating the input to fit in the context size like everyone’s already been doing?

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

I tried getting it to write out a simple melody using MIDI note numbers once. I didn’t think of asking it for LilyPond format, I couldn’t think of a text-based format for music notation at the time.

It was able to produce a mostly accurate output for a few popular children’s songs. It was also able to “improvise” a short blues riff (mostly keeping to the correct scale, and showing some awareness of/reference to common blues themes), and write an “answer” phrase (which was suitable and made musical sense) to a prompt phrase that I provided.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Someone explain to me why there are so many frameworks focused on LLM-based “agents” (LangChain, {{guidance}}, and now whatever this is) and how these are practically useful, when I have yet to find a model that can even successfully perform a simple database query to answer an easy question (searching for one or two items by keyword, retrieving their quantity, and adding the quantities together if applicable) regardless of the model, prompt template, and function API used.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

To be honest, the same could be said of LLaMa/Facebook (which doesn’t particularly claim to be “open”, but I don’t see many people criticising Facebook for doing a potential future marketing “bait and switch” with their LLMs).

They’re only giving these away for free because they aren’t commercially viable. If anyone actually develops a leading-edge LLM, I doubt they will be giving it away for free regardless of their prior “ethics”.

And the chance of a leading-edge LLM being developed by someone other than a company with prior plans to market it commercially is quite small, as they wouldn’t attract the same funding to cover the development costs.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

IMO the availability of the dataset is less important than the model, especially if the model is under a license that allows fairly unrestricted use.

Datasets aren’t useful to most people and carry more risk of a lawsuit or being ripped off by a competitor than the model. Publishing a dataset with copyrighted content is legally grey at best, while the verdict is still out regarding a model trained on that dataset and the model also carries with it some short-term plausible deniability.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

There are only a few popular LLM models. A few more if you count variations such as “uncensored” etc. Most of the others tend to not perform well or don’t have much difference from the more popular ones.

I would think that the difference is likely for two reasons:

LLMs require more effort in curating the dataset for training. Whereas a Stable Diffusion model can be trained by grabbing a bunch of pictures of a particular subject or style and throwing them in a directory, an LLM requires careful gathering and reformatting of text. If you want an LLM to write dialog for a particular character, for example, you would need to try to find or write a lot of existing dialog for that character, which is generally harder than just searching for images on the internet.
LLMs are already more versatile. For example, most of the popular LLMs will already write dialog for a particular character (or at least attempt to) just by being given a description of the character and possibly a short snippet of sample dialog. Fine-tuning doesn’t give any significant performance improvement in that regard. If you want the LLM to write in a specific style, such as Old English, it is usually sufficient to just instruct it to do so and perhaps prime the conversation with a sentence or two written in that style.

micheal65536@lemmy.micheal65536.duckdns.org · edit-2 1 year ago

WizardLM 13B (I didn’t notice any significant improvement with the 30B version), tends to be a bit confined to a standard output format at the expense of accuracy (e.g. it will always try to give both sides of an argument even if there isn’t another side or the question isn’t an argument at all) but is good for simple questions

LLaMa 2 13B (not the chat tuned version), this one takes some practice with prompting as it doesn’t really understand conversation and won’t know what it’s supposed to do unless you make it clear from contextual clues, but it feels refreshing to use as the model is (as far as is practical) unbiased/uncensored so you don’t get all the annoying lectures and stuff

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Yeah, I think you need to set the contextsize and ropeconfig. Documentation isn’t completely clear and in some places sort of implies that it should be autodetected based on the model when using a recent version, but the first thing I would try is setting these explicitly as this definitely looks like an encoding issue.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

I would guess that this is possibly an issue due to the model being a “SuperHOT” model. This affects the way that the context is encoded and if the software that uses the model isn’t set up correctly for it you will get issues such as repeated output or incoherent rambling with words that are only vaguely related to the topic.

Unfortunately I haven’t used these models myself so I don’t have any personal experience here but hopefully this is a starting point for your searches. Check out the contextsize and ropeconfig parameters. If you are using the wrong context size or scaling factor then you will get incorrect results.

It might help if you posted a screenshot of your model settings (the screenshot that you posted is of your sampler settings). I’m not sure if you configured this in the GUI or if the only model settings that you have are the command-line ones (which are all defaults and probably not correct for an 8k model).

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

TBH my experience with SillyTavern was that it merely added another layer of complexity/confusion to the prompt formatting/template experience, as it runs on top of text-generation-webui anyway. It was easy for me to end up with configurations where e.g. the SillyTavern turn template would be wrapped inside the text-generation-webui one, and it is very difficult to verify what the prompt actually looks like by the time it reaches the model as this is not displayed in any UI or logs anywhere.

For most purposes I have given up on any UI/frontend and I just work with llama-cpp-python directly. I don’t even trust text-generation-webui’s “notebook” mode to use my configured sampling settings or to not insert extra end-of-text tokens or whatever.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Yes, that makes more sense. I was concerned initially that you were looking to buy a new GPU with more VRAM for the sole reason of being unable to do something that you should already be able to do, and that this would be an unnecessary spend of money and/or not actually fix the problem, that you would be somewhat mad at yourself if you found out afterwards that “oh, I just needed to change this setting”.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Fair enough but if your baseline for comparison is wrong then you can’t make good assessments of the capabilities of different GPUs. And it’s possible that you don’t actually need a new GPU/more VRAM anyway, if your goal is to generate 1024x1024 in Stable Diffusion and run a 13B LLM both of which I can do with 8 GB of VRAM.

micheal65536@lemmy.micheal65536.duckdns.org · edit-2 1 year ago

text-generation-webui “chat” and “chat-instruct” modes are… weird and badly documented when it comes to using a specific prompt template. If you don’t want to use the notepad mode, use “instruct” mode and set your turn template with the required tags and include your system prompt in the context (? I forget what it is labeled as) box.

EDIT: Actually I think text-generation-webui might use <|user|> as a special string to mean “substitute the user prefix set in the box directly above the turn template box”. Why they have to have a turn template field with “macro” functionality and then separate fields for user and bot prefixes when you could just… put the prefix directly in the turn template I have no idea. It’s not as though you would ever want or need to change one without the other anyway. But it’s possible that as a result of this you can’t actually use <|user|> itself in the turn template…

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

What sort of issues are you getting trying to generate 1024x1024 images in Stable Diffusion? I’ve generated up to 1536x1024 without issue on a 1070 (although it takes a few minutes) and could probably go even larger (this was in img2img mode which uses more VRAM as well - although at that size you usually won’t get good results with txt2img anyway). What model are you using?

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

What is wrong with LLM benchmarks, and why are we still using them? - sh.itjust.works

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago