Meta Llama 3.1 405B Instruct IQ3_XS or IQ2_KL and a note about numbering

#183
by jhofseth - opened

It disappeared after my download of IQ3_XS got corrupted. LM Studio requires .gguf to be the end of the filename. That may have been part of why filename ignorant feedback may have misleadingly prompted you to take the repos down. You could simply re-name them: Meta-Llama-3.1-405B-Instruct.IQ3_XS-00001-of-00004.gguf Meta-Llama-3.1-405B-Instruct.IQ3_XS-00002-of-00004.gguf Meta-Llama-3.1-405B-Instruct.IQ3_XS-00003-of-00004.gguf Meta-Llama-3.1-405B-Instruct.IQ3_XS-00004-of-00004.gguf Also, I don’t know if IQ2_KL is possible, but that might be useful? Anyway, thank you and keep up the great work! :-)

llama.cpp had no support for llama-3.1 rope scaling until is was implemented a few hours ago. therefore, I am redoing the quants. To use split quants, you need to concatenate them (follow the link to TheBloke's model to see how that is done). Renaming is unlikely to help (you can see the faq for some more in-depth explanation). And due to the size of the model, it will take a while before all quants will be available.

mradermacher changed discussion status to closed

TheBloke's guide is kind of bad in my opinion as it downloads all the parts manually one by one and then concatenates them locally. Doing so is a huge inconvenience and waste of time. I recommend to instead adopt the following command to automatically download them all directly to a concatenated file:

curl -L https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-Instruct-i1-GGUF/resolve/main/Meta-Llama-3.1-70B-Instruct.i1-Q6_K.gguf.part[1-2]of2?download=true > Meta-Llama-3.1-70B-Instruct.i1-Q6_K.gguf

While writing above example I noticed that Meta-Llama-3.1-70B and a lot of other Llama 3.1 based models that where quantized before the rope scaling fix will unfortunately likely need to be redone as well.

I'd be happy if somebody wrote a better guide, but note that I don't agree with your assessment - not everybody has the great internet connection we have, and if something goes wrong with your "better" method, you have to start over from scratch. TheBloke's method let's people use their normal resumable download tools, and the split second to concatenate the files (well, with my quants and a modern fs at least) is in no way a huge inconvenience. In fact, it might be much more convenient to click the parts in your browser and use cat rather than type/paste in a large command line with curl. It's certainly an alternative that could be mentioned, but calling it categorically more convenient is factually wrong :)

Your last comment is a bit cryptic, are you talking about quants not done by me? I can't do anything about those, obviously, so mentioning them here is a bit confusing.

I'd be happy if somebody wrote a better guide, but note that I don't agree with your assessment - not everybody has the great internet connection we have, and if something goes wrong with your "better" method, you have to start over from scratch. TheBloke's method let's people use their normal resumable download tools, and the split second to concatenate the files (well, with my quants and a modern fs at least) is in no way a huge inconvenience. In fact, it might be much more convenient to click the parts in your browser and use cat rather than type/paste in a large command line with curl. It's certainly an alternative that could be mentioned, but calling it categorically more convenient is factually wrong :)

Sorry I completely forgot about people with bad internet connection. In this case the method of TheBloke's method obviously makes more sense as the web browser lets you resume your download after encountering internet issues. The ability to download them in parallel could also speed it up. My AI LXC container has not GUI so I had to download each file separately using wget. Back then I was also always low on storage and so had to create a RAM disk just to concatenate them so it likely was way more inconvenient for me than normal. The more I think about it the more I see how for a normal user on an OS with a graphical User Interface and plenty of free disk space TheBloke's method is far superior.

Your last comment is a bit cryptic, are you talking about quants not done by me? I can't do anything about those, obviously, so mentioning them here is a bit confusing.

I was talking about https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-Instruct-i1-GGUF which I used in the example to demonstrate my method of concatenating while downloading. You already took it down and so are are probably already working on requantising it. Regarding the other models I already answered in https://huggingface.co/mradermacher/Lumimaid-v0.2-70B-i1-GGUF/discussions/1#66a6b2380246fad8d997777f and turns out they are fine and it was just me misunderstanding when you started using the new llama.cpp version. Sorry for the confusion.

Right, if your filesystem is not capable of zero-copy, then you might want to use a ramdisk to avoid a physical copy. Let's not talk too much about being so poor that one is forced to concatenate these tiny files in ram entirely, though. :*)

And yes, if you thought everything quantized recently was without the new rope scaling, that would have been a lot of models - I was collecting models for a while now, only to release a torrent yesterday :)

I'll be so pissed if it turns out to be buggy (cf. https://github.com/ggerganov/llama.cpp/issues/8730 :)

Meta Llama 3.1 405B Instruct
Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise.
🕯️ You can make 4 candles with 500 grams of wax (500 ÷ 125 = 4). However, you only have 3 wicks, so you can only make 3 candles. 🤔

Wow, and it only took a 405B.

Sign up or log in to comment