ChatWaifu

ChatWaifu

Not enough ratings
Run your LLMs on LM Studio instead of Waifu (Better Performance/More Control)
By Chronic-X
Just a quick tutorial for anyone who wants to run their models on LM Studio and connect to that locally. When Waifu imports settings, it seems to "50/50" the load between your GPU and CPU. When you have a GPU that can handle loading in the full LLM, this is a major slowdown.

First follow my other guide for setting up LM Studio. Host it and set your GPU/CPU layers. The more layers on the GPU can load the more RAM on your card it will utilize. Set up your LLM and make sure that after it loads in, you check your resource monitor under the performance tab. On your primary GPU you should see that when loaded in your GPU will have 7.8GB out of 8GB (for example) of the dedicated space used up. Increasing context size or GPU layers will increase the amount of dedicated GPU reserved for generation and processing.

I also recommend making sure you run in a "balanced" power mode so that your PC will utilize any onboard GPU for lesser video tasks. This should allow you to use the FULL GPU for LLM generation tasks. If you don't have the full amount of layers loaded in, and your GPU shows that it still has free space, then increase the layers. If your model will load in, but NOT after you increase the context size, then try lowering the number of layers by a couple and then increasing context. I usually aim for good performance with at least 8-16K context. Anyting over 5-6 tokens per second in generation is usable. (You can see your TPS generation in LM Studio. Send your LLM a message in the LM Studio "Chat" tab. When it finishes generating a response, you will see a small list of details just under the reply. tok/sec is your generation speed.)

Now for the setup, this is easy:

First subscribe to the ChatGPT API plugin on the workshop.

Open LM Studio.
Navigate to the "Developer" tab on the left.
Load in your model.
On the right bottom, you'll see the "API Usage" information window. Copy the "Local Server Address".

Open Waifu.
Under the "LLM" tab, choose the ChatGPT API.

Paste the "Local Server Address" into the "URL" field.
Just put a few random letters/numbers in the "API" field.
Click "Update List" and your model should show up. Select it in the dropdown menu and then "Apply" the settings.

Talk to your LLM, you should see everything processing in LM Studio when you talk to your waifu. (May require you to reload the app)

By doing the above I managed to increase tokens per second from 2 to 5.8 on a 15B model.

Keep in mind that while LM Studio is running, and your LM is loaded in, that your dedicated GPU will be reserved. This can have a major negative impact on gaming etc. when the model is generating if you don't have the GPU resources to handle it. If you plan to play a low resource game or something along those lines while running your LLM, make sure you load it in with fewer "layers" so that it doesn't reserve as much of your GPUs resource pool.

   
Award
Favorite
Favorited
Unfavorite
1 Comments
Chronic-X  [author] 20 Dec, 2024 @ 5:59pm 
For comparison, I was running marco-o1-uncensor via Waifu and getting 3-5 TPS generation with a response time of around 30+ seconds. With the same Q08 model loaded into LM Studio, I can assign it 24 layers and give it a 32K context memory while still generating 15+ TPS. This shortens my response times to around 10-ish seconds.