Model spiting on one device #41

Manni1000 · 2023-08-09T01:49:38Z

Manni1000
Aug 9, 2023

We are able to split models in half and generate the first part on one PC, then transfer the activation and generate the rest of the model on another PC. If we only have one PC, we could offload the unused part into system RAM. However, let's imagine that our model is so large that this approach is not feasible, or we simply don't have enough space in system RAM. Instead, we could read the current model part from storage. This would slow the process down , but at least the model would be able to run regardless. In the future, models will become much larger, necessitating more system/V RAM. If we consider NVMe storage, the model part switching could become quite fast. With PCIe Gen 4, we can read at 7GB/s, meaning we can transfer almost 24 GB of data into VRAM in just 3 seconds. With PCIe Gen 5, we will achieve double that speed. Additionally, NVMe drives are much easier to upgrade than GPU / VRAM. And if the CPU becomes a bottleneck, we might be able to utilize direct storage technology to load the model even faster into VRAM, bypassing the CPU. what do u think about that?

mcmonkey4eva · 2023-09-08T01:17:47Z

mcmonkey4eva
Sep 8, 2023

The RAM unload is essentially the idea of --lowvram modes supported in auto and comfy. The problem is, this is incredibly slow.

Because, remember: Stable Diffusion doesn't run the model once. It runs as many times as you have steps - so eg on a 20 step image, it runs the model 20 times (or more with certain samplers).
This means you would have to load and unload the entire size of the model... 20 times for that one image.

Re SSD offload:

So say you have really fast loading that takes just 1 second to load the whole model - that's 20 seconds added just from loading, without calculation. And that's a crazy fast hypothetical SSD.

Let's take a more practical example: You have a excellent tier SSD that's really fast, 500 MiB/s. You're running an SDXL model, which takes ~5-6GiB on the unet. That's 10-12 seconds to load the model in full. You have 20 steps, so 12 * 20 = 240 seconds, ie 4 minutes. Just to do the loading part, not counting the actual execution... which, if you have a GPU smaller than 6GiB of VRAM, probably your execution is pretty slow too, considering the only GPUs that small are either (A) 3+ generations behind* or (B) laptop-class GPUs.
*It'd have to a GTX 1060 or older. The RTX 2060 is 6GiB and the RTX 3060 is 12GiB. Even the GTX 1060 had a 6GiB variant in addition to its smaller 3GiB base variant. Comfy should be able to run in a 6GiB env with only a small amount of RAM offload that doesn't need SSD offload at all.
Also, frankly, if you have a GPU that small, and don't have the RAM space for RAM offload, you likely also don't have a high-end SSD that can achieve significant speed, so the practical loading time is likely much worse. You're getting into the realm of calculating image generation time in hours at that point.

Re splitting the model between two devices:

If it's two GPUs on one PC, it is possible - in fact, I have implemented software that does this in the LLM world oobabooga/text-generation-webui#2100 at (relatively) decent speeds. Slower than one GPU, but not terrible. SD will be worse off than LLMs because of the mentioned "it has to do it 20 times" part except now it has to transfer both directions, so, 40 transfers (2x number of steps).

Between different PCs, this will be ... quite slow. Even the fastest way to transfer data on LAN will naturally take noticeably longer than any form of internal data transfer. It should be faster (on LAN) than SSD offloading though, as mid-model latent activations should be smaller than the model itself.
It would be dirty and difficult to set that up, and you'd likely be better off selling both GPUs and buying a single RTX 3060 or something with the sales profit.

Outside of LAN, over the internet, transfer speeds will hurt bad unless you have very very fast internet service.
Note that while mid-step latent size is only about half a megabyte, mid-model-execution data is actually quite large (on this scale at least). I'm not sure offhand the exact size, but it's somewhere in the range of hundreds of megabytes of data.

This has also been done with LLMs over open internet before (eg the bloom network project) but those projects are horrifically slow.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model spiting on one device #41

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Model spiting on one device #41

Manni1000 Aug 9, 2023

Replies: 1 comment

mcmonkey4eva Sep 8, 2023

Manni1000
Aug 9, 2023

mcmonkey4eva
Sep 8, 2023