Replies: 1 comment
-
The RAM unload is essentially the idea of Because, remember: Stable Diffusion doesn't run the model once. It runs as many times as you have steps - so eg on a 20 step image, it runs the model 20 times (or more with certain samplers). Re SSD offload: So say you have really fast loading that takes just 1 second to load the whole model - that's 20 seconds added just from loading, without calculation. And that's a crazy fast hypothetical SSD. Let's take a more practical example: You have a excellent tier SSD that's really fast, 500 MiB/s. You're running an SDXL model, which takes ~5-6GiB on the unet. That's 10-12 seconds to load the model in full. You have 20 steps, so 12 * 20 = 240 seconds, ie 4 minutes. Just to do the loading part, not counting the actual execution... which, if you have a GPU smaller than 6GiB of VRAM, probably your execution is pretty slow too, considering the only GPUs that small are either (A) 3+ generations behind* or (B) laptop-class GPUs. Re splitting the model between two devices: If it's two GPUs on one PC, it is possible - in fact, I have implemented software that does this in the LLM world oobabooga/text-generation-webui#2100 at (relatively) decent speeds. Slower than one GPU, but not terrible. SD will be worse off than LLMs because of the mentioned "it has to do it 20 times" part except now it has to transfer both directions, so, 40 transfers (2x number of steps). Between different PCs, this will be ... quite slow. Even the fastest way to transfer data on LAN will naturally take noticeably longer than any form of internal data transfer. It should be faster (on LAN) than SSD offloading though, as mid-model latent activations should be smaller than the model itself. Outside of LAN, over the internet, transfer speeds will hurt bad unless you have very very fast internet service. This has also been done with LLMs over open internet before (eg the bloom network project) but those projects are horrifically slow. |
Beta Was this translation helpful? Give feedback.
-
We are able to split models in half and generate the first part on one PC, then transfer the activation and generate the rest of the model on another PC. If we only have one PC, we could offload the unused part into system RAM. However, let's imagine that our model is so large that this approach is not feasible, or we simply don't have enough space in system RAM. Instead, we could read the current model part from storage. This would slow the process down , but at least the model would be able to run regardless. In the future, models will become much larger, necessitating more system/V RAM. If we consider NVMe storage, the model part switching could become quite fast. With PCIe Gen 4, we can read at 7GB/s, meaning we can transfer almost 24 GB of data into VRAM in just 3 seconds. With PCIe Gen 5, we will achieve double that speed. Additionally, NVMe drives are much easier to upgrade than GPU / VRAM. And if the CPU becomes a bottleneck, we might be able to utilize direct storage technology to load the model even faster into VRAM, bypassing the CPU. what do u think about that?
Beta Was this translation helpful? Give feedback.
All reactions