Home GADGETS Stable Diffusion Benchmarks: 45 Nvidia, AMD, and Intel GPUs Compared

Stable Diffusion Benchmarks: 45 Nvidia, AMD, and Intel GPUs Compared

Stable Diffusion Benchmarks: 45 Nvidia, AMD, and Intel GPUs Compared

Stable Diffusion Introduction

Stable Diffusion and other AI-based image generation tools like Dall-E and Midjourney are some of the most popular uses of deep learning right now. Using trained networks to create images, videos, and text has become not just a theoretical possibility but is now a reality. While more advanced tools like ChatGPT can require large server installations with lots of hardware for training, running an already-trained network for inference can be done on your PC, using its graphics card. How fast are consumer GPUs for doing AI inference using Stable Diffusion? That’s what we’re here to investigate.

We’ve benchmarked Stable Diffusion, a popular AI image generator, on the 45 of the latest Nvidia, AMD, and Intel GPUs to see how they stack up. We’ve been poking at Stable Diffusion for over a year now, and while earlier iterations were more difficult to get running — never mind running well — things have improved substantially. Not all AI projects have received the same level of effort as Stable Diffusion, but this should at least provide a fairly insightful look at what the various GPU architectures can manage with AI workloads given proper tuning and effort.

The easiest way to get Stable Diffusion running is via the Automatic1111 webui project. Except, that’s not the full story. Getting things to run on Nvidia GPUs is as simple as downloading, extracting, and running the contents of a single Zip file. But there are still additional steps required to extract improved performance, using the latest TensorRT extensions. Instructions are at that link, and we’ve previous tested Stable Diffusion TensorRT performance against the base model without tuning if you want to see how things have improved over time. Now we’re adding results from all the RTX GPUs, from the RTX 2060 all the way up to the RTX 4090, using the TensorRT optimizations.

For AMD and Intel GPUs, there are forks of the A1111 webui available that focus on DirectML and OpenVINO, respectively. We used these webui OpenVINO instructions to get Arc GPUs running, and these webui DirectML instructions for AMD GPUs. Our understanding, incidentally, is that all three companies have worked with the community in order to tune and improve performance and features.

Whether you’re using an AMD, Intel, or Nvidia GPU, there will be a few hurdles to jump in order to get things running optimally. If you have issues with the instructions in any of the linked repositories, drop us a note in the comments and we’ll do our best to help out. Once you have the basic steps down, however, it’s not too difficult to fire up the webui and start generating images. Note that extra functionality (i.e. upscaling) is separate from the base text to image code and would require additional modifications and tuning to extract better performance, so that wasn’t part of our testing.

Additional details are lower down the page, for those that want them. But if you’re just here for the benchmarks, let’s get started.

Stable Diffusion 512×512 Performance

(Image credit: Tom’s Hardware)

This shouldn’t be a particularly shocking result. Nvidia has been pushing AI technology via Tensor cores since the Volta V100 back in late 2017. The RTX series added the feature in 2018, with refinements and performance improvements each generation (see below for more details on the theoretical performance). With the latest tuning in place, the RTX 4090 ripped through 512×512 Stable Diffusion image generation at a rate of more than one image per second — 75 per minute.

AMD’s fastest GPU, the RX 7900 XTX, only managed about a third of that performance level with 26 images per minute. Even more alarming, perhaps, is how poorly the RX 6000-series GPUs performed. The RX 6950 XT output 6.6 images per minute, well behind even the RX 7600. Clearly, AMD’s AI Matrix accelerators in RDNA 3 have helped improve throughput in this particular workload.

Intel’s current fastest GPU, the Arc A770 16GB, managed 15.4 images per minute. Keep in mind that the hardware has theoretical performance that’s quite a bit higher than the RTX 2080 Ti (if we’re looking at XMX FP16 throughput compared to Tensor FP16 throughput): 157.3 TFLOPS versus 107.6 TFLOPS. It looks like the Arc GPUs are thus only managing less than half of their theoretical performance, which is why benchmarks are the most important gauge of real-world performance.

While there are differences between the various GPUs and architecture, performance largely scales proportionally with theoretical compute. The RTX 4090 was 46% faster than the RTX 4080 in our testing, while in theory it offers 69% more compute performance. Likewise, the 4080 beat the 4070 Ti by 24%, and it has 22% more compute.

The newer architectures aren’t necessarily performing substantially faster. The 4080 beat the 3090 Ti by 10%, while offering potentially 20% more compute. But the 3090 Ti also has more raw memory bandwidth (1008 GB/s compared to the 4080’s 717 GB/s), and that’s certainly a factor. The old Turing generation held up as well, with the newer RTX 4070 beating the RTX 2080 Ti by just 12%, with theoretically 8% more compute.

Stable Diffusion 768×768 Performance

Stable Diffusion performance

(Image credit: Tom’s Hardware)

Kicking the resolution up to 768×768, Stable Diffusion likes to have quite a bit more VRAM in order to run well. Memory bandwidth also becomes more important, at least at the lower end of the spectrum.

The relative positioning of the various Nvidia GPUs doesn’t shift too much, and AMD’s RX 7000-series gains some ground with the RX 7800 XT and above, while the RX 7600 dropped a bit. The 7600 was 36% slower than the 7700 XT at 512×512, but dropped to being 44% slower at 768×768.

The previous generation AMD GPUs had an even tougher time. The RX 6950 XT didn’t even manage two images per minute, and the 8GB RX 6650 XT, 6600 XT, and 6600 all failed to render even a single image. That’s a bit odd, as the RX 7600 still worked okay with only 8GB of memory, but some other architectural difference was at play.

Intel’s Arc GPUs also lost ground at the higher resolution, or if you prefer, the Nvidia GPUs — particularly the fastest models — put some additional distance between themselves and the competition. The 4090 for example was 4.9X faster than the Arc A770 16GB at 512×512 images, and that increased to a 6.4X lead with 768×768 images.

We haven’t tested SDXL, yet, mostly because the memory demands and getting it running properly tend to be even higher than 768×768 image generation. TensorRT support is also missing for Nvidia GPUs, and most likely we’d see quite a few GPUs struggle with SDXL. It’s something we plan to investigate in the future, however, as the results are generally preferable to SD1.5 and SD2.1 for higher resolution outputs.

For now, we know that performance will be lower than our 768×768 results. As an example of what to expect, the RTX 4090 doing 1024×1024 images (still using SD1.5), managed just 13.4 images per minute. That’s less than half the speed of 768×768 image generation, which makes sense as the 1024×1024 images have 78% more pixels and the time required seems to scale somewhat faster than the resolution increase.

Picking a Stable Diffusion Model

Source link