Skip to content
Main Navigation Puget Systems Logo
  • Solutions
    • Content Creation
      • Photo Editing
        • Recommended Systems For:
        • Adobe Lightroom Classic
        • Adobe Photoshop
        • Stable Diffusion
      • Video Editing & Motion Graphics
        • Recommended Systems For:
        • Adobe After Effects
        • Adobe Premiere Pro
        • DaVinci Resolve
        • Foundry Nuke
      • 3D Design & Animation
        • Recommended Systems For:
        • Autodesk 3ds Max
        • Autodesk Maya
        • Blender
        • Cinema 4D
        • Houdini
        • ZBrush
      • Real-Time Engines
        • Recommended Systems For:
        • Game Development
        • Unity
        • Unreal Engine
        • Virtual Production
      • Rendering
        • Recommended Systems For:
        • Keyshot
        • OctaneRender
        • Redshift
        • V-Ray
      • Digital Audio
        • Recommended Systems For:
        • Ableton Live
        • FL Studio
        • Pro Tools
    • Engineering
      • Architecture & CAD
        • Recommended Systems For:
        • Autodesk AutoCAD
        • Autodesk Inventor
        • Autodesk Revit
        • SOLIDWORKS
      • Visualization
        • Recommended Systems For:
        • Enscape
        • Lumion
        • Twinmotion
      • Photogrammetry & GIS
        • Recommended Systems For:
        • ArcGIS Pro
        • Agisoft Metashape
        • Pix4D
        • RealityCapture
    • AI & HPC
      • Recommended Systems For:
      • Data Science
      • Generative AI
      • Large Language Models
      • Machine Learning / AI Dev
      • Scientific Computing
    • More
      • Recommended Systems For:
      • Compact Size
      • Live Streaming
      • NVIDIA RTX Studio
      • Quiet Operation
      • Virtual Reality
    • Business & Enterprise
      We can empower your company
    • Government & Education
      Services tailored for your organization
  • Products
    • Puget Mobile
      Powerful laptop workstations
      • Puget Mobile 16″
    • Puget Workstations
      High-performance desktop PCs
      • AMD Ryzen
        • Ryzen 9000:
        • Small Form Factor
        • Mini Tower
        • Mid Tower
        • Full Tower
      • AMD Threadripper
        • Threadripper 7000:
        • Mid Tower
        • Full Tower
        • Threadripper PRO 7000WX:
        • Full Tower
      • AMD EPYC
        • EPYC 9004:
        • Full Tower
      • Intel Core Ultra
        • Core Ultra Series 2:
        • Small Form Factor
        • Mini Tower
        • Mid Tower
        • Full Tower
      • Intel Xeon
        • Xeon W-2500:
        • Mid Tower
        • Xeon W-3500:
        • Full Tower
    • Custom Computers
    • Puget Rackstations
      Workstations in rackmount chassis
      • AMD Rackstations
        • Ryzen 7000 / EPYC 4004:
        • R550-6U 5-Node
        • Ryzen 9000:
        • R132-4U
        • Threadripper 7000:
        • T121-4U
        • Threadripper PRO 7000WX:
        • T141-4U
        • T140-5U (Dual 5090s)
      • Intel Rackstations
        • Core Ultra Series 2:
        • C132-4U
        • Xeon W-3500:
        • X131-4U
        • X141-5U
    • Custom Rackmount Workstations
    • Puget Servers
      Enterprise-class rackmount servers
      • Rackmount Servers
        • AMD EPYC:
        • E200-1U
        • E140-2U
        • E280-4U
        • Intel Xeon:
        • X200-1U
    • Comino Grando GPU Servers
    • Custom Servers
    • Puget Storage
      Solutions from desktop to datacenter
      • Network-Attached Storage
        • Synology NAS Units:
        • 4-bay DiskStation
        • 8-bay DiskStation
        • 12-bay DiskStation
        • 4-bay RackStation
        • 12-bay FlashStation
      • Software-Defined Storage
        • Datacenter Storage:
        • 12-Bay 2U
        • 24-Bay 2U
        • 36-Bay 4U
    • Recommended Third Party Peripherals
      Curated list of accessories for your workstation
    • Puget Gear
      Quality apparel with Puget Systems branding
  • Publications
    • Articles
    • Blog Posts
    • Case Studies
    • HPC Blog
    • Podcasts
    • Press
    • PugetBench
  • Support
    • Contact Support
    • Support Articles
    • Warranty Details
    • Onsite Services
    • Unboxing
  • About Us
    • About Us
    • Contact Us
    • Our Customers
    • Enterprise
    • Gov & Edu
    • Press Kit
    • Testimonials
    • Careers
  • Talk to an Expert
  • My Account
  1. Home
  2. /
  3. Hardware Articles
  4. /
  5. AMD Microsoft Olive Optimizations for Stable Diffusion Performance Analysis

AMD Microsoft Olive Optimizations for Stable Diffusion Performance Analysis

Posted on October 31, 2023 (December 8, 2023) by Matt Bach
Always look at the date when you read an article. Some of the content in this article is most likely out of date, as it was written on October 31, 2023. For newer information, see our more recent articles.

Table of Contents

  • Introduction
  • Test Setup
  • AMD Radeon 7900 XTX Performance Gains
  • Radeon 7900 XTX vs RTX 4080 vs RTX 4090 for Stable Diffusion
  • Conclusion

Introduction

Over the last few years, NVIDIA has had a stranglehold on the AI and Machine Learning markets, but AMD has been investing a lot of effort into improving support and performance for their Radeon and Radeon PRO GPUs. A great example is a recent BlackMagic DaVinci Resolve update that improved Neural Engine (AI) performance on AMD GPUs by up to 4.5x.

Today, we want to take a look at a Stable Diffusion optimization that AMD recently published for Automatic 1111 (the most common implementation of Stable Diffusion) on how to leverage Microsoft Olive to generate optimized models for AMD GPUs. AMD claims that this can result in performance improvements of up to 9.9x on AMD GPUs, which is something we had to check out for ourselves! What is interesting about this compared to optimizations we have seen from others like NVIDIA is that it is technically a vendor-agnostic solution that can run on any DX12-capable GPUs (including AMD, NVIDIA, and Intel). For developers, this can be a big deal when writing software that is intended to be used by various types of GPU hardware without having to depend on GPU vendor-specific proprietary solutions.

If you want to follow their guide or get additional background on how it works, we recommend reading the post: [UPDATED HOW-TO] Running Optimized Automatic1111 Stable Diffusion WebUI on AMD GPUs.

Featured image showing AMD Microsoft Olive Optimization for Stable Diffusion Automatic 1111
Image
Open Full Resolution

It is worth noting that AMD is not the only one making performance improvements for Stable Diffusion. NVIDIA also recently released a guide on installing and using a TensoRT extension, which they say should improve performance by almost 3x over the base installation of Automatic 1111, or around 2x faster than using xFormers. This is not as big of a gain as what AMD is touting, but we have a sister article looking at that extension that you can read here: NVIDIA TensorRT Extension for Stable Diffusion Performance Analysis.

While we are primarily looking at the AMD optimizations in this post, we will quickly compare NVIDIA and AMD GPUs at the end of this article using both optimizations to see how each brand compares after everything is said and done.

Test Setup

Threadripper PRO Test Platform

CPU: AMD Threadripper PRO 5975WX 32-Core
CPU Cooler: Noctua NH-U14S TR4-SP3 (AMD TR4)
Motherboard: ASUS Pro WS WRX80E-SAGE SE WIFI
BIOS Version: 1201
RAM: 8x Micron DDR4-3200 16GB ECC Reg.
(128GB total)
GPUs:
AMD Radeon 7900 XTX 24GB
Driver Version: 23.10.2
PSU: Super Flower LEADEX Platinum 1600W
Storage: Samsung 980 Pro 2TB
OS: Windows 11 Pro 64-bit (22621)

AMD Ryzen Test Platform

CPU: AMD Ryzen 9 7950X
CPU Cooler: Noctua NH-U12A
Motherboard: ASUS ProArt X670E-Creator WiFi
BIOS Version: 1602
RAM: 2x DDR5-5600 32GB
(64GB total)
GPUs:
AMD Radeon 7900 XTX 24GB
Driver Version: 23.10.2
PSU: Super Flower LEADEX Platinum 1600W
Storage: Samsung 980 Pro 2TB
OS: Windows 11 Pro 64-bit (22621)

Benchmark Software

Automatic 1111 v1.6 DirectML
Python: 3.10.6
SD Model: v1.5
Microsoft Olive Optimizations for Automatic 1111
SHARK v20231009.984
SD Model: v1.5

We originally intended to test using a single base platform built around the AMD Threadripper PRO 5975WX, but through the course of verifying our results against those in AMD’s blog post, we discovered that the Threadripper PRO platform could sometimes give lower performance than a consumer-based platform like AMD Ryzen or Intel Core. However, Intel Core and AMD Ryzen tended to be within a few percent, so, for testing purposes, it doesn’t matter if we used an AMD Ryzen or Intel Core CPU.

Since it is common for Stable Diffusion to be used with either class of processor, we opted to go ahead and do our full testing on both a consumer- and professional-class platform. We will largely focus on the AMD Ryzen platform results since they tend to be a bit higher, but we will also include and analyze the results on Threadripper PRO.

As noted in the Introduction section, we will be looking specifically at Automatic 1111 and will be testing it in a few different configurations: a base install using the DirectML fork, and an optimized version with Microsoft Olive. In addition, we also wanted to include SHARK since, historically, it is much faster than Automatic 1111 for AMD GPUs. It will be interesting to see if the new optimizations make Automatic 1111 on par, or faster than Shark.

Since changing GPUs requires re-generate the optimized model, we opted to test with just one AMD GPU for now. We decided on the Radeon 7900 XTX, but pro-level cards like the Radeon Pro W7900 should show similar performance uplifts.

For the specific prompts and settings we chose to use, we changed things up from our previous Stable Diffusion testing. While working with Stable Diffusion, we have found that few settings affect raw performance. For example, the prompt itself, CFG scale, and seed have no measurable impact. The resolution does, although since many models are optimized for 512×512, going outside of that sometimes has unintended consequences that are not great from a benchmarking standpoint.

To examine this TensorRT extension, we decided to set up our tests to cover four common situations:

  1. Generating a single high-quality image
  2. Generating a large number of medium-quality images
  3. Generating four high-quality images with minimal VRAM usage
  4. Generating four high-quality images with maximum speed

To accomplish this, we used the following prompts and settings for txt2img:

Prompt: red sports car, (centered), driving, ((angled shot)), full car, wide angle, mountain road

txt2img SettingsTest 1Test 2Test 3Test 4
Steps15050150150
Width512512512512
Height512512512512
Batch Count11041
Batch Size1114
CFG Scale7777
Seed3936349264393634926439363492643936349264

We will note that we also tested using img2img, but the results were identical to txt2img. For the sake of simplicity, we will only include the txt2img results, but know that they also hold true when using an image as the base input as well.

Tower Computer Icon in Puget Systems Colors

Looking for a Content Creation Workstation?

We build computers tailor-made for your workflow. 

Configure a System!
Talking Head Icon in Puget Systems Colors

Don’t know where to start?
We can help!

Get in touch with our technical consultants today.

Talk to an Expert

AMD Radeon 7900 XTX Performance Gains

Automatic 1111 NVIDIA Microsoft Olive Optimizations - 7900 XTX on Ryzen Performance Gains
Automatic 1111 NVIDIA Microsoft Olive Optimizations - 7900 XTX on Threadripper PRO Performance Gains
Automatic 1111 NVIDIA Microsoft Olive Optimizations - 7900 XTX on Ryzen Performance Gains
Automatic 1111 NVIDIA Microsoft Olive Optimizations - 7900 XTX on Threadripper PRO Performance Gains
Previous Next
System Image
Automatic 1111 NVIDIA Microsoft Olive Optimizations - 7900 XTX on Ryzen Performance Gains
Open Full Resolution
Automatic 1111 NVIDIA Microsoft Olive Optimizations - 7900 XTX on Threadripper PRO Performance Gains
Open Full Resolution
Previous Next

Looking at our results, you may first notice that many of our testing prompts seem a bit redundant. As long as we kept the batch size to 1 (the number of images being generated in parallel), the iterations per second (it/s) are pretty much unchanged with each of the three methods we looked at (Automatic 1111 base, Automatic 1111 w/ Olive, and SHARK). This is expected behavior since this metric essentially reports how many “steps” we can complete each second. In more standard GPU performance testing, we would likely do one of these three tests and call it a day, but we wanted to check to ensure there were no gaps with the optimization. If it turned out to only work with a batch count of 1, or only up to a certain number of steps, that is very important information to have.

The second thing you may notice is that there was a lack of results when we turned the batch size up to four. This setting controls how many images are generated in parallel, increasing performance, but also increasing the amount of VRAM needed. Even though the 7900 XTX has 24GB of VRAM, we ran into issues with insufficient VRAM with the base DirectML fork of Automatic 1111 and SHARK. No amount of –lowVRAM switches could get us around this, but luckily, the Microsoft Olive optimization allowed us to do this test without any special configuration beyond what AMD had in their guide.

This means that unlike in our similar article for NVIDIA, we can’t give you any speedup numbers for larger batch sizes. However, with a batch size of one, we saw some amazing performance gains of about 13.4x versus the base Automatic 1111 install on the AMD Ryzen platform, or about 9.1x on AMD Threadripper PRO. That is roughly in line, or better, than AMD’s example results in their guide.

Even with this massive gain, however, it is only enough to bring Automatic 1111 roughly in line with SHARK. We should also note that this is with the 1.5 Stable Diffusion model, which isn’t installed with SHARK automatically. SHARK defaults to a 2.1 model, which, in our testing is about 20% faster than the 1.5 model. However, comparing different models isn’t great as they work differently and give very different results. We would like to compare Automatic 1111 with a newer model in the future, but we had some issues getting the Microsoft Olive optimizations working on 2.1, so that will have to wait for a future article.

No matter how you look at it, this Microsoft Olive optimization makes a massive difference if you want to use an AMD GPU with Automatic 1111. Across both platforms, we saw on average about an 11.3x increase in performance over a basic Automatic 1111 DirectML installation.

Tower Computer Icon in Puget Systems Colors

Looking for a Content Creation Workstation?

We build computers tailor-made for your workflow. 

Configure a System!
Talking Head Icon in Puget Systems Colors

Don’t know where to start?
We can help!

Get in touch with our technical consultants today.

Talk to an Expert

Radeon 7900 XTX vs RTX 4080 vs RTX 4090 for Stable Diffusion

While a performance improvement of 11x is a massive accomplishment that will benefit a huge number of users, there remains the question of whether it is enough for AMD to catch up to NVIDIA. It certainly closes the gap significantly, but with NVIDIA also releasing a guide recently for using a TensorRT extension (which we found increased performance by around 2x over xFormers), this is a lot of ground for AMD to cover.

To show how NVIDIA and AMD stack up right now, we decided to take the results from the “Batch Size: 4” and “Batch Count: 4” tests in this article, as well as our NVIDIA-focused article, to show how the RTX 4090, RTX 4080, and Radeon 7900 XTX compare. We will note that this isn’t a completely apples-to-apples comparison since both NVIDIA and AMD are making optimized models or engines for their GPUs, but that seems to be the near reality for this type of workload.

Image
Open Full Resolution

Unfortunately for AMD, even with the huge performance gains we saw in Automatic 1111, it isn’t enough to bring the Radeon 7900 XTX in line with the RTX 4080. Before NVIDIA’s TensorRT extension was available to provide an additional 2x improvement via their custom path, the performance between the 7900 XTX and the RTX 4080 was very similar (23 vs 24 it/s). But, with it, NVIDIA jumps back out into the lead.

AMD has been doing a lot of great things in the AI/ML space recently, and we have hope that optimizations like this are only the tip of the iceberg from AMD. Olive is interesting because it is vendor-agnostic, so it should work on AMD, Intel, and NVIDIA GPUs; unlike NVIDIA’a TensorRT extension that only works on their GPUs. It also isn’t limited only to Automatic 1111. If it can be implemented by SHARK, we could see AMD jump right back to being on par with NVIDIA.

However, for the moment, with the current optimizations recommended by both NVIDIA and AMD we are still looking at about 2.2x higher performance with an NVIDIA GeForce RTX 4080 over the AMD Radeon 7900 XTX, or 3x higher performance with a more expensive RTX 4090.

Conclusion

The world of Artificial Intelligence and Machine Learning is in constant flux, in part because everyone is looking for ways to expand what it can do, and how fast it can be done. This Microsoft Olive optimization for AMD GPUs is a great example, as we found that it can give a massive 11.3x increase in performance for Stable Diffusion with Automatic 1111.

This huge gain brings the Automatic 1111 DirectML fork roughly on par with historically AMD-favorite implementations like SHARK. It isn’t enough for AMD to catch up to NVIDIA in terms of raw performance for Stable Diffusion (yet), but it shows how much can still be gained through pure software optimizations. While we don’t expect an 11x performance increase like this to be common, we are excited to see what AMD can do in the future. If nothing else, we are looking forward to what this may be able to do if implemented in other Stable Diffusion solutions like SHARK.

If you use Stable Diffusion on an AMD GPU, we would be very interested to hear about your experience with the Microsoft Olive optimations in the comments! And if you are looking for a workstation for any Content Creation workflow, you can visit our solutions pages to view our recommended workstations for various software packages, our custom configuration page, or contact one of our technology consultants for help configuring a workstation that meets the specific needs of your unique workflow.

Tower Computer Icon in Puget Systems Colors

Looking for a Content Creation workstation?

We build computers tailor-made for your workflow. 

Configure a System
Talking Head Icon in Puget Systems Colors

Don’t know where to start?
We can help!

Get in touch with one of our technical consultants today.

Talk to an Expert

Related Content

  • NVIDIA GeForce RTX 5090 & 5080 AI Review
  • Exploring GPU Performance Across LLM Sizes
  • LLM Inference – NVIDIA RTX GPU Performance
  • LLM Inference – Consumer GPU performance
View All Related Content

Latest Content

  • Do Video Editors Need GeForce RTX 50 Series GPUs?
  • Adobe Premiere Pro and After Effects – What’s New In Version 25.2?
  • The Future of LED Walls: Arena & Nuke Stage Go Beyond Game Engines
  • 2025 Tariff Impacts at Puget Systems
View All
Tags: AMD, automatic 1111, GPU, Microsoft Olive, NVIDIA, Radeon RX 7900 XTX

Who is Puget Systems?

Puget Systems builds custom workstations, servers and storage solutions tailored for your work.

We provide:

Extensive performance testing
making you more productive and giving better value for your money

Reliable computers
with fewer crashes means more time working & less time waiting

Support that understands
your complex workflows and can get you back up & running ASAP

A proven track record
as shown by our case studies and customer testimonials

Get Started

Browse Systems

Puget Systems Mobile Laptop Workstation Icon

Mobile

Puget Systems Tower Workstation Icon

Workstations

Puget Systems Rackmount Workstation Icon

Rackstations

Puget Systems Rackmount Server Icon

Servers

Puget Systems Rackmount Storage Icon

Storage

Latest Articles

  • Do Video Editors Need GeForce RTX 50 Series GPUs?
  • Adobe Premiere Pro and After Effects – What’s New In Version 25.2?
  • The Future of LED Walls: Arena & Nuke Stage Go Beyond Game Engines
  • 2025 Tariff Impacts at Puget Systems
  • Z890 vs. B860 vs. H810
View All

Post navigation

 NVIDIA TensorRT Extension for Stable Diffusion Performance AnalysisAI/ML Vocabulary Glossary 
Puget Systems Logo
Build Your Own PC Site Map FAQ
facebook instagram linkedin rss twitter youtube

Optimized Solutions

  • Adobe Premiere
  • Adobe Photoshop
  • Solidworks
  • Autodesk AutoCAD
  • Machine Learning

Workstations

  • Content Creation
  • Engineering
  • Scientific PCs
  • More

Support

  • Online Guides
  • Request Support
  • Remote Help

Publications

  • All News
  • Puget Blog
  • HPC Blog
  • Hardware Articles
  • Case Studies

Policies

  • Warranty & Return
  • Terms and Conditions
  • Privacy Policy
  • Delivery Times
  • Accessibility

About Us

  • Testimonials
  • Careers
  • About Us
  • Contact Us
  • Newsletter

© Copyright 2025 - Puget Systems, All Rights Reserved.