Jump to content

We bought a 48GB RTX 4090 from China to see if doubling your VRAM really makes that much of a difference with AI workloads. It definitely does, but what's more interesting is how much benchmarking doesn't really show that.

 

 

Link to post
Share on other sites

@podkall here's the missing vram chips

Link to post
Share on other sites

38 minutes ago, Plouffe said:

We bought a 48GB RTX 4090 from China to see if doubling your VRAM really makes that much of a difference with AI workloads. It definitely does, but what's more interesting is how much benchmarking doesn't really show that.

 

 

 

14 minutes ago, strange13930 said:

@podkall here's the missing vram chips

Basically just the RTX 6000 ADA Generation 😅

AMD Ryzen™ 5 5600g w/ Radeon Graphics | 16GB DDR4-3200 RAM | 256GB NVME SSD + 2TB HDD | Amazon Basics 2.0 Speakers

                                                                                            I'M JUST A REAL-LIFE TOM SAWYER

Link to post
Share on other sites

You guys got all the q4 quant wrong in the script and in the timestamps... 

Where did you even get q4_0 from? It's not a quant ollama will offer by default if you download from them, at least for any model that's come out in the past half a year or so.

q4_k_m is a refinement of q4_0 that offers significantly less degradation compared to the full model at marginal size increase.

 

Also if you use ollama as your backend, type "ollama ps" in the command line to get a breakdown of how the model you're running is spread between VRAM and RAM.

 

Example:

image.thumb.png.47195281d715645e539c9de2847c66f2.png

 

In this case I still had a game running in the background using up vram so despite the 21GB fitting into my 24GB 4090, it wasn't fully running on said 4090.

(Incidentally, this split runs at about 23 tokens/s, compared to the 40 t/s when fully loaded into vram)

 

Also I don't know if that's something your want to stick your nose into, but 'abliteration' is a very interesting technique for decensoring models. 

https://huggingface.co/blog/mlabonne/abliteration

 

 

edit: Also, I'd definitely go with Flux over SD3.5L 😛

edit2: If you do something like this again, I'd recommend putting 64GB of RAM in those systems. 32GB can run out unexpectedly and if it starts offloading to your SSD, system responsiveness goes to basically zero.

Link to post
Share on other sites

@fwdhsenid looks like Open WebUI to me. 

https://openwebui.com/

https://github.com/open-webui/open-webui

 

There's plenty of 'howto' videos to be found if you want to try it yourself and the 'how to install' instructions aren't detailed enough. 

Link to post
Share on other sites

all this is is a 4090 on a 3090ti pcb with extra vram not a custom board from what the ai qq and Chinese forums have said and also there is rumored to be a 192gb one this one Chinese guy has and also china has msi official 3080 20gb cards that nvidia discontinued after they were already made

Link to post
Share on other sites

That segue was straight nightmare fuel!

 

Don't do that again please @LinusTech 😂

DAEDALUS (2018 Refit) - Processor: AMD Ryzen 5 - 1600 @ 3.7Ghz // Cooler: Cooler Master Hyper 212 LED Turbo Black Edition // Motherboard: Asus RoG Strix B350-F Gaming // Graphics Card: Gigabyte GTX 1060 Windforce 6GB GDDR5 // Memory: 2 x 8GB DDR4 Corsair LPX Vengeance 3000Mhz // Storage: WD Green - 250GB M.2 SATA SSD (Boot Drive and Programs), SanDisk Ultra II 120GB (GTA V), WD Elements 1TB External Drive (Steam Library) // Power Supply: Cooler Master Silent Pro 700W // Case: BeQuiet Silentbase 600 with SilentWings Mk.2 Internal Fans // Peripherals: VicTop Mechanical Gaming Keyboard & VicTsing 7200 DPI Wired Gaming Mouse

 

PROMETHEUS (2018 Refit) - Processor: Intel Core i5-3470 @ 3.2Ghz // Cooler: Cooler Master 212 EVO // Motherboard: Foxconn 2ABF // Graphics Card: ATI Radeon HD 5450 (For Diagnostic Testing Only) // Memory: 2 x 4GB DDR3 Mushkin Memory // Storage: 10TB of Various Storage Drives // Power Supply: Corsair 600W // Case: Bitfenix Nova Midi Tower - Black

 

SpeedTest Results - Having Trouble Finding a Decent PSU? - Check the PSU Tier List!

Link to post
Share on other sites

45 minutes ago, Cxsmo_AI said:

all this is is a 4090 on a 3090ti pcb with extra vram not a custom board from what the ai qq and Chinese forums have said and also there is rumored to be a 192gb one this one Chinese guy has and also china has msi official 3080 20gb cards that nvidia discontinued after they were already made

As someone who directly work with factories making these and selling them, your statement is only partially correct.

 

There exists 2 version of a "48gb" 4090. One is a 4090D with GDDR6 non X 48GB, and that is using a 3090ti pcb. This version was more of a proof-of-concept and was the very first 48gb card developed in June 2024, and also because custom PCB were not invented back then. Then, a very small qty of 4090D with GDDR6X was made but quickly stopped and transitioned to 4090 non D after more supply channels for AD102-301 was open (the AD102-300 cannot be used for 48gb mods until much later in Jan 2025 where they fixed an issue with bios).

 

Another version is the original 4090 non D with gddr6x 48gb, and that is using a custom PCB with upgraded VRMs and MOSFETs. They are all made using SMT machines.

 

As for the rumours, 192gb is completely fake, so is the "96gb" that someone keep posting on X. The only reason why a 48GB exist is because a certain Nvidia OEM leaked an original, Nvidia-signed BIOS meant for an experimental AD102 card that got cancelled. No such bios even exist for any other vram capacity. It does not just work by "add more vram slot on pcb" or "change from 2gb to 4gb die".

 

Same goes for 5090. 96GB is possible once a similar leak is done. It has not yet.

 

Also, for anyone looking to buy one, read carefully and dont get scammed by the GDDR6 VS GDDR6X and 4090D vs 4090. Cost difference is HUGE, more than $800usd.

Link to post
Share on other sites

7 minutes ago, Billy Cao said:

custom PCB were not invented back then

That wording is a laugh 🙂

 

8 minutes ago, Billy Cao said:

Same goes for 5090. 96GB is possible once a similar leak is done. It has not yet.

Thinking about it, a RTX Pro 6000 bios might work if you're really lucky and the deactivated silicon is flawless? Or does nvidia burn a fuse or something to deactivate parts of the chip physically?

 

12 minutes ago, Billy Cao said:

Also, for anyone looking to buy one, read carefully and dont get scammed by the GDDR6 VS GDDR6X and 4090D vs 4090. Cost difference is HUGE, more than $800usd.

I suppose the performance is quite significant with LLMs mostly being memory bandwidth dependent?

Link to post
Share on other sites

5 minutes ago, steamrick said:

That wording is a laugh

Yea I meant the custom PCB for the 4090 48g

 

5 minutes ago, steamrick said:

RTX Pro 6000 bios

Tried long ago, cant get it to work yet. Its not about deactivating silicon, the BIOS recognizes the exact silicon bin.

 

5 minutes ago, steamrick said:

I suppose the performance is quite significant with LLMs mostly being memory bandwidth dependent?

Yes basically 1:1 proportion to memory bandwidth (12% diff, 896GB/s vs 1008GB/s)

Link to post
Share on other sites

I'm so happy Linus showed off Comfy UI! The models aren't easy to use, and Comfy UI is one of the hardest to learn, I'm not going to comment much on what Linus could have done differently to get better outputs, it's not much different from how I started out. Practice makes perfect!

 

"Not anyone can turn an image into a real product"

 

Hunyuan 3D (workflow) let you turn images into STLs that are good enough to be 3D printed! There is a workflow to add textures, but I don't need it to just make STLs. I use it to make custom D&D minis.

image.thumb.png.42d4cce43f444f77a2024037c2296590.png

Jacket.stl

 

 

3 hours ago, steamrick said:

Also if you use ollama as your backend, type "ollama ps" in the command line to get a breakdown of how the model you're running is spread between VRAM and RAM

  

3 hours ago, fwdhsenid said:

Any idea what UI was used that was shown in the video ?

 

My advice for new users is always LM Studio. All in one with easy UI and recomended models and easy runtimes that just work

 

  

1 hour ago, steamrick said:

I suppose the performance is quite significant with LLMs mostly being memory bandwidth dependent?

LLM are horribly bandwidth bound. You are almost always limited by memory bandwidth and little else in generation. The prefill (what happens before first token) is more compute bound. 

 

Diffusion models are really dense, bandwidth matter, but compute matters more. And they can't be easily split between GPUs, unlike LLMs.

  

47 minutes ago, RKPC10 said:

does anybody know what program Linus is using to generate images with the ai locally?

It's Comfy UI, that's what I use, but that's not what I would advice for a new user. It's really meant for people that want to be on the bleeding edge of models, and run what came out of a chinese lab the day before. You'll burn yourself out if you start from that. It's hard to run the models, and it's hard to make workflow and it's hard to debug and handel all dependencies and directories.

 

Instead I advice you have an Nvidia card and use SD Next

 

If you are on AMD, tough luck. To start out you can do Amuse, and turn on the advanced interface, but it loses 50% to 75% performance because of DirectML ONNX runtime. I use WSL ROCm and it's hardcore. Supposedly AMD will come out with ROCm binaries that are compatible with windows eventually.

 

Link to post
Share on other sites

Now that I look, I see quite a few 48GB RTX 4090s on Ebay.

AMD Ryzen™ 5 5600g w/ Radeon Graphics | 16GB DDR4-3200 RAM | 256GB NVME SSD + 2TB HDD | Amazon Basics 2.0 Speakers

                                                                                            I'M JUST A REAL-LIFE TOM SAWYER

Link to post
Share on other sites

Just now, Johan5 said:

How could we possibly know which ones are scams before buying one?

If the seller is kind enough to label it accurately then just be careful when reading the description.

 

Another thing is on ebay the price for a gddr6x 48g non D should be above 4100usd. If its around 3800usd or below its definitely GDDR6 non X or/and D version.

 

Lastly, if you want more info, you can dm me. ebay charges ridiculous seller fees so buying direct will be much cheaper.

Link to post
Share on other sites

21 hours ago, KidKid said:

Basically just the RTX 6000 ADA Generation 😅

That one uses regular GDDR6 (non-X) memory, so this 4090 48GB is actually faster due to the extra memory bandwidth and higher power limit/clocks.

 

19 hours ago, Billy Cao said:

The only reason why a 48GB exist is because a certain Nvidia OEM leaked an original, Nvidia-signed BIOS meant for an experimental AD102 card that got cancelled. No such bios even exist for any other vram capacity. It does not just work by "add more vram slot on pcb" or "change from 2gb to 4gb die".

Oh, that solves the curiosity that I had when it comes to the vbios. I knew it was an official one from Nvidia, I just wondered if nvidia just left that memory config in there for all 4090s by "accident", or if it was something else. Seems like it was the latter, thank you for the info.

 

19 hours ago, steamrick said:

Thinking about it, a RTX Pro 6000 bios might work if you're really lucky and the deactivated silicon is flawless? Or does nvidia burn a fuse or something to deactivate parts of the chip physically?

Sadly that doesn't work. It also why there was no proper working 48GB 3090. I wish I could just double up the vram of my 3090s 😞

 

18 hours ago, 05032-Mendicant-Bias said:

And they can't be easily split between GPUs, unlike LLMs.

Fun fact: they actually can. I have done so and gone through some lengthy discussion in other forums, and demo'ed such. Reminder that models such as flux actually use a transformer model for their diffusion steps.

And even the text encoders can be split, but the perf hit from doing so is bigger than a transformer block due to the residual layers, which incur in more data being passed across GPUs, but those are often small enough that you can fit in a single GPU anyway.

 

9 hours ago, Billy Cao said:

Lastly, if you want more info, you can dm me.

Done 🙂 

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to post
Share on other sites

i have a 4090 and rather use ollame(since it would be free) but it is sucks ass and not like it is uncensored either. 

CPU: i9 19300k////GPU: RTX 4090////RAM: 64gb DDR5 5600mhz ////MOBO: Aorus z790 Elite////MONITORS: 3 LG 38" 3840x1600 WIDESCREEN MONITORS

Link to post
Share on other sites

On 6/22/2025 at 10:29 AM, steamrick said:

You guys got all the q4 quant wrong in the script and in the timestamps... 

Where did you even get q4_0 from? It's not a quant ollama will offer by default if you download from them, at least for any model that's come out in the past half a year or so.

q4_k_m is a refinement of q4_0 that offers significantly less degradation compared to the full model at marginal size increase.

 

Also if you use ollama as your backend, type "ollama ps" in the command line to get a breakdown of how the model you're running is spread between VRAM and RAM.

 

Example:

image.thumb.png.47195281d715645e539c9de2847c66f2.png

 

In this case I still had a game running in the background using up vram so despite the 21GB fitting into my 24GB 4090, it wasn't fully running on said 4090.

(Incidentally, this split runs at about 23 tokens/s, compared to the 40 t/s when fully loaded into vram)

 

Also I don't know if that's something your want to stick your nose into, but 'abliteration' is a very interesting technique for decensoring models. 

https://huggingface.co/blog/mlabonne/abliteration

 

 

edit: Also, I'd definitely go with Flux over SD3.5L 😛

edit2: If you do something like this again, I'd recommend putting 64GB of RAM in those systems. 32GB can run out unexpectedly and if it starts offloading to your SSD, system responsiveness goes to basically zero.

There are (more than) a few comments about a few things regarding the demonstrations in this video. I’m Nik from the Lab, the one who helped Plouffe with the demos and wanted to share some insight into the decision making in this video.

 

First, a couple misspeaks were in the video:

  1. Linus says that the gemma3:27b-it-q4_K_M model was bigger than the gemma3:27b-it-q8_0 model. Talking about the size of the model usually pertains to the number of parameters in the model, in this case Linus was referring to the actual size on disk, the q4_K_M model is 17GB while the q8_0 is 30 GB. We’ll watch out for this in the future.
  2. Linus, the graphic, and the timestamp call it the q4_0  model, and not by its proper name the q4_K_M model. This was how he was referring to it during the shoot, and like above, we’ll be more careful to catch the names of things being pronounced properly.

When they were playing with Gemma 3, they should have started a new chat for a fresh context, also we should have shown explicitly on camera what was running on the test benches. Despite this, we achieved what we set out to demonstrate; the difference between 24GB and 48GB in regards to model sizes (as on disk in GB). Primarily for LLM’s how the model’s layers are split when it can’t fit into the VRAM, in the case of Stable Diffusion we wanted to show how increased VRAM allows for bigger batch sizes.

 

Regarding the comments about picking bad models, there are higher quality models, but at the time of writing and filming Gemma 27b at q4_K_M and q8_0 served our purposes. We weren’t concerned about the quality of the output, and frankly Linus and Plouffe did get some good laughs. Stable Diffusion was chosen for its better name recognition over Flux, not for its quality.

 

We like to use Ollama and OpenWebUI in these scenarios because they are accessible and easy to set up, but there are tons of options for those looking to get playing with AI, such as LM Studio. We aim for videos like these to spark curiosity in the covered topics and we shouldn’t be the last video you watch on the subject.

 

If anyone is interested in getting setup locally with Ollama and OpenWebUI check out Network Chuck’s video which has step by step instructions along with and excellent explanations as he goes


 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×