Jump to content

Samsung Demos In-Memory Processing for HBM2, GDDR6, DDR4, and LPDDR5X

Lightwreather

Summary

If Samsung has its way, the memory chips in future desktop PCs, laptops, or GPUs could think for themselves.

Samsung

Credit: Samsung

Quotes

Quote

At Hot Chips 33, Samsung announced that it would extend its processing-in-memory technology to DDDR4 modules, GDDR6, and LPDDR5X in addition to its HBM2 chips. Earlier this year, Samsung announced its HBM2 memory with an integrated processor that can compute up to 1.2 TFLOPS for AI workloads, allowing the memory itself to perform operations usually reserved for CPUs, GPUs ASICs, or FPGAs. Today marks more forward progress with that chip, but Samsung also has more powerful variants on the roadmap with its next-gen HBM3. Given the rise of AI-based rendering techniques, like upscaling, we could even see this tech work its way into gaming GPUs.

Today's announcement reveals the official branding of the Aquabolt-XL HBM2 memory, along with a reveal of the AXDIMM DDR4 sticks and LPDDR5 memory that also come with embedded compute power.

Put simply, the chips have an AI engine injected inside each DRAM bank. That allows the memory itself to process data, meaning that the system doesn't have to move data between the memory and the processor, thus saving both time and power. Of course, there is a capacity tradeoff for the tech with current memory types, but Samsung says that HBM3 and future memories will have the same capacities as normal memory chips.

Samsung's Aquabolt-XL HBM-PIM slots right into the company's product stack and works with standard JEDEC-compliant HBM2 memory controllers, so it's a drop-in replacement for standard HBM2 memory. Samsung recently demoed this by swapping its HBM2 memory into a standard Xilinx Alveo FPGA with no modifications to the card, netting a 2.5X system performance gain with a 62% reduction in energy consumption.

While Samsung's PIM tech is already compatible with any standard memory controller, enhanced support from CPU vendors will result in more performance in some scenarios (like not requiring as many threads to fully utilize the processing elements). Samsung tells us that it is testing the HBM2-PIM with an unnamed CPU vendor for use in its future products. Of course, that could be any number of potential manufacturers, be they on the x86 or Arm side of the fence — Intel's Sapphire Rapids, AMD's Genoa, and Arm's Neoverse platforms all support HBM memory (among others).

the company also demoed its AXDIMM, a new acceleration DIMM prototype that performs processing in the buffer chip. Like the HBM2 chip, it can perform FP16 processing using standard TensorFlow and Python code, though Samsung is working feverishly to extend support to other types of software. Samsung says this DIMM type can drop into any DDR4-equipped server with either LRDIMMs or UDIMMS, and we imagine that DDR5 support will follow in due course.The company says its tests (conducted on a Facebook AI workload) found a 1.8X increase in performance, a 42.6% reduction in energy consumption, and a 70% reduction in tail latency with a 2-rank kit, all of which is very impressive—especially considering that Samsung plugged the DIMMs into a standard server without modifications. Samsung is already testing this in customer servers, so we can expect this tech to come to market in the near future.Samsung's PIM tech is transferable to any of its memory processes or products, so it has even begun experimenting with PIM memory in LPDDR5 chips, meaning that the tech could come to laptops, tablets, and even mobile phones in the future. Samsung is still in the simulation phase with this tech. Still, its tests of a simulated LPDDR5X-6400 chip claim a 2.3X performance improvement in speech recognition workloads, a 1.8X improvement in a transformer-based translation, and a 2.4X increase in GPT-2 text generation. These performance improvements come paired with a 3.85X, 2.17X, and 4.35X reduction in power, respectively.

This tech is moving rapidly and works with standard memory controllers and existing infrastructure, but it hasn't been certified by the JEDEC standards committee yet, a key hurdle that Samsung needs to jump before seeing widespread adoption. However, the company hopes that the initial PIM spec is accepted into the HBM3 standard later this year.

Speaking of HBM3, Samsung says that it will move forward from the FP16 SIMD processing in HBM2 to FP64 in HBM3, meaning the chips will have expanded capabilities. FP16 and FP32 will be reserved for data center usages, while INT8 and INT16 will serve the LPDDR5, DDR5, and GDDR6 segments.

 

Additionally, you lose half the capacity of an 8GB chip if you want the computational power of HBM2 PIM, but there will be no such capacity tradeoffs in the future: The chips will have the full standard capacity regardless of the computational capabilities.   

 

Samsung will also bring this capability to other types of memory, like GDDR6, and widen the possible applications. CXL support could also be on the horizon. Samsung says its Aquabolt-XL HBM2 chips are available for purchase and integration today, with its other products are already working their way through the developmental pipeline.

 

Who knows, with the rise of AI-based upsca

ling and rendering techniques, this tech could be more of a game-changer for enthusiasts than we see on the surface. In the future, it's plausible that GPU memory could handle some of the computational workloads to boost GPU performance and reduce energy consumption.

 

My thoughts

Well, this is pretty exciting tech, I've already written a previous tech news post on samsung's PIM, but it coming to other types of memory is pretty nice. I'm not sure how this'll benefit the average consumer apart from things like DLSS and XeSS, but those are AI that have already been trained. That said, for Datacenters and Scientists this might be pretty great news for whom the benefit might be more obvious.

Sources

Toms's Hardware

"A high ideal missed by a little, is far better than low ideal that is achievable, yet far less effective"

 

If you think I'm wrong, correct me. If I've offended you in some way tell me what it is and how I can correct it. I want to learn, and along the way one can make mistakes; Being wrong helps you learn what's right.

Link to comment
Share on other sites

Link to post
Share on other sites

15 minutes ago, J-from-Nucleon said:

I'm not sure how this'll benefit the average consumer apart from things like DLSS and XeSS, but those are AI that have already been trained.

Here's an article that explains why stuff like this is coming this pretty well: https://www.extremetech.com/computing/325782-for-next-generation-cpus-not-moving-data-is-the-new-1ghz

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

Great, the Ai gibberish in it again. Kinda weird they didn't mention VR and "Turbo" too while we're in the land of buzzwords.

 

Massive memory sticks already have memory controllers on them. I forgot how they are called, but it's only found on server clusters where you have TB of RAM and every stick has so many modules they need own controller that talks to CPU memory controller so they don't burden it too much. I think Linus even talked about it once.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, WolframaticAlpha said:

Can someone ELI5

offloads tasks from cpu/gpu = moar FPS!

(tldr: you gotta buy new stuffs!)

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, RejZoR said:

Massive memory sticks already have memory controllers on them. I forgot how they are called, but it's only found on server clusters where you have TB of RAM and every stick has so many modules they need own controller that talks to CPU memory controller so they don't burden it too much. I think Linus even talked about it once.

That's barely a controller, it's an extra Register (RDIMM). Way back in DDR2 generation there was Fully Buffered Memory (FBDIMM), not really a controller either.

Link to comment
Share on other sites

Link to post
Share on other sites

I do wonder are we going to see HBM used in next consumer GPUs maybe for flagship just, being how it's supposed to get cheaper with newer gens while also getting better.

Another I'm looking forward to see some low latency DDR5 kits eventually, will be a while yeah, we know some expected speeds, though latencies is another thing.

| Ryzen 7 7800X3D | AM5 B650 Aorus Elite AX | G.Skill Trident Z5 Neo RGB DDR5 32GB 6000MHz C30 | Sapphire PULSE Radeon RX 7900 XTX | Samsung 990 PRO 1TB with heatsink | Arctic Liquid Freezer II 360 | Seasonic Focus GX-850 | Lian Li Lanccool III | Mousepad: Skypad 3.0 XL / Zowie GTF-X | Mouse: Zowie S1-C | Keyboard: Ducky One 3 TKL (Cherry MX-Speed-Silver)Beyerdynamic MMX 300 (2nd Gen) | Acer XV272U | OS: Windows 11 |

Link to comment
Share on other sites

Link to post
Share on other sites

This should help GPU in general more that CPU, at least for consumer, just looking at the lengths both AMD and Nvidia go to have fast memory (infinity cache and scorching GDDR6X memory) this tech can't come some enough to market.

this is one of the greatest thing that has happened to me recently, and it happened on this forum, those involved have my eternal gratitude http://linustechtips.com/main/topic/198850-update-alex-got-his-moto-g2-lets-get-a-moto-g-for-alexgoeshigh-unofficial/ :')

i use to have the second best link in the world here, but it died ;_; its a 404 now but it will always be here

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Doobeedoo said:

I do wonder are we going to see HBM used in next consumer GPUs maybe for flagship just, being how it's supposed to get cheaper with newer gens while also getting better.

Not sure about NVIDIA but probably not for AMD as they are moving more towards the shared cache pool with cheap GDDR as that seems to be working fine for them while having small bus. If they can also pull off the chiplet design for GPUs it may get very interesting very fast even without HBM.

Link to comment
Share on other sites

Link to post
Share on other sites

Ehhh, I'm not hip on this. I'm sure it's valuable in niche markets where this memory takes form as an "appliance" or requires special calculation acceleration. But in general, IMHO I don't want RAM to "think for itself"; that job belongs to the processor (CPU or GPU).

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, StDragon said:

Ehhh, I'm not hip on this. I'm sure it's valuable in niche markets where this memory takes form as an "appliance" or requires special calculation acceleration. But in general, IMHO I don't want RAM to "think for itself"; that job belongs to the processor (CPU or GPU).

Yea, I can think of only a single practical use in the consumer market and that's for GPUs and upscaling. Otherwise feels very CXL, Gen-Z etc etc applicable only.

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, RejZoR said:

Great, the Ai gibberish in it again. Kinda weird they didn't mention VR and "Turbo" too while we're in the land of buzzwords.

 

Massive memory sticks already have memory controllers on them. I forgot how they are called, but it's only found on server clusters where you have TB of RAM and every stick has so many modules they need own controller that talks to CPU memory controller so they don't burden it too much. I think Linus even talked about it once.

You're probably thinking of registered (aka buffered) memory. They have a buffer between the memory controller and the memory chips. It's pretty primitive though and is only there to reduce the electrical load on the memory, not to do any calculations.

 

This on the other hand, is not just buzzwords and "AI gibberish". The chip on these memory sticks actually has quite a lot of compute performance, and they are being tested and deployed in commercial solutions.

If I remember correctly, it is already being deployed in some Xilinx products. I doubt these in-memory processors will be used in consumer products anytime soon though, but in data centers they seem to already have been finding some uses.

Link to comment
Share on other sites

Link to post
Share on other sites

My question is: how do you debug what the memory is doing? If it behaves like a normal ram stick it probably doesn't have spare lanes nor standardized protocols for that.

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, Forbidden Wafer said:

My question is: how do you debug what the memory is doing? If it behaves like a normal ram stick it probably doesn't have spare lanes nor standardized protocols for that.

You don't, and that's the problem. It's an abstraction. Now, that might change in future memory standards where you can "pierce the veil" as it was in order to troubleshoot directly to the underlaying hardware.

It's the same thing with a HW RAID array. The OS doesn't address the individual disks, just the volume as presented by the HBA. However, the drivers can allow direct access to each disk so you can view and manage them from an application if needed.

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, Forbidden Wafer said:

My question is: how do you debug what the memory is doing? If it behaves like a normal ram stick it probably doesn't have spare lanes nor standardized protocols for that.

Samsung has published the instruction set their controller uses and supports, and has also been in talks with IEEE so my guess is that they want to standardize it.

But if you want to use it as a drop-in replacement for your current memory then you probably won't be able to debug it very well. But that's just me speculating. I am sure the people designing FPGAs and other circuits can figure something out. Again, I don't think this is for consumers like you or I.

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, The Unknown Voice said:

The next question is:

Will the owners of this forum take a plunge and buy new memory?

 

I know, it's on a need to know basis. The Colonel should de-classify this info, just for MOTF.

 

Arnold always looking to play the part in a new Terminator movie... 🙄 

Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, The Unknown Voice said:

The next question is:

Will the owners of this forum take a plunge and buy new memory?

 

I know, it's on a need to know basis. The Colonel should de-classify this info, just for MOTF.

Would not only have no benefit to the forum the forum software couldn't use it and the developers of it would be extremely unlikely to develop for it because well, no benefit.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, StDragon said:

Arnold always looking to play the part in a new Terminator movie... 🙄 

Well he did proclaim that "I'll be back". Good that he is a man of his word lol

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Forbidden Wafer said:

My question is: how do you debug what the memory is doing? If it behaves like a normal ram stick it probably doesn't have spare lanes nor standardized protocols for that.

Anandtech did a live blog of the presentation, one of the question was application support and apps need to be recompiled with support for it but there's no need to modify the source code for support so the changes required for support are on the toolchain level, one example they did was tensorflow python scripts would work as is on PIM memory but tensorflow itself needed to be recompiled to support it.

 

I don't know how much that would answer your question about debugging but i the way i see it is that supporting this wouldn't be to hard, it would just require updates to the lower parts of a stack, at least according to Samsung.

this is one of the greatest thing that has happened to me recently, and it happened on this forum, those involved have my eternal gratitude http://linustechtips.com/main/topic/198850-update-alex-got-his-moto-g2-lets-get-a-moto-g-for-alexgoeshigh-unofficial/ :')

i use to have the second best link in the world here, but it died ;_; its a 404 now but it will always be here

 

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, RejZoR said:

Great, the Ai gibberish in it again. Kinda weird they didn't mention VR and "Turbo" too while we're in the land of buzzwords.

 

Massive memory sticks already have memory controllers on them. I forgot how they are called, but it's only found on server clusters where you have TB of RAM and every stick has so many modules they need own controller that talks to CPU memory controller so they don't burden it too much. I think Linus even talked about it once.

It's less accurate to call what you are referring to a "controller", and more accurate to call it a "circuit". As @leadeaterand @LAwLzcorrectly pointed out, what you are describing is a function of registered memory (of which there are typically 3 types/functional methods) and only one could be technically considered a "controller" (fully buffered DIMM's, but even that is a stretch as the chip itself doesn't dictate what is being written/read, but rather a conversion from serial to parallel on an address/data scale). Circuit would be universally applicable across all 3 types and would better describe what is physically on the registered/FB DIMM's.

 

As for this technology, it would be interesting to see what impact it would have on the CPU's memory controller function if any. Would this allow the memory to theoretically train itself/correct itself in the event that the processor is unable to do so/trains the memory out of spec? If so, that is a serious boon for this technology as far as memory stability is concerned, but it begs the question of what it would do for memory overclocking in general.

 

Currently, memory is dumb. It doesn't know what to do without a controller telling it what to do. If it resides on the DIMM and can alleviate stress from the CPU memory controller or circumvent its function entirely, it would be interesting to see if this results in any gains in performance, specifically operational memory latency if it can process everything on the DIMM without board trace topology mattering at all.

My (incomplete) memory overclocking guide: 

 

Does memory speed impact gaming performance? Click here to find out!

On 1/2/2017 at 9:32 PM, MageTank said:

Sometimes, we all need a little inspiration.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

30 minutes ago, MageTank said:

It's less accurate to call what you are referring to a "controller", and more accurate to call it a "circuit". As @leadeaterand @LAwLzcorrectly pointed out, what you are describing is a function of registered memory (of which there are typically 3 types/functional methods) and only one could be technically considered a "controller" (fully buffered DIMM's, but even that is a stretch as the chip itself doesn't dictate what is being written/read, but rather a conversion from serial to parallel on an address/data scale). Circuit would be universally applicable across all 3 types and would better describe what is physically on the registered/FB DIMM's.

 

As for this technology, it would be interesting to see what impact it would have on the CPU's memory controller function if any. Would this allow the memory to theoretically train itself/correct itself in the event that the processor is unable to do so/trains the memory out of spec? If so, that is a serious boon for this technology as far as memory stability is concerned, but it begs the question of what it would do for memory overclocking in general.

 

Currently, memory is dumb. It doesn't know what to do without a controller telling it what to do. If it resides on the DIMM and can alleviate stress from the CPU memory controller or circumvent its function entirely, it would be interesting to see if this results in any gains in performance, specifically operational memory latency if it can process everything on the DIMM without board trace topology mattering at all.

It's as wrong as calling that bullshit any kind of Ai. There's no "Ai". It's hardcoded rigid logic that we've been using since forever to do cache predictions on CPU's. Remember when AMD was advertising Ryzen's cache management as some sort of Ai bullshit? It's all useless buzzwords. 99% of stuff people call "Ai" are nothing but rigid algorithms designed to do one thing. If there was ever any actual Ai, that Ryzen should learn what I run regularly and optimize performance on its own by 30% over this time. Yet it's behaving EXACTLY the same it did on day of release. And while it was good performer it's magical Ai did nothing. Just like this Ai does nothing. They just put in place some mechanism that speeds up certain things for data that's commonly shuffled across RAM sticks and instead of shuffling it on and off and flushing it they cache it in some clever way. Who cares how exactly, the point is, there is no magical "Ai". I wish this BS trend of calling everything "Ai" would die already. It's stupid.

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, RejZoR said:

It's as wrong as calling that bullshit any kind of Ai. There's no "Ai". It's hardcoded rigid logic that we've been using since forever to do cache predictions on CPU's. Remember when AMD was advertising Ryzen's cache management as some sort of Ai bullshit? It's all useless buzzwords. 99% of stuff people call "Ai" are nothing but rigid algorithms designed to do one thing. If there was ever any actual Ai, that Ryzen should learn what I run regularly and optimize performance on its own by 30% over this time. Yet it's behaving EXACTLY the same it did on day of release. And while it was good performer it's magical Ai did nothing. Just like this Ai does nothing. They just put in place some mechanism that speeds up certain things for data that's commonly shuffled across RAM sticks and instead of shuffling it on and off and flushing it they cache it in some clever way. Who cares how exactly, the point is, there is no magical "Ai". I wish this BS trend of calling everything "Ai" would die already. It's stupid.

Let me get this straight. I correct your incorrect use of the word "controller" to describe registered DIMM's, and you counter this by going off on a tangent about Samsung's use of AI? I gotta let you know my friend, this changes absolutely nothing, lol.

My (incomplete) memory overclocking guide: 

 

Does memory speed impact gaming performance? Click here to find out!

On 1/2/2017 at 9:32 PM, MageTank said:

Sometimes, we all need a little inspiration.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, MageTank said:

Let me get this straight. I correct your incorrect use of the word "controller" to describe registered DIMM's, and you counter this by going off on a tangent about Samsung's use of AI? I gotta let you know my friend, this changes absolutely nothing, lol.

You being anal about me calling it "memory controller" reminds me of someone flipping out about "electrical current" used as general term for electrical flow within wires (especially in Slavic languages afaik) and he couldn't shut up about amperes and how power can't be amperes. You remind me of that guy. He stopped when I mentioned rivers have a current too and tidal currents are also a thing and they have nothing to do with electricity...

 

Given I've said (and I'll quote myself for that now):

 

Quote

Massive memory sticks already have memory controllers on them. I forgot how they are called, but it's only found on server clusters where you have TB of RAM and every stick has so many modules they need own controller that talks to CPU memory controller so they don't burden it too much. I think Linus even talked about it once.

you instantly picked up I'm talking about buffered memory from all the info provided, I just couldn't remember how it was called exactly so I called it a "memory controller". I forgot LTT isn't a casual place to talk about tech and I mistakenly joined gathering of microcontroller engineers over here at LTT forums and any mistakes are punishable with castration using jagged RAM stick...

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, RejZoR said:

You being anal about me calling it "memory controller" reminds me of someone flipping out about "electrical current" used as general term for electrical flow within wires (especially in Slavic languages afaik) and he couldn't shut up about amperes and how power can't be amperes. You remind me of that guy. He stopped when I mentioned rivers have a current too and tidal currents are also a thing and they have nothing to do with electricity...

 

Given I've said (and I'll quote myself for that now):

 

you instantly picked up I'm talking about buffered memory from all the info provided, I just couldn't remember how it was called exactly so I called it a "memory controller". I forgot LTT isn't a casual place to talk about tech and I mistakenly joined gathering of microcontroller engineers over here at LTT forums and any mistakes are punishable with castration using jagged RAM stick...

I think you are missing the point here. It has nothing to do with you calling it a "memory controller". That part I completely understand. Ignorance is often never intentional and I can get beyond that quite easily, especially on subjects as complicated as memory. The part I don't understand is your decision to quote my explanation of controllers vs circuits and use that as a springboard to go off on a tangent for your personal vendetta against Samsung's use of "AI". If anyone is being anal here, it's the dude shouting about marketing buzzwords he doesn't like every opportunity he can.

 

That said, I don't know the other guy you're referring to, but I'll gladly die on this sword and argue until you give it up. It's a slow day at work and HLK testing takes a really long time.

My (incomplete) memory overclocking guide: 

 

Does memory speed impact gaming performance? Click here to find out!

On 1/2/2017 at 9:32 PM, MageTank said:

Sometimes, we all need a little inspiration.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×