Jump to content

Samsung's New HBM2 Memory Thinks for Itself: 1.2 TFLOPS of Embedded Processing Power

Lightwreather

Summary

Today, Samsung announced that its new HBM2-based memory has an integrated AI processor that can push out (up to) 1.2 TFLOPS of embedded computing power, allowing the memory chip itself to perform operations that are usually reserved for CPUs, GPUs, ASICs, or FPGAs. 

image.png.ffa260a5bd792a35ef5e0c2a6fbc0416.png

Quotes

Quote

Today, Samsung announced that its new HBM2-based memory has an integrated AI processor that can push out (up to) 1.2 TFLOPS of embedded computing power, allowing the memory chip itself to perform operations that are usually reserved for CPUs, GPUs, ASICs, or FPGAs. 

The new HBM-PIM (processing-in-memory) chips inject an AI engine inside each memory bank, thus offloading processing operations to the HBM itself. The new class of memory is designed to alleviate the burden of moving data between memory and processors, which is often more expensive in terms of power consumption and time than the actual compute operations.

Samsung says that, when applied to its existing HBM2 Aquabolt memory, the tech can deliver twice the system performance while reducing energy consumption by more than 70%. The company also claims that the new memory doesn't require any software or hardware changes (including to the memory controllers), thus enabling a faster time to market for early adopters. 

Samsung says the memory is already under trials in AI accelerators with leading AI solutions providers. The company expects all validations to be completed in the first half of this year, marking a speedy path to market. 

Each memory bank has an embedded Programmable Computing Unit (PCU) that runs at 300 MHz. This unit is controlled via conventional memory commands from the host to enable in-DRAM processing, and it can execute various FP16 computations. The memory can also operate in either standard mode, meaning it operates as normal HBM2, or in FIM mode for in-memory data processing.

Naturally, making room for the PCU units reduces memory capacity — each PCU-equipped memory die has half the capacity (4Gb) per die compared to a standard 8Gb HBM2 die. To help defray that issue, Samsung employs 6GB stacks by combining four 4Gb die with PCUs with four 8Gb dies without PCUs (as opposed to an 8GB stack with normal HBM2). 

Notably, the paper and slides above refer to the tech as Function-In Memory DRAM (FIMDRAM), but that was an internal codename for the technology that now carries the HBM-PIM brand name. Samsung's examples are based on a 20nm prototype chip that achieves 2.4 Gbps of throughput per pin without increasing power consumption.

The paper describes the underlying tech as "Function-In Memory DRAM (FIMDRAM) that integrates a 16-wide single-instruction multiple-data engine within the memory banks and that exploits bank-level parallelism to provide 4× higher processing bandwidth than an off-chip memory solution. Second, we show techniques that do not require any modification to conventional memory controllers and their command protocols, which make FIMDRAM more practical for quick industry adoption."

Unfortunately, we won't see these capabilities in the latest gaming GPUs, at least for now. Samsung notes that the new memory is destined to satisfy large-scale processing requirements in data centers, HPC systems, and AI-enabled mobile applications. 

 

My thoughts

Well, this is interesting. Memory that can increase the performance of a system without increasing power consumption is an amazing thing, This will be a great thing for, especially, Data center applications where power consumption is an important factor and if (when) this eventually will make it's way to other consumer stuff, I think it would be great. But as with any new technology, we'll have to see how it holds up in the real world, its cost-effectivness and its usefulness.

Sources

https://www.tomshardware.com/news/samsung-hbm2-hbm-pim-memory-tflops

"A high ideal missed by a little, is far better than low ideal that is achievable, yet far less effective"

 

If you think I'm wrong, correct me. If I've offended you in some way tell me what it is and how I can correct it. I want to learn, and along the way one can make mistakes; Being wrong helps you learn what's right.

Link to comment
Share on other sites

Link to post
Share on other sites

that's what I'd call real " smart memory"

Hi

 

Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler

hi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

50 minutes ago, Drama Lama said:

that's what I'd call real " smart memory"

And exactly what AMD has been needing for their GPU.

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

That's really cool, would like to know what operations and data size it actually can process though. Not all things are equal in this area so that is kind of important.

Link to comment
Share on other sites

Link to post
Share on other sites

I wonder when HBM will get back to consumer cards.

| Ryzen 7 7800X3D | AM5 B650 Aorus Elite AX | G.Skill Trident Z5 Neo RGB DDR5 32GB 6000MHz C30 | Sapphire PULSE Radeon RX 7900 XTX | Samsung 990 PRO 1TB with heatsink | Arctic Liquid Freezer II 360 | Seasonic Focus GX-850 | Lian Li Lanccool III | Mousepad: Skypad 3.0 XL / Zowie GTF-X | Mouse: Zowie S1-C | Keyboard: Ducky One 3 TKL (Cherry MX-Speed-Silver)Beyerdynamic MMX 300 (2nd Gen) | Acer XV272U | OS: Windows 11 |

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Doobeedoo said:

I wonder when HBM will get back to consumer cards.

only when it gives a lot more performance than gddr or is a lot cheaper

Hi

 

Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler
Spoiler

hi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Drama Lama said:

only when it gives a lot more performance than gddr or is a lot cheaper

The former is much more likely to occur than the latter... As I'd imagine the R&D for something like smart processing inside the memory module was not cheap. 

GPU: XFX RX 7900 XTX

CPU: Ryzen 7 7800X3D

Link to comment
Share on other sites

Link to post
Share on other sites

Dont we already have HBM2 memory? Wouldnt this just be HBM2 2 memory then?

✨FNIGE✨

Link to comment
Share on other sites

Link to post
Share on other sites

i can see this going mainstream and then being exploited and it being a huge incident. Lol

"If a Lobster is a fish because it moves by jumping, then a kangaroo is a bird" - Admiral Paulo de Castro Moreira da Silva

"There is nothing more difficult than fixing something that isn't all the way broken yet." - Author Unknown

Spoiler

Intel Core i7-3960X @ 4.6 GHz - Asus P9X79WS/IPMI - 12GB DDR3-1600 quad-channel - EVGA GTX 1080ti SC - Fractal Design Define R5 - 500GB Crucial MX200 - NH-D15 - Logitech G710+ - Mionix Naos 7000 - Sennheiser PC350 w/Topping VX-1

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, SlimyPython said:

Dont we already have HBM2 memory? Wouldnt this just be HBM2 2 memory then?

I read it as HBM choo choo and I'm not even ashamed 

PC: 5600x @ 4.85GHz // RTX 3080 Eagle OC // 16GB Trident Z Neo  // Corsair RM750X // MSI B550M Mortar Wi-Fi // Noctua NH-D15S // Cooler Master NR400 // Samsung 50QN90A // Logitech G305 // Corsair K65 // Corsair Virtuoso //

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, Drama Lama said:

only when it gives a lot more performance than gddr or is a lot cheaper

It definitely will give a big improvement in performance, but yeah price wise maybe for flagships.

| Ryzen 7 7800X3D | AM5 B650 Aorus Elite AX | G.Skill Trident Z5 Neo RGB DDR5 32GB 6000MHz C30 | Sapphire PULSE Radeon RX 7900 XTX | Samsung 990 PRO 1TB with heatsink | Arctic Liquid Freezer II 360 | Seasonic Focus GX-850 | Lian Li Lanccool III | Mousepad: Skypad 3.0 XL / Zowie GTF-X | Mouse: Zowie S1-C | Keyboard: Ducky One 3 TKL (Cherry MX-Speed-Silver)Beyerdynamic MMX 300 (2nd Gen) | Acer XV272U | OS: Windows 11 |

Link to comment
Share on other sites

Link to post
Share on other sites

57 minutes ago, gabrielcarvfer said:

Each PCU seems to be one or a couple of GPU stream processors based on DOI 10.1109/EPEPS47316.2019.193209 .

 

Samsung made a deal to use AMD graphics in the same year the paper got published, so it wouldn't surprise me if this is where it ended up being used.

Probably won't happen but it would be interesting if RT and/or Image Upscaling gets done on HBM2 and the GPU core is left to do everything else.

Link to comment
Share on other sites

Link to post
Share on other sites

Sounds like a really awful place to get malware.

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, Bombastinator said:

Sounds like a really awful place to get malware.

Or the perfect place for firmware level crypto mining malware/hijacking lol

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/17/2021 at 2:43 AM, leadeater said:

That's really cool, would like to know what operations and data size it actually can process though. Not all things are equal in this area so that is kind of important.

I found this presentation interesting. Check it out. It goes a little bit into the research. 
 

 

Link to comment
Share on other sites

Link to post
Share on other sites

My understanding is anything that is Turing complete can run literally anything.  It may not do it well or easily but it can do it.  There is talk of a “cpu” attached to the memory.  Can it be used to run arbitrary things? 

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

Well, damn... Can it turn all of the NPCs into tig biddy onee-sans?? PLEASE LET ME HAVE THIS ONE THING!

Cor Caeruleus Reborn v6

Spoiler

CPU: Intel - Core i7-8700K

CPU Cooler: be quiet! - PURE ROCK 
Thermal Compound: Arctic Silver - 5 High-Density Polysynthetic Silver 3.5g Thermal Paste 
Motherboard: ASRock Z370 Extreme4
Memory: G.Skill TridentZ RGB 2x8GB 3200/14
Storage: Samsung - 850 EVO-Series 500GB 2.5" Solid State Drive 
Storage: Samsung - 960 EVO 500GB M.2-2280 Solid State Drive
Storage: Western Digital - Blue 2TB 3.5" 5400RPM Internal Hard Drive
Storage: Western Digital - BLACK SERIES 3TB 3.5" 7200RPM Internal Hard Drive
Video Card: EVGA - 970 SSC ACX (1080 is in RMA)
Case: Fractal Design - Define R5 w/Window (Black) ATX Mid Tower Case
Power Supply: EVGA - SuperNOVA P2 750W with CableMod blue/black Pro Series
Optical Drive: LG - WH16NS40 Blu-Ray/DVD/CD Writer 
Operating System: Microsoft - Windows 10 Pro OEM 64-bit and Linux Mint Serena
Keyboard: Logitech - G910 Orion Spectrum RGB Wired Gaming Keyboard
Mouse: Logitech - G502 Wired Optical Mouse
Headphones: Logitech - G430 7.1 Channel  Headset
Speakers: Logitech - Z506 155W 5.1ch Speakers

 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, ARikozuM said:

Well, damn... Can it turn all of the NPCs into tig biddy onee-sans?? PLEASE LET ME HAVE THIS ONE THING!

No.  It’s bad for you.  And the ones-sans backs will hurt.

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

They are basically going back to old school hardware design; no fancy software. The Central Navigation Computer in Trident Submarines was designed this way; 1950s technology. It utilized about 900 circuit cards grouped according instruction set; each set controlled by its own processor. Commands were entered in binary, and/or hexadecimal, via a row of push buttons on the front panel and data was accessed via a companion Magnetic reel-to-reel Tape Unit. 
 

Simplistic design but it allowed for accurate and instantaneous processing of instructions (and very fast repair by quickly replacing affected circuit card)  to launch Trident nuclear missiles. It seems Samsung has hired an old school engineer and is bringing back simple direct path hardware communication concepts and waving the “look what we thought of” banner. 
 

Could be wrong. JMTC 

Link to comment
Share on other sites

Link to post
Share on other sites

But can it run doom?

Specs: Motherboard: Asus X470-PLUS TUF gaming (Yes I know it's poor but I wasn't informed) RAM: Corsair VENGEANCE® LPX DDR4 3200Mhz CL16-18-18-36 2x8GB

            CPU: Ryzen 9 5900X          Case: Antec P8     PSU: Corsair RM850x                        Cooler: Antec K240 with two Noctura Industrial PPC 3000 PWM

            Drives: Samsung 970 EVO plus 250GB, Micron 1100 2TB, Seagate ST4000DM000/1F2168 GPU: EVGA RTX 2080 ti Black edition

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/17/2021 at 11:04 AM, Dabombinable said:

And exactly what AMD has been needing for their GPU.

Not really. AMD proved that with smart design you don't need the very latest memory that's running on the edge like GDDR6X. They've achieved this by using larger GPU level cache (Infinity Cache). They've done similar things in the past like the Ring Bus memory on Radeon X1800 series. Imagination did similar in the past with Kyro graphic cards doing tiled based rendering and Z buffer compression which was huge thing back in the day when no one else was doing it and Z buffer compression became a thing years later on Radeon 7000 as well as tile based rendering happening as late as with GeForce GTX 900 series.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, RejZoR said:

Not really. AMD proved that with smart design you don't need the very latest memory that's running on the edge like GDDR6X. They've achieved this by using larger GPU level cache (Infinity Cache). They've done similar things in the past like the Ring Bus memory on Radeon X1800 series. Imagination did similar in the past with Kyro graphic cards doing tiled based rendering and Z buffer compression which was huge thing back in the day when no one else was doing it and Z buffer compression became a thing years later on Radeon 7000 as well as tile based rendering happening as late as with GeForce GTX 900 series.

Considering even now vRAM overclocking gets the best results. Much like my old graphics cards (some on a 64bit bus, some with lower-than-official vRAM clocks)

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Dabombinable said:

Considering even now vRAM overclocking gets the best results. Much like my old graphics cards (some on a 64bit bus, some with lower-than-official vRAM clocks)

Overclocking RAM these days is very hard because there aren't any visible cues on problems. You can literally set your memory to 50GHz and it'll just work. But you'll have worse performance, so you need to actually benchmark memory overclock and decide on how high clock solely by observed gains or loses in memory performance. Which usually means painful observing of minimalistic framerate changes in repeated benchmarks. Unless you know a test that is incredibly memory bound and would display larger changes more vividly. I was using 3DMark Port Royal test and changes were very subtle in framerate, at least at resolutions I used.

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/17/2021 at 4:43 AM, leadeater said:

That's really cool, would like to know what operations and data size it actually can process though. Not all things are equal in this area so that is kind of important.

Since it's characterized in TFLOPs and not TOPs, and it's advertised as being an "AI-accelerator" of sorts, I'd venture to guess that it will wind up being 32 bit floating point multiply-accumulate operations. Likely, to achieve that much performance, you will have to be doing fused multiply-adds.

Vector processors are pretty neat, and can increase performance in some applications.

Still, one wonders how an OS might expose an in-memory processor for general use. That will have to be solved pretty neatly before it becomes useful.

ENCRYPTION IS NOT A CRIME

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×