Jump to content

TSMC reveals their new chip stacking Wafer on Wafer (WoW)

WMGroomAK

During TSMC's tech symposium, TSMC unveiled their new Wafer on Wafer silicon stacking technology which, if implemented, will allow for the stacking of two silicon wafers on top of each other to connect chips as opposed to having the multiple dies sitting side by side and using an interposer.  According to the Overclock3d article, the ideal situation to use this would be on wafers with chip yields greater than 90% and to decrease thermal risks, the tech is better suited for low-power parts.

 

https://overclock3d.net/news/misc_hardware/tsmc_reveals_wafer-on-wafer_chip_stacking_technology_-_wow/1

Quote

At the TSMC Technology Symposium, the company has unveiled their new Wafer-on-Wafer (WOW) technology, a form of 3D stacking for silicon wafers. The new technique can connect chips on two silicon wafers using through-silicon via (TSV) connections, acting similarly to today's 3D NAND technology. 

 

This technique is different to what we see today with some multi-die silicon, which has multiple dies sit side-by-side wither on top of an interposer or using Intel's EMIB technology. TSMC's WoW technology can connect two dies directly and with minimal data transfer times thanks to the small distance between chips, creating silicon which offers high levels of performance and a smaller overall footprint.  

Notice that this new tech is called Wafer-on-Wafer and not die-on-die, this technique stacks silicon while it is still within its original wafer, offering advantages and disadvantages.

 
The advantage here is that this tech can connect two wafers of dies at once. Imagine an alternative method where we connect individual dies in the same way, offering a lot less parallelisation within the manufacturing process and the possibility of higher end costs. 

 

With Wafer-on-Wafer technology, the problem comes when faulty dies on each layer merge to working chips on the second layer, lowering overall yields. This issue prevents this technology from being viable for silicon that doesn't already offer high-yields on a wafer-by-wafer basis. Ideally, chip yields should be 90% or higher to use TSMC's Wafer-on-Wafer technology. 

 

Another potential issue comes when two heat producing pieces of silicon are stacked on top of each other, creating a situation where heat density could become a limiting factor for stacked silicon. This thermal concern makes WoW connected chips most suitable for low-power silicon, where heat is less of an issue. 

 

TSMC currently manufactures graphics cards for both AMD and Nvidia as well as the silicon used for all major games consoles, giving this technology the potential to improve a wide range of future products. The one remaining question is this process' viability when used with high-powered components. 

The direct die-to-die connectivity of WoW technology allows silicon to communicate exceptionally quickly and with minimal latencies, opening up the possibility of chip creation where two dies can be interconnected with few downsides.

It might be interesting to see if AMD &/or nVidia might use this to interconnect their GPUs directly to the memory.  Basically stick the low power/heat silicon on the bottom and have a direct connection to the high power/heat silicon that's on the top...  Might also be fairly useful for something like image processors and/or upcoming phones.  I doubt that we'll see full GPU dies directly connected to CPU dies anytime soon due to heat/power constraints (although that could be interesting:)). 

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, WMGroomAK said:

During TSMC's tech symposium, TSMC unveiled their new Wafer on Wafer silicon stacking technology which, if implemented, will allow for the stacking of two silicon wafers on top of each other to connect chips as opposed to having the multiple dies sitting side by side and using an interposer.  According to the Overclock3d article, the ideal situation to use this would be on wafers with chip yields greater than 90% and to decrease thermal risks, the tech is better suited for low-power parts.

 

https://overclock3d.net/news/misc_hardware/tsmc_reveals_wafer-on-wafer_chip_stacking_technology_-_wow/1

It might be interesting to see if AMD &/or nVidia might use this to interconnect their GPUs directly to the memory.  Basically stick the low power/heat silicon on the bottom and have a direct connection to the high power/heat silicon that's on the top...  Might also be fairly useful for something like image processors and/or upcoming phones.  I doubt that we'll see full GPU dies directly connected to CPU dies anytime soon due to heat/power constraints (although that could be interesting:)). 

I mean the memory on gpu would be too hot based in what they are saying. Also if they could do this wouldn't they also be able to do something similar with cpus making them have a L4 cache of sorts? That would be interesting to see if this could allow for something like that. 

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, Brooksie359 said:

I mean the memory on gpu would be too hot based in what they are saying. Also if they could do this wouldn't they also be able to do something similar with cpus making them have a L4 cache of sorts? That would be interesting to see if this could allow for something like that. 

They might be able to...  I've been reading EETimes write-up of the TSMC roadmap and it seems like the WoW is being added to supplement some of the other options they offer currently, such as the Chip on Wafer on Substrate (CoWoS).  

 

https://www.eetimes.com/document.asp?doc_id=1333244&page_number=3

Quote

But that’s not all. TSMC introduced two wholly new packaging options.

 

A wafer-on-wafer pack (WoW) directly bonds up to three dice. It was released last week, but users need to ensure that their EDA flows support the bonding technique. It will get EMI support in June.

 

Finally, the foundry roughly described something that it called system-on-integrated-chips (SoICs) using less than 10-micron interconnects to link two dice. Details of the process and its target apps were sketchy for the capability that will be released sometime next year.

Although I have to admit that the rest of their roadmap is fairly interesting as TSMC is planning on producing 1.1 million wafers/year using the 10/7-nm nodes and begin production using 5 nm EUV machines sometime in 2020.  

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Misanthrope said:

How would heat dissipation work here? Intuitively it looks to me as if the bottom chip would just get cooked.

I think that is why they are looking at this primarily for low power chips or to have the low power silicon be the lower wafer on the setup...  

Link to comment
Share on other sites

Link to post
Share on other sites

Keep in mind that at least for NAND, higher heat generally improves the lifespan (or at least doesn't substantially decrease it ). All those throttling issues with the 950 Pro and other SSDs were due to the controller, not the flash itself.

 

Obviously, DRAM and HBM are different beasts, but I wonder of there's any potential use case for a CPU/GPU with a slower form of storage effectively on die.

Link to comment
Share on other sites

Link to post
Share on other sites

I mean in theory this sounds interesting I suppose but in practice I see it being extremely limited in application, or creating a lot of "magic smoke" leaks.

https://linustechtips.com/main/topic/631048-psu-tier-list-updated/ Tier Breakdown (My understanding)--1 Godly, 2 Great, 3 Good, 4 Average, 5 Meh, 6 Bad, 7 Awful

 

Link to comment
Share on other sites

Link to post
Share on other sites

So this seems to function exactly like HBM. HBM runs at a much slower clock frequency than single layer memory so it won't overheat. With this technology, we might see a massive decrease in frequency. Maybe back down to 800-1000 megahertz. But if it doubles the performance, who cares? It might even bring cooler GPU's as well. After all, GPU's aren't as intolerant to low frequencies as CPU's, so it might actually work.

 

As for AMD, Navi has had a lot of speculation. Two of them is that it will be a mid end chip only and that they could be scalable using "glued together" chips, just like EPYC and Threadripper. Again, unlike those CPU's, a GPU could handle the latency of infinity fabric a lot better.

I don't think this technology will be ready for Navi, however. But AMD is taping out both CPU's and GPU's on 7nm right now, with seemingly great yields (unlike a certain company and it's dumbsterfire called 10nm).

 

I guess with the stagnation of process node shrinks, we were forced to either see massive chips, glued together smaller chips or stacked chips. It's going to be very interesting to see what they can do with actual stacked logic.

Watching Intel have competition is like watching a headless chicken trying to get out of a mine field

CPU: Intel I7 4790K@4.6 with NZXT X31 AIO; MOTHERBOARD: ASUS Z97 Maximus VII Ranger; RAM: 8 GB Kingston HyperX 1600 DDR3; GFX: ASUS R9 290 4GB; CASE: Lian Li v700wx; STORAGE: Corsair Force 3 120GB SSD; Samsung 850 500GB SSD; Various old Seagates; PSU: Corsair RM650; MONITOR: 2x 20" Dell IPS; KEYBOARD/MOUSE: Logitech K810/ MX Master; OS: Windows 10 Pro

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Notional said:

So this seems to function exactly like HBM. HBM runs at a much slower clock frequency than single layer memory so it won't overheat. With this technology, we might see a massive decrease in frequency. Maybe back down to 800-1000 megahertz. But if it doubles the performance, who cares? It might even bring cooler GPU's as well. After all, GPU's aren't as intolerant to low frequencies as CPU's, so it might actually work.

 

As for AMD, Navi has had a lot of speculation. Two of them is that it will be a mid end chip only and that they could be scalable using "glued together" chips, just like EPYC and Threadripper. Again, unlike those CPU's, a GPU could handle the latency of infinity fabric a lot better.

I don't think this technology will be ready for Navi, however. But AMD is taping out both CPU's and GPU's on 7nm right now, with seemingly great yields (unlike a certain company and it's dumbsterfire called 10nm).

 

I guess with the stagnation of process node shrinks, we were forced to either see massive chips, glued together smaller chips or stacked chips. It's going to be very interesting to see what they can do with actual stacked logic.

 

 

Two problem areas, blue and red

 

Doing a stacked die approach will still have the same problems as doing a multi die approach.  Yeah API must change to hide that latency and although less latency than a MCM type approach, GPU's don't automatically hide latency better.  They hide latency better when their control silicon can see what all ALU's, ROP's, GU's, TMU's (all parts of the GPU) are doing, so with a multi die approach or stacked die this isn't always possible.  Engines must change and API's must change.   Without that visibility, there is no way it can hide latency between the two stacked GPU's.

 

Right now API's don't allow this, nor does the current chip designs allow this.  Since the chip design is also based on how API's are laid out.  Right now we don't have bi-direction communication between the fixed function units and the control silicon and programmable units (ALU's), at a hardware level and a software level.  So those need to change.  For that to change, its a lot more transistors (when I mean a lot , I mean A LOT lol, more cache, intradie (L1, L2 and interdie (Global or fast enough memory access for off chip data storage which is not easy to do as that will introduce even more latency, then the extra control silicon to handle the increased amount of data it must keep a track of)), or a full shift to only using compute shaders for everything that a GPU does.  Neither of these will happen anytime soon.

 

This has nothing to do with stacking tech as you can see, it has everything to do with design of the uarch which will automatically force API changes or vice versa or at the same time.

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Notional said:

So this seems to function exactly like HBM. HBM runs at a much slower clock frequency than single layer memory so it won't overheat. With this technology, we might see a massive decrease in frequency. Maybe back down to 800-1000 megahertz. But if it doubles the performance, who cares? It might even bring cooler GPU's as well. After all, GPU's aren't as intolerant to low frequencies as CPU's, so it might actually work.

 

that may seems like a good idea, but GPU's unlike CPU's generally run at 100% most of the time (not if you're on LTT forum but you get my point) and like said before even with lower clocks the bottom half cannot dissipate the heat. There would have to be some miracle solution.

.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, asus killer said:

that may seems like a good idea, but GPU's unlike CPU's generally run at 100% most of the time (not if you're on LTT forum but you get my point) and like said before even with lower clocks the bottom half cannot dissipate the heat. There would have to be some miracle solution.

there is a good chance that if you are running a water cooler it will be alright but with standard air coolers you are right it will overheat, after all they wouldn't call it an huge innovation if there wasn't a way to cool it

Link to comment
Share on other sites

Link to post
Share on other sites

Wonder if you could externalize L3 cache to the bottom die to make room for more cores and other logic, and bigger L2 cache. L3 cache is a very significant portion of the die area and a purpose built design for such a thing might not have that big latency penalties, IF latency are high for other reasons than just because it's cross die communication.

 

Intel has already drastically changed their cache architecture and inter core data flows, also increased L2 cache size per core to accommodate it.

Intel-Xeon-Fig-1_0.png

 

TSV connections at all the current indicated points to the L3 die might reduce the L3 cache area usage by 50%-70% maybe? Might be worth it?

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

Wonder if you could externalize L3 cache to the bottom die to make room for more cores and other logic, and bigger L2 cache. L3 cache is a very significant portion of the die area and a purpose built design for such a thing might not have that big latency penalties, IF latency are high for other reasons than just because it's cross die communication.

 

Intel has already drastically changed their cache architecture and inter core data flows, also increased L2 cache size per core to accommodate it.

Intel-Xeon-Fig-1_0.png

 

TSV connections at all the current indicated points to the L3 die might reduce the L3 cache area usage by 50%-70% maybe? Might be worth it?

You'd probably end up with something akin to the Zeppelin layout that AMD uses, where there is a central connection bridge in the middle of the die design. It might also have some benefit for keeping the heat somewhat separated. 

 

Oddly enough, I was thinking about this not that long ago, but the use-case is going to be kind of limited since you need really high yields and playing parts on top of parts would highly benefit to symmetrical layouts. "Die on Die", i.e. a Chiplet technology, is still the key direction, but this could be very interesting for low-power applications. I expect this to show up in the Mobile space, given physical space is a big premium for those parts.

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, Taf the Ghost said:

You'd probably end up with something akin to the Zeppelin layout that AMD uses, where there is a central connection bridge in the middle of the die design. It might also have some benefit for keeping the heat somewhat separated.

There is no reason to have a central point and it's actually sub optimal to do that, AMD is only doing it that way because the design is MCM and is connected by a sub-straight. Die stacking would allow for more connection points without creating a super complicated sub-straight.

 

If you're trying to push all your cache traffic down a single path then it becomes a bandwidth bottleneck and starts increasing latency. You already see the effects of the increased latency in Intel's Mesh arch for the most distant cores.

 

NAND already has a great many TSV connections between dies, you could do the same at each L3 junction point shown in the mesh diagram in previous post.

 

TSV_image1.jpg

 

Just keep in mind 3D NAND is currently around 30nm-40nm process size.

Link to comment
Share on other sites

Link to post
Share on other sites

I feel like the "better suited for low-power parts" is more of a requirement than a suggestion. Maybe this tech will only take off for phone processors and memory.

CPU - Ryzen Threadripper 2950X | Motherboard - X399 GAMING PRO CARBON AC | RAM - G.Skill Trident Z RGB 4x8GB DDR4-3200 14-13-13-21 | GPU - Aorus GTX 1080 Ti Waterforce WB Xtreme Edition | Case - Inwin 909 (Silver) | Storage - Samsung 950 Pro 500GB, Samsung 970 Evo 500GB, Samsung 840 Evo 500GB, HGST DeskStar 6TB, WD Black 2TB | PSU - Corsair AX1600i | Display - DELL ULTRASHARP U3415W |

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, asus killer said:

that may seems like a good idea, but GPU's unlike CPU's generally run at 100% most of the time (not if you're on LTT forum but you get my point) and like said before even with lower clocks the bottom half cannot dissipate the heat. There would have to be some miracle solution.

Not by itself. But you can do through silicon metal rods to move heat. You can have a layer in between the silicon that either dissipates heat or move it, like the Polaris controller for Samsung nvme pro drives. So yeah, you can't just slap them together, but that doesn't mean there aren't solutions to the problem.

 

Heck, you can even (maybe) go full liquid cooling inside the chips:

ep_137_04_040802_f005.png

http://electronicpackaging.asmedigitalcollection.asme.org/article.aspx?articleid=2469021

 

Although that does seem quite exotic.

Watching Intel have competition is like watching a headless chicken trying to get out of a mine field

CPU: Intel I7 4790K@4.6 with NZXT X31 AIO; MOTHERBOARD: ASUS Z97 Maximus VII Ranger; RAM: 8 GB Kingston HyperX 1600 DDR3; GFX: ASUS R9 290 4GB; CASE: Lian Li v700wx; STORAGE: Corsair Force 3 120GB SSD; Samsung 850 500GB SSD; Various old Seagates; PSU: Corsair RM650; MONITOR: 2x 20" Dell IPS; KEYBOARD/MOUSE: Logitech K810/ MX Master; OS: Windows 10 Pro

Link to comment
Share on other sites

Link to post
Share on other sites

30 minutes ago, Notional said:

Not by itself. But you can do through silicon metal rods to move heat. You can have a layer in between the silicon that either dissipates heat or move it, like the Polaris controller for Samsung nvme pro drives. So yeah, you can't just slap them together, but that doesn't mean there aren't solutions to the problem.

 

Heck, you can even (maybe) go full liquid cooling inside the chips:

ep_137_04_040802_f005.png

http://electronicpackaging.asmedigitalcollection.asme.org/article.aspx?articleid=2469021

 

Although that does seem quite exotic.

i'm going to go way outside my comfort zone here but don't you have to account for the TSV connections in using some of those methods? 

.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, asus killer said:

i'm going to go way outside my comfort zone here but don't you have to account for the TSV connections in using some of those methods? 

Absolutely. But not all fluids are conductive.

As for metal, I'm not sure how it works. Maybe holes the TSV's can go through without contact? This is what Samsung'spolaris controllers use:

Samsung-heatspreader-sticker.jpg

 

Of course it's stylized, so it's impossible to know how it actually works.

Watching Intel have competition is like watching a headless chicken trying to get out of a mine field

CPU: Intel I7 4790K@4.6 with NZXT X31 AIO; MOTHERBOARD: ASUS Z97 Maximus VII Ranger; RAM: 8 GB Kingston HyperX 1600 DDR3; GFX: ASUS R9 290 4GB; CASE: Lian Li v700wx; STORAGE: Corsair Force 3 120GB SSD; Samsung 850 500GB SSD; Various old Seagates; PSU: Corsair RM650; MONITOR: 2x 20" Dell IPS; KEYBOARD/MOUSE: Logitech K810/ MX Master; OS: Windows 10 Pro

Link to comment
Share on other sites

Link to post
Share on other sites

Oh neat. In the end it seems everything will be interconnected with like in-chip colling tech. I mean that'd be amazing.

| Ryzen 7 7800X3D | AM5 B650 Aorus Elite AX | G.Skill Trident Z5 Neo RGB DDR5 32GB 6000MHz C30 | Sapphire PULSE Radeon RX 7900 XTX | Samsung 990 PRO 1TB with heatsink | Arctic Liquid Freezer II 360 | Seasonic Focus GX-850 | Lian Li Lanccool III | Mousepad: Skypad 3.0 XL / Zowie GTF-X | Mouse: Zowie S1-C | Keyboard: Ducky One 3 TKL (Cherry MX-Speed-Silver)Beyerdynamic MMX 300 (2nd Gen) | Acer XV272U | OS: Windows 11 |

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, asus killer said:

that may seems like a good idea, but GPU's unlike CPU's generally run at 100% most of the time (not if you're on LTT forum but you get my point) and like said before even with lower clocks the bottom half cannot dissipate the heat. There would have to be some miracle solution.

Could you not use aluminium or copper cored PCB materials to help dissipate the lower heat flux to the backplate? That would help to increase thermal dissipation for the lower silicon.

My Folding Stats - Join the fight against COVID-19 with FOLDING! - If someone has helped you out on the forum don't forget to give them a reaction to say thank you!

 

The only true wisdom is in knowing you know nothing. - Socrates
 

Please put as much effort into your question as you expect me to put into answering it. 

 

  • CPU
    Ryzen 9 5950X
  • Motherboard
    Gigabyte Aorus GA-AX370-GAMING 5
  • RAM
    32GB DDR4 3200
  • GPU
    Inno3D 4070 Ti
  • Case
    Cooler Master - MasterCase H500P
  • Storage
    Western Digital Black 250GB, Seagate BarraCuda 1TB x2
  • PSU
    EVGA Supernova 1000w 
  • Display(s)
    Lenovo L29w-30 29 Inch UltraWide Full HD, BenQ - XL2430(portrait), Dell P2311Hb(portrait)
  • Cooling
    MasterLiquid Lite 240
Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×