Jump to content

AMD Hosting Zen Event at HotChips on Tuesday

35 minutes ago, Tomsen said:

Name a few of those MIMD instructions then (just to make sure we are talking about the same things, I don't think you are through). Games sure have things that can be parallelized, the issue is rather the more serialized procedure dx11 takes. Let me frame it this way: Could the same optimizations not be done in scalar code and yield the same benefits?

 

You name a few old wizards and call it a proof? You want me to look through all their individual stuff? Be more clear, or I might as well say you should open your eyes as a proof.

 

Intel has increased the execution port number from 6 to 9, increased ALU count from 3 -> 4 in what 8 years? At some point, the hardware needs to advance for the software developers to take advantage of. We have SIMD, which are very useful for SIMD related workloads. You do realize most software use MIMD? Any multi-threaded software in fact.

 

Patrick, you clearly don't know what I'm talking about in regards to netburst. I'm not talking about introducing 2 ALUs, but rather have the ALUs run 2x the clock speed of the CPU, giving the effect of a 0.5 cycle execution for normal 1 cycle executions. Sure, to store the result the instruction latency appeared the same (to not mess up with cache subsystems), but intermediate executions was possible iirc. Ie, you could reuse the value generated in the first 0.5 cycle in the next 0.5 cycle, effectively cutting the data latency by half in relative to CPU clocks.

i can back you up on the Netburst thingy. It is one of several reasons why they were notoriously hot and problematic.

Source for netburst info is readily availible at wikipedia, anandtech and also Tom's Hardware, incase you wonder @patrickjp93.

 

I will say thoug, in theory, due to the design of DX11 and previous versions, you could technically run a low level injector system that spread out the data workload prior to feeding it through the CPU. However it would be a case by case implementation.

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Tomsen said:

Name a few of those MIMD instructions then (just to make sure we are talking about the same things, I don't think you are through). Games sure have things that can be parallelized, the issue is rather the more serialized procedure dx11 takes. Let me frame it this way: Could the same optimizations not be done in scalar code and yield the same benefits?

 

You name a few old wizards and call it a proof? You want me to look through all their individual stuff? Be more clear, or I might as well say you should open your eyes as a proof.

 

Intel has increased the execution port number from 6 to 9, increased ALU count from 3 -> 4 in what 8 years? At some point, the hardware needs to advance for the software developers to take advantage of. We have SIMD, which are very useful for SIMD related workloads. You do realize most software use MIMD? Any multi-threaded software in fact.

 

Patrick, you clearly don't know what I'm talking about in regards to netburst. I'm not talking about introducing 2 ALUs, but rather have the ALUs run 2x the clock speed of the CPU, giving the effect of a 0.5 cycle execution for normal 1 cycle executions. Sure, to store the result the instruction latency appeared the same (to not mess up with cache subsystems), but intermediate executions was possible iirc. Ie, you could reuse the value generated in the first 0.5 cycle in the next 0.5 cycle, effectively cutting the data latency by half in relative to CPU clocks.

The problem with DX 11 is not having 1 CPU core talking to the GPU. As a matter of fact, that's exactly how it should be. If you're really using your CPU for all it's worth in AI, supporting as many objects as possible, and handling Internet, the one core talking to the GPU should be dedicated to that task. The problem with DX 11 was the overhead associated with draw calls, requiring asinine marshaling techniques to stuff as much into one call as possible.

 

Check out the AVX 2 instruction extensions. They're not all SIMD. A few of them are slightly complex compare, manipulate, exchange algorithms commonly found in transactions, or in bubble sort (the most efficient for small N where N fits in cache btw).

 

No, it increased from 2 to 4 in 8 years (Pentium 3/4 to Haswell), and that's still ignoring branch prediction improvements every generation, loop detection (Nehalem), the microop cache (Sandy Bridge), MIMD (Haswell), and widening the Out of Order engine by 3 instructions. All of this is paid for in heat. It's not Intel's fault that programmers are the problem when it is handing out software tools on a silver platter to make programmers' lives easier.

 

They're not old wizards. They're current Titans of the industry and are frequent lecturers at CppCon, code::dive, going native, and the international supercomputing conference. Andrei Alendrescu vectorized Facebook's AI, something most of the world thought not possible due to AI's serial nature. Scott Meyers vectorized Prim's Algorithm. Stephan Lavavej is the chief architect of the Microsoft Visual C++ library and the one who vectorized Windows 10, yes, an operating system.

 

Oh please that doesn't count and it's not even remotely possible to pull off now. If you can't make the results get to registers or cache in less than a cycle, it makes the effort to produce the result faster worthless. Programmers are the problem, not Intel.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Prysin said:

i can back you up on the Netburst thingy. It is one of several reasons why they were notoriously hot and problematic.

Source for netburst info is readily availible at wikipedia, anandtech and also Tom's Hardware, incase you wonder @patrickjp93.

 

I will say thoug, in theory, due to the design of DX11 and previous versions, you could technically run a low level injector system that spread out the data workload prior to feeding it through the CPU. However it would be a case by case implementation.

Also how they got much closer to their 10GHz expectations than most people otherwise would think (clocking up to 9GHz iirc, with a little overclock).

 

Sure you could hack you through it to distribute the data and workload across threads, but in the end you will still be limited to serialize and synchronize the data again to feed through dx11. The amount of works required compared to the gains is sitting at a bad ratio.

Please avoid feeding the argumentative narcissistic academic monkey.

"the last 20 percent – going from demo to production-worthy algorithm – is both hard and is time-consuming. The last 20 percent is what separates the men from the boys" - Mobileye CEO

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, patrickjp93 said:

 No one can, yet that's what Intel would have to do for you people to feel it because software stagnated.

People expect magic because of performance leap in graphics these days with DX12 and Vulkan, because when those games are properly coded it's miles away from what it was even with cards which are 4 years old, because software can be more used (at least on amd side). But again those are the people who think hyperthreading is a must have and replace real cores, while it clearly doesn't and makes little to no difference, and even sometimes performance hits when you have a properly parallelized algorithm.

You're right to point out that software has to follow.

Link to comment
Share on other sites

Link to post
Share on other sites

57 minutes ago, patrickjp93 said:

The problem with DX 11 is not having 1 CPU core talking to the GPU. As a matter of fact, that's exactly how it should be. If you're really using your CPU for all it's worth in AI, supporting as many objects as possible, and handling Internet, the one core talking to the GPU should be dedicated to that task. The problem with DX 11 was the overhead associated with draw calls, requiring asinine marshaling techniques to stuff as much into one call as possible.

 

Check out the AVX 2 instruction extensions. They're not all SIMD. A few of them are slightly complex compare, manipulate, exchange algorithms commonly found in transactions, or in bubble sort (the most efficient for small N where N fits in cache btw).

 

No, it increased from 2 to 4 in 8 years (Pentium 3/4 to Haswell), and that's still ignoring branch prediction improvements every generation, loop detection (Nehalem), the microop cache (Sandy Bridge), MIMD (Haswell), and widening the Out of Order engine by 3 instructions. All of this is paid for in heat. It's not Intel's fault that programmers are the problem when it is handing out software tools on a silver platter to make programmers' lives easier.

 

They're not old wizards. They're current Titans of the industry and are frequent lecturers at CppCon, code::dive, going native, and the international supercomputing conference. Andrei Alendrescu vectorized Facebook's AI, something most of the world thought not possible due to AI's serial nature. Scott Meyers vectorized Prim's Algorithm. Stephan Lavavej is the chief architect of the Microsoft Visual C++ library and the one who vectorized Windows 10, yes, an operating system.

 

Oh please that doesn't count and it's not even remotely possible to pull off now. If you can't make the results get to registers or cache in less than a cycle, it makes the effort to produce the result faster worthless. Programmers are the problem, not Intel.

Patrick, you are now putting words into my mouth and simply assuming things that I didn't state. I'm not saying the issue lies with DX11 only using a single thread to communicate (or actually rather dictate) commands and data to the GPU. Rather the issue is how serialized dx11 whole rendering pipeline is. Overhead is also a part of it, but I could argue that is the result of such a serialized rendering pipeline (developers trying to work their way around dx11 limits).

 

All AVX2 is SIMD instructions (AVX2 is a SIMD ISA). You better name the individual instructions out if you want to argue otherwise.

 

Wtf are you on? You are clearly being deceptive picking other baseline than me? I clearly started that it was in the last 8 years (since nehalem), then you go ahead and pick a baseline from +16 years ago, great job Patrick. If you want to argue my argument, you better use the same baseline, else it makes the whole point moot.

 

They are old wizards, standing on their last leg before retirement. Wizards are also very powerful, I'm not understating their achievements.

Those a far from proof from what I argued. You can't expect the same scalability just by rewriting scalar code in vector code. In many cases you will see performance-regressions. You really think windows 10 is perfectly vectorized, if so you get plenty to learn.

 

How does that not count? How is that not remotely possible now? They had it working, they bloody hell just implement it back if they need it.

Obviously the result ended in the registers, else it wouldn't work, would it? Just pointing out the obvious.

Please avoid feeding the argumentative narcissistic academic monkey.

"the last 20 percent – going from demo to production-worthy algorithm – is both hard and is time-consuming. The last 20 percent is what separates the men from the boys" - Mobileye CEO

Link to comment
Share on other sites

Link to post
Share on other sites

Great. Hopefully that IPC improvement will be great, as expected. What was shown so far looks ok but can't wait to see more results. 

| Ryzen 7 7800X3D | AM5 B650 Aorus Elite AX | G.Skill Trident Z5 Neo RGB DDR5 32GB 6000MHz C30 | Sapphire PULSE Radeon RX 7900 XTX | Samsung 990 PRO 1TB with heatsink | Arctic Liquid Freezer II 360 | Seasonic Focus GX-850 | Lian Li Lanccool III | Mousepad: Skypad 3.0 XL / Zowie GTF-X | Mouse: Zowie S1-C | Keyboard: Ducky One 3 TKL (Cherry MX-Speed-Silver)Beyerdynamic MMX 300 (2nd Gen) | Acer XV272U | OS: Windows 11 |

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, Tomsen said:

Patrick, you are now putting words into my mouth and simply assuming things that I didn't state. I'm not saying the issue lies with DX11 only using a single thread to communicate (or actually rather dictate) commands and data to the GPU. Rather the issue is how serialized dx11 whole rendering pipeline is. Overhead is also a part of it, but I could argue that is the result of such a serialized rendering pipeline (developers trying to work their way around dx11 limits).

 

All AVX2 is SIMD instructions (AVX2 is a SIMD ISA). You better name the individual instructions out if you want to argue otherwise.

 

Wtf are you on? You are clearly being deceptive picking other baseline than me? I clearly started that it was in the last 8 years (since nehalem), then you go ahead and pick a baseline from +16 years ago, great job Patrick. If you want to argue my argument, you better use the same baseline, else it makes the whole point moot.

 

They are old wizards, standing on their last leg before retirement. Wizards are also very powerful, I'm not understating their achievements.

Those a far from proof from what I argued. You can't expect the same scalability just by rewriting scalar code in vector code. In many cases you will see performance-regressions. You really think windows 10 is perfectly vectorized, if so you get plenty to learn.

 

How does that not count? How is that not remotely possible now? They had it working, they bloody hell just implement it back if they need it.

Obviously the result ended in the registers, else it wouldn't work, would it? Just pointing out the obvious.

Eyeroll* now you're nitpicking. Like I said, MicroOp cache, changing the ALU count from 2-4 (Nehalem did not have 3), widening the out of order engine by 3 instructions (better self-optimization by the CPU, though that's still limited as most code has a branch every 7 lines), increasing the independent execution port number to 9 (less pressure on individual compute units, more flexible coding), including a programmable iGPU for heterogeneous tasks where latency is a limiting factor, and increasing the width of the loop detector by 30%. That has all happened since Nehalem.

 

Lavavej is in his early 30s and Andrei just turned 40. They're not old wizards. You could argue Scott Meyers is, but that's it.

 

It's not possible because the switching speeds of silicon are not maintainable at 8GHz without enormous heat problems, not to mention most transistors need to be very cold to maintain that speed, and that's the requirement to produce a result in half a cycle of the CPU. And it doesn't count because the half cycle saved ended up being swallowed by the requirements of cache and registers.

 

If anything Netburst is proof I'm right.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Its tuesday already ! Gimme the news !

Connection200mbps / 12mbps 5Ghz wifi

My baby: CPU - i7-4790, MB - Z97-A, RAM - Corsair Veng. LP 16gb, GPU - MSI GTX 1060, PSU - CXM 600, Storage - Evo 840 120gb, MX100 256gb, WD Blue 1TB, Cooler - Hyper Evo 212, Case - Corsair Carbide 200R, Monitor - Benq  XL2430T 144Hz, Mouse - FinalMouse, Keyboard -K70 RGB, OS - Win 10, Audio - DT990 Pro, Phone - iPhone SE

Link to comment
Share on other sites

Link to post
Share on other sites

34 minutes ago, Thony said:

Its tuesday already ! Gimme the news !

You gotta wait until 5:45 P.M. tho :( 

Make sure to quote me or tag me when responding to me, or I might not know you replied! Examples:

 

Do this:

Quote

And make sure you do it by hitting the quote button at the bottom left of my post, and not the one inside the editor!

Or this:

@DocSwag

 

Buy whatever product is best for you, not what product is "best" for the market.

 

Interested in computer architecture? Still in middle or high school? P.M. me!

 

I love computer hardware and feel free to ask me anything about that (or phones). I especially like SSDs. But please do not ask me anything about Networking, programming, command line stuff, or any relatively hard software stuff. I know next to nothing about that.

 

Compooters:

Spoiler

Desktop:

Spoiler

CPU: i7 6700k, CPU Cooler: be quiet! Dark Rock Pro 3, Motherboard: MSI Z170a KRAIT GAMING, RAM: G.Skill Ripjaws 4 Series 4x4gb DDR4-2666 MHz, Storage: SanDisk SSD Plus 240gb + OCZ Vertex 180 480 GB + Western Digital Caviar Blue 1 TB 7200 RPM, Video Card: EVGA GTX 970 SSC, Case: Fractal Design Define S, Power Supply: Seasonic Focus+ Gold 650w Yay, Keyboard: Logitech G710+, Mouse: Logitech G502 Proteus Spectrum, Headphones: B&O H9i, Monitor: LG 29um67 (2560x1080 75hz freesync)

Home Server:

Spoiler

CPU: Pentium G4400, CPU Cooler: Stock, Motherboard: MSI h110l Pro Mini AC, RAM: Hyper X Fury DDR4 1x8gb 2133 MHz, Storage: PNY CS1311 120gb SSD + two Segate 4tb HDDs in RAID 1, Video Card: Does Intel Integrated Graphics count?, Case: Fractal Design Node 304, Power Supply: Seasonic 360w 80+ Gold, Keyboard+Mouse+Monitor: Does it matter?

Laptop (I use it for school):

Spoiler

Surface book 2 13" with an i7 8650u, 8gb RAM, 256 GB storage, and a GTX 1050

And if you're curious (or a stalker) I have a Just Black Pixel 2 XL 64gb

 

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, patrickjp93 said:

Eyeroll* now you're nitpicking.

Ok then, let me know what I missed in my previous reply and I'll comment on it.

 

14 hours ago, patrickjp93 said:

Like I said, MicroOp cache

Which had a big influence in the performance improvement we saw in sandy-bridge.

 

14 hours ago, patrickjp93 said:

changing the ALU count from 2-4 (Nehalem did not have 3)

You are not even close to being right. Patrick, take 2 minutes and google it, you'll find I am right that nehalem have 3 ALU.

 

14 hours ago, patrickjp93 said:

widening the out of order engine by 3 instructions (better self-optimization by the CPU, though that's still limited as most code has a branch every 7 lines)

I'm having some issues understanding this statement. OoOE's width normally isn't described in instructions (as in x86 instructions). 3 instructions also seems awfully low to be honest. CPUs knows how to deal with branches, it is rather how unpredictable the branch is.

 

14 hours ago, patrickjp93 said:

increasing the independent execution port number to 9 (less pressure on individual compute units, more flexible coding)

I did already mention this in a previous reply.

 

14 hours ago, patrickjp93 said:

including a programmable iGPU for heterogeneous tasks where latency is a limiting factor

Sadly, the only thing I heard about people working on Intels iGPU for compute was the buggy drivers, but this was back in 2011 iirc, so lot could have changes.

QuickSync is most likely more used feature.

 

14 hours ago, patrickjp93 said:

and increasing the width of the loop detector by 30%.

The loop-detector is now a part of the uop-cache isn't it? Pretty sure that changes with sandy

 

14 hours ago, patrickjp93 said:

Lavavej is in his early 30s and Andrei just turned 40. They're not old wizards. You could argue Scott Meyers is, but that's it.

Scott Meyers is also the only one I had any real knowledge of, the rest only I have only heard few tidbits from.

 

14 hours ago, patrickjp93 said:

It's not possible because the switching speeds of silicon are not maintainable at 8GHz without enormous heat problems, not to mention most transistors need to be very cold to maintain that speed, and that's the requirement to produce a result in half a cycle of the CPU.

Here's the deal with that; You don't need the entire IC to run at the same high frequency, which is why netburst ran its ALU at 2x the frequencies.

Sure, at some point heat are going to be an issue, which is why Intel never reached their 10GHz goal.

 

14 hours ago, patrickjp93 said:

And it doesn't count because the half cycle saved ended up being swallowed by the requirements of cache and registers.

I'm not sure I'm understanding your logic. Of course it counts, just because it needs to realign stores order, doesn't change the fact that the internal registers is updating at the same rate as the ALU. So you can absolutely save the half cycle without been swallowed whatsoever.

 

14 hours ago, patrickjp93 said:

If anything Netburst is proof I'm right.

If anything netburst is a proof of Intel overestimating a new proof-of-concept without any tools or previous data about such experiement.

They were overambitious in their implementation for their times, and the benefits were rather diminishing as a result.

Please avoid feeding the argumentative narcissistic academic monkey.

"the last 20 percent – going from demo to production-worthy algorithm – is both hard and is time-consuming. The last 20 percent is what separates the men from the boys" - Mobileye CEO

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, DocSwag said:

You gotta wait until 5:45 P.M. tho :( 

Really ? That sounds good. I was expecting it like tomorrow. But thank god AMD does t do their stuff at evenings like most other companies.

Connection200mbps / 12mbps 5Ghz wifi

My baby: CPU - i7-4790, MB - Z97-A, RAM - Corsair Veng. LP 16gb, GPU - MSI GTX 1060, PSU - CXM 600, Storage - Evo 840 120gb, MX100 256gb, WD Blue 1TB, Cooler - Hyper Evo 212, Case - Corsair Carbide 200R, Monitor - Benq  XL2430T 144Hz, Mouse - FinalMouse, Keyboard -K70 RGB, OS - Win 10, Audio - DT990 Pro, Phone - iPhone SE

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Thony said:

Really ? That sounds good. I was expecting it like tomorrow. But thank god AMD does t do their stuff at evenings like most other companies.

If you've seen the slides floating around on the internet yesterday my guess is those are the slides that will be shown today. They may also show some benchmarks today too though so who knows.

Make sure to quote me or tag me when responding to me, or I might not know you replied! Examples:

 

Do this:

Quote

And make sure you do it by hitting the quote button at the bottom left of my post, and not the one inside the editor!

Or this:

@DocSwag

 

Buy whatever product is best for you, not what product is "best" for the market.

 

Interested in computer architecture? Still in middle or high school? P.M. me!

 

I love computer hardware and feel free to ask me anything about that (or phones). I especially like SSDs. But please do not ask me anything about Networking, programming, command line stuff, or any relatively hard software stuff. I know next to nothing about that.

 

Compooters:

Spoiler

Desktop:

Spoiler

CPU: i7 6700k, CPU Cooler: be quiet! Dark Rock Pro 3, Motherboard: MSI Z170a KRAIT GAMING, RAM: G.Skill Ripjaws 4 Series 4x4gb DDR4-2666 MHz, Storage: SanDisk SSD Plus 240gb + OCZ Vertex 180 480 GB + Western Digital Caviar Blue 1 TB 7200 RPM, Video Card: EVGA GTX 970 SSC, Case: Fractal Design Define S, Power Supply: Seasonic Focus+ Gold 650w Yay, Keyboard: Logitech G710+, Mouse: Logitech G502 Proteus Spectrum, Headphones: B&O H9i, Monitor: LG 29um67 (2560x1080 75hz freesync)

Home Server:

Spoiler

CPU: Pentium G4400, CPU Cooler: Stock, Motherboard: MSI h110l Pro Mini AC, RAM: Hyper X Fury DDR4 1x8gb 2133 MHz, Storage: PNY CS1311 120gb SSD + two Segate 4tb HDDs in RAID 1, Video Card: Does Intel Integrated Graphics count?, Case: Fractal Design Node 304, Power Supply: Seasonic 360w 80+ Gold, Keyboard+Mouse+Monitor: Does it matter?

Laptop (I use it for school):

Spoiler

Surface book 2 13" with an i7 8650u, 8gb RAM, 256 GB storage, and a GTX 1050

And if you're curious (or a stalker) I have a Just Black Pixel 2 XL 64gb

 

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, DocSwag said:

You gotta wait until 5:45 P.M. tho :( 

Is that pacific time?

- ASUS X99 Deluxe - i7 5820k - Nvidia GTX 1080ti SLi - 4x4GB EVGA SSC 2800mhz DDR4 - Samsung SM951 500 - 2x Samsung 850 EVO 512 -

- EK Supremacy EVO CPU Block - EK FC 1080 GPU Blocks - EK XRES 100 DDC - EK Coolstream XE 360 - EK Coolstream XE 240 -

Link to comment
Share on other sites

Link to post
Share on other sites

On 8/22/2016 at 9:53 AM, Tomsen said:

Intel has increased the data parallism of its vector units by a good amount from each new architecture. Thats helps alots in certain workloads, and basically have no benefits outside it. Trying to defend it by saying "programmers who can properly optimize code" is stupid, you can't properly optimize all code (far from actually) to vector code and expect the same scalability. In the general pipeline outside of vector data parallelism, what have Intel exactly done to makes it CPU faster? Not much.

Actually, "believe me, I'm great but nobody codes for me" is the story of Bulldozer & co. in a nutshell :P 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, TidaLWaveZ said:

Is that pacific time?

Yes, since the event is in the Bay Area.

Make sure to quote me or tag me when responding to me, or I might not know you replied! Examples:

 

Do this:

Quote

And make sure you do it by hitting the quote button at the bottom left of my post, and not the one inside the editor!

Or this:

@DocSwag

 

Buy whatever product is best for you, not what product is "best" for the market.

 

Interested in computer architecture? Still in middle or high school? P.M. me!

 

I love computer hardware and feel free to ask me anything about that (or phones). I especially like SSDs. But please do not ask me anything about Networking, programming, command line stuff, or any relatively hard software stuff. I know next to nothing about that.

 

Compooters:

Spoiler

Desktop:

Spoiler

CPU: i7 6700k, CPU Cooler: be quiet! Dark Rock Pro 3, Motherboard: MSI Z170a KRAIT GAMING, RAM: G.Skill Ripjaws 4 Series 4x4gb DDR4-2666 MHz, Storage: SanDisk SSD Plus 240gb + OCZ Vertex 180 480 GB + Western Digital Caviar Blue 1 TB 7200 RPM, Video Card: EVGA GTX 970 SSC, Case: Fractal Design Define S, Power Supply: Seasonic Focus+ Gold 650w Yay, Keyboard: Logitech G710+, Mouse: Logitech G502 Proteus Spectrum, Headphones: B&O H9i, Monitor: LG 29um67 (2560x1080 75hz freesync)

Home Server:

Spoiler

CPU: Pentium G4400, CPU Cooler: Stock, Motherboard: MSI h110l Pro Mini AC, RAM: Hyper X Fury DDR4 1x8gb 2133 MHz, Storage: PNY CS1311 120gb SSD + two Segate 4tb HDDs in RAID 1, Video Card: Does Intel Integrated Graphics count?, Case: Fractal Design Node 304, Power Supply: Seasonic 360w 80+ Gold, Keyboard+Mouse+Monitor: Does it matter?

Laptop (I use it for school):

Spoiler

Surface book 2 13" with an i7 8650u, 8gb RAM, 256 GB storage, and a GTX 1050

And if you're curious (or a stalker) I have a Just Black Pixel 2 XL 64gb

 

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, vitor_cut said:

will it be streamed?

 

That's the question I should've been asking instead of what time zone it's in.  Looks like there won't be a stream, the website says presentations are only for attendees.

- ASUS X99 Deluxe - i7 5820k - Nvidia GTX 1080ti SLi - 4x4GB EVGA SSC 2800mhz DDR4 - Samsung SM951 500 - 2x Samsung 850 EVO 512 -

- EK Supremacy EVO CPU Block - EK FC 1080 GPU Blocks - EK XRES 100 DDC - EK Coolstream XE 360 - EK Coolstream XE 240 -

Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, vitor_cut said:

will it be streamed?

 

I don't think so. If you want to find out what happens there looks like you'll have to just see what newssites report about it afterwards :/ 

Make sure to quote me or tag me when responding to me, or I might not know you replied! Examples:

 

Do this:

Quote

And make sure you do it by hitting the quote button at the bottom left of my post, and not the one inside the editor!

Or this:

@DocSwag

 

Buy whatever product is best for you, not what product is "best" for the market.

 

Interested in computer architecture? Still in middle or high school? P.M. me!

 

I love computer hardware and feel free to ask me anything about that (or phones). I especially like SSDs. But please do not ask me anything about Networking, programming, command line stuff, or any relatively hard software stuff. I know next to nothing about that.

 

Compooters:

Spoiler

Desktop:

Spoiler

CPU: i7 6700k, CPU Cooler: be quiet! Dark Rock Pro 3, Motherboard: MSI Z170a KRAIT GAMING, RAM: G.Skill Ripjaws 4 Series 4x4gb DDR4-2666 MHz, Storage: SanDisk SSD Plus 240gb + OCZ Vertex 180 480 GB + Western Digital Caviar Blue 1 TB 7200 RPM, Video Card: EVGA GTX 970 SSC, Case: Fractal Design Define S, Power Supply: Seasonic Focus+ Gold 650w Yay, Keyboard: Logitech G710+, Mouse: Logitech G502 Proteus Spectrum, Headphones: B&O H9i, Monitor: LG 29um67 (2560x1080 75hz freesync)

Home Server:

Spoiler

CPU: Pentium G4400, CPU Cooler: Stock, Motherboard: MSI h110l Pro Mini AC, RAM: Hyper X Fury DDR4 1x8gb 2133 MHz, Storage: PNY CS1311 120gb SSD + two Segate 4tb HDDs in RAID 1, Video Card: Does Intel Integrated Graphics count?, Case: Fractal Design Node 304, Power Supply: Seasonic 360w 80+ Gold, Keyboard+Mouse+Monitor: Does it matter?

Laptop (I use it for school):

Spoiler

Surface book 2 13" with an i7 8650u, 8gb RAM, 256 GB storage, and a GTX 1050

And if you're curious (or a stalker) I have a Just Black Pixel 2 XL 64gb

 

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, Tomsen said:

Ok then, let me know what I missed in my previous reply and I'll comment on it.

 

Which had a big influence in the performance improvement we saw in sandy-bridge.

 

You are not even close to being right. Patrick, take 2 minutes and google it, you'll find I am right that nehalem have 3 ALU.

 

I'm having some issues understanding this statement. OoOE's width normally isn't described in instructions (as in x86 instructions). 3 instructions also seems awfully low to be honest. CPUs knows how to deal with branches, it is rather how unpredictable the branch is.

 

I did already mention this in a previous reply.

 

Sadly, the only thing I heard about people working on Intels iGPU for compute was the buggy drivers, but this was back in 2011 iirc, so lot could have changes.

QuickSync is most likely more used feature.

 

The loop-detector is now a part of the uop-cache isn't it? Pretty sure that changes with sandy

 

Scott Meyers is also the only one I had any real knowledge of, the rest only I have only heard few tidbits from.

 

Here's the deal with that; You don't need the entire IC to run at the same high frequency, which is why netburst ran its ALU at 2x the frequencies.

Sure, at some point heat are going to be an issue, which is why Intel never reached their 10GHz goal.

 

I'm not sure I'm understanding your logic. Of course it counts, just because it needs to realign stores order, doesn't change the fact that the internal registers is updating at the same rate as the ALU. So you can absolutely save the half cycle without been swallowed whatsoever.

 

If anything netburst is a proof of Intel overestimating a new proof-of-concept without any tools or previous data about such experiement.

They were overambitious in their implementation for their times, and the benefits were rather diminishing as a result.

It had 2 ALUs, an FPU, and an SSE vector unit. Read the Nehalem architecture manual.

 

Handling 14 (vs. 11 from SB's days) is about all Intel can do without increasing the pipeline stall rate due to branch misprediction.

 

But Intel has made its predictor 99% accurate in a 4-decision branch and 90% accurate in a 7-decision branch.

 

Intel's graphics drivers are not buggy. They just don't fix bad code for you. Everyone in enterprise who wanted iGPU has been happy to use it since Ivy Bridge, and that includes Amazon, Facebook, Google, and Microsoft. If you adhere strictly to the manuals, you're fine.

 

I forgot QS. Let's add that to the list!

 

No, the loop detector is at an instruction, not MuOp level, these days. I think the loop can only be 20 instructions wide and only have branches change internal paths (no function calls), but that is a further innovation since the Sandy days.

 

If you want other names, Chandler Caruth currently works for Google and is in his mid 30s. Howard Hinint is another Titan active in the industry today. That's not even counting the hundreds of C++ programmers Intel has working for it that never contribute to open conferences like CPPCon et al.. CilkPlus proved almost anything you can loop over, be it objects or functional tasks, can be vectorized and have performance gains.

 

Internal registers do not update at the same time as the ALU. Digital signaling anyone? It takes time for the data to come back from the ALU. It doesn't count, because the end result was not really getting anything done in less than a cycle.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Yo @patrickjp93 and @Tomsen are you two by any chance confusing each other? Because I just searched it up and you two are both right, because Nehalem has two vector ALU units but 3 integer ALU units. So technically patrick is right that is has 2, and Tomsen is right that it has 3. Is that by any chance the problem here?

Make sure to quote me or tag me when responding to me, or I might not know you replied! Examples:

 

Do this:

Quote

And make sure you do it by hitting the quote button at the bottom left of my post, and not the one inside the editor!

Or this:

@DocSwag

 

Buy whatever product is best for you, not what product is "best" for the market.

 

Interested in computer architecture? Still in middle or high school? P.M. me!

 

I love computer hardware and feel free to ask me anything about that (or phones). I especially like SSDs. But please do not ask me anything about Networking, programming, command line stuff, or any relatively hard software stuff. I know next to nothing about that.

 

Compooters:

Spoiler

Desktop:

Spoiler

CPU: i7 6700k, CPU Cooler: be quiet! Dark Rock Pro 3, Motherboard: MSI Z170a KRAIT GAMING, RAM: G.Skill Ripjaws 4 Series 4x4gb DDR4-2666 MHz, Storage: SanDisk SSD Plus 240gb + OCZ Vertex 180 480 GB + Western Digital Caviar Blue 1 TB 7200 RPM, Video Card: EVGA GTX 970 SSC, Case: Fractal Design Define S, Power Supply: Seasonic Focus+ Gold 650w Yay, Keyboard: Logitech G710+, Mouse: Logitech G502 Proteus Spectrum, Headphones: B&O H9i, Monitor: LG 29um67 (2560x1080 75hz freesync)

Home Server:

Spoiler

CPU: Pentium G4400, CPU Cooler: Stock, Motherboard: MSI h110l Pro Mini AC, RAM: Hyper X Fury DDR4 1x8gb 2133 MHz, Storage: PNY CS1311 120gb SSD + two Segate 4tb HDDs in RAID 1, Video Card: Does Intel Integrated Graphics count?, Case: Fractal Design Node 304, Power Supply: Seasonic 360w 80+ Gold, Keyboard+Mouse+Monitor: Does it matter?

Laptop (I use it for school):

Spoiler

Surface book 2 13" with an i7 8650u, 8gb RAM, 256 GB storage, and a GTX 1050

And if you're curious (or a stalker) I have a Just Black Pixel 2 XL 64gb

 

Link to comment
Share on other sites

Link to post
Share on other sites

38 minutes ago, DocSwag said:

Yo @patrickjp93 and @Tomsen are you two by any chance confusing each other? Because I just searched it up and you two are both right, because Nehalem has two vector ALU units but 3 integer ALU units. So technically patrick is right that is has 2, and Tomsen is right that it has 3. Is that by any chance the problem here?

Are you getting those numbers by looking at the execution port diagram or the core block diagram? Also, one of your 3 ALUs is an AGU, which is only for indirect memory addressing.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

42 minutes ago, patrickjp93 said:

Are you getting those numbers by looking at the execution port diagram or the core block diagram? Also, one of your 3 ALUs is an AGU, which is only for indirect memory addressing.

I only read that off Wikipedia xD so idk.

https://en.m.wikipedia.org/wiki/Nehalem_(microarchitecture)

Quote

3 integer ALU, 2 vector ALU and 2 AGU per core.

 

Make sure to quote me or tag me when responding to me, or I might not know you replied! Examples:

 

Do this:

Quote

And make sure you do it by hitting the quote button at the bottom left of my post, and not the one inside the editor!

Or this:

@DocSwag

 

Buy whatever product is best for you, not what product is "best" for the market.

 

Interested in computer architecture? Still in middle or high school? P.M. me!

 

I love computer hardware and feel free to ask me anything about that (or phones). I especially like SSDs. But please do not ask me anything about Networking, programming, command line stuff, or any relatively hard software stuff. I know next to nothing about that.

 

Compooters:

Spoiler

Desktop:

Spoiler

CPU: i7 6700k, CPU Cooler: be quiet! Dark Rock Pro 3, Motherboard: MSI Z170a KRAIT GAMING, RAM: G.Skill Ripjaws 4 Series 4x4gb DDR4-2666 MHz, Storage: SanDisk SSD Plus 240gb + OCZ Vertex 180 480 GB + Western Digital Caviar Blue 1 TB 7200 RPM, Video Card: EVGA GTX 970 SSC, Case: Fractal Design Define S, Power Supply: Seasonic Focus+ Gold 650w Yay, Keyboard: Logitech G710+, Mouse: Logitech G502 Proteus Spectrum, Headphones: B&O H9i, Monitor: LG 29um67 (2560x1080 75hz freesync)

Home Server:

Spoiler

CPU: Pentium G4400, CPU Cooler: Stock, Motherboard: MSI h110l Pro Mini AC, RAM: Hyper X Fury DDR4 1x8gb 2133 MHz, Storage: PNY CS1311 120gb SSD + two Segate 4tb HDDs in RAID 1, Video Card: Does Intel Integrated Graphics count?, Case: Fractal Design Node 304, Power Supply: Seasonic 360w 80+ Gold, Keyboard+Mouse+Monitor: Does it matter?

Laptop (I use it for school):

Spoiler

Surface book 2 13" with an i7 8650u, 8gb RAM, 256 GB storage, and a GTX 1050

And if you're curious (or a stalker) I have a Just Black Pixel 2 XL 64gb

 

Link to comment
Share on other sites

Link to post
Share on other sites

35 minutes ago, DocSwag said:

I only read that off Wikipedia xD so idk.

https://en.m.wikipedia.org/wiki/Nehalem_(microarchitecture)

 

Well, it's wrong.

 

http://sc.tamu.edu/systems/eos/nehalem.pdf

 

As detailed in this paper, the execution ports list 3 ALU operations possibilities, but there are only 2 ALUs. The reason for having more ports is more flexible out of order management and higher independence between instruction paths.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, patrickjp93 said:

It had 2 ALUs, an FPU, and an SSE vector unit. Read the Nehalem architecture manual.

 

Handling 14 (vs. 11 from SB's days) is about all Intel can do without increasing the pipeline stall rate due to branch misprediction.

 

But Intel has made its predictor 99% accurate in a 4-decision branch and 90% accurate in a 7-decision branch.

 

Intel's graphics drivers are not buggy. They just don't fix bad code for you. Everyone in enterprise who wanted iGPU has been happy to use it since Ivy Bridge, and that includes Amazon, Facebook, Google, and Microsoft. If you adhere strictly to the manuals, you're fine.

 

I forgot QS. Let's add that to the list!

 

No, the loop detector is at an instruction, not MuOp level, these days. I think the loop can only be 20 instructions wide and only have branches change internal paths (no function calls), but that is a further innovation since the Sandy days.

 

If you want other names, Chandler Caruth currently works for Google and is in his mid 30s. Howard Hinint is another Titan active in the industry today. That's not even counting the hundreds of C++ programmers Intel has working for it that never contribute to open conferences like CPPCon et al.. CilkPlus proved almost anything you can loop over, be it objects or functional tasks, can be vectorized and have performance gains.

 

Internal registers do not update at the same time as the ALU. Digital signaling anyone? It takes time for the data to come back from the ALU. It doesn't count, because the end result was not really getting anything done in less than a cycle.

It got 3 ALUs. It is shown in the core block diagram, in the documentation and just generally accepted as such. Where is it stated otherwise?

 

Never heard of using x86 instructions as measurement for OOOe width. That would depend on workload, can't just say random numbers, unless it is distinct caused by some kind of bottleneck in the microarchitecture. 

 

As for branch predictors and their accuracy, obviously marketing has had their fingers in the numbers. Else branches wouldn't be a problem.

 

They were buggy, I can't say anything about how it is today. Good thing there wasn't any bad code, he did end up single out the issue so he could replicate it to verify it was a bug (it didn't behave as the documentation stated it should). I haven't heard about enterprise been happy for intels iGP (only the opposite), and those mega corporations don't use the same public documentation/information as others do, often relying on things they discovered themselves or was told in confidence. 

 

You don't need to add it to the list, it is already under the "intel added igp to sandy".

 

http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-3.html

"With the Nehalem architecture, Intel has improved the functionality of the Loop Stream Detector. First of all the buffer is larger—it can now store 28 instructions. But what’s more, its position in the pipeline has changed. In Conroe, it was located just after the instruction fetch phase. It’s now located after the decoders; this new position allows a larger part of the pipeline to be disabled. The Nehalem’s Loop Stream Detector no longer stores x86 instructions, but rather µops."

 

We can both play a game of name a wizard, but that doesn't prove anything yet. Because all code is simple loops.. Also that is not new news.. Loops are perfect for vectors, because guess what, often small code sample, get repeated over and over, and if no dependencies are in place there are basically no issues. That is however far from normal code.

 

Patrick, of course the necessary internal registers updated at the same frequency, else it wouldn't work one bit.

Yes, it takes time for the data to come back from the ALU, however that is a non-issue, the same way it takes time for the data to come into the ALU.

Or are you saying they clocked it 2x the normal clockspeed for nothing, since nothing got accomplished in less than a normal clockcycle?

 

Please avoid feeding the argumentative narcissistic academic monkey.

"the last 20 percent – going from demo to production-worthy algorithm – is both hard and is time-consuming. The last 20 percent is what separates the men from the boys" - Mobileye CEO

Link to comment
Share on other sites

Link to post
Share on other sites

@DocSwag almost a day later and not a single word on what happened on Tuesday. Was AMD not giving any new info or what happened ? 

Connection200mbps / 12mbps 5Ghz wifi

My baby: CPU - i7-4790, MB - Z97-A, RAM - Corsair Veng. LP 16gb, GPU - MSI GTX 1060, PSU - CXM 600, Storage - Evo 840 120gb, MX100 256gb, WD Blue 1TB, Cooler - Hyper Evo 212, Case - Corsair Carbide 200R, Monitor - Benq  XL2430T 144Hz, Mouse - FinalMouse, Keyboard -K70 RGB, OS - Win 10, Audio - DT990 Pro, Phone - iPhone SE

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×