Jump to content

looncraz

Member
  • Posts

    99
  • Joined

  • Last visited

Awards

This user doesn't have any awards

1 Follower

Recent Profile Visitors

939 profile views
  1. When you go to buy a video card are you concerned about which one has been on the market longer? I've never been, beyond some maturity concerns with bleeding edge stuff. The only issue with being late is that the ideal sales window has closed and the next generation of products will soon eclipse your products. If Vega can deliver good performance at a reasonable price it will sell just as well as it would have if it had been released two months ago, but nowhere near as well as if it had come out prior to the Pascal cards. We can all just hope that AMD has some secret sauce in the mix and they actually planned to leapfrog Pascal from the outset. If not, Volta may prove to be too much for AMD to effectively counter... which means AMD can't make cheap enough CPUs to compete with a positive product margin.
  2. What we're looking at is not the die, but the CPU package. Rumor has it that there may be a 32 core + GPU + HBM version planned for this same socket... and also eight memory channels. You need room for all of that. The 32 core version is also likely to be using MCM, as Drak3 mentioned, which also takes more room.
  3. It's most certainly not - the idiots looked at the misread L3 cache data in a leaked benchmark that stated 64MBx8 and have been claiming that means it has 512MB of L3... instead of just the program being unable to accurately read the cache configuration on an experimental chip... A 16-core CPU should have 32MB of L3 cache, though, which is still plenty nice.
  4. Intel's bidirectional ring bus. Xeon E5-4650. Yes, if these cores want data another core has, they jump out to the inclusive L3, but imagine what would happen in this arrangement if the L3 was not inclusive - if it were, instead, "mostly" exclusive like that in Zen. Sure it's possible, with four cores, arranged across from each other. You just run a bus next door and across the way and make requests on the bus in an orderly manner. Then you can access any cache on any core in the CCX with the same average latency (I think they are meaning they can access, from Core 0, Core 1, 2, & 3 caches with the same average latency as each other - not as if the cache were internal to Core 0... that would be impossible). Assuming AMD doesn't do something crazy, such as simply mirroring all of the data used by more than one core... or something more sane like have a tag list between CCX blocks. There are unidentified structures adjoining each memory controller that looks an awful lot like tag structures. Each CCX could only be designed to run out to the main bus for data and these structure know if this data is located in an L3 of another CCX. There are so many ways around these issues it's hilarious. ?? NUMA memory coherency and CCX cache coherency are effectively identical issues. We're still talking about multiple memory stores which require synchronization with each other and with main memory when working with common data. You're making an awfully large number of assumptions about there being weaknesses that haven't been addressed. If they weren't, even implementing simple atomics on multi-CCX Zen would be a horribly slow experience as the CPU would always need to update in-memory structures, rather than staying on-die. Worst case is a lazy inter-CCX L3 snoop with no external tags or synchronization controller. This would tens of cycles for every global access, including simple - and frequent - things like atomics or volatiles. The very idea that AMD would design a multi-CCX CPU like Zen with multiple L3s and not mitigate the obvious downsides related to that is laughable.
  5. Except that isn't always the case. With a bi-directional ring-bus, for example... where cores closer to each other can communicate faster and each hop adds a cycle or two latency. AMD's slide says: "Every core can access every cache with the same average latency." This, of course, is only relative to the CCX. It also indicates that synchronization may be possible without using the L3 - making it potentially faster, or at least as fast as, synchronization using a common LLC. I have, actually, to all of the above, though I've never bothered to profile too deeply for performance because I've had the performance I've needed, and I haven't had to work with NUMA at the same time, though the vast majority of my code is highly parallel... I also know to try to work on separate data as much as possible - that reduces locks in software, which also reduces cache issues like you describe. Yes, all non-originating (writer) CCX blocks will stall on dependent data during a write. For all of tens of cycles... assuming they are trying to obtain that data within that same window and can't use the local copy to finish their load - as these situations are actually governed by software synchronization (or locks), the data in other CCX caches will only need to be refreshed before the lock is released. Even the most simple and highly dependent of programs, when distributed across multiple cores, will have synchronization in software for shared data. These locks (and their contention) are the major limiting factor for performance for these programs - far outstripping internal CPU synchronization in Zen. Atomics exist to create more granular locking, which allows for other work to continue up until an access is needed for the atomically accessed data. It's a way to minimize lock contention, and is exactly the type of system that relies on full-CPU synchronization (so, on Zen, inter-CCX communication). Yes, this is the up-front costs that will be incurred. It will be interesting to see what AMD has done, if anything, to mitigate this - they clearly won't be going back out to system memory for the data if the data is on another CCX (at least we'd hope, LOL!). Still, if multiple threads are working on the same data, this data is usually persistent for some time. There's also the possibility that AMD hasn't disclosed speculative L3 cache loading, which could be used to address this... or, perhaps, that's a Zen+ topic. Fail is a pretty strong term. Sometimes you take a few downsides for some major upsides. If you can quickly tack on more cores to a design without spending tens of million of dollars, then you've done a great job - even if those cores don't scale as well in groups of four as you'd hope... that didn't stop people from building servers with multiple CPUs or with NUMA... Of course there are situations where Zen will have more of a disadvantage due to this design, but these aren't situations where Zen will be targeted, so it is entirely irrelevant.
  6. The L3 is not a speculative store in Zen, it's also "mostly exclusive" of the L2 caches, suggesting some data is stored - and synchronized - in multiple caches even with a CCX, extending that between CCX blocks is critical for performance. (The rest here, is purely conjecture) If there is only one CCX (quad core), then we don't need to do anything special for shared data. With more than one CCX, we have to worry about synchronizing data between each CCX. The most logical place to do this synchronization would be between the CCX blocks themselves, within the data fabric. An external tag store would be all that is needed, as well as some synchronization primitives. Most data doesn't change, so you optimize for the read-only state and create a rapid read-lock acquisition and release scheme and speculatively execute assuming you acquired the lock, hiding the cost in the majority of cases. When a write occurs to shared data all read locks must be released and the write lock gained - simple stuff. In hardware, with a centralized synchronization controller, this is as simple as setting a noNewReader bit and stalling other reads until the write is complete in other CCXs - the originating CCX, however, can begin operating with the new data as soon as it knows no other writer is pending, reducing the impact of the delay - it will only be as long as the time it takes to check the synchronization state and write the data to the local cache(s). Dependent writes to memory will have to stall until the lock succeeds, but this should never take more than a few cycles (there can only be so many readers, and they will all take a fixed period of time - and numerous locks can be released simultaneously). This is all rather simple synchronization techniques - very commonly implemented in various forms. I can't say if this is how Zen does it, but that's how I would do it.
  7. His numbers are nearly exactly mine (which makes sense given how simple the math and how many resources are available). In theory, Zen should almost exactly match Haswell... if, AND ONLY IF, that 40% IPC improvement claim relates to a 40% average performance increase per clock. Sounds like I might be confusing things, but there's a very distinct difference. IPC is NOT performance per cycle - it is INSTRUCTIONS per cycle. You can have a 40% increase in IPC and a 5% increase in performance/cycle if that 40% increase is poorly targeted. Likewise, you can have a 40% increase in IPC and see 100% increase in performance/cycle if the improvement is superbly targeted. However, we know a great deal about Zen's basic design. It is 4 or 8-issue, depending on issue rate. It has 10-pipelines, and likely three schedulers behind a reorder buffer (ALU, AGU, FPU). It has a 256-bit FPU which uses multiple pipelines at the same time to execute AVX, potentially being an issue for AVX performance if not done exactly right. However, it should be exceptional for "legacy" floating point, given how floating point code is usually excellent for parallelization. Certain SMT scenarios could see 50% or more scaling when Intel will see 0% scaling due to pipeline conflicts. ... The list goes on and on. But none of that matters if it is only a 3GHz CPU, though it's suppose to be "closer to 4GHz."
  8. GCN Has been extremely good to AMD. It has kept them competitive for many years and has only just scaled to near its limits. Following the same principal design - with refinements - makes sense, especially as that architecture's full capabilities are just now beginning to be exploited by DirectX 12. They will probably improve per-CU performance, clockrate headroom, the scalar units (probably doubling their performance will be needed to head into the future), and a next generation of ACE and asynchronous shaders - just to keep nVidia in the mirror. That would be an outsized level of tessellation improvement, doubling-down where GCN excels, and addressing its weak points. They'd also have to ensure that it can scale beyond 64CU, or make each CU twice as fast... we've hit the point of diminishing returns for each new CU already.
  9. I see, that's one heck of a size difference (2:1). So if Polaris 10 is 15x15 (~225mm2), that would make Polaris 11 over 900mm2. Which is unrealistic, to say the least. So, if 232mm2 is accurate for Polaris 11 (15.23mm/side), then Polaris 10 is 7.62mm/side (I *love* that number :p).. which is only 58mm2. If we convert to 14nm LPP density (about 2.5x higher than 28nm HPP), we get the following die-size equivalents: Polaris 10: 145mm2 Polaris 11: 580mm2 These numbers are interesting in how well they line up with current products. The Fiji die is 596mm2. With a new version of GCN, not just a shrink, we should expect higher performance by area. Maybe 15%... maybe even 30%. If we extrapolate that to known products, that would be a 15~30% higher performance over Fiji for Polaris 11, but using less than half the power. If we assume that AMD made some efforts to free up the clockrate, additional performance would come from there. And we know that performance scales quite well with HBM, so HBM2 should bring another 5%. These little things push us up to a ~50% improvement over Fiji, while still using less power than a GTX 970. That's a pretty nice increase... and AMD can always make a bigger die when the yields improve. For Polaris 10, using the same method, we see a GPU that does not match the R9 380, but falls about a third short. Which is right around the GTX950 area of performance... which, sadly, makes sense given AMD's decision to compare it to the GTX950. This leaves the entire middle range of performance out cold. Polaris 10 will not scale up to reach R9 380 levels, nor will Polaris 11 scale down to 390 levels (without a LOT of die harvesting). Of course, this is all linear math, which is far from accurate when considering more relative area of the smaller GPU will be taken up by supporting circuitry than the larger GPU.. but that only goes to hurt the little GPU more... or, as we know Polaris 10 can keep pace with GTX 950, may well help to push Polaris 11 even higher in the performance category. Of course, this is based off the graphic you showed, which may well not be even close to accurate, and a 'leaked' die size for an unknown 14nm GPU we assume to be Polaris 11... so it's not worth the pixels with which it is written.
  10. Core 2 Quad. Intel also did the same with their original dual cores (Pentium D).
  11. We already know this is a massive change to GCN, including a new ISA. http://www.eteknix.com/amd-prepping-post-gcn-greenland-baffin-ellesmere-gpus/
  12. FinFet pitch requirements and the utilization of 20nm upper layers (and probably other factors) throw a wrench into area scaling with all but Intel's 14nm process:
  13. Anandtech's database numbers are always done with Turbo enabled, that's where you're messing up. The database is for comparing stock performance - which includes turbo. I've used their database for years to compare with other numbers. May of their articles will disable turbo to compare core to core or generational improvements, but they don't put those results in the database. You are right, though, that the phenom II X6 has Turbo (3.7GHz) that I didn't take into account... however, there are plentiful direct comparisons between Phenom II and Bulldozer clock-for-clock, no turbo, already done for us... http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/9 Direct IPC comparison (Opteron 6174 vs 6276). There's no comparison, Bulldozer is woefully slow. http://www.pcper.com/reviews/Processors/AMD-FX-Processor-Review-Can-Bulldozer-Unearth-AMD-Victory/FX-versus-Phenom-Perf-0 Direct IPC comparison #2. FX-8150 1C: 700.7 Phenom II 1C: 835.5 Now, for some simple math fun: 700 * 1.1 Piledriver: 770 770 * 1.067 Steamroller: 821.59 821.59 * 1.0985 Excavator: 902.52 Which, while just one benchmark, does fully match what I said. http://www.tomshardware.com/reviews/piledriver-k10-cpu-overclocking,3584-7.html Here we see that it takes ~4GHz Piledriver to match a 3GHz Athlon II X4 640.
  14. I've done an incredible amount of work comparing generational performance - and Phenom II has higher IPC than Bulldozer - by quite a lot, in fact. You are comparing a 4GHz CPU to a 3.3GHz CPU - and they just about break even... you forget that FX has turbo, and Phenom II does not (you are also comparing Piledriver - not Bulldozer....). If Bulldozer had the same IPC as Phenom II and the clockrates it pushed, it would not have been deemed a failure at all as it would have given AMD lower power draw, more cores, and more performance all in one go. Piledriver still had lower IPC, with parity not even happening with Steamroller (though it got closer). Excavator, however, has higher IPC than Phenom II, finally, and is quite similar to Penryn (Core 2). I do agree, though, that Bulldozer's IMC was much better than Phenom II's. Having owned both, and done extensive testing, that was one of Bulldozer's best attributes (good luck running high speed DDR3 with phenom II).
×