AMD Hosting Zen Event at HotChips on Tuesday

Prysin · August 22, 2016

35 minutes ago, Tomsen said:

Name a few of those MIMD instructions then (just to make sure we are talking about the same things, I don't think you are through). Games sure have things that can be parallelized, the issue is rather the more serialized procedure dx11 takes. Let me frame it this way: Could the same optimizations not be done in scalar code and yield the same benefits?

You name a few old wizards and call it a proof? You want me to look through all their individual stuff? Be more clear, or I might as well say you should open your eyes as a proof.

Intel has increased the execution port number from 6 to 9, increased ALU count from 3 -> 4 in what 8 years? At some point, the hardware needs to advance for the software developers to take advantage of. We have SIMD, which are very useful for SIMD related workloads. You do realize most software use MIMD? Any multi-threaded software in fact.

Patrick, you clearly don't know what I'm talking about in regards to netburst. I'm not talking about introducing 2 ALUs, but rather have the ALUs run 2x the clock speed of the CPU, giving the effect of a 0.5 cycle execution for normal 1 cycle executions. Sure, to store the result the instruction latency appeared the same (to not mess up with cache subsystems), but intermediate executions was possible iirc. Ie, you could reuse the value generated in the first 0.5 cycle in the next 0.5 cycle, effectively cutting the data latency by half in relative to CPU clocks.

i can back you up on the Netburst thingy. It is one of several reasons why they were notoriously hot and problematic.

Source for netburst info is readily availible at wikipedia, anandtech and also Tom's Hardware, incase you wonder @patrickjp93.

I will say thoug, in theory, due to the design of DX11 and previous versions, you could technically run a low level injector system that spread out the data workload prior to feeding it through the CPU. However it would be a case by case implementation.

patrickjp93 · August 22, 2016

1 hour ago, Tomsen said:

Name a few of those MIMD instructions then (just to make sure we are talking about the same things, I don't think you are through). Games sure have things that can be parallelized, the issue is rather the more serialized procedure dx11 takes. Let me frame it this way: Could the same optimizations not be done in scalar code and yield the same benefits?

You name a few old wizards and call it a proof? You want me to look through all their individual stuff? Be more clear, or I might as well say you should open your eyes as a proof.

Intel has increased the execution port number from 6 to 9, increased ALU count from 3 -> 4 in what 8 years? At some point, the hardware needs to advance for the software developers to take advantage of. We have SIMD, which are very useful for SIMD related workloads. You do realize most software use MIMD? Any multi-threaded software in fact.

Patrick, you clearly don't know what I'm talking about in regards to netburst. I'm not talking about introducing 2 ALUs, but rather have the ALUs run 2x the clock speed of the CPU, giving the effect of a 0.5 cycle execution for normal 1 cycle executions. Sure, to store the result the instruction latency appeared the same (to not mess up with cache subsystems), but intermediate executions was possible iirc. Ie, you could reuse the value generated in the first 0.5 cycle in the next 0.5 cycle, effectively cutting the data latency by half in relative to CPU clocks.

The problem with DX 11 is not having 1 CPU core talking to the GPU. As a matter of fact, that's exactly how it should be. If you're really using your CPU for all it's worth in AI, supporting as many objects as possible, and handling Internet, the one core talking to the GPU should be dedicated to that task. The problem with DX 11 was the overhead associated with draw calls, requiring asinine marshaling techniques to stuff as much into one call as possible.

Check out the AVX 2 instruction extensions. They're not all SIMD. A few of them are slightly complex compare, manipulate, exchange algorithms commonly found in transactions, or in bubble sort (the most efficient for small N where N fits in cache btw).

No, it increased from 2 to 4 in 8 years (Pentium 3/4 to Haswell), and that's still ignoring branch prediction improvements every generation, loop detection (Nehalem), the microop cache (Sandy Bridge), MIMD (Haswell), and widening the Out of Order engine by 3 instructions. All of this is paid for in heat. It's not Intel's fault that programmers are the problem when it is handing out software tools on a silver platter to make programmers' lives easier.

They're not old wizards. They're current Titans of the industry and are frequent lecturers at CppCon, code::dive, going native, and the international supercomputing conference. Andrei Alendrescu vectorized Facebook's AI, something most of the world thought not possible due to AI's serial nature. Scott Meyers vectorized Prim's Algorithm. Stephan Lavavej is the chief architect of the Microsoft Visual C++ library and the one who vectorized Windows 10, yes, an operating system.

Oh please that doesn't count and it's not even remotely possible to pull off now. If you can't make the results get to registers or cache in less than a cycle, it makes the effort to produce the result faster worthless. Programmers are the problem, not Intel.

Tomsen · August 22, 2016

Just now, Prysin said:

i can back you up on the Netburst thingy. It is one of several reasons why they were notoriously hot and problematic.

Source for netburst info is readily availible at wikipedia, anandtech and also Tom's Hardware, incase you wonder @patrickjp93.

I will say thoug, in theory, due to the design of DX11 and previous versions, you could technically run a low level injector system that spread out the data workload prior to feeding it through the CPU. However it would be a case by case implementation.

Also how they got much closer to their 10GHz expectations than most people otherwise would think (clocking up to 9GHz iirc, with a little overclock).

Sure you could hack you through it to distribute the data and workload across threads, but in the end you will still be limited to serialize and synchronize the data again to feed through dx11. The amount of works required compared to the gains is sitting at a bad ratio.

laminutederire · August 22, 2016

7 hours ago, patrickjp93 said:

No one can, yet that's what Intel would have to do for you people to feel it because software stagnated.

People expect magic because of performance leap in graphics these days with DX12 and Vulkan, because when those games are properly coded it's miles away from what it was even with cards which are 4 years old, because software can be more used (at least on amd side). But again those are the people who think hyperthreading is a must have and replace real cores, while it clearly doesn't and makes little to no difference, and even sometimes performance hits when you have a properly parallelized algorithm.

You're right to point out that software has to follow.

Tomsen · August 22, 2016

57 minutes ago, patrickjp93 said:

The problem with DX 11 is not having 1 CPU core talking to the GPU. As a matter of fact, that's exactly how it should be. If you're really using your CPU for all it's worth in AI, supporting as many objects as possible, and handling Internet, the one core talking to the GPU should be dedicated to that task. The problem with DX 11 was the overhead associated with draw calls, requiring asinine marshaling techniques to stuff as much into one call as possible.

Check out the AVX 2 instruction extensions. They're not all SIMD. A few of them are slightly complex compare, manipulate, exchange algorithms commonly found in transactions, or in bubble sort (the most efficient for small N where N fits in cache btw).

No, it increased from 2 to 4 in 8 years (Pentium 3/4 to Haswell), and that's still ignoring branch prediction improvements every generation, loop detection (Nehalem), the microop cache (Sandy Bridge), MIMD (Haswell), and widening the Out of Order engine by 3 instructions. All of this is paid for in heat. It's not Intel's fault that programmers are the problem when it is handing out software tools on a silver platter to make programmers' lives easier.

They're not old wizards. They're current Titans of the industry and are frequent lecturers at CppCon, code::dive, going native, and the international supercomputing conference. Andrei Alendrescu vectorized Facebook's AI, something most of the world thought not possible due to AI's serial nature. Scott Meyers vectorized Prim's Algorithm. Stephan Lavavej is the chief architect of the Microsoft Visual C++ library and the one who vectorized Windows 10, yes, an operating system.

Oh please that doesn't count and it's not even remotely possible to pull off now. If you can't make the results get to registers or cache in less than a cycle, it makes the effort to produce the result faster worthless. Programmers are the problem, not Intel.

Patrick, you are now putting words into my mouth and simply assuming things that I didn't state. I'm not saying the issue lies with DX11 only using a single thread to communicate (or actually rather dictate) commands and data to the GPU. Rather the issue is how serialized dx11 whole rendering pipeline is. Overhead is also a part of it, but I could argue that is the result of such a serialized rendering pipeline (developers trying to work their way around dx11 limits).

All AVX2 is SIMD instructions (AVX2 is a SIMD ISA). You better name the individual instructions out if you want to argue otherwise.

Wtf are you on? You are clearly being deceptive picking other baseline than me? I clearly started that it was in the last 8 years (since nehalem), then you go ahead and pick a baseline from +16 years ago, great job Patrick. If you want to argue my argument, you better use the same baseline, else it makes the whole point moot.

They are old wizards, standing on their last leg before retirement. Wizards are also very powerful, I'm not understating their achievements.

Those a far from proof from what I argued. You can't expect the same scalability just by rewriting scalar code in vector code. In many cases you will see performance-regressions. You really think windows 10 is perfectly vectorized, if so you get plenty to learn.

How does that not count? How is that not remotely possible now? They had it working, they bloody hell just implement it back if they need it.

Obviously the result ended in the registers, else it wouldn't work, would it? Just pointing out the obvious.

Doobeedoo · August 22, 2016

Great. Hopefully that IPC improvement will be great, as expected. What was shown so far looks ok but can't wait to see more results.

patrickjp93 · August 22, 2016

7 hours ago, Tomsen said:

Patrick, you are now putting words into my mouth and simply assuming things that I didn't state. I'm not saying the issue lies with DX11 only using a single thread to communicate (or actually rather dictate) commands and data to the GPU. Rather the issue is how serialized dx11 whole rendering pipeline is. Overhead is also a part of it, but I could argue that is the result of such a serialized rendering pipeline (developers trying to work their way around dx11 limits).

All AVX2 is SIMD instructions (AVX2 is a SIMD ISA). You better name the individual instructions out if you want to argue otherwise.

Wtf are you on? You are clearly being deceptive picking other baseline than me? I clearly started that it was in the last 8 years (since nehalem), then you go ahead and pick a baseline from +16 years ago, great job Patrick. If you want to argue my argument, you better use the same baseline, else it makes the whole point moot.

They are old wizards, standing on their last leg before retirement. Wizards are also very powerful, I'm not understating their achievements.

Those a far from proof from what I argued. You can't expect the same scalability just by rewriting scalar code in vector code. In many cases you will see performance-regressions. You really think windows 10 is perfectly vectorized, if so you get plenty to learn.

How does that not count? How is that not remotely possible now? They had it working, they bloody hell just implement it back if they need it.

Obviously the result ended in the registers, else it wouldn't work, would it? Just pointing out the obvious.

Eyeroll* now you're nitpicking. Like I said, MicroOp cache, changing the ALU count from 2-4 (Nehalem did not have 3), widening the out of order engine by 3 instructions (better self-optimization by the CPU, though that's still limited as most code has a branch every 7 lines), increasing the independent execution port number to 9 (less pressure on individual compute units, more flexible coding), including a programmable iGPU for heterogeneous tasks where latency is a limiting factor, and increasing the width of the loop detector by 30%. That has all happened since Nehalem.

Lavavej is in his early 30s and Andrei just turned 40. They're not old wizards. You could argue Scott Meyers is, but that's it.

It's not possible because the switching speeds of silicon are not maintainable at 8GHz without enormous heat problems, not to mention most transistors need to be very cold to maintain that speed, and that's the requirement to produce a result in half a cycle of the CPU. And it doesn't count because the half cycle saved ended up being swallowed by the requirements of cache and registers.

If anything Netburst is proof I'm right.

Thony · August 23, 2016

Its tuesday already ! Gimme the news !

DocSwag · August 23, 2016

34 minutes ago, Thony said:

Its tuesday already ! Gimme the news !

You gotta wait until 5:45 P.M. tho

Tomsen · August 23, 2016

14 hours ago, patrickjp93 said:

Eyeroll* now you're nitpicking.

Ok then, let me know what I missed in my previous reply and I'll comment on it.

14 hours ago, patrickjp93 said:

Like I said, MicroOp cache

Which had a big influence in the performance improvement we saw in sandy-bridge.

14 hours ago, patrickjp93 said:

changing the ALU count from 2-4 (Nehalem did not have 3)

You are not even close to being right. Patrick, take 2 minutes and google it, you'll find I am right that nehalem have 3 ALU.

14 hours ago, patrickjp93 said:

widening the out of order engine by 3 instructions (better self-optimization by the CPU, though that's still limited as most code has a branch every 7 lines)

I'm having some issues understanding this statement. OoOE's width normally isn't described in instructions (as in x86 instructions). 3 instructions also seems awfully low to be honest. CPUs knows how to deal with branches, it is rather how unpredictable the branch is.

14 hours ago, patrickjp93 said:

increasing the independent execution port number to 9 (less pressure on individual compute units, more flexible coding)

I did already mention this in a previous reply.

14 hours ago, patrickjp93 said:

including a programmable iGPU for heterogeneous tasks where latency is a limiting factor

Sadly, the only thing I heard about people working on Intels iGPU for compute was the buggy drivers, but this was back in 2011 iirc, so lot could have changes.

QuickSync is most likely more used feature.

14 hours ago, patrickjp93 said:

and increasing the width of the loop detector by 30%.

The loop-detector is now a part of the uop-cache isn't it? Pretty sure that changes with sandy

14 hours ago, patrickjp93 said:

Lavavej is in his early 30s and Andrei just turned 40. They're not old wizards. You could argue Scott Meyers is, but that's it.

Scott Meyers is also the only one I had any real knowledge of, the rest only I have only heard few tidbits from.

14 hours ago, patrickjp93 said:

It's not possible because the switching speeds of silicon are not maintainable at 8GHz without enormous heat problems, not to mention most transistors need to be very cold to maintain that speed, and that's the requirement to produce a result in half a cycle of the CPU.

Here's the deal with that; You don't need the entire IC to run at the same high frequency, which is why netburst ran its ALU at 2x the frequencies.

Sure, at some point heat are going to be an issue, which is why Intel never reached their 10GHz goal.

14 hours ago, patrickjp93 said:

And it doesn't count because the half cycle saved ended up being swallowed by the requirements of cache and registers.

I'm not sure I'm understanding your logic. Of course it counts, just because it needs to realign stores order, doesn't change the fact that the internal registers is updating at the same rate as the ALU. So you can absolutely save the half cycle without been swallowed whatsoever.

14 hours ago, patrickjp93 said:

If anything Netburst is proof I'm right.

If anything netburst is a proof of Intel overestimating a new proof-of-concept without any tools or previous data about such experiement.

They were overambitious in their implementation for their times, and the benefits were rather diminishing as a result.

Thony · August 23, 2016

2 hours ago, DocSwag said:

You gotta wait until 5:45 P.M. tho

Really ? That sounds good. I was expecting it like tomorrow. But thank god AMD does t do their stuff at evenings like most other companies.

DocSwag · August 23, 2016

3 minutes ago, Thony said:

Really ? That sounds good. I was expecting it like tomorrow. But thank god AMD does t do their stuff at evenings like most other companies.

If you've seen the slides floating around on the internet yesterday my guess is those are the slides that will be shown today. They may also show some benchmarks today too though so who knows.

TidaLWaveZ · August 23, 2016

3 hours ago, DocSwag said:

You gotta wait until 5:45 P.M. tho

Is that pacific time?

SpaceGhostC2C · August 23, 2016

On 8/22/2016 at 9:53 AM, Tomsen said:

Intel has increased the data parallism of its vector units by a good amount from each new architecture. Thats helps alots in certain workloads, and basically have no benefits outside it. Trying to defend it by saying "programmers who can properly optimize code" is stupid, you can't properly optimize all code (far from actually) to vector code and expect the same scalability. In the general pipeline outside of vector data parallelism, what have Intel exactly done to makes it CPU faster? Not much.

Actually, "believe me, I'm great but nobody codes for me" is the story of Bulldozer & co. in a nutshell

DocSwag · August 23, 2016

1 hour ago, TidaLWaveZ said:

Is that pacific time?

Yes, since the event is in the Bay Area.

vitor_cut · August 23, 2016

will it be streamed?

TidaLWaveZ · August 23, 2016

11 minutes ago, vitor_cut said:

will it be streamed?

That's the question I should've been asking instead of what time zone it's in. Looks like there won't be a stream, the website says presentations are only for attendees.

DocSwag · August 23, 2016

10 minutes ago, vitor_cut said:

will it be streamed?

I don't think so. If you want to find out what happens there looks like you'll have to just see what newssites report about it afterwards

patrickjp93 · August 23, 2016

6 hours ago, Tomsen said:

Ok then, let me know what I missed in my previous reply and I'll comment on it.

Which had a big influence in the performance improvement we saw in sandy-bridge.

You are not even close to being right. Patrick, take 2 minutes and google it, you'll find I am right that nehalem have 3 ALU.

I'm having some issues understanding this statement. OoOE's width normally isn't described in instructions (as in x86 instructions). 3 instructions also seems awfully low to be honest. CPUs knows how to deal with branches, it is rather how unpredictable the branch is.

I did already mention this in a previous reply.

Sadly, the only thing I heard about people working on Intels iGPU for compute was the buggy drivers, but this was back in 2011 iirc, so lot could have changes.

QuickSync is most likely more used feature.

The loop-detector is now a part of the uop-cache isn't it? Pretty sure that changes with sandy

Scott Meyers is also the only one I had any real knowledge of, the rest only I have only heard few tidbits from.

Here's the deal with that; You don't need the entire IC to run at the same high frequency, which is why netburst ran its ALU at 2x the frequencies.

Sure, at some point heat are going to be an issue, which is why Intel never reached their 10GHz goal.

I'm not sure I'm understanding your logic. Of course it counts, just because it needs to realign stores order, doesn't change the fact that the internal registers is updating at the same rate as the ALU. So you can absolutely save the half cycle without been swallowed whatsoever.

If anything netburst is a proof of Intel overestimating a new proof-of-concept without any tools or previous data about such experiement.

They were overambitious in their implementation for their times, and the benefits were rather diminishing as a result.

It had 2 ALUs, an FPU, and an SSE vector unit. Read the Nehalem architecture manual.

Handling 14 (vs. 11 from SB's days) is about all Intel can do without increasing the pipeline stall rate due to branch misprediction.

But Intel has made its predictor 99% accurate in a 4-decision branch and 90% accurate in a 7-decision branch.

Intel's graphics drivers are not buggy. They just don't fix bad code for you. Everyone in enterprise who wanted iGPU has been happy to use it since Ivy Bridge, and that includes Amazon, Facebook, Google, and Microsoft. If you adhere strictly to the manuals, you're fine.

I forgot QS. Let's add that to the list!

No, the loop detector is at an instruction, not MuOp level, these days. I think the loop can only be 20 instructions wide and only have branches change internal paths (no function calls), but that is a further innovation since the Sandy days.

If you want other names, Chandler Caruth currently works for Google and is in his mid 30s. Howard Hinint is another Titan active in the industry today. That's not even counting the hundreds of C++ programmers Intel has working for it that never contribute to open conferences like CPPCon et al.. CilkPlus proved almost anything you can loop over, be it objects or functional tasks, can be vectorized and have performance gains.

Internal registers do not update at the same time as the ALU. Digital signaling anyone? It takes time for the data to come back from the ALU. It doesn't count, because the end result was not really getting anything done in less than a cycle.

DocSwag · August 23, 2016

Yo @patrickjp93 and @Tomsen are you two by any chance confusing each other? Because I just searched it up and you two are both right, because Nehalem has two vector ALU units but 3 integer ALU units. So technically patrick is right that is has 2, and Tomsen is right that it has 3. Is that by any chance the problem here?

patrickjp93 · August 23, 2016

38 minutes ago, DocSwag said:

Yo @patrickjp93 and @Tomsen are you two by any chance confusing each other? Because I just searched it up and you two are both right, because Nehalem has two vector ALU units but 3 integer ALU units. So technically patrick is right that is has 2, and Tomsen is right that it has 3. Is that by any chance the problem here?

Are you getting those numbers by looking at the execution port diagram or the core block diagram? Also, one of your 3 ALUs is an AGU, which is only for indirect memory addressing.

DocSwag · August 23, 2016

42 minutes ago, patrickjp93 said:

Are you getting those numbers by looking at the execution port diagram or the core block diagram? Also, one of your 3 ALUs is an AGU, which is only for indirect memory addressing.

I only read that off Wikipedia so idk.

https://en.m.wikipedia.org/wiki/Nehalem_(microarchitecture)

Quote

3 integer ALU, 2 vector ALU and 2 AGU per core.

patrickjp93 · August 23, 2016

35 minutes ago, DocSwag said:

I only read that off Wikipedia so idk.

https://en.m.wikipedia.org/wiki/Nehalem_(microarchitecture)

Well, it's wrong.

http://sc.tamu.edu/systems/eos/nehalem.pdf

As detailed in this paper, the execution ports list 3 ALU operations possibilities, but there are only 2 ALUs. The reason for having more ports is more flexible out of order management and higher independence between instruction paths.

Tomsen · August 24, 2016

14 hours ago, patrickjp93 said:

It had 2 ALUs, an FPU, and an SSE vector unit. Read the Nehalem architecture manual.

Handling 14 (vs. 11 from SB's days) is about all Intel can do without increasing the pipeline stall rate due to branch misprediction.

But Intel has made its predictor 99% accurate in a 4-decision branch and 90% accurate in a 7-decision branch.

Intel's graphics drivers are not buggy. They just don't fix bad code for you. Everyone in enterprise who wanted iGPU has been happy to use it since Ivy Bridge, and that includes Amazon, Facebook, Google, and Microsoft. If you adhere strictly to the manuals, you're fine.

I forgot QS. Let's add that to the list!

No, the loop detector is at an instruction, not MuOp level, these days. I think the loop can only be 20 instructions wide and only have branches change internal paths (no function calls), but that is a further innovation since the Sandy days.

If you want other names, Chandler Caruth currently works for Google and is in his mid 30s. Howard Hinint is another Titan active in the industry today. That's not even counting the hundreds of C++ programmers Intel has working for it that never contribute to open conferences like CPPCon et al.. CilkPlus proved almost anything you can loop over, be it objects or functional tasks, can be vectorized and have performance gains.

Internal registers do not update at the same time as the ALU. Digital signaling anyone? It takes time for the data to come back from the ALU. It doesn't count, because the end result was not really getting anything done in less than a cycle.

It got 3 ALUs. It is shown in the core block diagram, in the documentation and just generally accepted as such. Where is it stated otherwise?

Never heard of using x86 instructions as measurement for OOOe width. That would depend on workload, can't just say random numbers, unless it is distinct caused by some kind of bottleneck in the microarchitecture.

As for branch predictors and their accuracy, obviously marketing has had their fingers in the numbers. Else branches wouldn't be a problem.

They were buggy, I can't say anything about how it is today. Good thing there wasn't any bad code, he did end up single out the issue so he could replicate it to verify it was a bug (it didn't behave as the documentation stated it should). I haven't heard about enterprise been happy for intels iGP (only the opposite), and those mega corporations don't use the same public documentation/information as others do, often relying on things they discovered themselves or was told in confidence.

You don't need to add it to the list, it is already under the "intel added igp to sandy".

http://www.tomshardware.com/reviews/Intel-i7-nehalem-cpu,2041-3.html

"With the Nehalem architecture, Intel has improved the functionality of the Loop Stream Detector. First of all the buffer is larger—it can now store 28 instructions. But what’s more, its position in the pipeline has changed. In Conroe, it was located just after the instruction fetch phase. It’s now located after the decoders; this new position allows a larger part of the pipeline to be disabled. The Nehalem’s Loop Stream Detector no longer stores x86 instructions, but rather µops."

We can both play a game of name a wizard, but that doesn't prove anything yet. Because all code is simple loops.. Also that is not new news.. Loops are perfect for vectors, because guess what, often small code sample, get repeated over and over, and if no dependencies are in place there are basically no issues. That is however far from normal code.

Patrick, of course the necessary internal registers updated at the same frequency, else it wouldn't work one bit.

Yes, it takes time for the data to come back from the ALU, however that is a non-issue, the same way it takes time for the data to come into the ALU.

Or are you saying they clocked it 2x the normal clockspeed for nothing, since nothing got accomplished in less than a normal clockcycle?

Thony · August 24, 2016

@DocSwag almost a day later and not a single word on what happened on Tuesday. Was AMD not giving any new info or what happened ?

Sign In

AMD Hosting Zen Event at HotChips on Tuesday

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites