Mira Yurizaki

Member

View Profile See their activity

Posts
20,911
Joined
May 18, 2016
Last visited
August 2, 2023

Content Type

All Activity

Forums

Topics
Posts

Status Updates

Blogs

Events

Gallery

Downloads

Store Home

Product Reviews

Blog Entries posted by Mira Yurizaki

Sort By

The actual reason why communication standards measure in bits per second, probably

By Mira Yurizaki, January 30, 2020

When you look at the bandwidth of a communication bus or interface such as USB, SATA, or the speed your ISP advertises they give you, you notice that often times they measure everything in bits per second instead of bytes per second, a figure we're more used to. The common reason we think that companies advertise the bits per second is because it's a larger number. And obviously larger means better to the average consumer. Confusingly as well, the shorthand versions for bandwidth looks similar enough.

Except there's a more likely reason why you see bits per second: in the physical aspect of communication, data isn't always 8-bits.

Let's take for instance every embedded system's favorite communication interface: the humble UART (universal asynchronous receiver/transmitter). The physical interface itself is super simple, at most all you need is two wires (data and ground), though a system may have three (transmit, receive, ground). However, there's three issues:
How do you know when the start of a data frame (a byte in this case) has started? What if you were sending a binary 0000 0000? If you were using a 0V as binary 0, the line would look flat the entire time so how would you know if you actually are getting data or not? How do you know when to stop receiving data? A UART can be setup to accept a certain amount of data bits per "character," and so it needs to know when to stop receiving data. Do you want some sort of error detection mechanism? To resolve these:
A bit is used to signal the start of a transmission by being the opposite of what the UART 'rests' at. So if the UART rests at a value of 0, the start bit will be whatever the value of 1 is. A bit (or more) is used to signal the end of a transmission. This is often the opposite value of what the start bit is in order to guarantee at least a voltage transition takes place. A bit can be used for parity, which is 0 or 1 depending if the number of data bits are 1 is even or odd. Note error detection mechanisms are optional. A common UART setting is 8-N-1, or 8 data bits, no parity, 1 stop bit. This means at the minimum there are 10 bits per 8 data bits (the start bit is implied). This can be as high as 13 bits per 9 data bits such as in 9-Y-2 (9 data bits, using parity, 2 stop bits). So if we had a UART in an 8-N-1 configuration, if the UART is transmitting at a rate of 1,000 bits per second, the system is only capable of transferring 800 data bits per second, or an 80% efficiency rating.

Note: Technically it's not proper to express the transmission rate of a UART as "bits per second" but "baud", which is how many times per second the UART can shift its voltage level. In some cases, you may want to use more than one voltage level shift to encode a bit, such as embedding a clock signal. This is used in some encoders like Manchester Code. But often times, baud = bits per second.

Another example is PCIe (before 3.0) and SATA. These use another encoding method known as 8b/10b encoding. In this, 8 bits are encoded over a 10-bit sequence. The main reason for doing this is to achieve something called DC-balance. That is, over time, the average voltage of the signal is 0V. This is important because communication lines often have a capacitor to act as a filter. If the average voltage is higher than 0V over time, it can charge this capacitor to the point where the communication line reaches a voltage that causes issues such as a 0 bit looking like a 1 bit.

In any case, like the UART setting 8-N-1, 8b/10b encoding is 80% efficient.

This is all a long explanation to say the reason why communication lines are expressed in bits per second than bytes per second is bits per second is almost always technically correct, whereas bytes per second is not.
- Read more...
- 0 comments
- 1,421 views
Pet peeves of a software developer

By Mira Yurizaki, January 14, 2020

As a software developer, I've come across things people say that annoy me, because often it's not the reality:
Software development is "easy"
Like any other skill, the easy part isn't because it's actually easy, it's because people built-up experience and skills necessary to simply just do something. Because if it was easy, you, as a layperson, should be able to do it just as easily.
Software is built from start to finish in one go, e.g.: "Day one patches are dumb"
In software development land, this is known as the "Waterfall model." In a lot of commercially developed software, this process is almost never used as a whole. It might be used for just the software development process itself, as in, it's in the hands of the people actually churning out code, but as far as a software project goes, other models are used. After all, if you're doing a game and all the concept art and storyboarding is done, those creative people likely aren't going to be the same people coding the game. It might be the case in smaller studios, but not in a AAA studio.

Typically what's done is some variation of incremental build model or iterative build model, which usually ends up going to Agile software development.

This is why games have things as day-1 patches or DLC. Before the game can be formally released, it has to go through a validation process. Instead of having the people working on the game sit on their thumbs, why not have them work on stuff in the mean time and release it later? And since patches and DLC often have a less stringent validation process, it can go through much faster.
Software is released the moment the last line of code is written and the application is built
Any developer worth their salt will have a process in place once the final build is made. That process involves testing the heck out of the application to make sure it works, that all the things it needs to do are done, that it doesn't break other things in horrible ways. Only once the final build has passed all these checks can it be released.

Granted it may not feel like this in certain cases, but it's silly to release the final build without doing some sort of testing.
"But that problem should've been seen a mile away!"
Have you ever proofread for the hundredth time a paper you wrote and you somehow missed a simple spelling or grammar rule? Same principle applies here.

This is also on top of some software sources being huge, up to hundreds of thousands to millions of lines of code. Chances are you're not touching every bit of it, but laser focused on only some parts of it. Or so focused on solving one problem, you don't see that there's a problem with another area.

Or basically it's a similar thing that this video attempts to point out: "How can X have so many problems?"
This is sort of an umbrella. An example I can think of is Windows Updates. Yes it's become a sort of meme that Windows updates are unreliable and can break your system, but at the same time, Microsoft has to deal with having hundreds of millions of instances of Windows to update, likely with millions of different configurations not just hardware, but software as well. To think that a 100% success rate should be a thing is absurd. Also given the install base of say Windows 10, if we assume the figure given by Microsoft at 900 million "Windows devices" (https://news.microsoft.com/bythenumbers/en/windowsdevices), even a million people affected by the problem is less than 1%. A million people is a lot. < 1% of the userbase isn't.

Basically, the pool of users Microsoft has to deal with is so large, using the devices uniquely, that they have enough of a sample size such that the probability of any problem coming up is basically 100%.

You try making software that works on nearly a billion different devices with countless combinations used uniquely in each way without a problem.
- Read more...
- 1 comment
- 1,456 views
Demystifying Ray Tracing Further

By Mira Yurizaki, November 12, 2019

With NVIDIA's RTX cards out and the company pushing ray tracing, I figured I have a look around at what I could in the graphics community through blog posts and whatnot about ray tracing itself. Then interacting with the community it seems like there are some misunderstandings and perhaps a warped interpretation of what's going on. So this post is a random assortment of thoughts regarding the encounters of others discussing this topic and what my input is.

Ray tracing describes a type of algorithm, but it's not necessary a specific algorithm
The first thing I encountered with looking through literature is that what's called "ray tracing" can be vague. Does it describe a specific algorithm such as heap sorts or fast inverse square or does it describe a class of algorithms like selection sorting or binary search algorithm? Or in another way of thinking, does ray tracing describe something like "storage device" or does it describe something like "NAND-based, SATA solid state drive?"

As far as the usage of the term goes, I'm led to believe that ray tracing is describing a type of algorithm. That is, the basic algorithm is shooting some ray out that mimics how a photon is shot out, then tracing it along some path and seeing how it interacts with the world. To that end, I've found several forms of ray tracing that exist:
Ray Casting: This is the most basic version of ray tracing, where the first thing the ray intersects is what the final output is based on. One could argue this is the basic step of ray tracing in and of itself. Ray Marching: The most common implementation of this is the ray is the path generated by spheres that originate at some point. At the first point, a sphere grows until it hits something, then the the next point of the ray is at the edge of the sphere in the ray's direction. Then another sphere is generated that grows until it hits something, creates a new point at the edge in the direction of the ray, and so on. An object is considered "hit" when the sphere is small enough.

(Taken from http://jamie-wong.com/2016/07/15/ray-marching-signed-distance-functions/)
Path Tracing: When someone usually talks about ray tracing without any other context, this is the algorithm they're usually referring to. Path tracing attempts to trace the path of the ray from the camera to a light source. On top of this, each sample point use a ray that's pointed in a random direction. The idea is the more samples you use, the closer you get to the actual image. Some industry folks may consider "ray tracing" itself to be the original algorithm devised by J. Turner Whitted while "path tracing" is the algorithm described by Jim Kajiya.

Ray tracing is also solving a problem with rasterizing
What rasterized rendering does today is it dumps a ton of information about the scene before proceeding to work on it. One of the first thing it dumps is all of the geometry the camera view cannot see. Next, pieces of the scene are built up one by one. They either add on top of each other right away like in a forward renderer or the parts of it are assembled on the side to be combined later like in a deferred renderer. (https://gamedevelopment.tutsplus.com/articles/forward-rendering-vs-deferred-rendering--gamedev-12342)

One issue with traditional rendering is also the order in which things are rendered. This can lead to weird artifacts like light spilling onto areas where there's no obvious light source, like in these example:

By using ray tracing, the rays bring back information of what's visible, what isn't visible, and how light can indirectly affect other objects in a realistic manner.

Real-time ray tracing isn't a relatively new thing for games
The funny thing is, ray tracing has been used for some time. Some games, like Guerilla Games' Killzone Shadowfall uses ray tracing to do screen-space lighting (slide 84), mostly in reflections and what appears to be ambient occlusion.
- Read more...
- 1 comment
- 1,416 views
Think of Task Manager like a report to the department head

By Mira Yurizaki, November 21, 2019

A common complaint or concern I hear with Task Manager is that what it reports doesn't seem very useful or that they wish what it'd report was something more meaningful. A commonly cited one is CPU % Utilization, which people have commented that it doesn't really represent what the CPU is actually doing.

After stewing on this I thought of a way to think about what Task Manager is better thought of: a status report to a company's department head. For example, let's take a company's head of the software development department. They're more interested in the progress of each project or program, if people are doing things or not, how much money is it spending, etc. If we were to take a real life example, say you're the software department head for Apple. You'd care about things like how iOS, iPadOS, and macOS are doing, what their staff allocation is and how busy they are (because this may mean you need more or less people), and how much money each of those projects are spending. What you wouldn't care about are things like what each individual person is working on or what minor hiccups there are in the development process. A department head wouldn't care about the little things, they care about the big picture.

In the same vein, Task Manager should be used to look at every process and see what's taking up your hardware resources. Hardware resources that execute something like the CPU and GPU should only indicate how busy they are. Resources that store something like RAM and storage, should only indicate how much stuff they can hold (though storage is a strange one since it also has a queue, so it's more useful to show how busy that is over how full it is). Everything has their little details, but it's still important to remember: Task Manager is for, as its name implies, managing tasks. Knowing the intricate details of say how the CPU is being used isn't useful in this regard. It's only useful for someone who's developing software and wanting to know how they can optimize. Likewise in a company, there's probably someone who looks at how people are developing software and it's their job to make sure things run smoothly
- Read more...
- 0 comments
- 1,190 views
Discussing Myths: You can't boot Windows that was installed for another computer

By Mira Yurizaki, November 12, 2019

This is a statement I see prevalent in tech circles: Whatever computer you install Windows on can't be booted on another computer. The most common cited reason is the drivers that were installed on that particular Windows will conflict and cause issues with the hardware in another computer. This is false. If only because I've not only successfully booted into Windows installed for another machine, but managed to do work in it. I used this method to fix someone else's computer and even when the drive was installed back, it ran just fine.

Okay, an anecdote isn't the best way to disprove this. So let's think about this using what we know about the booting process for computers:
System powers on, loads up BIOS/UEFI, sets up basic I/O System applies settings to configure hardware and the motherboard chipset Other hardware is figured out if they exist or not, and possibly configured accordingly for bare minimum operation System looks at the boot devices table and checks each one for a bootloader. The first bootloader encountered is ran. If the bootloader points to an OS, it begins the OS starting process The OS will load the kernel and do a hardware detection check. As part of the hardware detection check, it sees which driver for said hardware is available. After the kernel and hardware is loaded and configured, the rest of the OS gets setup until the user can finally interact with it Step 6 leads me to believe that if an OS installed for one system is booted into another with different hardware, it's not going to break assuming there's some driver for it. For modern operating systems, there's going to be a generic driver for most of the core devices. So how can Windows not boot or "break" if the driver set installed on it won't be loaded in the first place because the hardware for said driver isn't detected? I guess if you consider "break" to be "not running as intended", sure. But that's kind of a stretch.

Further Reading
http://www.c-jump.com/CIS24/Slides/Booting/Booting.html
https://www.cs.rutgers.edu/~pxk/416/notes/02-boot.html
https://en.wikipedia.org/wiki/Booting
https://en.wikipedia.org/wiki/Ntdetect.com
- Read more...
- 0 comments
- 1,332 views
Explaining "Asynchronous Compute"

By Mira Yurizaki, November 12, 2019

One of the more touted features of applications using DirectX 12 and Vulkan is "asynchronous compute." Over the years it's seemed to gain some mystique with people saying one thing or another creating a sort of stir of information. In this blog we'll go over what asynchronous compute is, where it came about from, and how NVIDIA and AMD handle it.

What exactly is asynchronous compute?
Instead of trying to find an answer from the interwebs as this term didn't really exist much before 2014 or so, let's look at the term itself, break it apart, and see what we can come up with. The "compute" part is easy, it's some kind of task that's computing something. The term itself suggests it's generic as well. So then comes the "asynchronous" part. This term has a few meanings in computer hardware and software:
If hardware is synchronous, then its operations are driven by a clock cycle. Likewise, asynchronous hardware means that it's not driven by a clock cycle. In software, if an operation is synchronous, then each step is performed in order. If an operation is asynchronous, then tasks can be completed out of order. Since it doesn't make sense that graphics would have a clock cycle associated with it, then the second definition applies. To expand further on this, an example task for an application to do is to load data from the hard drive. The application can then do the following:
Pause until the hard drive came back with data, which could take a while. Set aside the request as a separate task, and when it comes to needing to read data from a hard drive, make the request and go to sleep. This way other tasks can run and when the data becomes available, do something with it as soon as possible. In the first case, the application will ensure that the order of tasks is met, but this can cause the application to perform poorly because it won't do anything else. In the second case, this application won't ensure the order of tasks is met, but at the same time it can do things otherwise rather idle around.

Putting it together in the context of graphics cards, then it stands to reason that "asynchronous compute" is allowing the graphics card to do compute tasks out of some order so that the GPU can better utilize its resources. So what did the APIs do in order to enable this?

Multiple command queues to the rescue!
What DirectX 12[1] and Vulkan[2] introduced was multiple command queues to feed the GPU with work to do. To understand why this is necessary, let's look at the old way of doing things.

The irony of graphics rendering is that while we consider it to be a highly parallel operation, to determine the final color of the actual pixel is in fact, a serial operation.[3] And because it's a serial operation, the operations could be submitted into one queue. However as graphics got more and more advanced, more and more tasks were added to this queue. Still, the overall basic rendering operation was "render this thing from start to finish." Once the GeForce 8 and Radeon HD 2000 series introduced the idea of a GPU that can perform generic "compute" tasks, it was found that some operations could be offloaded to the compute pipeline rather than along with the graphics pipeline. One such technique that could be offloaded to compute shaders is ambient occlusion.[4] It can do this because only one type of output from the graphics pipeline is needed: depth buffers.

But this presents a problem now. Depth buffers are generated pretty early in the rendering pipeline which means that early on the GPU can work on the ambient occlusion any time up until needing to composite the final output. But with a single queue, there's no way for the GPU to squeeze this work in if the graphics pipeline has bubbles where resources are not being used.

To illustrate this further, let's have an example:

In this, there's ten resources that are available. The first three tasks take up 9 of them. But then the GPU looks at task 4, see it doesn't fit, and stops here. As noted before, actually rendering a pixel is a serial process and it's not a good idea to jump around in the queue to figure out what will fit. After all, what if task 5 is dependent on task 4? No assumptions can be made. But let's pretend task 5 isn't dependent, then how do we get this on the GPU if we can't re-order the queue? We make another.

The GPU will fill itself with Tasks 1-1, 1-2, and 1-3 like in the last example and when it sees Task 1-4 can't fit, it looks in Queue 2 and sees Task 2-1 can fit. Now the GPU has reached saturation. Although I'm not sure in reality if GPUs really issue work like this. The overall takeaway however is the introduction of additional queues allows for tasks to become independent and for the GPU to schedule its resources accordingly. So how do AMD and NVIDIA handle these queues?

How AMD Handles Multiple Queues
The earliest design that could take advantage of multiple queues was GCN. With GCN it's fairly straight forward. What feeds the execution units in GCN are command processors. One of them is a graphics command processor which only handles graphics tasks. The other is the Asynchronous Compute Engine (ACE), which handles scheduling all other tasks. The only difference between the two is the access to which parts they can reserve for work.[5] The graphics command processor can reserve all execution units, including fixed function units meant for graphics. The ACE can only reserve the shader units themselves. A typical application could have three queues: one for graphics work, one for compute work, and one for data requests. The graphics command processor obviously focuses on the graphics queue, while the ACEs focus on the compute and data request queues.

My only issue is how AMD presented their material on asynchronous computing is the diagram they showed to illustrate it:

This works on a timeline view point, but not a resource usage view point. To better understand it from a resource usage point of view, GCN and RDNA can be thought of like a CPU with support for multiple threads, although in this case, the "threads" are Wavefronts, or a group of threads. Each ACE can be thought of a thread. Then you can draw it out like this[6]

How NVIDIA Handles Multiple Queues
The way NVIDIA handles multiple queues isn't as straight forward. To start, let's go back to what could be said is last major revision of GPU design NVIDIA did, which would be the Kepler GPU. There were a few striking things that NVIDIA did compared to the previous GPU, Fermi. These were:
Doubling the shader units Halving the shader clock speed Simplifying the scheduler The third point is where things get interesting. This is the image that NVIDIA provided [7]:

What's probably most interesting is that Kepler uses a "software pre-decode," which oddly sounds like software, the drivers that is, is now doing the scheduling. But that's not what happened. What happened is that NVIDIA found that operations have a predictable amount of time they complete[7]:

... then they removed the part in hardware that does all of the dependency checking. The scheduler is the same otherwise between Fermi and Kepler[7]:
The so-called GigaThread engine still exists, even in Turing[8]

However this doesn't exactly answer the question "does NVIDIA support multiple command queues", which is the basic requirement to do asynchronous compute. Though why bring it up? To establish that NVIDIA's GPUs still schedule work on the hardware contrary to popular belief and NVIDIA GPU's cannot support asynchronous compute. It's just that the work that comes in is streamlined by the drivers to make the scheduler's job easier. Not that it would matter anyway, since the basic requirement to support asynchronous compute is to read from multiple command queues.

So what was the issue? Kepler doesn't actually support simultaneous compute and graphics execution. It can do one or the other, but not both at the same time. It wasn't until the 2nd gen Maxwell did NVIDIA add the ability for hardware to support both graphics and compute at the same time.[9] Yes, technically Kepler and 1st generation Maxwell do not benefit from multiple command queues as they cannot run compute and graphics tasks simultaneously.

Some other things of note

The biggest take away though from introducing multiple command queues is this, from FutureMark.[10]

But ultimately, because this debate often has AMD proponents claiming NVIDIA can't do asynchronous compute, AMD themselves said this (emphasis added):[11]

Thus if one of the huge promoters of asynchronous compute is saying this is the basic requirement, then it doesn't matter what the GPU has as long as it meets this requirement.

Sources, references, and other reading
[1] https://docs.microsoft.com/en-us/windows/win32/direct3d12/design-philosophy-of-command-queues-and-command-lists
[2] https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch02s02.html
[3] https://www.khronos.org/opengl/wiki/Rendering_Pipeline_Overview
[4] http://developer.download.nvidia.com/assets/gamedev/files/sdk/11/SSAO11.pdf
[5] https://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/5
[6] https://commons.wikimedia.org/wiki/File:Hyper-threaded_CPU.png
[7] https://www.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdf
[8] https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
[9] https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/9
[10] https://benchmarks.ul.com/news/a-closer-look-at-asynchronous-compute-in-3dmark-time-spy
[11] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/Asynchronous-Shaders-White-Paper-FINAL.pdf
- Read more...
- 0 comments
- 9,045 views
Discussing Myths: GeForce RTX. The Whole Thing

By Mira Yurizaki, August 28, 2019

On this entry, GeForce RTX! And not just one thing but the whole thing. Or at least as much as possibly can remember. As with the Discussing Myth series, it's more about taking assertions or claims that people have said and poking at it, presenting an argument against the assertion or claim.

RTX is only about ray tracing
Perhaps the "RT" in "RTX" makes it confusing, but RTX is a technology feature set that encompasses what Turing can do. In addition to hardware accelerated ray tracing, it also encompasses DLSS.

DXR was something only NVIDIA collaborated with Microsoft to create
AMD contributed as well, stating even they were "working closely with Microsoft": https://hexus.net/tech/news/graphics/116354-amd-nvidia-working-closely-microsoft-dxr-api/

Ray tracing hardware takes up a lot of space on the die
The first thing to note is we don't really have anything resembling a die shot of Turing other than this:

Note that this is from NVIDIA themselves. But if we were to try to map it to the block diagram:
.

And pretend that the die shot sort of resembles this, then figuring out where the RT core is, it's probably the thing highlighted in red

That's not a whole lot of space. Some quick and dirty math against the area that the red highlighted area takes up vs. the entire image comes out to about 3% of the space.

BUT, the thing to remember nobody really verified if the die shot NVIDIA provided is the actual die of Turing.

RTX is pointless because ray tracing can be done "in software"
By "in software", this means that dedicated hardware for ray tracing isn't necessary. But of course ray tracing can be done without hardware acceleration. Anything can be done without hardware acceleration. But then it's missing the point: "acceleration." RTX can do ray tracing faster. Though there is one caveat to this: the application has to explicitly give the "hint" that it wants to use hardware acceleration. Otherwise to the GPU, it looks like some other generic workload.

So at this point, I've yet to see a direct comparison of the same application using ray tracing that targets both hardware accelerated RT and software RT. From what I can find, it's usually one or the other. What's most damning is AMD for whatever reason won't even enable the DXR fallback layer on their cards, even though there's nothing that prevents those cards from running it.

As an aside, one of the foundations of modern GPUs today, a feature called hardware transform and lighting (hardware T&L), met with similar resistance. 3dfx infamously said that hardware T&L was not necessary as long as you had a fast enough CPU. And for the most part, at least for games at the time, they were right:

However once games started taking advantage of hardware T&L, 3dfx's offerings were no longer looking all that great: https://www.anandtech.com/show/580/6

DLSS is a super sampler/image quality improver
To poke at the first point, it's not a super sampler. Tom's Hardware did an analysis of what's going on behind the scenes and found out that the DLSS renders internally at a lower resolution, then upscales it. This is not the definition of super sampling, which is to make more sample points than needed for a given point. The only "super sampling" part about it is that the so-called "ground truth" images used to train the AI are super sampled.

However, to call it an image quality improver is also, I would argue, not correct. Something that attempts to improve image quality I would argue retains either the same or lower resolution of the original source image. Since DLSS is an upscaler, it fails to meet that criteria.
- Read more...
- 1 comment
- 1,221 views
List of Guides I've Written

By Mira Yurizaki, September 25, 2016

A list of guides I posted somewhere on the site, just in case I post more than the 10 URL limit for profiles (plus that'd get wild anyway)

A guide to how to identify if you have a CPU bottleneck and see how much it can affect you.

An explanation on HyperThreading.

It also answers the question "Why is it bad to have no page file?"

Not really a guide, but might be helpful

Not something I wrote, but I think it's useful to share in this post:
- Read more...
- 1 comment
- 7,457 views
Discussing Myths: Task Manager is Broken

By Mira Yurizaki, August 9, 2019

This is one I often see and often question about what's really going on: the idea that Task Manager is halfway broken and what it reports is inaccurate and shouldn't be trusted. Instead, trust some other program! The reason why this assertion puzzles me is that it's a system tool that Microsoft readily makes available. I'm pretty certain as well that companies, especially major hardware manufacturers, makes sure that Task Manager in Windows reports at least something resembling what they're desiring, because otherwise they'll also be the ones getting the call on "why doesn't this thing work!?!?!?!!!"

So here's a few things that I've found people commonly ask that make other people think Task Manager is broken in some way

Task Manager reports memory usage is really high, but when I add it up, it isn't anywhere near that
People often go to this page to see what their memory usage is:

And then they try do something like calculating the values in either the Processes or Details tab. Then they find that whatever number they came up with doesn't match the number in this page. The first thing to know is does show exactly what you expect: how much memory is in use. The problem is that the information in order to get to this number is hidden by default or isn't intuitive.

For example whenever someone reports that their memory usage is high but they don't have anything up, I ask them if their "Non-paged pool" is high. Normally this value doesn't get very high, but sometimes leaky drivers will start eating into it and inflate it. In one instance I've found VirtualBox also adds to this value. For some insight on what this is, the Non-paged pool is memory that should not be paged out. i.e., put on the pagefile in your storage drive. Typically system processes and programs use this space.

But otherwise, another thing people try to do is add up the values in the Processes tab:

The same thing happens: they don't get a number that's anywhere near what "In Use" in the memory page reports. This value however is the app's so-called private working set, or the data in memory that is accessible only to the app. In addition to this space, apps have a sharable memory region that can be used by other apps. This can be shared libraries or data they're willing to let some other app use. To get a value that's closer to the "In Use" value, you have to look at the working set, which can be found in the Details tab:

Note that you have show it first by right clicking on a column header, selecting "Show Columns", then checking "Working Set". I've also included "Memory (shared working set)". Adding the numbers in the "Working Set" will get you a number that's more in line of what's in use.

Task Manager doesn't show GPU usage even though I'm running a game
I covered this already in a topic:
The short of it is, Task Manager only shows four graphics, but there are actually a pletheora of categories to choose from. If you click on the header for the graph, you can change what it shows. It's very likely that the game is using something on the GPU that's triggering another graph

Task Manager can't report CPU speed properly
Another common complaint I see is Task Manager can't report the CPU speed properly. This is a little vague since there's two values Task Manager reports: Base Speed and "Speed", which is the current speed. Base speed is determined by the OS using the processor's CPUID information. The OS likely has a database of information regarding the processors it supports and so it can map CPUID information to specs. Or the CPUID may contain enough information on its own to make that determination. As an example, Intel has a table of what CPUID values map to which processor family: https://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers. Also see ways to determine clock speed in Window XP/2003

"Speed" itself I'm not sure where it's getting from for certain, but it's likely taking the fastest clocked core. However, where that data is being determined is up for grabs. A very common way to do frequency counting is to peg the processor to a high performance state for a block of time while monitoring a counter on it. Then when the block of time is up, you can estimate the clock speed based on the difference between the value the counter started at an the value you end up with, divided by the time the block lasted for. Though this can be inaccurate if not handled carefully. Supposedly this is how CPUID's HWMonitor does things (https://arstechnica.com/civis/viewtopic.php?f=8&t=1370429).

Otherwise you can make a guess based on other things in the system, like monitoring what the multiplier is and using that against the base clock. But if those things aren't straight forward or the manufacturer hasn't told them a specific formula to report the clock speed correctly, then it's not really the fault of the reporting software per se.

Disk Usage is high, but read/write speeds are low!
This value is how much time the disk wasn't servicing some request. If we take a look a look at this screen shot, we can see the utilization is based on the "Active time" value

But if we go to Resource Manager and take a look at it, we see a tool tip that states "Percentage of time the disk was not idle" for Active Time:

(This is about as much as I can think of, but feel free to present other things that seem off)
- Read more...
- 0 comments
- 1,678 views
Discussing Myths: The Intro Post

By Mira Yurizaki, July 31, 2019

Since I seem to talk about assertions that people seem to throw around left and right, I thought I'd start (yet) another blog series, Discussing Myths, where I take an assertion and explain why it's a myth. Or at the very least, spreading more light on it so people know what's going on beyond the surface.

So for an introduction post, I figure a shotgun approach to various assertions with shorter responses are in order. If you have more you want me to look at, feel free to drop a reply!

Hardware utilization is how much the hardware is actually being used
Hardware utilization is a statistic that often appears in system monitors and the like which show a percentage of how much some hardware, like the CPU or GPU, is being used. And by being used, it's usually something like "how much of the resources in the hardware are being used" or something vague. Like if a task is only using the FPU on a CPU core, maybe this counts as 50%.

Utilization usually isn't that intuitive. For hardware that executes something, like CPUs and GPUs, utilization means the inverse "how long did this hardware spend not doing anything at all?" The OS usually has an idle task where it can do things in the background and this gets run whenever some other app or task doesn't need to be run. It's likely this way due to a historical reason.

Computer systems are designed such that apps think they own the computer once they run. Of course, that's not really the case but it makes it easy for application development so they don't have to worry about the specifics. After all, applications don't know what they're running on. This is a carry-over from the early days of computing, where mainframes would use time-sharing to divvy up when a user or application can run on the system.

To make said monitors be able to figure out how much is being used would be pretty invasive. I wrote more about it over at

You can remove the page file without issue
There may be an issue depending on how much RAM you have and how much memory applications want. When an application makes a request for memory, the OS will reserve some space but not actually turn over that space to the application until the application actually uses it. This reservation is called a memory commit. Windows, and likely other OSes, will reserve more space than the app requested so that the next time the app wants more memory, there's a contiguous block of it ready.

What makes removing the page file a problem is it reduces the virtual memory address space. That is, the amount of memory space the OS has is how much physical RAM there is plus the page file. So if you have 8GB of RAM and another 8GB of page file space, the total virtual memory space is 16GB. Removing the page file reduces this to 8GB. As commit counts as being used in the virtual memory space but memory that's in use, you can have a situation where the commit fills up whats in physical RAM. Then when the application wants more memory, it may complain the system ran out of memory even though memory usage isn't at 100%.

You need to uninstall drivers to update them (you also need to use DDU to uninstall them)
I feel this one propagated because someone had a problem updating, found uninstalling drivers, either using the uninstaller or DDU (an that scrubs all traces of said driver from the computer), fixed it. However, I think this is only necessary if you have a problem after updating drivers. It shouldn't be something you should do all the time. In fact, the author of DDU specifically stated:

From what I've seen with updating drivers, the older one will either be uninstalled as part of the install process or be overwritten outright. As the driver's filename doesn't change from version to version, a simple overwrite will be enough to update it.

You always need to update drivers
You don't. First figure out what the update is bringing. If it's a security update or it adds something you really could use, then yes, you should update. If not, leave your drivers alone. If your system was fine before the driver update, it's not going to be miles better after.

This is especially the case with video card drivers. While there are performance improvements, most times they only affect specific games. So the same thing applies here: unless there's a specific game it lists as having a performance update, there's a feature you want to try out, or there's a security update, you don't have to update your drivers.

On another note, this also applies to You always need to update firmware/BIOS/UEFI
- Read more...
- 0 comments
- 957 views
Yet another AMA

By Mira Yurizaki, July 8, 2019

I've been stewing on this for a while (and I kind of didn't want to stomp on @Arika S's) but I figured... why not go for it? So here's the AMA if you want to ask me a question, any question! Yes you can ask anything and I will answer. But just so you're aware of the "rules" about this:
The answer that I give may not be the answer you want. Until it gets added to this post, I'll accept the question. But once the question is added to this post, I will ignore future repeats of that question.
If you want to ask me something, you have the following options:
Via PM Commenting directly on this blog post Happy asking away!
- Read more...
- 2 comments
- 1,139 views
How does the CPU/GPU bottleneck work?

By Mira Yurizaki, May 27, 2019

The title might be a little strange to anyone who's remotely familiar with performance bottlenecks. But rather than try to explain things on a higher level, where all of the CPU and GPU usage comparisons are done, this explains on a lower level. That is, not only the what is going on, but why it happens.
How Do Performance Bottlenecks work?
To understand how performance bottlenecks, particularly for games, it's important to understand the general flow of games from a programming standpoint. Taken in its simplest form, the steps to running a game are:
Process game logic
Render image
Of course, we can expand this out to be more detailed:
Process inputs
Update the game's state (like weather or something)
Process AI
Process physics
Process audio
Render image
The one notable thing is rendering the image is one of the last steps. Why? Because the output image represents the current state of the game world. It doesn't make much sense to display an older state of the game world. However to get to the point where the game gets to the render image state, the CPU needs to process the previous steps (though physics may be offloaded to the GPU). This means if this portion of processing takes up too much time, this limits the maximum number of frames that can be rendered in a second. For example, if these steps take 2ms to complete, then expected maximum frame rate is 500 FPS. But if these steps take 30ms to complete, then the expected maximum frame rate is about 30 FPS.

The Issue of Game Flow: Game time vs. Real time
If a developer plans on having a game run on multiple systems, there's a fundamental problem: how do you make sure the game appears to run at the same speed no matter the hardware? That is, how do you get game time must match real time regardless of hardware? If you design a game such that each loop step is 10ms in real time, then you need to make sure the hardware runs the loop in 10 ms or less, otherwise game time will creep away from real time. Likewise, if the hardware can process the game in less than 10ms, you need to make sure the processor doesn't immediately work on the next state of the game world. Otherwise game time will be faster than real-time.
To do that, developers find ways of syncing the game so it matches up with real time.

Unrestricted Processing (i.e., no syncing)
This runs the game without pausing to sync up to real time. While it's simple, this means if the game isn't running on a system it was designed for, game time will never match up to real time. Early DOS games used this method.

This chart shows a comparison of an "ideal timeline" where the designer wanted 1 frame to be 100ms in real time. The next timeline is when the game is run on a faster system, and so it completes intervals faster. This results in more frames being pushed out and now game time is faster than real time. That is, in the unrestricted timeline, 1.7 seconds of game time has passed, but is being squeezed into 1 second of real time. The result is that the game runs faster than real-time

Fixed Interval Processing, With an Image Rendered Every Tick
The loop is run at a fixed interval. If the CPU is done early, the CPU idles for the rest of the interval. However, if the CPU takes too long, processing spills into the next interval to be completed, and then it idles. Note that the image is not rendered until the CPU is done with its work. If the CPU is late, the GPU simply displays the last frame it rendered.

In this chart, we have a scenario where the CPU took too long to process game logic and so it spills into the next interval. If a frame is meant to represent 100ms of game time, this scenario completed 8 frames, resulting in a game time of 0.8s over a real-time period of 1s. The result is the game runs slower in real-time. Note: this is not how V-Sync works. V-Sync is a forced synchronization on the GPU. That is, the GPU will render the frame anyway, but will wait until the display is done showing the previous frame before presenting it.

Chances are for 8-bit and 16-bit systems, if it isn't using unrestricted time syncing, it's using this. A convenient source of a time interval is the screen's refresh rate. Modern game consoles and other fixed-configuration hardware may also still use this because it's still easy to implement. If such a game gets ported to the PC and its time syncing wasn't updated, this can cause issues if a 60FPS patch is applied.

Here's a video showing how the SNES used this method of syncing:

Variable Intervals
Instead of demanding a fixed interval, why not have an interval that's determined based on the performance of the CPU? While this can guarantee that game-time is the same as real-time, it presents a problem: now the game's physics isn't consistent. The thing with physics is that a lot of formulas examine a change over time. For example, velocity is position over time. This means that if have two different intervals where things are updated, you'll have two different outcomes.

Say for example we have an object traveling at 10m/s. If we have two intervals, one 100ms (or 1m per tick) and the other 50ms (or 0.5m per tick), the object will be in the same place at any time as long as nothing happens to the object. But let's say the object is about to impact a wall and the collision detection assumes if the object either touches or is "inside" of the wall by the time of the next interval, it's a collision. Depending on where the object is from the wall and the wall's thickness, the object in the longer interval game may appear to have traveled right through the wall because of where it ends up on the next interval.

Another issue is that because physics are calculated using floating point numbers, its inherit errors will compound readily with more calculations. So this means that the faster interval game may come to a different number because it's calculating numbers that have accumulated more errors.

Essentially, the physics and interaction of the game are no longer predictable. This has obvious issues in multiplayer, though it can also change how single player game mechanics work.

Fixed Intervals, but Drop Frames if the CPU Needs to Catch Up
The game is run in a fixed interval, but instead of requiring that the GPU renders an image after every game tick, if the CPU is late, don't tell the GPU to render a frame and instead use the free time to catch up. Once the CPU is caught up, then allow the GPU to render the next image. The idea is that the CPU should be able to catch up at some point due to the variable load during the game and this load is not making the game always in the "CPU needs to catch up" phase. This allows systems of varying performance to run the game while keeping up with real time.

Modern game engines these days use this method since it allows for stability and determinism, while allowing flexibility of rendering times. For the most part, the interval comes from a timer so that when it expires, servicing it becomes a high priority.

Looking at a few scenarios: What happens when the GPU can complete its task faster or slower relative to when the CPU does?
The following will be looking at a few scenarios on when game-time ticks are processed and when frames are rendered. One thing to keep in mind is that a rendering queue is used so that if the CPU does take a while, the GPU can at least render something until the CPU is less busy. You might know this option as "render ahead" or similar. With a longer render queue, rendering can be smooth at the expense of latency. With a shorter queue, latency is much shorter but the game may stutter if the CPU can't keep up.

With that in mind, the charts used have the following characteristics:
Processing a game tick or rendering a frame is represented by a color. The colors are supposed to match up with each other. If the CPU or GPU is not doing anything, this time is represented by a gray area.
Assume the render queue can hold 3 commands.
For the queue representation, frames enter from the right and go to the left. They exit to the left as well:

The render queue will be shown at each interval, rather than when the CPU finishes up processing a game tick.
CPU can complete its work within an interval, GPU is slower
This is a case where the CPU can complete its work without spilling over into the next interval, but the GPU takes much longer to generate a frame.

Note that 10 game-ticks were generated, but only 5 frames were rendered. The queue also at the end filled up and since the GPU couldn't get to the next frame, the queue had to drop a frame. In this case, the second purple frame was queued up but it had to be dropped at the end since the GPU could not get to it fast enough.

This is also why the GPU cannot bottleneck the CPU. The CPU still processes the subsequent game ticks without waiting for the GPU to be done. However, if Fixed Interval Processing while forcing the game to render an image every tick is used, then the GPU can bottleneck the CPU. But since most PC games don’t use that method, we can assume it’s not an issue.

CPU can complete its work within an interval, GPU is faster
In this case, the GPU can easily render a frame, so the GPU is ready to render a new frame as soon as the current game tick is processed.

(Note: technically the queue would be empty the entire time)

This is not a CPU bottleneck condition however, as the game is designed around a fixed interval. Some game developers may design their game loop so it runs at a much faster interval than 60Hz so that high-end GPUs don't have idle time like this. But if the GPU can keep up with this and the interval is a lower frequency, then performance can be smooth or stuttering, depending on the CPU processing times.

Some games may allow the CPU to generate GPU commands to render “in-between” frames and use the time between the render command and the last game tick to represent how much to move objects. Note that these extra frames are superfluous, meaning your actions during them have no impact on the state of the game itself until the next game-tick.

CPU cannot complete its work within an interval, GPU is faster
In this scenario, the CPU has issues keeping up with processing the game logic, so it drops the frame in favor of catching up. This has the effect of not queuing up a frame in the first place and the GPU is stuck repeating the last frame it rendered. However, when the CPU catches up and queues up a frame, it's representing the last game tick.

(Note: technically the queue would be empty the entire time)
In this case, because the first green frame took too long, it doesn't get queued and so the GPU continues to show the yellow frame. The CPU catches up on the first red frame which the GPU will render. A similar thing happens on the game tick of the second yellow frame.

Out of the other scenarios, this one is probably the least favorable one. Notice how frames can be clumped up with longer pauses between them. This causes the feeling of the game stuttering.

CPU cannot complete its work within an interval, GPU is slower
This is a case where the CPU has trouble completing every tick within an interval and the GPU has issues rendering a frame within an interval as well:

This scenario, depending on how fast the frame rate actually is, may not be as bad as it looks, as it spreads the frames out over time.

Further Reading
For further reading, I picked up most of this information from:
https://bell0bytes.eu/the-game-loop/
https://gameprogrammingpatterns.com/game-loop.html
https://docs.microsoft.com/en-us/windows/desktop/api/dxgi/nf-dxgi-idxgidevice1-setmaximumframelatency

https://www.reddit.com/r/nvidia/comments/821n66/maximum_prerendered_frames_what_to_set_it_to/?depth=2
- Read more...
- 1 comment
- 2,401 views
"What programming language should I start off with?"

By Mira Yurizaki, March 21, 2019

This is a frequently asked question from people who are curious about programming. So here's the short answer: it depends.

In my experience with various programming languages such as assembly, C, C++, C#, Java, JavaScript, Python, TI BASIC, Visual Basic, Ruby, Bash scripting, Windows Batch scripting, and even Brainfuck (though this was a curiosity), the language itself doesn't really matter. Over time you learn a lot of things that can carry over to other languages, and you find that a lot of languages have basic characteristics. There are other characteristics that can help aid in making applications, but there's nothing that, without anything else taken into consideration, makes one "better" than the other. A programming language I'd call is an implementation detail. Meaning, it doesn't matter what language you use, you can probably realize what you want.

But you've shown an interest in programming, and obviously you need a programming language to start this journey of learning how to code! So for the sake of putting down a language, what should you learn? Well to ask another question: what are you interested in doing? This will help narrow down what you should focus on because certain categories of applications prefer one language over another for arbitrary reasons. For example, want to get into web app development? Start learning HTML, CSS, and JavaScript. You may not have to use the last two, but it certainly will help. Want to get into Android app programming? Start with Java. iOS app programming? Swift. Windows app programming? C#. Don't know? Just pick a language and go from there.

However, if you're fresh to programming, I would argue not to care so much about the nuances of the language. I'd argue that any language worth its salt will allow you to do the following:
Create symbols (or names) to represent data Freely operate on that data using basic math and bit-wise operations Allow for conditional control, using things like or similar to if-statements and loops Allow for controlling where the program can go, using things like function calls or jumps And many widely used programming languages have these features.

Okay, may be you're still wracked with decision paralysis. If you were to ask me which one to use to start off your journey into the world of programming, and I'm sure I'll draw the ire of a few people, I would have to go with Python. Why? A couple of reasons.

The first is the tool chain, so to speak. I don't believe the journey to programming should start off with the entire process of building the program. It's nice to know, but anything that gets in the way of the person jumping right into coding adds resistance. They likely just want to code and see results right away which is encouraging and can build even more curiosity. While you can select another language that has an IDE, those can be quite intimidating to tread through. You could argue "if they get into programming, they should be using industry standard tools to leverage the experience if they want to make this into a job", okay. But that's like telling a kid who's interested in cinematography to start with a $30,000 RED camera so they can get experience on the industry standard, not their smartphone because "who makes serious professional films using a smart phone?"

I digress though. So what makes Python's tool chain great for beginners? To start, it has an interpreter. This makes it much quicker to jump into programming than say using C. If all you want to do is print the standard "Hello World!", this is literally what you have to do in Python:
Open a command line Type in python to start the interpreter Type in print("Hello world!") in the prompt Doing the same thing in C would easily take twice as many steps. Whether you think so or not, this can be intimidating for some people. And if you go "well if this intimidates them, then they shouldn't be programming." Well going back to the cinematography example, if buying a $30,000 RED camera is intimidating or even getting something like a beefy computer with one of the widely used video editing software, should they stop pursuing their dreams?

And when you're ready to move onto making Python files, you don't need to do anything different. It's just invoking python and the file in question.

Secondly Python's multi-paradigm flexibility allows you to adjust how you want to code. You can start off being procedural which I argue is quite intuitive. If you want to do object oriented programming (OOP), Python supports that. You can group functionality into modules. There's no memory management to think about. Data types, while important to know, don't have to be explicitly defined. When I started working with Python, I was surprised how easy it was to work with and get something done.

However, Python isn't a perfect language. No language is. It has its downsides as well
Python is an interpreted language. So if performance is your goal, Python isn't for you. However, I'd argue while the pursuit of performance is fine, prove your app works before making performance your goal. While Python doesn't require you to explicitly say what a variable is (known as dynamic typing), it can cause some trip-ups if you're not careful at best and you may not even know what type of data the variable is supposed to handle at worst (though if you need to, just have Python spit out the variable's type). And since Python doesn't check the data type until the script is running, you may run into issues where you're trying to do something with two different data types, like adding a number to a string, and the thing throws an error and stops as a result. The way Python handles certain OOP concepts is not intuitive, but I'd argue you shouldn't be touching OOP until you've done some reasonably complex apps. But to get the basics down, Python offers a fairly low bar of entry. And once you have the basics down, then you can move onto more advanced topics and other languages.

At the end of the day though: The language doesn't matter, what's important is to know the basics that programming in general requires. However, if you have a goal in mind of what applications you want to do, it might be better to start learning the languages used in that field first.
- Read more...
- 2 comments
- 1,071 views
Process for gathering video card data when analyzing application behavior, pt. 2

By Mira Yurizaki, April 28, 2019

There was one bit I should add from this last blog:
Once you've gathered up the data, how do you use it? The biggest trouble with PerfMon in the way it gathers GPU data is, while it can do per-process data gathering, it doesn't actually capture the process you're interested in unless it's already running. While that's fine and all for observing it without logging, creating a Data Collector Set that captures the process is impossible. PerfMon uses Process IDs, or PIDs, and the game's PID will change every time its run. So you're forced in PerfMon to capture data from every process using the GPU at the start of the capture.

This presents a problem, because what you end getting is something that looks like this:

Only some of this data is actually useful, so we have to figure out which ones aren't. However in this case, there's a semi-obvious choice: the solid green line going up and down during the benchmark. If you hover over it with the mouse, it tells you what it is:

This helps narrow down the PID of the game and what you can do to filter out the results. In this case, the PID is 3892. If we filter out everything in the "Instance" column for 3892, we get:

And you can doubly check to make sure it's the game by checking GPU memory usage. Now remove every other "Instance" that isn't for PID 3892 to clean up the data. Once you're done with that, you can right click on the graph, select "Save Data As...", select "Text File (Comma Delimited)" as the file type, and save. Now you can use your favorite spreadsheet application of choice to process this data.
- Read more...
- 0 comments
- 1,339 views
The Software Enigma Machine Bonus: Refactoring some code

By Mira Yurizaki, April 25, 2019

A bonus update to the Software Enigma Machine I did way back when. This time, I went back and refactored some code because I felt it needed it.

The Outline
Part 1 What is the Enigma Machine? Why did I choose to implement the Enigma machine? Before Programming Begins: Understanding the theory of the Enigma Machine Finding a programming environment Part 2 Programming the features Rotors Rotor housing Part 3 Plug Board Part 4 GUI If you'd like to look at the source code, it's uploaded on GitHub.

What needed refactoring?
One of the things I did was placed each individual element on the Full version of the GUI, which to recap looked like this:

The problem was that placing everything down and updating all of the properties took forever. But this isn't the only way to add GUI elements onto a Windows Form. You can programmatically add them in! So I went about to figure out how to do this and after a few minutes of experimentation, I came up wit this GUI in the designer:

The idea here is that I have "anchor points" for the lights, keys, and plug board. Based on these anchor points I can then build the rest of the elements as needed. Picking on the "lights", the old way of doing things was this snippet of code:
private void setupLampLabels() { lampLabels = new List<Label>(); lampLabels.Add(ALightLabel); lampLabels.Add(BLightLabel); lampLabels.Add(CLightLabel); lampLabels.Add(DLightLabel); lampLabels.Add(ELightLabel); lampLabels.Add(FLightLabel); lampLabels.Add(GLightLabel); lampLabels.Add(HLightLabel); lampLabels.Add(ILightLabel); lampLabels.Add(JLightLabel); lampLabels.Add(KLightLabel); lampLabels.Add(LLightLabel); lampLabels.Add(MLightLabel); lampLabels.Add(NLightLabel); lampLabels.Add(OLightLabel); lampLabels.Add(PLightLabel); lampLabels.Add(QLightLabel); lampLabels.Add(RLightLabel); lampLabels.Add(SLightLabel); lampLabels.Add(TLightLabel); lampLabels.Add(ULightLabel); lampLabels.Add(VLightLabel); lampLabels.Add(WLightLabel); lampLabels.Add(XLightLabel); lampLabels.Add(YLightLabel); lampLabels.Add(ZLightLabel); } Ick. All this was doing was adding each object to the lampLabels list. By switching it to a programmatically generated one, the code now looks like this:
static string[] QWERTY_LINE = {"QWERTYUIO", "ASDFGHJK", "ZXCVBNMLP"}; private void setupLightLabels() { int row = 0; lampLabels = new List<Label>(); setupLight(QLightLabel, row++); setupLight(ALightLabel, row++); setupLight(ZLightLabel, row++); lastLight = QLightLabel; lampLabels.Sort((x, y) => String.Compare(x.Text, y.Text)); } private void setupLight(Label BaseLabel, int Row) { const int LABEL_SPACING = 36; lampLabels.Add(BaseLabel); for (int Col = 1; Col < QWERTY_LINE[Row].Length; Col++) { Label lampLabel = new Label(); lampLabel.Font = BaseLabel.Font; lampLabel.TextAlign = ContentAlignment.MiddleCenter; lampLabel.BackColor = Color.Gray; lampLabel.Parent = fullMethodTabPage; lampLabel.Location = new Point(BaseLabel.Location.X + (Col * LABEL_SPACING), BaseLabel.Location.Y); lampLabel.Size = new Size(30, 30); lampLabel.Text = QWERTY_LINE[Row][Col].ToString(); lampLabel.Name = string.Format("{0}LightLabel", QWERTY_LINE[Row][Col]); lampLabels.Add(lampLabel); } } What this does is builds a row of lights based on an anchor point. In this case, It builds off the Q, A, and Z lights. This makes a few things much easier to do now:
If I wanted to change how the letters are laid out, I can update the QWERTY_LINE array. If I wanted to change where and how the lights themselves are laid out, I only have to move their anchor point around. I only have to move the anchor points around, I don't have to move a bunch of GUI widgets around. Also this generates uniform spacing too. This does come with the caveat that each string in the QWERTY_LINE array has to start with an anchor letter.

The second refactoring I did was how the plug board shuffling works. Before it was this confusing mess that to be honest, I can't really explain it anymore off the top of my head:
public void ShuffleWiring() { int entries = mapping.Count; Dictionary<string, string>.KeyCollection keys = mapping.Keys; for(int i = 0; i < entries; i++) { int randEntry = rng.Next(entries + 1); int counter = 0; string firstLetter = ""; string secondLetter = ""; foreach (string randomKey in keys) { firstLetter = randomKey; counter++; if (counter == randEntry) break; } randEntry = (rng.Next() % entries); counter = 0; foreach (string randomKey in keys) { secondLetter = randomKey; counter++; if (counter == randEntry) break; } ChangeWiring(firstLetter, secondLetter); } } I found one key thing that made this method much shorter: You can take a dictionary's keys and turn them into an array if you use System.Linq. Now this method becomes:
public void ShuffleWiring() { string[] keys = mapping.Keys.ToArray(); for (int i = 0; i < keys.Length; i++) { string firstLetter = keys[i]; string secondLetter = keys[rng.Next(keys.Length)]; ChangeWiring(firstLetter, secondLetter); } } And it's much easier to understand.

I also did some other miscellaneous refactoring such as integrating utility methods that only had one reference into the method that called it. Another is I got rid of this method:
private void resetLights() { foreach(Label lamp in lampLabels) lamp.BackColor = Color.Gray; } I didn't like this because this was referenced a few times and each time it was referenced, it went through all 26 lights to reset the background color. So instead of doing that, I have a lastLight object that holds a reference to the last light that was touched. Then, whenever this method was called, it was replaced with lastLamp.BackColor = Color.Gray;
- Read more...
- 0 comments
- 1,205 views
Process for gathering video card data when analyzing application behavior

By Mira Yurizaki, April 12, 2019

Since I've been doing some tests lately involving how applications use the video card, I thought I'd write down the process of gathering this data and presenting it. After all, any sufficiently "scientific" test should be repeatable by others, and being repeatable means knowing what to do!

What data am I gathering?
CPU Utilization
This is to see how the CPU is being used by the application. The higher the usage overall, the more likely it is to bottleneck the GPU. I may omit this altogether if I'm not interested in CPU utilization. GPU engine usage
A "GPU engine" is something that Microsoft calls a part of the GPU that handles a certain task. Which engines are available depends on the GPU manufacturer. The two I'm primarily interested in are the graphics and compute engines, because these two will show how the execution portions of the GPU are being used by the application. This can only be used for informational purposes, i.e., there is no "lower/higher is better" value. VRAM usage
Since Windows Vista, Microsoft implemented virtual memory on a system level. This allows me to look at three elements: Committed VRAM (how much was requested to be reserved), Dedicated VRAM usage (how much is used on the video card itself), and Shared VRAM usage (which is VRAM usage in system memory). Like GPU engine usage, this can only be used for informational purposes. Frame Time
This is the amount of time between frames. As long as VSync or frame limiting is not used, this should represent how long it took to render the frame. The inverse of this is what I call "instantaneous FPS," which is the time between the current and last frame normalized over a second. I call this "instantaneous" since FPS would require counting all of the frames in a second. What data am I not gathering?
I'm not looking at temperatures, clock speeds, and fan speeds. These are aspects of hardware that don't reflect how the application is using it.

What tools am I using?
Performance Monitor (PerfMon)
PerfMon gathers CPU utilization, GPU engine usage, and VRAM usage. Other tools like GPU-Z and MSI Afterburner cannot gather this data, at least with respect to the specific aspects I'm looking for. The other thing is that PerfMon can gather data per-application. Meaning the data I gather is specifically from the application in question, rather than on a system wide level. FRAPS
While FRAPS is old (the last update was in 2013) and the overlay no longer seems to work in DX12 applications, its benchmark functionality still works. This allows me to gather data about frame times. Note that FRAPS counts a frame as when one of the display buffers flips. This poses a limitation when VSync is enabled but the application is not triple buffered or when frame rate limiting is used. How do I use these tools?
PerfMon takes some setting up:
Open it by going to Control Panel -> All items -> Administrative Tools -> Performance Monitor. Open it as an Administrator, otherwise you won't be able to do the other steps. Select "Data Collector Sets" in the left pane Right click "User Defined" in the right pane and select New -> Data Collector Set In the wizard that pops up, name the Data Collector Set, choose "Create manually (Advanced)" In the next page, select "Create data logs" and check off "Performance counter" In the next page, click on the "Add..." button, the select the following: GPU Engine -> Utilization for All Instances GPU Memory -> Committed, Dedicated, and Shared memory for All Instances If doing CPU utilization, select Processor -> "% Processor Time" for All Instances The next page will ask where you want to save these logs When you want to start the data collection, select the one you created and on the tool bar on the top, press the green triangle. To stop collecting data, press the black square. Note: PerfMon gathers GPU data by the apps using the GPU that are currently running when the collection starts. If the app isn't running and you start data collecting, it won't gather data for that app. To open the log, go to where you said to save the data and double click on it. The data collected for each app is by process ID. Unless you figured this out ahead of time, the best way I've found to find it is to plot all of the 3D or graphics engines and see which one looks like the process ID of the app. Then I sort by name, then remove the data from the other process IDs. Once the data has been filtered, right click on the graph and select "Save Data" Save it as a "Text File - Comma Separated Values (CSV)" Once you have the data in a CSV format, you should be able to manipulate this data using spreadsheet apps like Microsoft Excel or Open/Libre Office Calc.

FRAPS requires pressing F11, or whatever the benchmark hotkey is, to start then pressing it again to stop. FRAPS saves the data as CSV. The items of interest are frame times and MinMaxAvg data. Frame times do require additional work as FRAPS records the timestamp in milliseconds from the start of the run rather than the time between frames.

What other tools did I consider and why weren't they used?
EVGA Precision X
Polls system wide stats. Also, while it it has a frame rate counter, it samples it over the period which can mask hiccups (and it's likely based on the inverse of FPS). While higher sampling rates can be used, I noticed this adds a significant use to the GPU. GPU-Z
Polls system wide stats. MSI Afterburner
Polls system wide stats. May also have the same issues as EVGA Precision X.
- Read more...
- 1 comment
- 1,796 views
Why OSes report processor utilization as idle process time

By Mira Yurizaki, March 12, 2019

This was brought up in a conversation about how Windows reports processor utilization, whether it's the CPU or GPU. You may be tempted to think that processor utilization is or should be the percentage of how much the processor is being used over time rather than something like idle time, which is defined as the CPU running the OS's idle task. In a previous blog post, I did touch upon this, but I don't think the reasoning I put painted the entire picture. Basically I said:

Or rather, this was looking at it from a point of view that utilization should be when the CPU is executing something "useful", whatever that means. So let's expand upon why the OS reports processor utilization as idle time instead of something that may make more sense like, well, CPU resource utilization.

Backup: What is the "Idle process?"
The idle process is a special process or program that runs on the OS. It's always there and it's running when the OS has nothing else scheduled for the CPU to do. That is, none of the other apps have something they want to do because they're waiting on something. So processor utilization is the percentage of time that the idle process was running over some sampling period.

Of note, the idle process isn't something that runs only NOPs (no operation) forever or goes right to sleep, but often takes care of background tasks. In portable systems, it may put the system into a low power mode once certain conditions are met.

How do you event define "utilization" from a resource usage standpoint?
Let's take a look at the block diagram for AMD's Zen microarchitecture (I don't know why I keep using it, but it's a nice diagram. Also from https://en.wikichip.org/wiki/File:zen_block_diagram.svg)

There are a lot of things here you could count as being used:
How much of cache is not dirty? How many instructions is the decoder working on? How many uOPs are being queued up? How full is that retire queue? How many schedulers are being pinged with something to do? Those schedulers have a queue too, how full are they? How many execution units are being used? etc. etc. etc. So if you wanted to know how much of the CPU resources are being used, there's a lot of things you have to keep tabs of. If you really want to know the details of each of the above, that would require silicon to keep track of it. Even though this silicon likely wouldn't take up a lot of space, I'm pretty sure people would rather want that put somewhere else.

However, there lies another problem: you'll almost never reach 100% utilization. The first is that unless you write completely linear code from start to finish, you'll have gaps due to branching. Another is that mixing integer and floating point together is a hassle, and so typically some things are integer only and some things are floating point only. You'll likely not saturate the integer and FP side of the CPU. On the opposite side, you'll never reach 0% utilization, at least completely. 0% means the CPU isn't running period, and the CPU is always running something.

It's about the perspective
It's important to note that when displaying information from a system, you have to take account who's viewing it and when it should be necessary to show it. For example, a typical car that most people drive. There are many more parameters than the car's instrument panel shows that the car is measuring and taking note of. Things like oil pressure, battery voltage, tire pressure, oxygen content in the intake, whether or not the brakes are locking, whether or not the tires are slipping, etc. But none of this is shown as a value on the instrument panel. The most you get out of this is some warning light that comes on only when there's a problem. Heck, even in my car, there's no gauge for engine temperature which is something that was on many cars. So for the purposes of the average Joe, if all of the values for these where somewhere on the instrument panel, it would be information overload. Most of these values have no meaning to the person unless there's an actual problem. Worst yet, they may misinterpret some of the values as being dangerous, even though there's nothing wrong with the car. Like 220F sounds pretty hot, but this is a normal temperature for oil when the car is warmed up.

Similarly with a computer, knowing all of this information may be useful, but for most people that are using the computer, it's not values that are really important. It doesn't even have to be from the user's perspective, but from the software's as well. What would happen if average Joe saw that the execution units on the CPU weren't at 100%? What does uOp queue even mean? Certain values aren't necessarily bad per se and even if they knew what these values meant, they couldn't do anything about it anyway.

So the two major perspectives that I want to go over as far as CPU utilization is concerned is from the operating system's and the application's points of view.

The Operating System's job is to service applications
The job of the operating system is to manage the resources available to the computer so that applications that run on it can share those resources. In order to do so, it has to look at everything from the system point of view, which means the hardware and software in total, not just a single piece. Once the application has a resource, the OS largely doesn't care what the application does with this resource and assumes the application will either get whatever it wanted done or tell it that it doesn't need the resource anymore. Or to put in another way, a manager has employees to look after, but the manager's job is to assign tasks to the employees. The manager largely doesn't care how it's done (assuming there's no stringent company requirements) as long as the task is done.

The application just wants to use the CPU, so it cares how it's used
The way most modern operating systems we use allocates the CPU to applications is it gives the application a slice of time to run. Once the slice of time is up or the application releases control of the CPU, the OS takes over and finds some other thing to put on the CPU. So from this view point, it's important to maximize the usage of the CPU as much as you can for the tasks you need to complete.

However if the OS is only going to report utilization as the percentage of time spent running the Idle Process, how can one figure out how the app is using the CPU? After all, it's useful for needing to know where there are performance choke points. And there is such a way: by the way of profilers. However, profilers are rather intrusive since they interact with many intimate parts of the software and possibly the CPU. I alluded to this in my previous post, and at some point, profiling the application's CPU usage to get more accurate results showing all the time can cause a much larger performance degradation for information that really doesn't add much value once the application is shipped.

On the flipside, does the application care how other applications are using the CPU? Not really. After all, if my application is on the CPU, it should be the only thing on the CPU. Even in a simultaneous multithreading environment, I might care what the other app is doing, but even if I did know what was going on, my app can't do anything about it.

The information can also be misinterpreted
Going back to the car example I made earlier, even if the OS can provide processor resource usage, this information can be misinterpreted by the average end-user. What if we had two processors, one with say a smaller instruction queue size than another. We give both processors the same task and of course, they would report back different utilization values. This can be interpreted to mean by someone who doesn't know better than one processor works harder than the other because their instruction queue is fuller percentage-wise or another could say that one processor is worse than the other for the same reason. It doesn't really matter. If they both complete the task at the same time, they're both equally as good.

What do most end-users really care about?
For most end-users, you have to ask yourself what information they would really care about. For instance, what's a better metric when troubleshooting system issues? Knowing that your uOP queue is 80% full and your integer schedulers have 10 entries each or knowing that App A is hogging up most of the CPU time and that's probably slowing everyone else down?

Or for perhaps a better thing a user would commonly encounter: Wi-Fi. The typical user doesn't care what protocol is being used, how fast the speed is, what the exact signal strength is, etc. The typical user only cares about two things: 1. How strong is the signal on a scale that I can understand? and 2. Am I connected?
- Read more...
- 0 comments
- 1,182 views
Why are physics engines tied to frame rates?

By Mira Yurizaki, November 1, 2018

I came across the news about Fallout 76 has a problem: you can edit a value in one of the ini files which affects the frame rate, and then the physics engine is tied to that, with the end result is players are able to move faster simply by rendering more frames. Obviously this is a problem but why do developers design the physics engines around a frame rate? You need to have a rate of change. Frame rate is a convenient source of what the rate is.

A lot of equations in physics are over time. Meaning they measure something that happens between two points in time. Take velocity: it's the distance between two points in space divided by the time it took for something to travel that distance. Acceleration is the rate at which an object changes its velocity between two points in time. The problem with physics simulations in games (and perhaps in general) is that everything being calculated is being calculated for an instant of time. You might be able to know the input parameters, like how fast an object was going, but you won't know if the velocity will change in the future even if you knew all of the factors that are affecting it because you need one more thing: how long is the object going to be subjected to these effects? Or to put it in another way it's like asking this: I have an object going 5 meters per second, it's experiencing a acceleration of -1 m/s^2, what's its velocity going to be? I don't know, because I don't know how long this object will be experiencing said acceleration.

What Bethesda is doing though is they likely have a reasonably frequent enough rate at which the physics simulation is run at. But they also cap the frame rate to that rate as well. This may also cap the input sampling rate. So why would they do this? Because rendering faster than the rate at which the physics are simulated would mean rendering extra frames that don't add any new information. This may have been a design choice, because other developers don't seem to care like what Ubisoft does with Assassin's Creed:
The cloth runs at "half frame rate" because the physics engine for this isn't ran as fast the graphics rendering is, and so you have this weird effect with the cloth seems to move at a disjointed rate.

So when you uncap the frame rate in a Bethesda game, you're effectively telling the game to run the physics engine at a faster rate, which is great!... except for the fact that if everything else about the game was designed around this one value and affects how everything else behaves. Really the solution to Bethesda's problem is to not make this value changeable in the first place. Or I guess, you know, design a physics engine that isn't dependent on frame rates.
- Read more...
- 6 comments
- 6,102 views
[RQB] VRAM usage may not be what you think it is

By Mira Yurizaki, March 3, 2019

Note: This is a copypasta of a reply I did to a topic.

I think the VRAM thing is more complicated than "[Game] uses X amount of VRAM, therefore, you need more than X amount of VRAM these days" for performance. I've been reading on the interwebs from people that games will request more VRAM than they actually need and they may never use it, much like how apps may overshoot how much memory they need (yes this is actually a thing: https://blogs.msdn.microsoft.com/oldnewthing/20091002-00/?p=16513). Though in a lot of cases where I've seen VRAM usage, the game tends to use the same amount regardless of VRAM. e.g., a game uses around 4.5GB of VRAM regardless if there's 6GB, 8GB, or 16GB.

So what about the case where a game uses roughly the same amount of VRAM regardless and there isn't enough? I'm not convinced there's a huge issue here. So here's an example (from https://www.techspot.com/article/1600-far-cry-5-benchmarks/ )

Given that Far Cry 5 uses around 3GB of VRAM at 1080p, this may not be much of an interesting result to look at but for the record:

Now the 1440p and 4K benchmarks should be more interesting. Clearly Far Cry 5 will use more VRAM than the GTX 1060 3GB has
Yet strangely enough, performance, even the min FPS results, aren't tanking hard and is remaining in line with the expected performance delta from the GTX 1060 6GB. In fact, even the GT 1030 is only seeing a linear drop-off in performance despite having 2GB of VRAM (1440p is basically 2x 1080p, and 2160p is 4x 1080p)

And for all the research I'm willing to do, I came across a PcPer article interviewing one of NVIDIA's VPs of engineering, with the most interesting bit being:

tl;dr, VRAM usage may not actually be indicative of any actual requirement of what's needed.
- Read more...
- 0 comments
- 1,384 views
Programming application and language types

By Mira Yurizaki, February 11, 2019

When it comes to making a program for a computer, while there are a plethora of languages and ways to make an application, there's only two types of both applications and languages. This post will cover what those types are.

Programming Language Types
There are two basic types of programming languages: low level and high level.

Low level languages are the "hardware" languages, which are typically defined into two more types:
Machine language, which are literally the binary values being fed into a machine. A programmer can either write out the 0s and 1s or some other representation of numerical values, typically hexadecimal. Machine language consists of primarily two parts: An opcode, which maps a number to an operation. e.g., 1 is add, 2 is subtract, 3 is jump somewhere else An operand, which tells which datum or data to operate on. Assembly language, which gives the opcodes a mnemonic name and allows for operands to have names as well, but otherwise it can map directly to machine language. Low level languages tend to be architecture specific. Even if two architectures contain the same operations, their opcodes may be different. And even if two architectures use the same assembly language mnemonic, they may handle it differently, especially with the operands. Due to the architecture specific nature, it's rare to write programs by hand using a low level language. It's typically reserved for architecture specific optimizations or if you need to get down into the depths of software baked in to the hardware. But the advantages to using low level languages is that this is the fastest your program will operate since it's directly talking to the hardware.

High level languages attempt to provide a human readable way of representing program code. So instead of writing mov 10, x, you can write x = 10 or x is 10. This typically comes at the speed due to needing to translate what the higher level language is trying to accomplish. Also depending on the language itself, it may bar you from some higher performance features that are handy if used correctly, but disastrous if used incorrectly. One could also argue that some high level languages are more "mid-level" languages, in that they lack enough recent concepts and readily map to assembly language. For example, C is sometimes dubbed as a "mid-level" language because of how bare-bones it is and how easily it compiles into a low level language. This is in contrast to say Python, which comes with many more features and it doesn't readily compile into a low level language.

No matter the language type, all of them except for machine code have to be turned into some form machine readable code. Though machine code may be represented as a text file such as Intel HEX, which has to be read by some loader to be runnable. For those looking for specific terms, if the source code being turned into machine code is assembly language, the program that does this is called an assembler. If the source code is a high level language, there are various types of converting it closer to machine readable code, with the three main types being:
Ahead-of-Time compiling (AoT): the source code is compiled into an executable for loading and running. This generally allows for the fastest execution. Examples of normally AoT compiled languages are C and C++. Just-In-Time compiling (JIT): the source code is compiled into an intermediate form, then when the program is run, this intermediate form is compiled to the machine code as needed. Examples of normally JIT compiled languages are Java and the .NET family (C# and VB.NET) Interpreting: the source code is read and executed line by line. In order to help speed up the process, this may be JIT compiled instead. Examples of interpreted languages are BASIC, Python, and JavaScript.
Computer Language Types
You might be familiar with a lot of things that are "languages" that tell a computer what do, but these break down into various categories based on what their intended role is. The main ones you usually encounter are:
Programming Language: This describes, obviously enough, computer programs. While this is a broad description, I tend to think of a program as something that runs directly on the hardware. i.e., the instructions represent something the actual hardware is capable of doing. Scripting Language: Scripting languages differ from programming languages in that they are meant to be run from a program to automate tasks or manipulate something about the program itself. A way to think of this is like JavaScript on web pages. The JavaScript is run on a web browser to manipulate the web application, but not all of it represents what a machine is supposed to do. e.g., a machine doesn't know what a button is, but a web browser does. Configuration Languages: These are data containers that store well, configurations that programs use to set parameters. A modern example that is popular is the JSON format, due to being human readable and easily turned into binary data. Markup Languages: This describes a document and how it should look. The name comes from "marking up" a paper in editing.
For all intents and purposes, you could argue all of these are some sort of "programming" language since you are telling a computer what to do. But if you want to be a snob about languages being "real programming" languages, if a language is Turing Complete, then that means it can execute any program conceivable. Virtually all programming and scripting languages are Turing complete. So what's a non-Turing complete language? Configuration and markup languages, as well as some query based languages like SQL, and some other ways to change the behavior of an action such as regular expressions.

Computer Application Types
Computer applications break down into two main types:

Application Program
An application program, or just application, provides a service to the user. Examples include web browsers, document editors, and media players. As you might guess, this is where the term "app" comes from. While applications today are typically written in higher level languages, earlier ones had to be written in lower level languages.

System Program
A system program provides a service to applications. Examples include firmware, hardware drivers, and to a strong degree, operating systems. They typically do not contain any sort of user facing interface, relying on the user to write into configuration files to change the behavior of the application. If a system program does contain a user interface, it may be the only program running on the system (such as in UEFI/BIOS settings) or it may be decoupled from the core components of the program itself (such as the GUI environment of the OS kernel).
- Read more...
- 0 comments
- 639 views
The Chiplet "Problem" with GPUs

By Mira Yurizaki, January 18, 2019

UPDATE: I've edited this blog too many times because I always think I'm done, but then another idea comes up. *sigh* But I should be done now.

With AMD's semi-recent announcement of their server processors using the so-called "Chiplet" design, I thought it'd be a good idea to talk about how this could affect other processor types. People have pointed to GPUs being the next logical step, but I've been hesitant to jump on that and this blog is to discuss why.

An Explanation: What is the Chiplet Design?
To understand the chiplet design, it's useful to understand how many processors are designed today. Typically they're designed using the so-called monolithic approach, where everything about the processor is built onto a single piece of silicon. The following is an example of a quad core design:

Everything going to the processor has to go through an I/O section of the chip. Primarily this handles talking to main memory, but modern processors also have other I/O built in like PCI Express lanes or display compositors (the GPU would be considered a separate thing). From there, it goes through a typically much faster inter-processor bus where the processor's cores talk among each other and through the I/O.

What the chiplet design does is separate the cores and I/O section into different chips.

The advantage here is that one part of the processor as a whole can break, but the entire processor doesn't have to be thrown away. But it doesn't stop here. As long as the I/O section can support more and more processor core chiplets, then you can expand it out however many you want. Or something like this:

This is obviously a great design. You need more cores? Just throw on another chiplet!

So what's the problem here with GPUs adopting this? It's the expectations of what each processor is designed to take care of. Their core designs reflect that.

A Comparison Between a CPU Core and a GPU Core
At the heart of a processing unit of any sort is the "core", which I will define as a processing unit containing a memory interface, a "front-end" containing an instruction decoder and scheduler, and a "back-end" containing the execution units. A CPU core tends to have a complicated front-end and a back-end with a smaller number of execution units, while a GPU tends to have a simpler or smaller front-end with a much larger back-end. To put it visually:

Block Diagram of an AMD Zen 1 CPU Core

Block Diagram of an AMD Fiji GPU Core. Each "ACE" is a Front-End Unit and Each "Shader Engine" is a Back-End Unit

They are designed this way because of the tasks they're expected to complete. A CPU is expected to perform a randomized set of instructions in the best way it can from various tasks with a small amount of data. A GPU is expected to perform a smaller number of instructions, specifically built and ordered, on a large amount of data.

From the previous section about chiplet design, you might be thinking to yourself: "Well can't the Fiji GPU core have the stuff on the left side (HBM + MC) and the right side (Multimedia Accelerators, Eyefinity, CrossFire XDMA, DMA, PCIe Bus Interface) separated into its own chip?" Well let's take a look at what the Fiji GPU die looks like (taken from https://www.guru3d.com/news-story/amd-radeon-r9-fiji-die-shot-photo.html)

The big part in the middle are all of the ACEs, the Graphics Command Processor, and the Shader Engines from the block diagram. This takes up roughly, if guessing, 72% of the die itself. Not only that, aside from everything on the right side in the block diagram, this GPU core still needs everything from the left side, or all of the HBM and MC parts. Something needs to feed the main bit of the GPU with data and this is a hungry GPU! To put in another way, a two-chiplet design would very similar to the two GPU, single card designs of many years ago, like the R9 Fury Pro Duo:

But Wouldn't Going to 7nm Solve This Issue?
While it's tempting to think that smaller nodes means smaller sized dies, the thing is with GPUs, adding more execution units increases its performance because the work it solves is what is known as embarrassingly parallel, or it's trivial to split the work up into more units. It's more pixels per second to crunch. This isn't the case with the CPU, where instructions are almost never guaranteed to be orderly and predictable, the basic ingredient for parallel tasks. So while adding more transistors per CPU core hasn't always been viable, it has been for GPUs and so the average die size of a GPU hasn't gone down as transistors get smaller:

Transistor count, die size, and fabrication process for the highest-end GPU of a generation for AMD GPUs (Data sourced from Wikipedia)

Since AMD has had weird moments, let's take a look at its competitor, NVIDIA:

Transistor count, die size, and fabrication process for the highest-end* GPU of a generation for NVIDIA GPUs (Data sourced from Wikipedia)

Notes:
G92 is considered it's own generation due to being in two video card series GTX 280 and GTX 285 were included due to being the same GPU but with a die shrink TITANs were not included since the Ti version is more recognizable and are the same GPU
But the trend is the same: the average die size for the GPUs has remained fairly level.

Unfortunately transistor count for processors isn't straight-forward like it is for GPUs. Over the years, processors have integrated more and more things into it. So we can't even compare say an AMD Bulldozer transistor count to an AMD Ryzen transistor count due to Ryzen integrating more features like extra PCIe lanes, the entirety of what used to be "Northbridge", among other things. However, with that in mind, it's still nice to see some data to see where overall things have gotten:

Transistor count, die size, and fabrication process for various processors (Data from Wikipedia)

One just has to keep in mind that at various points, processors started to integrate more features that aren't related to the front-end, back-end, or memory interface, so processors from that point on may actually have a lower transistor count and thus die-size.

How about separating the front-end from the back end?
This is a problem because the front-end needs to know how to allocate its resources, which is the back end. This introduces latency due to the increased distance and overhead because of the constant need to figure out what exactly is going on. To put it in another way, is it more efficient to have your immediate supervisor in a building across town or in the same building as you work in? Plus the front-end doesn't take up a lot of space on the GPU anyway.

What About Making Smaller GPUs?
So instead of making large GPUs with a ton of execution units, why not build smaller GPUs and use those as the chiplets? As an example, let's take NVIDIA's GTX 1080:

Compare this to the GTX 1050/1050 Ti (left) and the GT 1030 (right):

With this, you could take away the memory and PCI Express controllers and move them to an I/O chip, and just duplicate the rest as many times as you want. Except now you have SLI, which has its problems that need to be addressed.

The Problem with Multi-GPU Rendering
The idea of multi-GPU rendering is simple: break up the work equally and have each GPU work on the scene. If it's "embarrassingly" easy to break up the rendering task, wouldn't this be a good idea? Well, it depends on what's really being worked on. For example, let's take this scene:

Approximate difficulty to render this scene: Green = Easy, Yellow = Medium, Red = Hard

The areas are color coded more or less to approximate the "difficulty" of rendering it. How would you divide this up evenly so that every GPU has an equal workload? Let's say we have four GPU chiplets.

Obviously splitting this scene up into quadrants won't work because one of the chiplets will be burdened by the large amount of red in the top right while another will be sitting around doing nothing at all taking care of the top left. And because you can't composite the entire image without everything being done, the GPU taking care of the top right portion will be the bottleneck. Another option may be to have each chiplet in succession work on a frame. Though this may be an issue with more chiplets as you can't exactly render ahead too far and this sort of rendering is what causes microstuttering in multi-GPU systems. Lastly, we could have the chiplets render the entire scene at a reduced resolution, but offset a bit. Or divvy this entire scene by say alternating pixels. This could potentially minimize an imbalance of workload, but someone still has to composite the final image and there could be a lot of data passing back and forth between the chiplets, possibly increasing bandwidth requirements more than necessary. This is also not including another aspect that GPUs have taken on lately: general compute tasks. And then there's the question of VR, which is sensitive to latency.

Ultimately the problem with graphics rendering is that it's time sensitive. Whereas tasks for CPUs often have the luxury of "it's done when it's done" and the pieces of data they're working on are independent from beginning to end, graphics rendering doesn't enjoy the same luxuries. Graphics rendering is "the sooner you get it done, the better" and "everyone's writing to the same frame buffer"

What about DirectX 12 and Vulkan's multi-GPU support?
With the advent of DirectX 12 and (possibly) Vulkan adding effective multi-GPU support, we may be able overcome the issues described above. However, that requires developer support and not everyone's on board with either API. You may want them to be, but a lot of game developers would probably rather worry more on getting their game done than optimizing it for performance, sadly to say.

Plus it would present issues for backwards compatibility. Up until this point, we've had games designed around the idea of a single GPU and only sometimes more than one. And while some games may perform well enough on multiple GPUs, many others won't, and running those older games on a chiplet design may result in terrible performance. You could relieve this issue perhaps by using tools like NVIDIA Inspector to create a custom SLI profile. But to do this for every game would get old fast. Technology is supposed to help make our lives better, and that certainly won't.

But who knows? Maybe We'll Get Something Yet
Only time will tell though if this design will work with GPUs, but I'm not entirely hopeful given the issues.
- Read more...
- 5 comments
- 5,841 views
Second followup to the airflow mod

By Mira Yurizaki, January 6, 2019

I had two burning questions in my head:
What happens if the fan is pointing down? Are the intake fans creating much of a difference? The new configs are:
Intake at 30% with the other fans at a custom fan curve Fan pointing down, not running CAM Fan pointing down, running with CAM at 50% fans So here's all the data compiled. Regarding the charts, instead of using maximum clock speed (which all of them were within 1% of each other), I used maximum temperature instead which is a far more useful statistic. I also found a better way to represent the performance cap data. Instead of showing bars and counts, the counts are normalized to 100% and all of the performance cap reasons take a chunk of this. This makes it much easier to visualize how often the GPU spent in that reason.

Spoiler alert: Running with the fan pointing down without CAM running performed the worst, even running into Thermal performance caps. However, using CAM to turn the fans on to 50% seemed to help a lot. The 30% intake fans performed very well. Which makes me wonder if this is just a problem with air circulation within the case rather than a problem of trying to push air in.

Why bother with gathering this data?
One could think that this affects just my configuration, but I think it affects any micro-ATX case with a tempered glass side panel and a PSU shroud. If anything, the results from this data gathering could be generalized for other people who want to tweak with getting the best cooling performance.
- Read more...
- 3 comments
- 1,208 views
Follow-up to the Airflow Mod

By Mira Yurizaki, December 30, 2018

This is a follow-up to my blog on the airflow mod I made. It was brought to my attention that the sound card might play a role in affecting how well my video card remains cool. My presumption is that it's not doing much to affect the cooling potential because the issue was moving hot air away from the video card area and that while the video card is sucking air from the rear, there was enough airflow that it wouldn't make much of a difference.

So today I decided to test whether or not the sound card had an impact along with retesting whether or not the airflow mod had an impact. The numbers I got earlier I pondered about and wondered if another factor had something to do with it: the intake from the front and the exhaust in the rear and top.
The Setup
Taking all of these combinations together, I've narrowed down the parameters to:
Fan speed of the top and rear of the case (they're both tied to the same fan controller channel) Fan speed of the two front intake fans If NZXT CAM was running or not If EVGA Precision XOC was running or not Whether or not the airflow mod fan was installed Whether or not the sound card was installed To avoid having a bajillion combinations to test, I've eliminated the following variables by setting them to a default
If NZXT CAM is running, both the exhaust and intake fans are at 50% If NZXT CAM is not running, then it's likely the controller keeps them fixed at a lower RPM range. Due to my observation the other day with the new LED fans "blinking", I can tell without NZXT CAM running they're not being varied because no blinking occurs. EVGA Precision XOC will always be running with a custom fan profile Basically, the only variables I'm testing then are:
NZXT CAM is running or not Airflow mod fan is installed or not Sound card is installed or not The following is the software setup when EVGA Precision XOC and NZXT CAM running:

Also of note, the ambient air temperature in the room was about 68°F or 20°C

What I'll be doing
Run the Final Fantasy XIV Stormblood Benchmark three times. Between each run, there will be about a 10 second delay before the next. The logging captures each configuration, but not each test. i.e., the same log file will be used for all three runs. This benchmark is used because this game is a frequent use case. It lasts for about 4-5 minutes per run. I did not want to run something like FurMark because that's not a realistic test and I'm not interested in workloads I won't be subjecting the video card to. Use GPU-Z to log GPU clock speed, temperature, and performance cap reason And the results
Table Format
(Note that some of the formatting didn't carry over)

Charts of Interest

Average Clock Speed (higher is better)

Maximum Clock Speed (higher is better)

Average GPU temperature (lower is better)

Performance Cap Reason Counts
This one needs explaining. The goal is to have 0 thermal hits. Vmax should not be encountered because I've not allowed the card to push past default voltage limits. Util just means the GPU was not given a hard enough workload. The other two then, Power and Vrel, are for hitting the TDP and not being able to clock faster due to hitting the stock voltage limit. Now the question becomes is one better than the other? And for the purposes of this test, actually, there is! Power delivery circuitry becomes less efficient as it heats up. It's not by much, but when you have hardware that's being pushed to the edge, that "not by much" becomes "actually an appreciable amount" This is why it's important to cool off your VRM if you plan on pushing the overclocks.

tl;dr, better is:
0 Thermal Lower Power Higher Vrel This chart only has Power and Vrel, as no Thermal reasons were hit and the other two, Vmax and Util, don't matter.

Conclusions
Despite my initial presumption, having no sound card is effective in decreasing the temperature of the video card as much as having the airflow fan mod. So this means there could be more air flow coming into the rear video card fan. Having both the air flow mod and no sound card however, resulted in significantly better cooling. I'm not sure if having NZXT CAM was a factor or not since in some cases it helped a lot and in others it didn't seem to help at all.

Some other conclusions I can make regarding this:
microATX cases, at least ones of this size and configuration, are not ideal for having an "open air" style cooler video card if you are planning to use another expansion card. Though this may be mitigated if you don't mind turning your intake fans up higher if you don't want to lose the expansion card or do a airflow mod. Having a mid-case air circulator helps regardless. They make L brackets for case fans!
- Read more...
- 2 comments
- 1,301 views
What is a computer?

By Mira Yurizaki, December 31, 2018

Let's start with the most basic of questions when it comes to programming a computer: What is a computer? It seems a silly question for a device we take for granted but it's important to understand what it is if you want to program for one.

To put simply, a computer is a calculating machine. That's it. Its sole purpose is to compute things. A computer need not be electronic either, as there were mechanical calculating machines such as Charles Babbage's Analytical Engine, with even programmable machines around as early as the 19th century. Before the advent of fast calculating machines that we're used to, there were actually teams of people called computers to produce math tables. And much like electronic computers of today, human computers were simply given a set of instructions and were not allowed to deviate from them.

While I won't go into a deep dive into the history of machine computers, even with the advent of fully electronic, then fully digital computers (analog computers were a thing!), the computer's job for a while was primarily to handle computation. Everything from military artillery tables, accounting information, to even predicting who would win the 1952 US presidential election. Eventually this evolved into allowing computers to control machines, instead of showing a value to a human and having them control the machine itself. This was mostly useful in aerospace where many minute inputs in a small period time could often correct either an airplane or spacecraft better than a human could based on feel. But as computers became more powerful, eventually it grew to controlling more things, until finally we have our modern, electronic, digital computer.
- Read more...
- 0 comments
- 650 views
Welcome to the programming guides!

By Mira Yurizaki, December 31, 2018

After stewing on it for a while, I decided to start a blog of guides for the budding programmer. This will be covering the basics and some intermediate topics, and if I feel like sharing something more advanced then that may crop up once in a while. This series is intended more to give a crash course on programming and concepts in general and while I may use a programming language for examples, I must emphasize the language does not matter here. The programming language is the means with which to get things done.

Please note that this blog isn't purely for software based topics. This will include some hardware based topics as well because understanding hardware influences how you should design software at times, even if you have an OS that theoretically takes care of everything for you.

As some background of myself:
My educational background is in Computer Engineering. For those that don't know, the simplified explanation is Computer Engineering is a combined discipline of both Computer Science and Electrical Engineering. Where the Computer Scientist focuses mostly on software and Electrical Engineer focuses mostly on hardware, the Computer Engineer looks at both in a way to develop computer systems.

So basically that means I have at least a basic background in electrical engineering on top of my software development skills. And while I haven't had a need for it to hash out circuit designs, just knowing the basics of electronics helped me understand computer systems better.
I'm entering the 10th year of my career as a software engineer/developer/whatever they call it these days. My development experience is in embedded systems (or from a high level view, simple computer systems, often not running an OS) and application software.
While I do have a list of topics in mind, feel free to suggest something in this post as a comment and I'll see what I can do.
- Read more...
- 0 comments
- 651 views

Sign In

Mira Yurizaki

Posts

Joined

Last visited

Content Type

Forums

Status Updates

Blogs

Events

Gallery

Downloads

Store Home

Blog Entries posted by Mira Yurizaki

The actual reason why communication standards measure in bits per second, probably

Pet peeves of a software developer

Demystifying Ray Tracing Further

Think of Task Manager like a report to the department head

Discussing Myths: You can't boot Windows that was installed for another computer

Explaining "Asynchronous Compute"

Discussing Myths: GeForce RTX. The Whole Thing

List of Guides I've Written

Discussing Myths: Task Manager is Broken

Discussing Myths: The Intro Post

Yet another AMA

How does the CPU/GPU bottleneck work?

"What programming language should I start off with?"

Process for gathering video card data when analyzing application behavior, pt. 2

The Software Enigma Machine Bonus: Refactoring some code

Process for gathering video card data when analyzing application behavior

Why OSes report processor utilization as idle process time

Why are physics engines tied to frame rates?

[RQB] VRAM usage may not be what you think it is

Programming application and language types

The Chiplet "Problem" with GPUs

Second followup to the airflow mod

Follow-up to the Airflow Mod

What is a computer?

Welcome to the programming guides!

My Activity Streams