Background On Parallelism and HPC

Entry posted by patrickjp93 April 22, 2015

521 views

It may come as a shock to some that just over 15 years ago CPUs only had a single core, and running 2 threads on one core simultaneously was just an infant of an idea. Nowadays dual-core chips are practically ubiquitous as the minimum across the civilized world. Even so, to this day most consumer software runs using a single thread and only the oldest of instruction sets for a target platform. There are reasons for this, such as making software which can be sold to the most people with a reliable minimum performance metric which is very concrete and easy to calculate. Whether or not these are good reasons you can argue back and forth very easily, though I think it's safe to tell the board of sales to shove it in a post-SandyBridge era.

With Windows 10 very likely to finally shove Windows XP and Windows 7 users from their moorings, and with the new OS requiring the presence of newer instruction sets, we can finally guarantee a much better minimum performance based purely on instruction-level parallelism gained through compiler optimization of code and the Out-Of-Order Processing (OOOP) unit in x86 CPUs made by Intel and AMD. This, however, is limited by the skill of coders to write code in such a manner as to be implicitly parallel. That is, code which can be very rapidly analyzed in small sections and be obviously concurrent to a compiler or CPU OOOP engine with little to no dependence on immediately previous code. In the microarchitecture world this is superscalar pipelining where instructions are ordered in such a manner as to execute in a shingled manner, where fetching of one instruction can happen while another is being decoded, another executed, and the result of another is being written back. There are all manner of data and control hazards in most code which are not obvious to the naked eye, even to seasoned programmers who haven't taken a course or read material on the subject of High Performance Computing (HPC), not to be confused with Scientific Computing, Parallel Computing, and especially Heterogeneous Computing, though these sectors of the computer science world actively use HPC concepts to squeeze out every last drop of performance from the hardware of a given system while increasing power efficiency. High Performance Computing is about maximizing the performance of a single computing resource through code restructuring. This is by no means an easy task in any code base intricate enough to be used in production systems.

So what is easy by comparison? Frankly, it's thread-level parallelism. Not every task is internally concurrent and thus not every task is appropriate for multithreading, and even for those which are, the overhead of launching and killing threads may entirely eclipse perceived gains for small instances of a given problem. There are primarily two forms of parallelism focused on in modern computing paradigms: data-level parallelism and task-level parallelism. Data-level parallelism is practically self-explanatory. Any task which can concurrently search or transform a set of data is a good candidate for data-parallel optimization. This can come in the forms of multiple threads working on different portions of the data set simultaneously, SIMD instructions iterating over the data set in chunks in a single thread, or (in the most optimal performance case) a combination of the two. Generally this is the easiest parallel code to build for programmers.

Task-level parallelism is probably the most obvious, even if not the easiest, form of parallelism to implement. If two tasks can run concurrently and are both required to complete before a third task can begin, these first two tasks are prime candidates for task-parallel optimization. In simplest terms, throw each task into its own thread forked thread, let them execute, receive back the results, kill the spawned threads, and move on to task 3. You would think this isn't difficult at all to figure out, and frankly I would agree with such a position, but there are instances where tasks 1 and 2 are touching the same resource. If either one is modifying it while the other is analyzing or trying to modify it in a different way, such a situation can produce unpredictable behavior in results. This is an instance of a "race condition." This is where parallelism gets tricky, but the truth is if one is careful, race conditions can be avoided for most things one can implement task-level parallel code for. If I want to save the temporary state of a file to disk, I can certainly do that while the data is being manipulated. It's a temp state, and temporal locality (how close in time) is "close enough" that a loss of power or system error would cause only a tiny loss of data. Therefore, why should I lock the user from being able to modify the file while it's being saved? If it's a large file, this is highly inconvenient. Solution: separate editing and saving into two threads, aka very simple task-level parallelism.

In future entries I intend to show task-level parallelism is easy to implement thanks to standards and tools developed and maintained by various non-profits which can be ubiquitously applied to any architecture developed by the major and intermediate players of the microprocessor industry. As a consequence, I will support the idea code should be developed as parallel wherever possible and the era of excuses not to is over.

Sign In

High Performance Computing For Dummies

Background On Parallelism and HPC

0 Comments

My Activity Streams