General Software Development: September 2007

A quick words on various processor architecture, taken as is from wikipedia. More for my refrence than anything else.
There are three major processor architectures- Superscalar, VLIW and SIMD.

Superscalar:
A superscalar CPU architecture implements a form of parallelism called Instruction-level parallelism within a single processor. It thereby allows faster CPU throughput than would otherwise be possible at the same clock rate. A superscalar architecture executes more than one instruction during a single pipeline stage by pre-fetching multiple instructions and simultaneously dispatching them to redundant functional units on the processor.
The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy is the difference between scalar and vector arithmetic. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple redundant functional units within each CPU so that multiple instructions can be processing separate data items concurrently.
Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple functional units in use at all times. This has become increasingly important as the number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design like the PowerPC 970 includes four ALUs and two FPUs, as well as two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system as a whole will suffer.
A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle. But merely processing multiple instructions concurrently does not make an architecture superscalar, as both pipelined CPUs and Multicore CPUs also achieve that, but via different methods.
In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to redundant functional units contained inside a single CPU. Therefore a superscalar processor can be envisioned as having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread.

Limitations:Available performance improvement from superscalar techniques is limited by two key areas:
The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and The complexity and time cost of the dispatcher and associated dependency checking logic. Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of the other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; d = a + f might not be runnable in parallel, depending on the order in which the instructions complete as they move through the units.
As the number of simultaneously issued instructions increases, the cost of dependency checking increases extremely rapidly. This is exacerbated by the need to check dependencies at run time and at the CPU's clock rate. This cost includes additional logic gates required to implement the checks, and time delays through those gates. Research shows the gate cost in some cases may be nk gates, and the delay cost k2logn, where n is the number of instructions in the processor's instruction set, and k is the number of simultaneously dispatched instructions. In mathematics, this is called a combinatoric problem involving permutations.
Even though the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, as there is no assurance otherwise and failure to detect a dependency would produce incorrect results.
No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of functional units (e.g, ALUs), the burden of checking instruction dependencies grows so rapidly that the achievable superscalar dispatch limit is fairly small. -- likely on the order of five to six simultaneously dispatched instructions.
However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus the degree of intrinsic parallelism in the code stream forms a second limitation.

VLIW:Very Long Instruction Word or VLIW refers to a CPU architecture designed to take advantage of instruction level parallelism (ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined scalar architecture) may use processor resources inefficiently, potentially leading to poor performance. The performance can be improved by executing different sub-steps of sequential instructions simultaneously (this is pipelining), or even executing multiple instructions entirely simultaneously as in superscalar architectures. Further improvement can be achieved by executing instructions in an order different from the order they appear in the program; this is called out-of-order execution.
These three techniques all come at a cost: increased hardware complexity. Before executing any operations in parallel, the processor must verify that the instructions do not have interdependencies. There are many types of interdependencies, but a simple example would be a program in which the first instruction's result is used as an input for the second instruction. They clearly cannot execute at the same time, and the second instruction can't be executed before the first. Modern out-of-order processors use significant resources in order to take advantage of these techniques, since the scheduling of instructions must be determined dynamically as a program executes based on dependencies.
The VLIW approach, on the other hand, executes operation in parallel based on a fixed schedule determined when programs are compiled. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that the three techniques described above require. As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs.
In superscalar designs, the number of execution units is invisible to the instruction set. Each instruction encodes only one operation. For most superscalar designs, the instruction width is 32 bits or less.
In contrast, one VLIW instruction encodes multiple operations; specifically, one instruction encodes at least one operation for each execution unit of the device. For example, if a VLIW device has five execution units, then a VLIW instruction for that device would have five operation fields, each field specifying what operation should be done on that corresponding execution unit. To accommodate these operation fields, VLIW instructions are usually at least 64 bits in width, and on some architectures are much wider.
For example, the following is an instruction for the SHARC. It simultaneously does a floating-point multiply, a floating-point add, and two autoincrement loads. All of this fits into a single 48-bit instruction.
f12=f0*f4, f8=f8+f12, f0=dm(i0,m3), f4=pm(i8,m9);Since the earliest days of computer architecture, some CPUs have added several additional arithmetic logic units (ALUs) to run in parallel. Superscalar CPUs use hardware to decide which operations can run in parallel. VLIW CPUs use software (the compiler) to decide which operations can run in parallel. Because the complexity of instruction scheduling is pushed off onto the compiler, the hardware's complexity can be substantially reduced.
A similar problem occurs when the result of a parallelisable instruction is used as input for a branch. Most modern CPUs "guess" which branch will be taken even before the calculation is complete, so that they can load up the instructions for the branch, or (in some architectures) even start to compute them speculatively. If the CPU guesses wrong, all of these instructions and their context need to be "flushed" and the correct ones loaded, which is time-consuming.
This has led to increasingly complex instruction-dispatch logic that attempts to guess correctly, and the simplicity of the original RISC designs has been eroded. VLIW lacks this logic, and therefore lacks its power consumption, possible design defects and other negative features.
In a VLIW, the compiler uses heuristics or profile information to guess the direction of a branch. This allows it to move and preschedule operations speculatively before the branch is taken, favoring the most likely path it expects through the branch. If the branch goes the unexpected way, the compiler has already generated compensatory code to discard speculative results in order to preserve program semantics.

SIMD:In computing, SIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level parallelism.DSPsA separate class of processors exist for this sort of task, commonly referred to as Digital Signal Processors, or DSPs. The main difference between SIMD-capable CPU's and DSP is that the latter are complete processors with their own (often difficult to use) instruction set, whereas SIMD-extentions rely on the general-purpose portions of the CPU to handle the program details, and the SIMD instructions handle the data manipulation only. DSPs also tend to include instructions to handle specific types of data, sound or video for instance, whereas SIMD systems are considerably more general purpose. DSP's generally operate in Scratchpad RAM driven by DMA transfers initiated from the host system - and are unable to access external memory. Some DSP include SIMD instruction sets. The inclusion of SIMD units in general purpose processors has supplanted the use of DSP chips in computer systems, though they continue to be used in embedded applications. A sliding scale exists - the Cell's SPU's and the Ageia Physics Processing Unit could be considered half way between CPU's & DSP, in that they are optimized for numeric tasks & operate in local store, but they can autonomously control their own transfers so are in effect true CPU's.
AdvantagesAn application that may take advantage of SIMD is one where the same value is being added (or subtracted) to a large number of data points, a common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three values for the brightness of the red, green and blue portions of the color. To change the brightness, the R G and B values are read from memory, a value is added (or subtracted) from it, and the resulting value is written back out to memory.
With a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "get this pixel, now get the next pixel", a SIMD processor will have a single instruction that effectively says "get lots of pixels" ("lots" is a number that varies from design to design). For a variety of reasons, this can take much less time than "getting" each pixel individually, as in a traditional CPU design.
Another advantage is that SIMD systems typically include only those instructions that can be applied to all of the data in one operation. In other words, if the SIMD system works by loading up eight data points at once, the add operation being applied to the data will happen to all eight values at the same time. Although the same is true for any superscalar processor design, the level of parallelism in a SIMD system is typically much higher.
DisadvantagesMany SIMD designers are hampered by design considerations outside their control. One of these considerations is the cost of adding registers for holding the data to be processed. Ideally one would want the SIMD units of a CPU to have their own registers, but many are forced for practical reasons to re-use existing CPU registers - typically the floating point registers. These tend to be 64-bits in size, smaller than optimal for SIMD use, as well as leading to problems if the code attempts to use both SIMD and normal floating point instructions at the same time - at which point the units fight over the registers. Such a system was used in Intel's first attempt at SIMD, MMX, and the performance problems were such that the system saw very little use. However, recent x86 processor designs from Intel and AMD (as of November 2006, or several months prior) have eliminated the problems of shared SIMD and floating-point math registers, by providing a new, separate bank of SIMD registers. Still, in most cases the programmer doesn't know which processor model his code will be run on. Packing and unpacking data to/from SIMD registers can be time-consuming in some applications, reducing the efficiency gained. If each datum (say, an 8-bit value) needs to be gathered/dispersed separately rather than loading an entire register in one operation, it is advisable to reorganize the data if possible, or consider not using SIMD at all. Though recently there has been a flurry of research activities into techniques for efficient compilation for SIMD, much remains to be done. For that matter, the state-of-the-art for SIMD, from a compiler perspective, is hardly comparable to that for vector processing. Because of the way SIMD works, the data in the registers must be well-aligned. Even for simple stream processing like convolution this can be a challenging task. Not all algorithms suit vectorization.

General Software Development

Thursday, September 27, 2007

Processor Architectures

Blog Archive

Researcher/Developer

Related Links

Useful Links