Improving Performance

Parallelism

Widen the basic word length of the machine
- 8 bit
- 16 bit
- 32 bit
- 64 bit
Vector execution
- Execute a single instruction on multiple pieces of data
Parallel-ise instructions

Instruction-level Pipelining

Deep Pipeline

The depth of the pipeline is increased to achieve higher clock frequencies (more stages).

Limitations

Stage delay cannot be arbitrarily reduced
There is a delay for each register pipeline required
Pipeline flush penalty will discard more instruction
Memory hierarchy can stall executions

CPU with Parallel Processing

Multiple execution components can be performed simultaneously by having parallel groups of instructions

(Software) VLIW Architecture

VLIW - Very Long Instruction Word

Issues with more instructions in parallel
- May create more data hazards
- Forwarding in the pipelined datapath becomes hard
- Identifying parallel instructions is not easy
More aggressive scheduling is required

Above: The SUB.D instruction does not use the previous F8 register value. To increase performance, we could change the register used for SUB.D - which allows the instructions to be run in parallel as they no longer have a data dependency.

(Hardware) Superscalar Architecture

Dynamic Scheduling

TL;DR - Each execution unit has its own queue

The hardware issue component in the processor schedules instructions to different parallel execution units

Track instruction dependencies to allow instruction execution as soon as all operands are available
Renaming registers to avoid WAR and WAW hazards

Issue

Get next instruction from the queue
Issue the instruction and related available operands from the register file to a matching reservation station entry if available, else stall

Execute

Execute ready instructions in the reservation stations
Monitor the CDB (Common data bus) for the operands of not-ready instructions
Execution unit idles until a ready instruction is available

Write Result

Results from the EU are sent through the CDB to destinations
- Reservation station
- Memory load buffers
- Register file
The write operations to the destinations should be controlled to avoid data hazards

Special data structures in the register file, reservation stations and memory buffers are used to detect and eliminate hazards

Reservation Station

The state table holds the a link to the execution unit index that is using the register

Dynamic Execution with Speculation

Issue, Execute, Write Result, and COMMIT

The commit step allows instructions to execute out of order, but force them to commit in the correct execution order

Thread Level

Multithreaded Processors

When one thread is not available due to an operation delay (i.e. memory access taking a long time), the processor can switch to another thread

Hardware level multithreading

Thread switching has less overhead than context-switching

Fast switching between threads
Requires extra resources: replicated registers, PC, etc

Fine-grained Multithreading

Round-robin approach

Switch threads after each cycle
If one thread stalls, another is executed

Coarse-grained Multithreading

Only switch threads on long stalls (i.e. L2-cache miss)

Simultaneous Multithreading (SMT)

A variation of HW multithreading that uses the resources of superscalar architecture
Exploits both instruction-level parallelism and thread-level parallelism

System Level - GPUs

Graphics Processing Units are processors developed for processing lots of data at once (i.e. all the pixels on a screen)

Typical Tasks

HSR - Hidden Surface Removal (Remove hidden parts of a 3D object to be shown on a [2D] screen)
Shading - Making a flat object look more 3D-like
Texture Mapping - Providing high frequency details, surface texture, colour information

Many tasks require a huge level of parallelism, however it is common that all tasks are independent (do not rely on each other)

Example: Powerful but single threaded

Example: Cheap but multiple processors

BRRRR

SIMD

Since multiple processors are performing the same instruction, just on different data fragments, the instructions can be shared (same fetcher/decoder).

Each execution unit has its own local memory, and they all share a larger memory

In the event that there is a stall (i.e. data not available), then the processors can thread-switch to another thread for continued execution

Remarks

Use many cheap cores and run them in parallel
- Easier than improving a single core by n times
Pack cores full of ALUs and share instruction streams across groups of data sets
- i.e. SIMD vector
Avoid long stalls by interleaving execution of many threads

Note: CUDA

CUDA (Compute Unified Device Architecture) is NVIDIA's platform to use their GPUs for arbitrary operations that require parallel computation.

It is executed with some C-like programming language, and unifies all forms of GPU parallelism as a CUDA thread.

More Hardware Designs on Parallel Processing

Contents