Implications of Performance
Contents
There is no good rule to pick the optimal target clock frequency.
For this course: start with a clock period of 10ns
HLS will attempt to optimise clock count as well as optimise clock speed.
Operational Chaining
The high-level synthesiser will look at the functions and attempt to reduce the number of clock cycles required by possibly increasing the clock frequency.
As we lengthen the clock period ---> slower clock frequency, then we can pack more functionality per cycle.
Code Hoisting
Code optimisation to refactor redundant / uncommon code paths.
Loop Fission
Split a loop into multiple loops. This therefore allows the loops to be treated and optimised independently (and can run in parallel)
Loop Unrolling
By default, HLS synthesis for
loops sequentially; creating a data path that executes sequentially for each iteration of the loop. Replicate the loop body [within the same loop] and split the tasks.
We can insert the directive #pragma HLS unroll factor=2
to automate this.
If we don't specify a factor
argument, the loop will be unrolled completely - This maximises the hardware resource usage (and takes a long time to synthesise). The bounds of the loop need to be statically defined (i.e. during compile time).
something something exit check?
Loop Pipelining
Overlapping of executions (where possible).
We can use the directive #pragma HLS pipeline II=2
which will attempt to achieve an II
of 2. If we don't specify the II
argument, II
will attempt to be minimised
Loop Performance Metrics
- Iteration latency - number of cycles it takes to perform one iteration of the loop body
- Loop latency - number of cycles to complete the entire execution PLUS one to determine if the loop is finished / or a writeback
- Vivado HLS defines the loop latency prior to writeback
- Initiation interval (
II
)- number of cycles before the next iteration of the loop can start- A higher II value can potentially increase the maximum operating frequency (fMAX) without a decrease in throughput
Bit-width Optimisation
See here
Loop Interchange / Pipeline-interleaved Processing
Switching loop variable usage around to reduce repeated lookups
See here
Function Pipelining
When pipelining a function, all loops contained in the function are unrolled, which is a requirement for pipelining
Pipelining loops gives you an easy way to control resources, with the option of partially unrolling the design to meet performance.
False Dependencies
For operations to block RAM (i.e. two ported), we must alternate between reads and writes as long as
x0
andx1
are independent - in order to complete the operation.
What if they are not actually independent? For instance, we might know that the source of data never produces two consecutive pieces of data that actually have the same bin. What do we do now? If we could give this extra information to the HLS tool, then it would be able to read at location x1
while writing at location x0
because it could guarantee that they are different addresses. In Vivado® HLS, this is done using the dependence directive.
To overcome this deficiency, you can use the DEPENDENCE directive to provide Vivado HLS with additional information about the dependencies.
Inter: Specifies the dependency is between different iterations of the same loop.
If this is specified as FALSE it allows Vivado HLS to perform operations in parallel if the
pipelined or loop is unrolled or partially unrolled and prevents such concurrent operation
when specified as TRUE.
Intra: Specifies dependence within the same iteration of a loop, for example an array being
accessed at the start and end of the same iteration.
When intra dependencies are specified as FALSE, Vivado HLS may move operations freely
within the loop, increasing their mobility and potentially improving performance or area.
When the dependency is specified as TRUE, the operations must be performed in the order
specified.