Having determined the overall structure of the DSP firmware in the previous entry, we started work on the UKF port. This is likely to be the most computationally intensive component, involving large numbers of double-precision matrix operations and high update rates, so our target was approx 40% single-core utilisation (about 500,000 cycles per iteration, with 1000 iterations per second).
After making a few minor modifications to get the latest version of Eigen compiling under TI’s Code Composer Studio compiler for the C6600-series DSPs, and splitting the UKF source into separate files in order to avoid exceeding the Windows 32-bit per-process memory limit during compilation, I ran a simple UKF test to check performance.
Without optimisation, a single iteration completed in 43 million cycles;
with all optimisation settings turned on, it took approx 20 million cycles.
For comparison, a single iteration of the same test compiled with Clang at
-O3 and running on a Core i7 completes in under 200,000
So, it was clear that while Eigen performs amazingly well on compilers with strong template and inlining support, TI’s compiler wasn’t able to get good performance out of it. Rather than attempt to learn both low-level DSP programming and the Eigen backend simultaneously, potentially only to find that the performance issues were inherent to the C6600 compiler’s handling of templates or something else outside our control, we decided to use the Eigen UKF implementation as a reference and implement an alternative version in C using the existing C interface to the Eigen-based UKF code.
This was relatively straightforward, although the C version is less flexible and twice as long; results are very nearly identical between the C and Eigen implementations, and the minor differences in output are largely due to the differences in matrix inverse implementation and the order in which sigma points are averaged to derive the a priori mean.
The naïve C implementation, using entirely cache-oblivious operations and
the most straightforward matrix multiply, Cholesky decomposition and matrix
inverse algorithms, performed within 10% of the Eigen version at
-O3, and with optimisation disabled, about four times as fast.
However, after compiling for the C6657 and running a few iterations of the same test in the cycle-approximate simulator for the C6674 (same core, but 4 instead of 2), it became clear that the UKF was still taking 2.5 million cycles to execute. Much better than the previous 20 million, but still a long way from the 500,000 upper limit.
At that point I started looking more deeply into TI’s optimisation and compiler manuals (which are great), and realised I’d essentially made the worst possible implementation choices for the DSP architecture at almost every point, and I’d been given a false sense of accomplishment by the general brilliance of the Clang compiler and x86 architecture.
It turns out that to get good performance on the C6657 (and really any of
TI’s C6000-series DSPs) it’s important to make heavy use of the optimised loop
handling built into the processors, as well as the automatic
scheduling/pipelining functionality in the compiler. I found pretty much
every time I performed the same operation four or more times, I could get
a 100%+ improvement in performance by putting it into a
and using the
MUST_ITERATE pragma to give the compiler
information about expected iteration counts. Looking at the assembly output,
with a bit of help the compiler is able to get very close to the theoretical
2 double-precision multiply-accumulates per loop iteration, so after moving
everything into fairly tight loops, adding some specialised vector outer
product operations to replace matrix multiplications, and switching to
optimised versions of the square root, inverse square root and divide
operations the UKF was down to worst-case performance of 521,000 cycles per
iteration with asserts and debugging symbols enabled (10–20% performance hit).
Further optimisation will require improvements to be made to the general
matrix multiply function and/or the Cholesky decomposition and matrix inverse
functions, which are a bit trickier to implement correctly than the changes
I’ve made to this point (although still quite straightforward in the scheme
of things). Another potential option for improvements in performance and
stability would be alternative formulations such as the square-root UKF.
Since we’re running the UKF in double-precision mode, and the DSP supports two double-precision multiply-accumulates per cycle compared with eight in single-precision, it’s likely that the single-precision NMPC code will be much easier to get running well.
Next I’ll be writing the code to interface the DSP with the two I/O boards, as well as the IPC and logging routines. Although the DSP is likely to perform far worse at this than an x86 CPU, the other end of the I/O board interfaces is a pair of AVR32UC3Cs running at 48MHz and approx 30% CPU utilisation, so it’s not going to be an issue for the DSP to keep up.
The in-progress FCS source code is available at sfwa/fcs.
The C66x-optimised UKF code is in the ccs-c66x directory within the UKF repository, sfwa/ukf.