Performance Hints for Signal-Processing Applications

The MIPS processors, particularly the newer ones, can execute a substantial amount of real-time signal processing code. But as with all processors, writing efficient code requires an understanding of the processor and compiler architectures. This list is intended to give you some things to think about.

First, my disclaimer. I've based this list upon a number of "gotchas" that I and other audio & image-processing developers have hit. This is by no means a complete list, but it's hopefully got some useful hints. It's a new list and I'm constantly refining it and adding things. If you have your own "gotchas" or suggestions, please send them in!

Thanks to Bruce Karsh, Gints Klimanis, Chris Pirazzi, and Bryan James for their review and contributions.

Doug Scott
(cook@sgi.com)

1. There are few "magic bullets", only guidelines and trade-offs. What speeds up one app may slow-down another if the applications are architected differently. Take the time to understand and profile your application to confirm that these guidelines help or don't help.

2. Consider the data type you need for each operation. Don't make the assumption that integer types are faster than floating-point. This is often not true. Here are some gross generalizations about what's best for different types of operations:

	Bit-Swizzling & Logical Ops	Integer only
	Addition			Integer faster than FP (older processors)
					FP faster than integer (R5K/R10K/R8K and newer)
	General Multiplication		FP faster than integer
	Division			FP is faster. But if you can replace
					division by multiplication, do so; 
					multiplication is faster.
	Conditionals			Integer is faster

From this, you can sort of see that numerical computations are best done in floating-point, and logical operations are best done in integer. From a computation perspective, floating-point usually has a much larger dynamic range than integer, so if you are scaling or mixing data in your computations, you seldom need to check the results for overflow.

Another thing to consider is that the FP unit has its own register set. When you use FP, you effectively get a much larger register set to use, because your logical code uses the integer registers, whereas the computations use the FP registers. With more registers, some loops may require fewer load and store operations.

Finally, when you use floating-point instructions, you take advantage of more parallelism in the processor. For example, the R5K can do an integer and floating-point operation simultaneously. If you use floating-point, the compiler will usually find an integer operation (like a load or store) to perform simultaneously. The R10K allows even more instruction-level parallelism.

In general, for the typical signal-processing application, which requires a lot of multiply/adds of variables, floating-point offers superior performance. But as always, it depends upon the architecture of the application; your mileage may vary.

This is all somewhat processor-dependent. For example, the R4400 is better at floating-point than the R4600. If your code is targeted mainly at the R4600 the optimal balance of FP to integer operations will differ from that on the R4400. But most of the newer processors provide superior floating-point; in general, that's the way to go for numerical computations.

2a. Another note: one common speed-up for integer operations is to use table-lookup, where possible. This often works well for image processing, where the input is 8- or 10-bit data and the tables are small enough to stay in the cache. For audio operations, table-lookup is less common, since audio data is typically 16-24 bits.

3. Avoid unnecessary type conversion. This is a trade-off with #1. Conversion between integer and floating-point can be expensive (particularly FP->int, which sometimes requires limiting). If your application accepts and produces integer data and only does a small amount of signal-processing, the cost of conversion to FP may outweigh the gains in performance from using FP.

Also, be aware: in some cases, the compiler may do type conversion "behind your back," particularly by promoting single-precision to double-precision floating-point (hint #7).

4. Consider the instruction set (ISA) you want to use. If you're writing in C, selecting an instruction set requires just a compiler flag, e.g. -mips2. There are currently 4 MIPS instruction sets (MIPS 1-4). Each set is a superset of the previous. Roughly speaking, here's how they work:

	MIPS 1		R3000
	MIPS 2		R4000 instructions
	MIPS 3		R4000 -- introduces 64-bit operations
	MIPS 4		R8000/R10000 instructions
			notably floating-point multiply/add, conditional FP move

I've noticed that one big benefit of going from MIPS1->MIPS2 is a reduction in the cost of FP->integer conversion. For MIPS1, the compiler will generate some inline code to perform this; for MIPS2, the code is replaced by a single instruction. In general, though, you should not expect a huge performance gain going mips1->mips2.

MIPS3 instructions essentially introduce native 64-bit integer operations. These can be useful in some cases. One example I've seen is a phase accumulator (e.g. from a wavetable synthesizer). The operations on this are typically add (to add the delta from one sample to the next), convert to integer (to compute the sample array index), comparisons (to determine whether or not the loop-point has passed), and subtractions (to bring the accumulator back to the beginning of the loop). Looking in the table under hint #2, we see that this probably works faster with integer data types (fixed-point). However, 32-bit fixed-point may not have enough precision for the task. Using MIPS3 would enable you to have a native 64-bit fixed-point phase accumulator (coded as a "long long" type in C).

The MIPS4 instruction set provides some nice opportunities for signal-processing applications. It provides an FP multiply/add instruction and floating-point conditional-move instructions. These last allow one to write branch-free limiting code, which is a big win in some cases (see #5).

With each instruction set, you may get a performance gain at the cost of compatibility with older MIPS-architecture processors. In other words, if you compile -mips2, you may see some speedup, but you lose R3000 compatibility. This is a trade-off you will have to consider. Some people maintain multiple versions of a binary using different ISA's. (If your application uses "inst" for installation, you can have "inst" automatically select the appropriate binary for the end-user's machine).

MIPS3 and MIPS4 are only supported on IRIX6.X, and MIPS4 requires an R8K or R10K processor.

4. Disassemble your code and look at it. Use the "dis" command on your actual binary, not the code generated by "cc -S" (the compiler lies, often substituting higher-level macros for more complex things that actually occur in the binary itself).

There's really no substitute for this. Looking at the source-code and thinking about the algorithms at a high-level is a first step, but in the end if you need to squeeze every bit of performance out of your code, you need to examine the actual instructions that the CPU will execute.

5. Avoid branches where possible in critical code. If you can replace a bit of frequently executed code by an equivalent branch-free piece of code, it may run in fewer cycles though it has more instructions. This is because the processors are pipelined, and for some of them, a branch requires several extra cycles to refill the pipeline.

Another reason to avoid branches is that many of the compiler optimizations are limited to basic blocks, i.e. the region between two branches. Removing branches often gives the compiler a better chance to optimize the code. Finally, branches inside loops reduce the effectiveness of loop unrolling.

6. Work with the cache. Though the processor is very fast, the memory system is relatively slow. If you don't get good cache performance, your application will suck wind. A good rule of thumb is the order-of-magnitude rule. Think of each successive level of cache, going from the processor out to main memory, as 10 times as slow as the previous level. Suppose an instruction which hits the primary cache takes 1 cycle. The same instruction may require 10 cycles if it misses the primary cache and hits the secondary cache, and 100 cycles if it misses the secondary cache and requires a cache-line fetch from main memory. The effect is more pronounced as processors become faster. Cache effects can thus make a huge difference in program performance.

Organize your code and your data to optimize your use of the cache. Minimize your inner loops and working sets of data. For example, in image-processing, "tiling" is a common practice. Suppose a number of consecutive operations are to be performed on an image. The "obvious" way would be to perform each operation on the entire image, then proceed to the next operation. However, assuming the image doesn't fit in the cache, this gets the worst possible cache performance, since the image has to be reloaded from main memory every time it is touched. A far superior approach is to operate on small "tiles" which fit in the cache: perform all the operations on one tile, then proceed to the next.

There are lots of other tricks to improve cache performance, but hopefully this short example will get you thinking. If you need data in a critical loop, try to make sure it's in the cache every time the loop is executed.

7. Watch out for floating-point promotion. Double-precision arithmetic is typically more costly than single-precision, so if you don't need it, make sure you're not using it. In some cases the compiler will automatically promote operands to double-precision "behind your back." Here are some things to look for. Check out the "-float" option on the C compiler, which instructs the compiler (in K&R mode) to avoid promoting operands to double-precision. It's also a very good idea to use function prototypes with functions taking "float" arguments -- otherwise the arguments will be promoted to double-precision (even with the -float option). Finally, if a floating-point constant is to be single-precision, it should be given in the form "0.0f", i.e. explicitly single-precision, to avoid promotion to double-precision.

8. Don't assume that assembly language is inherently better than C code. Write your algorithm in C unless you find by disassembly that the compiler is missing some possible optimizations, or unless there's some special trick you need that can really only be expressed in assembly. In most cases you'll find that the compiler does a really good job.

Moreover, the optimal code scheduling differs from processor to processor. By writing in C (or another high-level language), and telling the compiler to generate code for the appropriate target architecture, you make it much easier to optimize for different architectures.

9. Don't assume that pointer arithmetic is inherently better than array indexing. The normal addressing mode of the MIPS processor (a fixed offset from a register) lends itself nicely to array indexing. Consider the following two loops:

	for (i=0; i < 50; i++) {
		array[i]=0;
	}

	int *p=array;
	for (i=0; i< 50; i++) {
		*p++=0;
	}

Both are functionally equivalent if the value of p is not used later in the second case. Some people shy away from the first approach on the mistaken belief that the address generation for the array indexing is computationally expensive. In fact, the compiler generates code for the first version which has 20% fewer instructions per iteration than the code for the second version. Why? The compiler often generates better code for simpler expressions, because it can better understand what the code is trying to accomplish. In the first case, the compiler sees that the code is looping on an array index. It calculates the end address of the array (200 from the base, since it is an integer array), and generates a loop which has only one variable -- the address in the array. "i" as such is nowhere to be seen. The loop requires four instructions per iteration:

  [test1.c:   5] 0x4009a8:      248500c8        addiu   a1,a0,200
  [test1.c:   5] 0x4009ac:      24840004        addiu   a0,a0,4
  [test1.c:   5] 0x4009b0:      0085082b        sltu    at,a0,a1
  [test1.c:   5] 0x4009b4:      1420fffd        bne     at,zero,0x4009ac
  [test1.c:   6] 0x4009b8:      ac80fffc        sw      zero,-4(a0)

The second example is too "smart" for the compiler. The compiler can't quite figure out what the programmer is trying to accomplish, so it generates two loop variables, one for "i" and one for "p" (even though "p" is never used after the loop). This example requires 5 instructions per iteration, since it has to increment two loop variables.

  [test2.c:   4] 0x4009a8:      24020032        li      v0,50
  [test2.c:   6] 0x4009ac:      00002025        move    a0,zero
  [test2.c:   6] 0x4009b0:      24840001        addiu   a0,a0,1
  [test2.c:   6] 0x4009b4:      0082082a        slt     at,a0,v0
  [test2.c:   7] 0x4009b8:      ac600000        sw      zero,0(v1)
  [test2.c:   6] 0x4009bc:      1420fffc        bne     at,zero,0x4009b0
  [test2.c:   7] 0x4009c0:      24630004        addiu   v1,v1,4

(if you're not familiar with MIPS assembly, note that the instruction after the branch [bne] is executed before the branch actually occurs. This is the so-called "branch delay slot").

Some folks have tested numerous ways of stating loops to see for which the compiler generates the fastest code. This is an interesting experiment, but it may not hold true from compiler release to compiler release. However, the general principle holds: don't be afraid to state your loops in simple terms. Often the compiler does best with the simplest constructs.

10. Watch out for expressions hidden in macros. Know what's a macro and what's a function. Expressions hidden in macros will sometimes be evaluated multiple times; this causes a performance hit if it's not required for correct evaluation of the macro.

11. Consider unrolling your loops. Often the compiler will do this for you, but you can sometimes glean a little more performance by doing it yourself. Loop unrolling works as follows. Consider the following loop:

	for (i=0; i < x; i++) {
		array[i]=0;
	}

We've seen above (#9) that the compiler generates somewhat smart code for this. For each iteration of the loop, the pointer is incremented and tested against its end value. But we can cut down on the number of instructions per array item if we write the loop like this:

	for (i=0; i < x; i+=4) {
		array[i]=0;
		array[i+1]=0;
		array[i+2]=0;
		array[i+3]=0;
	}

This loop is faster than the first one. The trick is that the pointer increment and test-against-end-value need only be done for every 4 array elements, instead of for every array element. There's no extra computation overhead associated with the (i+n) indices, because these make use of the MIPS addressing mode: the compiler can determine the fixed offset (represented by n) from the base address represented by i. Also, since there are fewer iterations, there are fewer branches. As noted above, a pipelined processor likes to execute instructions linearly (#5). The compiler-generated code looks like:

  [test3.c:   6] 0x4009a8:      248500c8        addiu   a1,a0,200
  [test3.c:   6] 0x4009ac:      24840010        addiu   a0,a0,16
  [test3.c:   6] 0x4009b0:      0085082b        sltu    at,a0,a1
  [test3.c:   7] 0x4009b4:      ac80fff0        sw      zero,-16(a0)
  [test3.c:   8] 0x4009b8:      ac80fff4        sw      zero,-12(a0)
  [test3.c:   9] 0x4009bc:      ac80fff8        sw      zero,-8(a0)
  [test3.c:   6] 0x4009c0:      1420fffa        bne     at,zero,0x4009ac
  [test3.c:  10] 0x4009c4:      ac80fffc        sw      zero,-4(a0)

Note that this snippet assumes that x is divisible by 4, unlike the original loop. There's a way around this for general values of x: split the loop into two loops. The first is not unrolled, and processes (x & 3) elements. The remaining number of elements is guaranteed to be divisible by 4, and the second, unrolled loop handles the bulk of the elements. If x is going to be very small, it's not worth the overhead, but for most values of x, this approach is a big win.

So why did we choose 4? First, it's a power of two, which makes the modulo operation cheap (x & 3, as seen above). It's also fairly small. As the loop-unrolling block size becomes larger, you get diminishing returns. At some point, things actually slow down because the code no longer gets nice cache behavior. Common values for the block size are 4, 8, and sometimes 16.

Experiment with your code to see what works well. As always, if it's a critical loop, disassemble it to see what the compiler is doing for you.

A critical thing to note is that the compiler will usually do loop unrolling for you, unless it determines that there is little benefit to doing so. One reason why it can determine this is that pointers inside the loop are aliased. See #12.

12. Understand pointer aliasing. This is really important for writing good signal-processing code. Here is a new little blurb called Performance Implications of Pointer Aliasing.

13. Watch out for floating-point exceptions. These can really hurt your performance. Check out the excellent little blurb that Chris Pirazzi has written, available from the audio apps page.

[...more to come...]