Copying Data Quickly

By Chris Pirazzi and Micheal Minakami

Word/Doubleword Alignment Helps

Whenever you need to copy large buffers of data, keep in mind that SGI systems can copy data more efficiently if the source and destination buffers are properly aligned, since the code can use lw/ld and sw/sd instructions to copy multiple bytes at a time. For a copy from address src to address dst, the relative alignment of the pointers determines the efficiency:

   (src % 2) == (dest % 2) - probably won't help
   (src % 4) == (dest % 4) - definitely will help 
   (src % 8) == (dest % 8) - often helps even more
   even more well aligned  - usually same as 8
Referring to the 4-byte alignment case on some Indigo2-class machines, Micheal Minakami tells us that "when copying between buffers with the same alignment, throughput is approximately 400% greater than when copying between buffers with mismatched alignments." He also notes that "the memory allocation routines such as malloc return double-word (64-bit aligned) buffers." Entries in VLBuffers and DMbufferpools are also always at least 8 byte aligned.

Avoid Cache Thrashing

Another performance gotcha occurs if you copy between buffers that are too strictly aligned in virtual or physical address: cache thrashing. To check for and fix this problem, you will need to know about the primary and secondary CPU caches on your target platforms. The MIPS processor manuals, available in bookstores and, list the details (cache type, cache size, cache line type, cache line size).

SGI primary caches are virtually indexed, meaning that for a cache_size byte cache whose lines are each cache_line_size bytes long, a reference to a virtual address V will map to cache line number (V%cache_size)/cache_line_size. Say you're copying from a buffer A to a buffer B, and both buffers are 32k-aligned. Say your computer has a cache_size==32k primary data cache. Say your cache is direct-mapped, meaning each cache line can store data for only one address. If the inner loop of your copy looks like "B[i]=A[i]", then the math above tells us that &A[i] and &B[i] will always map to the same cache line. Thus each reference to A[i] and B[i] in the inner loop above will miss in the primary cache, severly hurting performance. To solve this, you need to reduce the number of cache misses by reducing the number of times the same cache line gets reused for a different address. If you can control the address of buffer A and B, then allocating A and B such that:

  (A%cache_size)/cache_line_size != (B%cache_size)/cache_line_size.
will minimize cache thrashing. If you can't control the address of A and B (for example, if you're copying between VLBuffers or DMbuffers), then you should try to read a whole cache line at a time from buffer A into registers, and then write a whole cache line at a time from registers into buffer B. For cache_line_size=32bytes, this might look something like:
  __uint64_t *A, *B;
  int i;
  assert((max%4) == 0);
  for(i=0; i < max; i+=4)
      __uint64_t tmp0, tmp1, tmp2, tmp3;
      tmp0 = A[i+0];
      tmp1 = A[i+1];
      tmp2 = A[i+2];
      tmp3 = A[i+3];
      B[i+0] = tmp0;
      B[i+1] = tmp1;
      B[i+2] = tmp2;
      B[i+3] = tmp3;
You should disassemble any C code written like this to make sure the compiler did the right thing, or write assembly yourself.

If your primary cache is 2-way set-associative instead of direct-mapped, then a copy operation ("B[i]=A[i]") will not thrash in the cache, but an operation involving three pointers ("C[i]=A[i]+B[i]") will, and the solution is the same.

SGI secondary caches are indexed by physical address, and it is possible for them to thrash in the same way. Because secondary caches are so much larger, and because a given range of virtual memory (say, our buffers A and B) sometimes maps to discontiguous physical pages, secondary cache thrashing is not as common. On O2 (mvp), it happens that video buffers are made up of 64k-aligned physical memory chunks of 64k in size, and so O2 customers notice this problem more often. Since a normal user-mode app cannot control the physical address of any memory it allocates, the only workaround is the second one above: try to read a whole secondary cache line into registers, and then write from registers into a whole secondary cache line.

Kernel Data Copying

The kernel can bcopy() faster than you can from user-mode on some SGI systems, using some protected instructions. The speedup is typically around 150%. So don't be surprised if copies in the kernel (for example, buffered I/O calls as described in Software Methods for Disk I/O) exceed the performance of the best user-mode bcopy() benchmark which you can cook up.