Memory Speed Benchmarks

By Brian Beach.

NOTE (from Chris Pirazzi): These statistics were collected in early 1997. By the time you read this, clock speed and other system enhancements will have changed the numbers significantly. We recommend you read the text and run the provided test program on the modern platforms you are interested in. In particular, you will find that the R10000 O2 numbers you get from a recent system are significantly better than what is shown here.

This page contains benchmark results for several different SGI computers. For applications that access memory intensively, these numbers can help in tuning the program to attain the highest possible performance on each machine. These tests are designed to mimic the memory activity of programs that do simple processing of large blocks of data. For example, running a simple filter on a video-sized image buffer, putting the result into another buffer.

The numbers were gathered with memspeed.c++, a simple test program. The program allocates 2MB memory blocks, and scans through them from beginning to end. The 2MB size was chosen because it is bigger than the cache on all of the machines tested. Reading, writing, and copying are all measured, using a simple for loop that iterates through the memory. The inside of the loop for each of these operations is:

read: tmp = *src++;
write: *dst++ = 0;
copy: *dst++ = *src++;

The program includes three different versions of each test for different data types: unint8_t, uint32_t, and uint64_t. It also runs the tests for both cached and uncached memory.

The program was compiled "-n32 -mips4", which produces code that takes advantage of the 64-bit load and store operations, and unrolls the loops pretty well. Here is the loop for reading 64-bit integers:

.BB57.testReadWrite__GPvPCc:          # 0x1d0
        .loc        1 130 3
         addiu $2,$2,64                       # [0] 
        ld $0,-72($2)                         # [1]  id:21
        ld $0,-64($2)                         # [2]  id:21
        ld $0,-56($2)                         # [3]  id:21
        ld $0,-48($2)                         # [4]  id:21
        ld $0,-40($2)                         # [5]  id:21
        ld $0,-32($2)                         # [6]  id:21
        ld $0,-24($2)                         # [7]  id:21
        bne $8,$2,.BB57.testReadWrite__GPvPCc # [8]       
        ld $0,-16($2)                         # [8]  id:21

The test program runs at the maximum non-degrading priority, and locks itself into memory, to avoid random effects from other applications. All of the tests were run on quiescent systems so that I/O interrupts would not affect the results.

Brief Review: Caching

A basic knowledge of the operation of the cache is important if you want to understand the numbers below. All of the machine profiled use a write-back cache, not a write-through cache. This means that before loading a cache line, the previous contents must first be written back to memory if it's dirty.

This is what happens when loading from cached memory:

    if the given address is not in the cache then
        write back the cache line (if it's dirty)
        load the cache line that contains the address
    endif

    read from the cache

This is what happens when writing to cached memory:

    if the given address is not in the cache then
        write back the cache line (if it's dirty)
        load the cache line that contains the address
    endif

    write to the cache

For this test program, reading is faster than writing because the blocks are large, and all the program is doing is reading. This means that after the first pass through the cache, all of the cache lines will be clean when it comes time to load them. Writing is slower, because after the first pass through the cache all of the cache lines will be dirty when the previous contents are loaded.

On some machines (such as O2), uncached writes are pipelined in the memory controller if they're the same width as the memory. The CPU does not have to wait after issuing such a write. This is why the 64-bit uncached writes are so much faster than anything else on the O2.

Summary of Results

Here are the numbers (in megabytes per second) for 64-bit read and write access:

read cached write cached read uncached write uncached
O2 180MHz R5K PC ??? ??? ??? ???
O2 180MHz R5K SC

74.281

58.704

21.774

180.378

O2 150MHz R10K SC 55.968 48.809 10.704 127.330
Octane 195MHz R10K SC 270.135 201.890 272.189 200.687
Onyx2 190MHz R10KSC 259.539 142.712 253.769 144.453
Indigo2 200MHz R4K SC 71.537 57.981 10.758 19.460

O2: 180MHz R5000 32+32PC

I still haven't found one of these to test on.

O2: 180MHz R5000 32+32PC 512SC

    read cached
        bytes:        52.875 MB/s
        32-bit ints:  71.479 MB/s
        64-bit ints:  74.281 MB/s

    write cached
        bytes:        49.021 MB/s
        32-bit ints:  58.678 MB/s
        64-bit ints:  58.704 MB/s

    read uncached
        bytes:         2.714 MB/s
        32-bit ints:  10.893 MB/s
        64-bit ints:  21.774 MB/s

    write uncached
        bytes:         5.525 MB/s
        32-bit ints:  22.099 MB/s
        64-bit ints: 180.378 MB/s

    copy   cached to   cached
        bytes:        18.097 MB/s
        32-bit ints:  28.380 MB/s
        64-bit ints:  31.388 MB/s
        bcopy:        32.665 MB/s

    copy   cached to uncached
        bytes:         5.177 MB/s
        32-bit ints:  17.477 MB/s
        64-bit ints:  51.948 MB/s
        bcopy:        52.333 MB/s

    copy uncached to   cached
        bytes:         2.487 MB/s
        32-bit ints:   8.816 MB/s
        64-bit ints:  15.213 MB/s
        bcopy:        15.833 MB/s

    copy uncached to uncached
        bytes:         1.786 MB/s
        32-bit ints:   7.142 MB/s
        64-bit ints:  16.517 MB/s
        bcopy:        19.641 MB/s

O2: 150MHz R10000 32+32PC 1024SC

    read cached
        bytes:        41.966 MB/s
        32-bit ints:  51.072 MB/s
        64-bit ints:  55.968 MB/s

    write cached
        bytes:        41.017 MB/s
        32-bit ints:  46.793 MB/s
        64-bit ints:  48.809 MB/s

    read uncached
        bytes:         1.338 MB/s
        32-bit ints:   5.353 MB/s
        64-bit ints:  10.704 MB/s

    write uncached
        bytes:         5.625 MB/s
        32-bit ints:  22.492 MB/s
        64-bit ints: 127.330 MB/s

    copy   cached to   cached
        bytes:        15.066 MB/s
        32-bit ints:  20.802 MB/s
        64-bit ints:  22.518 MB/s
        bcopy:        23.856 MB/s

    copy   cached to uncached
        bytes:         5.124 MB/s
        32-bit ints:  16.499 MB/s
        64-bit ints:  40.900 MB/s
        bcopy:        43.502 MB/s

    copy uncached to   cached
        bytes:         1.300 MB/s
        32-bit ints:   4.848 MB/s
        64-bit ints:   8.739 MB/s
        bcopy:         8.714 MB/s

    copy uncached to uncached
        bytes:         1.078 MB/s
        32-bit ints:   4.310 MB/s
        64-bit ints:   9.176 MB/s
        bcopy:         9.975 MB/s

Octane: 2x195MHz R10000 32+32PC 1024SC

    read cached
        bytes:       124.430 MB/s
        32-bit ints: 233.586 MB/s
        64-bit ints: 270.135 MB/s

    write cached
        bytes:       118.115 MB/s
        32-bit ints: 174.691 MB/s
        64-bit ints: 201.890 MB/s

    read uncached
        bytes:       123.799 MB/s
        32-bit ints: 223.918 MB/s
        64-bit ints: 272.189 MB/s

    write uncached
        bytes:       117.887 MB/s
        32-bit ints: 172.916 MB/s
        64-bit ints: 200.687 MB/s

    copy   cached to   cached
        bytes:        29.699 MB/s
        32-bit ints:  59.917 MB/s
        64-bit ints:  76.208 MB/s
        bcopy:       139.126 MB/s

    copy   cached to uncached
        bytes:        46.065 MB/s
        32-bit ints: 103.399 MB/s
        64-bit ints: 121.507 MB/s
        bcopy:       156.203 MB/s

    copy uncached to   cached
        bytes:        48.635 MB/s
        32-bit ints:  99.496 MB/s
        64-bit ints: 109.698 MB/s
        bcopy:       159.957 MB/s

    copy uncached to uncached
        bytes:        25.418 MB/s
        32-bit ints:  62.988 MB/s
        64-bit ints:  88.254 MB/s
        bcopy:       158.487 MB/s

Onyx2: 4x190MHz R10000 32+32PC 4096SC

NOTE: Multiple processors don't speed up this benchmark.

NOTE: Buffer size changed to 8MB for this run.

    read cached
        bytes:       112.749 MB/s
        32-bit ints: 222.683 MB/s
        64-bit ints: 259.539 MB/s

    write cached
        bytes:        85.087 MB/s
        32-bit ints: 130.116 MB/s
        64-bit ints: 142.712 MB/s

    read uncached
        bytes:       114.255 MB/s
        32-bit ints: 224.054 MB/s
        64-bit ints: 253.769 MB/s

    write uncached
        bytes:        84.775 MB/s
        32-bit ints: 126.155 MB/s
        64-bit ints: 144.453 MB/s

    copy   cached to   cached
        bytes:        26.531 MB/s
        32-bit ints:  49.233 MB/s
        64-bit ints:  61.999 MB/s
        bcopy:       124.973 MB/s

    copy   cached to uncached
        bytes:        38.786 MB/s
        32-bit ints:  85.670 MB/s
        64-bit ints:  93.844 MB/s
        bcopy:       139.424 MB/s

    copy uncached to   cached
        bytes:        38.785 MB/s
        32-bit ints:  83.630 MB/s
        64-bit ints:  91.274 MB/s
        bcopy:       145.113 MB/s

    copy uncached to uncached
        bytes:        23.243 MB/s
        32-bit ints:  54.961 MB/s
        64-bit ints:  70.418 MB/s
        bcopy:       141.284 MB/s

Indigo2: 200MHz R4000 16+16PC 1024SC

NOTE: This machine does not have 64-bit loads and stores.

NOTE: I think that the cached-to-cached copy rate is so low because the Indigo2 does not have a 2-way associative cache, like the other machines tested do. This means that if the block being read and the block being written are lined up in the cache, then every read and write must re-load the cache line.

    read cached
        bytes:        31.829 MB/s
        32-bit ints:  62.903 MB/s
        64-bit ints:  71.537 MB/s

    write cached
        bytes:        29.881 MB/s
        32-bit ints:  55.780 MB/s
        64-bit ints:  57.981 MB/s

    read uncached
        bytes:         2.670 MB/s
        32-bit ints:  10.675 MB/s
        64-bit ints:  10.758 MB/s

    write uncached
        bytes:         4.864 MB/s
        32-bit ints:  19.460 MB/s
        64-bit ints:  19.460 MB/s

    copy   cached to   cached
        bytes:         0.410 MB/s
        32-bit ints:   1.597 MB/s
        64-bit ints:   3.060 MB/s
        bcopy:        15.852 MB/s

    copy   cached to uncached
        bytes:         4.685 MB/s
        32-bit ints:  17.040 MB/s
        64-bit ints:  17.046 MB/s
        bcopy:        17.029 MB/s

    copy uncached to   cached
        bytes:         2.425 MB/s
        32-bit ints:   8.640 MB/s
        64-bit ints:   8.876 MB/s
        bcopy:         9.038 MB/s

    copy uncached to uncached
        bytes:         1.837 MB/s
        32-bit ints:   7.346 MB/s
        64-bit ints:   7.182 MB/s
        bcopy:         6.983 MB/s