Software Methods for Disk I/O

By Chris Pirazzi. Information provided by folks throughout the company, including Tony Barnes, Doug Doucette, Bill Earl, Jeremy Higdon, Brad Juskiewicz, Michael Minakami, Ted Malone, Jim Montine, Rob Novak, Dave Olson, Scott Pritchett, Paul Spencer, Adam Sweeney, and Vince Uttley.

Once you have your connection to or from video, you need to choose a way to get that data from or to the disk. This varies from trivial to hard depending on your platform's memory system and your disks. Try out the methods on this page in order. When you've got a solution that works on your target platforms, stop: you're done!

Since we're focusing on disk I/O, we'll assume you have the operations to get and release video fields described in Basic Uncompressed Video I/O Code.

Many of the sections below use the terminology defined in Concepts and Terminology for Disks and Filesystems.

Buffered I/O

The most obvious way to get video to disk is to "just write it:"
{
  p = videoport_open(direction, 20, 0);
  fd = open(filename, O_RDWR|O_TRUNC|O_CREAT, 0644));
  for(;;)
    {
      videoport_wait_for_space_or_data(p);
      f = videoport_get_one_field(p);
      rc = write(fd, f->pixels, nbytes);
      videoport_put_one_field(p, f);
    }
}

The most obvious way to get video from disk is to "just read it:"

{
  p = videoport_open(direction, 20, 0);
  fd = open(filename, O_RDONLY, 0644);
  for(;;)
    {
      videoport_wait_for_space_or_data(p);
      f = videoport_get_one_field(p);
      rc = read(fd, f->pixels, nbytes);
      videoport_put_one_field(p, f);
    }
}

Working code for this is in justdoit.c.

This standard usage of read()/write() is called buffered I/O because it will go through the UNIX kernel's buffer cache. When you read() data, the kernel will DMA the necessary disk data into the kernel buffer cache, and then copy the requested data into your buffer. When you write() data, the kernel will copy your data into the buffer cache and return to you. The cached data will eventually be DMAed to the disk.

DMA (Direct Memory Access) is the process by which a hardware device (like a video I/O system or a SCSI disk controller) pulls data out of main memory or puts data into main memory. Usually the CPU can do other things while hardware DMAs are in progress, even things that use main memory. The kernel buffer cache is just an area of main memory which the kernel manages. The term "kernel buffer cache" does not refer to:

The buffer cache is absolutely crucial to the performance of most normal UNIX applications, which want to do small reads and writes on platforms where memory copy speed far exceeds disk bandwidth. The buffer cache lets you read() and write() any amount of data to any part of a file using any memory buffer, whereas the underlying hardware is much more restrictive. The hardware only lets you DMA whole disk blocks at a time, and the hardware puts alignment restrictions on the disk starting address and the DMA memory buffer.

However, video disk I/O is not a normal UNIX application. It requires unusually high bandwidths.

On platforms where the CPU and memory system are fast enough to copy video data between the buffer cache and your buffer in real-time, you can use buffered I/O. If you are writing code for high-end and possibly even mid-end SGI systems, you should try this technique before spending time on the more elaborate techniques below.

When you're trying out buffered I/O, keep in mind that SGI systems can copy data much more efficiently if the source and destination buffers are properly aligned, as described in Copying Data Quickly. You can assume that the kernel's buffers are 8-byte aligned. Therefore, the copy will be efficient if the pointer you pass to read() or write() is 8-byte aligned. Fortunately, malloc() returns 8-byte-aligned pointers, and entries in VLBuffers and DMbufferpools are always at least 8 byte aligned.

The kernel can bcopy() faster than you can from user-mode on some SGI systems, using some protected instructions. So actual buffered I/O performance may exceed the best user-mode bcopy() benchmark which you can cook up.

Even if your system has the juice for buffered I/O, we have found that sometimes the kernel's algorithm for when to discard buffered read data, and when to write back buffered modified data, is not ideal for video applications. You can control this decision somewhat with calls such as fsync() or fdatasync() and with various kernel tunable variables.

If buffered I/O works for you, then stop right here. You're done.

Direct I/O

On platforms where the CPU and memory system are not fast enough to copy video data between the buffer cache and your buffer in real-time, you need an alternative---direct I/O.

Direct I/O lets you bypass the buffer cache and its associated memory copy by providing buffers that the kernel uses directly as the target of a DMA operation. You use direct I/O by specifying the O_DIRECT flag to open(), or by enabling the FDIRECT flag using the F_SETFL fcntl(). Direct I/O works on local EFS and XFS filesystems.

In a way, direct I/O makes your disk system work like your video system. vlGetActiveRegion() and dmBufferMapData() point you directly to the area of memory into which or out of which the video hardware does DMA. So if you pass those pointers into direct I/O read(), write(), readv(), or writev(), you can move data between your video hardware and your disk hardware with no copies on the CPU.

A direct I/O write() will not return until the DMA operation has completed. Contrast this with buffered write(), which is little more than a copy into the buffer cache. A direct I/O read() or write() will always generate a DMA; the OS makes no attempt to cache read data.

You can access a file using both direct I/O and buffered I/O at the same time, and the kernel will automatically keep both views of the file consistent. For example, the kernel will invalidate a region of the buffer cache when you modify its underlying disk storage with direct I/O.

As you might guess from the section above, direct I/O puts the duty of meeting the hardware's constraints on you. The F_DIOIONFO fcntl() tells you what constraints you must follow for a given file descriptor:

{
  int fd = open("filename", O_DIRECT|..., 0644);
  struct dioattr dioinfo;

  /* get direct I/O constraints for this fd */
  fcntl(fd, F_DIOINFO, &dioinfo); 

  ... 

  void *data = pointer to data you're going to read()/write();
  int nbytes = size of that data;

  /* verify data and nbytes for direct I/O read()/write() */

  printf("buffers can be between %d and %d bytes\n",
         dioinfo.d_miniosz, dioinfo.d_maxiosz);
  assert(nbytes >= dioinfo.d_miniosz);
  assert(nbytes <= dioinfo.d_maxiosz);

  printf("buffers must be a multiple of %d bytes long\n",
         dioinfo.d_miniosz);
  assert((nbytes % dioinfo.d_miniosz) == 0);
  
  printf("file position must be on %d-byte boundary\n",
         dioinfo.d_miniosz);
  assert((lseek64(fd,SEEK_CUR,0) % dioinfo.d_miniosz) == 0);

  printf("memory buffer must be on %d-byte boundary\n",
         dioinfo.d_mem);
  /* uintptr_t is integer that's pointer-sized (see inttypes.h) */
  assert((((uintptr_t)data) % dioinfo.d_mem) == 0);

  read(fd, data, nbytes);  or  write(fd, buf, nbytes);
}
Working code that reads or writes video with direct I/O is in dio.c.

Although you must open a raw device file without O_DIRECT, and although you cannot use fnctl(F_DIOINFO) on a raw device file, I/O to a raw device file has all the characteristics of direct I/O:

IRIX 6.2 (with patch 1429) and all future OSes (natively) support direct I/O readv()/writev(). We will see later why readv()/writev() can sometimes help you solve disk throughput problems by letting you transfer more than one field or frame at a time. In this case, the constraints are a little more strict:

{
  int fd = open("filename", O_DIRECT|..., 0644);
  struct dioattr dioinfo;
  int vector_chunksize;
  int vector_memalign;

  /* get direct I/O constraints for this fd */
  fcntl(fd, F_DIOINFO, &dioinfo); 

  /* additional constraints for readv()/writev() usage */
  vector_chunksize = max(dioinfo.d_miniosz, getpagesize());
  vector_memalign = max(dioinfo.d_mem, getpagesize());

  ...
  assert(NVECS <= sysconf(_SC_IOV_MAX));
  struct iovec iov[NVECS] = desired pointers and lengths;
  int totalbytes = 0;

  /* verify iov[] for direct I/O readv()/writev() */

  for(i=0; i < NVECS; i++)	
    {
      void *data = iov[i].iov_base;
      int nbytes = iov[i].iov_len;

      assert((((uintptr_t)data) % vector_memalign) == 0);
      assert((nbytes % vector_chunksize) == 0);

      totalbytes += nbytes;
    }

  assert(totalbytes >= dioinfo.d_miniosz);
  assert(totalbytes >= vector_chunksize); /* redundant, but hey */
  assert(totalbytes <= dioinfo.d_maxiosz);

  assert((lseek64(fd,SEEK_CUR,0) % dioinfo.d_miniosz) == 0);

  readv(fd, iov, NVECS);  or  writev(fd, iov, NVECS);
}
As you can see, the memory alignment and I/O size constraints become at least as strict as the system's pagesize (4k or 16k depending on the platform and OS release). The file position and maximum I/O size constraints are the same.

Working code that reads or writes video with direct I/O readv()/writev() is in vector.c.

As of IRIX 6.3 and 6.4, readv() and writev() are not supported for raw device files. If this is important to you, please contact your local support office.

Satisfying the Constraints of Both Disk and Video

The direct I/O examples above referred to "data" and "nbytes." Since you are reading or writing uncompressed video, your "data" pointers will come from vlGetActiveRegion() or dmBufferMapData(), and your "nbytes" byte counts will come from vlGetTransferSize() or dmBufferGetSize(). The VL will align and pad those buffers as necessary to meet the video hardware DMA constraints. But you also need "data" and "nbytes" to satisfy the disk hardware DMA constraints (ie, the direct I/O constraints). How can you do this if the VL allocates the buffers?

Here's how you finesse that problem:

Clearly there should be a supported and device-independent way to add the constraints to VLBuffer and DMbufferpool creation explicitly. Perhaps there will be one day.

Direct I/O Constraints and Portability

If you intend to write a file which can be read one or more items (fields or frames) at a time into VL or DM buffers using direct I/O read() or readv(), then you need place each item in the file in such a way that the reader can meet the reader's direct I/O constraints. Specifically, using the variable names from the sample code above:

If your application produces files which will only be read back by your application on the same type of SGI machine, disk setup, and OS release, then the direct I/O constraints for reading and writing will be the same and this is easy: just call fcntl(fd, F_DIOINFO, &dioinfo) when writing the file to get your constraints.

But if your application produces files which the user could transport to a different type of SGI machine, move to a different disk, or access after installing a new OS release, the directio constraints at write and read time may not match. You must choose a set of constraints which works for all target configurations.

Unfortunately, there are no published guarantees from SGI about the range of many of the direct I/O constraints or the system page size. We recommend that you use these constraints in your video writing program:

And of course we recommend that you place items together in the file where possible so the reader can use readv(). For reasons we'll explain below, there is a point at which larger contiguous groups of items will not help the reader. Also, contiguous video items sometimes cause more disk seeks when fetching audio data. Make sure you choose the right tradeoff.

It is unlikely that you will run into trouble with these values, but we can offer no guarantee: we recommend that you continue to call fcntl(fd, F_DIOINFO, &dioinfo) and getpagesize() and use those constraints or fall back to buffered I/O if they are stricter than those above.

Our suggested constraints can be wasteful of disk space and bandwidth. If you want to cut down on the waste, here are some tips:

You can probably decrease your waste significantly if you have some control over your user's software and hardware configuration.

If your application has the further constraint that your files need to be transportable to non-SGI machines, but still read efficiently on SGI machines, then we recommend you write the data to a file yourself and then use the SGI Movie Library to add a QuickTime header to the file. We'll describe this process in detail in Getting Raw Data in and out of the Movie Library. Despite its humble origins, the QuickTime file format is now mature enough that you can write video fields or frames anywhere in the file and tell QuickTime where they are. Furthermore, a file reader can query the location of each field or frame and read the data itself.

If your application reads files with video, and wants to see if it can use direct I/O readv() or writev() on the entire file, it needs to scan the file offset of each field in the file. For many applications this will work fine. For applications which handle particularly long files, this might be a problem. Some developers have requested some kind of single tag which the application could read once which told it about the worst case alignment and padding of all of the video fields in a particular QuickTime track. SGI may recommend that Apple add such a tag to the QuickTime file format, but we'd still like more feedback from developers about what exact information that tag needs to carry.

Data Cache Concerns

Guides to programming video often tell you to call vlBufferAdvise() or use the "cacheable" argument to dmBufferSetPoolDefaults() in a particular way, but don't tell you when or why. This section gives you a peek at what's going on underneath the covers that will help you understand what these functions do, and when and why they will help you or hurt you. Most of the details discussed here are not things you have to program (in fact, they'll probably change on future hardware!). But understanding the concepts here should help you write video apps which perform well.

Basic Problem

All SGI platforms rely heavily for performance on the data cache (sometimes two caches) between the CPU and main memory. The data cache does for main memory what the buffer cache does for disk. The data cache translates the CPU's attempts to load or store single words into cache-line-size fetches and writebacks of main memory. That way, programs which load and store in only a cache-sized subset of main memory execute much more quickly, requiring an expensive trip to main memory only for the initial cache line fetches and the final cache line writebacks. See Memory Speed Benchmarks for more on how the cache works.

Alas, there is a hitch. Devices such as video, disk or graphics DMA to or from main memory. They are not aware of the state of the data cache. This can lead to cache coherency problems.

The same coherency issues would exist if we were doing direct disk I/O instead of video I/O. If we were doing buffered disk I/O, the coherency issues would exist, but they would be the operating system's problem (you may still have to deal with the performance consequences).

Coherency Solutions and Their Costs

Fortunately, SGI systems shield you from needing to deal with this problem in one of two ways: Generally, the I/O cache coherent memory systems of high-end SGI platforms are so fast that you can ignore the coherency overhead. Interesting note: 486/Pentium-style PCs also have I/O cache coherency hardware, but the performance of the memory system is so low compared with video bandwidths that you cannot ignore it, and often must find a way to get around the coherency hardware!

For the other SGI systems, the cost of the software I/O coherency operations varies greatly. For a region of memory which is bigger than the secondary data cache (which video fields or frames often are), the worst-case cost of a writeback-invalidate ranges from 2.6ms on an R5000PC O2, to 4.1ms on an R5000PC/SC Indy, to 4.2ms on an R5000SC O2, to 7.3ms on an R4400SC Indy, to 8.1ms on an early-model R10000 O2. You can measure the writeback-invalidate cost on your system using dmGetUST(3dm) and cacheflush(2). For video/disk applications, this cost will generally appear as system or interrupt time (the yellow or red bar of gr_osview).

"Just Skip It" for Uncompressed Direct Video Disk I/O

Now comes the good part: if all you are doing is DMAing data into main memory using video or disk, and then DMAing data out of main memory using disk or video, then you don't care whether the data cache is consistent with main memory or not! You will not be accessing the data at all using the CPU. If the OS knew this, then it could skip most or all of the cache operations.

That is just the hint you give by calling vlBufferAdvise(VL_BUFFER_ADVISE_NOACCESS) or passing cacheable==FALSE to dmBufferSetPoolDefaults(). Specifying NOACCESS will do the right thing on all platforms (including those with I/O cache coherency hardware, where the right thing may be nothing!). We recommend ignoring the return value. On some platforms where the NOACCESS feature is not supported (Indigo2 IMPACT R10000 for example), vlBufferAdvise() will fail, which is pretty much the best the program can do on that platform.

IRIX 6.2 exhibits a vlBufferAdvise() bug which is visible on sirius platforms (Onyx, Challenge). This bug may have been present in releases before IRIX 6.2. If you request VL_BUFFER_ADVISE_ACCESS or if you do not call vlBufferAdvise() (since ACCESS is the default), this should introduce no performance penalty on high-end systems with I/O cache coherency hardware. But the IRIX 6.2 VL will perform software cache coherency operations anyway, harmlessly but wastefully burning up CPU cycles. The workaround is to call vlBufferAdvise(VL_BUFFER_ADVISE_NOACCESS) on sirius platforms even if you intend to access the data. This workaround will only work on sirius, and may not work on releases later than IRIX 6.2.

On platforms without I/O cache coherency hardware, there are some cases where a buffer can be accessible (mapped) but references to that buffer bypass the cache (uncached). This is only supported on some SGI platforms. Each load or store instruction must touch main memory, and may be very expensive. You can achieve this mode by passing cacheable==FALSE and mapped==TRUE to dmBufferSetPoolDefaults() or using cachectl(2) on a non-VL buffer. The mode is not available with VLBuffers. This mode may be useful in cases where you want to touch only a few words per field (for example if you want to parse the VITC out of each field) but write the rest to disk. Stay away from this if you want cross-platform software.

Don't forget that these caching optimizations only apply when you're doing direct disk I/O. The OS will need to touch your data with the CPU when you're doing buffered disk I/O.

Residency Concerns

In the last section, we saw how to make the OS skip part of its normal routine for preparing a buffer for DMA. Another thing the OS must generally do before a buffer can be DMAed is to make sure the entire buffer is resident (ie, that physical pages exist for the buffer's entire virtual address range). The OS does this by touching and locking every page, an operation which can take milliseconds per field in the worst case on some platforms. It must also unlock locked pages when the DMA is done. You can avoid this cost by making sure your buffers are resident before you first use them for I/O.

For VLBuffers and DMbufferpools this is already done---these buffers are always resident. For non-VL buffers, you can use the mpin() system call to assure that a buffer always has underlying physical pages.

Many of SGI's drivers (video, disk, and graphics) have fast path checks which skip the whole residency procedure if the buffer is marked as pinned.

Keeping Your Disk Drives Busy Reading or Writing

Ok, so now you're using direct I/O with no copy, you've got the cache licked, and you're pinned down. What more could you need?

If field-at-a-time direct I/O works for you, then stop right here. You're done.

If you still do not get enough disk throughput, then you need to start thinking about how your requests to read and write video data are performing on your disk system. You are probably not keeping all your disk drives busy reading or writing all the time, which means you're wasting potential throughput. Read on!

While you are recording or playing back uncompressed video, a disk in your system which is not reading or writing data is probably doing one of these things:

The remainder of this document will describe various ways of eliminating these sources of dead time and thus improving your overall disk throughput.

Terminology

We will refer to one call to read(), write(), readv(), or writev() as a file I/O or I/O. The I/O size is the buffer size for read()/write(), or the sum of the iovcnt iovec.iov_len fields for readv()/writev().

Many of the sections below use the terminology defined in Concepts and Terminology for Disks and Filesystems.

If you've read this far then you're using direct I/O or raw I/O. Here are all the layers that could be involved in your I/O, from highest to lowest:

All your I/Os pass through physio. From there your I/O jumps to the XFS layer if you are writing to a filesystem (/mydisk/myfile), the XLV data subvolume layer if you are writing to an XLV logical volume's raw device file (/dev/rdsk/xlv/myvolume), or the disk partition layer if you are writing to a partition's raw device file (/dev/rdsk/dks?d?s?).

You will learn more about each layer as you read on.

Disk Seek, Disk Retry, and Disk Recal Issues

The worst case seek time of modern disks ranges from 5 milliseconds to tens of milliseconds, so you must make sure that your disk accesses are sufficiently close that the seek cost doesn't rob you of your needed throughput. We'll examine the sources of disk seeks from the bottom up.

When your computer issues a read or write request to a disk over a bus, it specifies the starting location for the operation as a "logical block number." The drive is actually divided up into "physical sectors," and the drive firmware maps logical block numbers to physical sectors. Normally, it will map logical blocks of increasing address to contiguous physical sectors. But real disk drives have surface defects (up to 1% of a drive's sectors may be bad before it is considered broken). The firmware of many drives will automatically remap logical blocks which would point to defective sectors to other (sometimes distant) sectors. This can introduce extra seeks. You can query the drive for a map of defective physical sectors, but there is no sure-fire way to determine how the drive will map any given logical block number to a physical sector.

When a drive hits a bad sector which it has not previously marked as defective, it will retry the operation a number of times (0 or higher) which you can control with fx. In many cases, the error is recoverable with one retry. The default on SGI-shipped drives is to retry at least once. After a certain drive-specific (and not queryable) number of failed retries, the drive will assume the head is miscalibrated and perform a thermal recalibration operation. If you set the drive retry count high enough, retries and recals together can prevent any useful disk activity from proceeding on the disk for up to several seconds. If the number of retries you requested all fail and the drive still cannot return the correct data (read case) or write the data with possible sector reallocation (write case), then the I/O operation you issued will fail with an error like EIO. If you crank down disk retries in order to bound the worst-case disk I/O execution time, be aware that your I/Os may fail much more often. You can only disable retries on disks which you access with raw device files or XFS realtime subvolumes, since the data and log sections of XFS filesystems must remain consistent for the filesystem to work (unrecoverable errors an XFS log section cause nasty console messages and/or system panics, for example).

Drives on some SGI systems are configured to perfom recals once every 10-20 minutes, regardless of whether a failure has occurred. These periodic recals prevent useful disk activity for a maximum of 50ms for today's drives (4 or 5 times that much for older drives). The newest SGI configurations only recalibrate when a sector read or write actually fails.

From now on, this document will ignore seeks due to remapped blocks (it will assume logical blocks map contiguously on to physical sectors), retries, and recals. In practice, some apps use the default retry count and successfully ignore the effect of these anomalies, and some place requirements on the condition of the disk.

Seek and Retry Issues Above the Disk Level

We now move up to the lowest software levels, the disk partition and the XLV logical volume. At this point we run into a terminology clash. When you open up a partition's raw device file (/dev/rdsk/dks?d?s?) and you lseek(2) to a particular address, much of our documentation will refer to this as a "physical address." When you open up a raw device file for an XLV logical volume (/dev/rdsk/xlv/*, giving you the logical volume's data subvolume) and you lseek(2) to a particular address, much of our documentation will refer to this as a "logical address." These definitions of "logical" and "physical" are stacked on top of the disk definitions above. So a /dev/rdsk/dks?d?s? physical address is the same thing as a disk's logical block number.

An obvious way to avoid seeks at the partition and logical volume level is to access their raw device sequentially. As we will see in the discussion of disk commands below, when you access a disk partition or XLV data subvolume in a sequential manner using its raw device file, you are accessing logical blocks on the underlying disks in a sequential manner.

If there is some serious error on your SCSI bus such as a timeout or a SCSI reset, or if there is some failure in a particular SCSI device such as unit attention, media error, hardware error, or aborted command, the dksc driver (which manages access to partitions by you (/dev/rdsk/dks?d?s?), XLV, and XFS) will retry an I/O several times. If all the retries fail, your I/O will fail. These serious failure conditions can incapacitate your bus or device for seconds. They do not occur in normal operation, so this document will ignore them.

Finally we get to XFS. When you write to a file in an XFS filesystem, the filesystem decides where to place the data within the raw partition or volume, and it may be forced to fragment a file. Currently there is no XFS mechanism to guarantee contiguous allocation of a file. One practical way to create contiguous files is to start with an empty filesystem, and pre-allocate the amount of contiguous space you will need in one or more files using mkfile(1) or the F_RESVSP fcntl(). Unlike F_ALLOCSP, this XFS fcntl doesn't zero out the allocated space and so is fast. After preallocating contiguous files, you can use your disk normally. This method is similar to the raw partition method, since part of your disk becomes dedicated to one use, but the dedicated part can be accessed by any normal UNIX program without going through the awkward /dev/rdsk mechanism.

You can use the xfs_bmap(1M) tool or the F_GETBMAP and F_FSGETXATTR fcntl(2)s to check whether and how a specified XFS file is fragmented across your disk.

There is currently no XFS filesystem reorganizer (fsr). The fsr program rearranges existing file data to create large contiguous free sections. EFS, the previous SGI filesystem, supported fsr, but the days of EFS are numbered. We don't recommend building a product around that filesystem. If your application requires filesystem reorganization or additional contiguous file support from XFS, please post questions and your data to an SGI newsgroup or contact your local field office.

Unless otherwise specified, the rest of this document will assume that the vast majority of your disk access is sequential, and so seek time is never a significant limiting factor. Some of the optimizations we will describe here only work under that assumption.

Disk Commands

Surprisingly, seeks and the other anomalies usually turn out to be the least of your problems when debugging uncompressed video disk I/O applications. A much more subtle performance problem occurs when your disks stop doing useful work because they starve for write requests or read requests. To understand the many ways this can happen and how to avoid them, we need to take another look under the covers.

The OS accomplishes file I/O by issuing commands to the relevant disks. A command is a data transfer between the computer (the host) and one disk on one bus. During a command, the computer reads from or writes to a sequential set of logical blocks on the disk. Each command looks like one DMA on the host side and involves various bus acquisition and data transfer procedures on the bus side. The host adapter is the piece of hardware on the host which implements that DMA by communicating over the bus. At any given time, the host can be executing multiple commands to different disks, on a single bus or multiple busses.

When you use buffered I/O, the OS chooses when to issue disk commands and how large they are. With direct I/O (including I/O to a raw device file, which is inherently direct):

Soon, we'll show you exactly how the OS transforms your I/O into one or more disk commands. But first, we'll motivate this discussion by describing a performance characteristic of commands.

The Effect of Command Size and Spacing

In order to complete a command, your computer has to spend a certain amount of CPU time setting up the command, then your host adapter has to occupy the bus for a certain amount of time starting, executing, and ending the command, and then your computer has to spend a certain additional amount of CPU time shutting down the command. This CPU setup and shutdown cost includes the cache coherency and residency operations we described earlier in this document. If you've read this far in the document, you've probably already eliminated those operations by making your buffers uncacheable and pinning them down.

But there are other CPU costs, and there are the bus usage costs. Some of these costs scale linearly with the size of the command. Some of these costs are fixed per command. If your system has significant per-command fixed costs, you can increase your overall data throughput if you make your commands as few and as large as possible. This simple diagram assumes each of your I/Os generates one command to the same disk:

In the upper case, we pay several extra fixed CPU and bus costs to shut down the first I/O and set up the second I/O.

This diagram is not to scale. For uncompressed video disk applications, where each command transfers hundreds of kilobytes and where you have eliminated the cache coherency and residency operations, the CPU and bus setup costs are typically a tiny fraction of the useful work.

Because of uncertainties in UNIX process scheduling, we may also waste some additional time, shown as a gap between the two I/Os, between when the first I/O completes and when our application runs and issues the second I/O. These delays may be a significant fraction of the useful work done by each command. You may be able to reduce this average time with some of the hacks described in Seizing Higher Scheduling Priority.

Say you can somehow assure that the costs between the two I/Os (bus overhead to complete and issue commands, CPU overhead, UNIX process scheduling) will be no more than a few tens or hundreds of microseconds. In some cases, this small delay in the host response may trigger a multi-millisecond delay in the disk response which is called a missed revolution. We'll discuss missed revs more when we talk about disks below.

As the diagram shows, combining the two commands into one reduces the overall cost and brings us to our result sooner. We also eliminate one command boundary, and thus one opportunity for a missed rev.

How can you increase your command size in video disk I/O programs? Typical video applications have their video data stored in field- or frame-sized buffers, so it would seem that roughly 300k or 600k is the greatest possible I/O size. Fortunately, UNIX offers an extremely simple I/O interface called readv() and writev() which allow you to do one contiguous I/O to a file from several buffers which are not contiguous in memory. As of IRIX 6.2, readv() and writev() are supported on direct I/O file descriptors. As discussed in detail in Direct I/O above, the direct I/O constraints become a little more strict when using direct I/O readv()/writev(), but they are still within the parameters of VLBuffers and DMbufferpools. So your video recording or playback program can do disk I/O in multiples of V fields or frames.

Working code that reads or writes video with direct I/O readv()/writev() is in vector.c.

Choosing V can sometimes be tricky. The readv() and writev() calls can accept at most sysconf(_SC_IOV_MAX) vectors. This is 16 on all systems as of 9/3/97, but it could get larger on future OSes. One of the directio constraints, dioinfo.d_maxiosz, limits the total number of bytes per readv()/writev(). Ideally, you would want the largest value of V which satisfies these conditions. But V is also equal to the number P used in determining how much memory buffering to use in your videoport, as described under "How Much Memory Buffering Do I Need?" in Basic Uncompressed Video I/O Code. A high value of V may lead you to allocate a prohibitively large amount of memory, depending on your application and target platforms. You need to choose the best tradeoff.

I/Os to Commands

Now we'll tell you which commands the OS will generate for any I/O of yours. This diagram (valid up to IRIX 6.3 and 6.4) shows you the layers of kernel software your raw I/O or direct I/O goes through where the I/O may be split, constrained, or generate other I/Os. The diagram shows all four ways of accessing your disks:

UNIX geek note: the requests which we call I/Os are actually called uio_ts or buf_ts inside a UNIX kernel, depending on the level at which they occur. The distinction is not important for this discussion. The requests which we call commands are called scsi_request_ts inside IRIX.

This document makes many simplifying assumptions about your XLV volume, which you can read about in Concepts and Terminology for Disks and Filesystems. This diagram leaves out the kernel buffer cache since we are talking about direct I/O (and since raw I/O is inherently direct).

All raw or direct I/O requests pass through a layer called physio. This is the layer which does the cache and residency operations described above. You should have already rendered these operations harmless if you've read this far in the document. The physio layer imposes an overall hard limit on I/O size of min(maxdmasz, 2^24 bytes). This guarantees that the I/O will not exceed hardware and software limits in lower layers. One of the direct I/O constraints (dioinfo.d_maxiosz) gets its value in part from this limitation. The maxdmasz kernel tunable variable defaults to 1MB. In some cases (most likely with disk arrays), you may want to issue I/Os bigger than 1MB. To do this, apply the maxdmasz tune shown in Hardware/System Setup for Uncompressed Video I/O. The 2^24 value comes from the number of bits available in the SCSI command SGI uses for read and write commands (nope, encapsulation is not a technique used in a UNIX kernel!).

In the next two sections, we'll follow your I/O through the XFS and XLV levels to see the cases where it can be split into several I/Os or generate other I/Os.

I/Os to Commands: XFS Level

The first and most obvious place your I/O can be split is at the filesystem level, if your file is not contiguous on disk. As we stated in Disk Seek, Disk Retry, and Disk Recal Issues above, the rest of this document will assume that this split is rare enough to have a negligible effect, so we ignore it unless otherwise specified.

When you use an XFS filesystem instead of a raw device file, the kernel occasionally needs to update the filesystem's log section and the metadata in the filesystem's data section. The kernel will automatically generate these I/Os. Their frequency depends on the frequency of your I/Os. Although these log and metadata updates involve a miniscule number of bytes compared to your video data, they may generate the occasional seek on your disks and so their cost may be significant. There is no easy way to judge what kind of performance impact they will have. Some developers use a rule of thumb that the log and metadata updates will eat up 10% of the throughput achievable from the raw device file, but we're not so sure this is a good rule of thumb. For example, if you use XLV to place your log section on a different set of disks, the impact of the log updates may become negligible.

I/Os to Commands: XLV Level

As shown in Concepts and Terminology for Disks and Filesystems, XLV takes your I/O and maps it through the logical subvolume, plex, and volume element levels onto physical addresses on your partitions. Note that the XLV/dksc and the disk each have their own definition for "logical" and "physical:" XLV/dksc sit atop the disk, and dksc's "physical disk address" is the same thing as the disk's "logical block number" (as described above). This section uses the XLV/dksc definitions.

Given our assumptions from that document, the mapping from logical subvolume to plex is one-to-one, and there is only one plex (no redundancy) for each logical subvolume. So this level passes your I/O through unchanged.

A plex may contain several volume elements. If your I/O spans two or more volume elements, XLV will split it up. Since your volume elements should be vast in size compared to a typical I/O, we will assume this event is rare enough to be insignificant, and ignore it.

If your volume elements are single partitions, the mapping from volume elements to partitions is also one-to-one.

The interesting stuff happens when you have a striped volume element with partitions on different disks. XLV will split your I/O into stripe-unit-size pieces and execute the pieces concurrently on each of the volume element's partitions. It is crucial to understand exactly where XLV will split your I/O. Say you have a striped volume element with N=4 underlying partitions, and a stripe unit of S bytes. XLV will use the following fixed scheme to map logical positions within the volume element to physical positions on the volume element's partitions:

As you access the volume element from beginning to end, you access:

Note that the beginning of a stripe unit occurs at offset 0*S, 1*S, 2*S, ... within each partition, and the beginning of a stripe occurs at offset 0*N*S, 1*N*S, 2*N*S, ... within the volume element.

With this fixed addressing scheme, the striped combination of N disks should give you an overall throughput that is N times the throughput of the slowest disk in the stripe (since your I/O does not complete until all the I/Os it generates have completed).

To make your app use the striped volume element as efficiently as possible, you want each of your I/Os to generate the fewest and largest possible I/Os on the underlying disks.

The first obvious thing is that this scheme will never generate an I/O larger than S bytes. So if you choose S very small (say 16k), then your I/O will get split up into an enormous number of small I/Os and you will lose performance in bus and CPU overhead.

You want to choose the largest value of S that will still give you parallelism, given your application's usage pattern. For example, say your app reads or writes one 256k field at a time using direct I/O. If you choose S==1MB, then each of your I/Os will execute on at most two, and usually one, of the disks in the stripe set! You will not have achieved the desired N-fold throughput increase.

For an application which issues one direct I/O (read(), write(), readv(), writev(), ...) of W bytes at a time, a good value of S to use would be W/N. This would tend to give you the smallest number of largest-sized I/Os to your underlying disks.

Things are more complicated for an application that does asynchronous I/O, which will be defined below. In this case, we need a more general definition of W: "the number of bytes of I/O which the application has outstanding at any given time." Then the formula is the same as above.

Depending on the constraints of your application, you may be able to tweak W by changing your app, or tweak S by telling your users to create their logical volume in a certain way, or both.

Another performance issue on striped volumes is stripe alignment. Say you perform an I/O of size N*S. How many I/Os will that generate?

If your I/O is stripe-aligned, then you will generate N=4 I/Os (each generated I/O is outlined in yellow):

These I/Os will execute in parallel on each disk, and you will have hopefully achieved a near-fourfold performance improvement.

But if your I/O is not stripe-aligned, you will generate N+1=5 I/Os (the first four I/Os are outlined in yellow, and the fifth is outlined in green for clarity):

Remember that your I/O (ie, your call to direct I/O read(), write(), readv() or writev()) will not return until all the commands it generated have finished. If there is a significant fixed cost associated with commands in your system, the fact that there are two, smaller commands on disk 1 may hurt your overall performance:

How can you assure that your I/Os are stripe aligned?

I/Os to Commands: Host Adapter Driver Level

Finally, at the bottom level of the I/O splitting diagram, we have the host adapter. The overall hard limitation on I/O size which we saw at the physio layer accommodates constraints at the host adapter level, since it is the host adapter that sets up DMAs and sends SCSI commands.

In Hardware/System Setup for Uncompressed Video I/O, we specified some platform-specific tunes you could apply to your system, and promised to explain when they were useful later. Now is the time:

Exploiting Parallelism

Coalescing commands and reducing inter-command overhead is not the only way to increase throughput. Each CPU, each bus, and each disk on your system is a separate entity. In order to complete one command, each must do some work in a certain order. But in many cases you can overlap the CPU, bus, or disk part of one command with another part of another command, and end up with the result of both commands sooner.

XLV striping is the most obvious example of this. An I/O to a striped volume gets split up into several commands which execute concurrently on several disks. We discussed XLV striping above.

You can exploit parallelism even when issuing commands to one disk. Since your CPU and your bus are separate entities, you can often increase your throughput by overlapping the CPU part of one command with the bus part of another command, as shown here (as above, we assume that each of your I/Os generates one command to the same disk):

This is called asynchronous I/O. In this case, you still end up paying the same CPU cost and the same bus cost, and you still have the same gap between the start of each I/O due to UNIX process scheduling. But you take advantage of your ability to pay some of these costs simultaneously, and so you get to your result sooner.

Another way to think of it: asynchronous I/O lets your app give the OS advance warning of upcoming commands. The OS can use this advance warning to pay its fixed per-command setup costs while the bus and disk are busy. Then, when the bus and disk become free, the OS can issue the next command most efficiently. As we'll see later, the OS can often extend this advance warning into the host adapter hardware layer, and you and the OS can extend the advance warning all the way down to the disk itself using something called disk command tag queueing. The lower you go, the more opportunities for parallelism you can exploit. With the right combination of methods, you can even reduce or eliminate missed revolutions. We'll describe these lower-down methods of parallelism later.

If you've read this far in the document, then you're using direct I/O or raw I/O. These forms of I/O do not return until all the commands they generate have completed. To take advantage of asynchronous I/O for video, you will need a way to have multiple direct or raw I/O requests outstanding. In other words, you need a way to tell the OS what I/O you are going to issue next while one is currently executing.

The POSIX party line way to do this is the aio library (man aio_read(3), aio_write(3)). The aio method is less code for you, but aio lacks a selectable file descriptor which you can block on until the I/O completes (instead, it uses signals or other-threadly callbacks), and aio does not support readv()/writev().

You may instead want to roll your own asynchronous I/O code by issuing I/Os to the same file simultaneously from separate sproc(2) threads or pthreads. If you can only open() the file once, you need to use pread()/pwrite() from each thread so that the seek and read/write operation are atomic in each thread. Otherwise, we recommend that you open() the file once per thread so that each thread has its own file descriptor with its own current file position ("file pointer," in lseek(2) lingo). The dup(2) and dup2(2) system calls will create another file descriptor with the same current file position; you do not want to use them. Because there is no preadv() or pwritev(), opening the file once per thread is the only way you can use readv()/writev() and asynchronous I/O together.

Working code that reads or writes video with direct I/O readv()/writev() in an asynchronous manner (not using the aio library) is in async.c.

Be warned: there are few things that will twist, complicate, add bugs to, and reduce the readability of your code more expertly than asynchronous I/O. To understand why, consider that asynchronous I/Os do not necessarily complete in the order you issue them. Think about how your application manages buffers of video data now. For example, if you are using the classic VL buffering API (see What Are the SGI Video-Related Libraries?), the vlPutFree() and vlPutValid() calls do not take a VLInfoPtr argument. They assume that you want to free or enqueue the oldest item that you have dequeued or allocated (respectively). To get the most out of asynchronous I/O, you will want to keep each of your threads or aiocb_ts busy, so you may need to free or enqueue items in a different order than you dequeue or allocate items (respectively). In Basic Uncompressed Video I/O Code, we presented a wrapper around the classic VL buffering API which gave you this ability, albeit with some memory waste. This shortcoming of the classic VL buffering API is one of the main reasons why the DMbuffer APIs exist. The DMbuffer APIs lets you free or enqueue items in any order with optimal memory usage, since you pass in a DMbuffer handle to dmBufferFree() or vlDMBufferSend().

Even with the DMbuffer APIs, asynchronous I/O is messy because it involves coordination between processes. We strongly recommend you investigate all other options before paying the development cost for asynchronous I/O.

Commands and Your Host Adapter

When you use asynchronous I/O, you give the OS advance warning about the next command on a disk while the previous command on that disk is executing. This allows time for your I/O to pass through physio and all the other kernel layers, and time for the kernel to prepare data structures needed to issue the I/O to the host adapter. Depending on the host adapter on your platform, the system can prepare for the impending completion of the previous command to different degrees. Both hinv and /var/sysgen/system/irix.sm tell you which host adapter you have.

This performance difference can be significant for video disk I/O applications in terms of missed revolutions, which we'll describe below.

Commands and Disks

Ok, so now you know exactly how I/Os map onto commands. Plus you know that keeping your commands large and closely spaced, and exploiting parallelism where possible will help keep your disks busy. Three optimizations at the disk level---disk readahead, disk write buffering, and disk command tag queueing---also help us achieve this goal. To understand these you need to know a little more about how data gets between the host adapter and the disks on a bus.

Once a read or write command between the computer (the host) and a disk begins, the disk will begin to shuttle chunks of data at bus speeds between the host and some RAM on the disk itself. Typical modern disks have a few hundred kilobytes of RAM. Disk arrays may have significantly more. A disk will only occupy the bus when it is shuttling one of these chunks to or from the host. Then the disk will let go of the bus ("disconnect" in SCSI terms) and begin the much slower process of transferring data to or from its media. Several commands to different disks on the same bus can be in progress at once. Each disk takes turns refilling its RAM (in the write case) or emptying its RAM (in the read case) over the bus. On the host side, the host adapter implements the DMA issued by the OS using these chunks of data shuttling in and out on the bus.

This diagram shows one complete write and read command:

The row marked "command" represents the interval between when the OS programs the host adapter to initiate the command to the disk on the bus and when the host adapter notifies the OS that the disk has reported the command complete. The upper arrow shows how the host DMAs into (write command) and out of (read command) the fast RAM on the hard drive. The lower arrow shows how the (much slower) media side of the disk writes from the RAM to the media (write command) or reads from the media to the RAM (read command). The row marked "disk RAM fill level" shows how the disk's RAM fills and empties as the host and media sides of the drive access it.

How does the disk decide when to request more data from the host (write command) or push more data to the host (read command)? This is controlled by a set of drive parameters which you can set with fx, including a high (read command) or low (write command) water mark and some bus ownership timeouts. Unfortunately, there is no way to query the size in bytes of the disk's RAM (the high and low water marks above are percentages). You have to get this data from the drive manufacturer.

Note that the drive holds the bus for only a small percentage of the command. You can create antisocial drives by disabling drive disconnect using fx, but this is unlikely to be useful for video disk I/O. We'll assume all your drives have disconnects enabled.

The row marked "command completeness on the disk's media" indicates what percent of the data for this command has actually been read from the media or written to the media. Finally, the row "media busy" is the really important one: it says whether the disk is actively reading or writing on the media. If we are forced to access the disk's physical sectors non-contiguously, then from now on we will also count time spent doing these seeks as "yes." If we're doing our job right, then this row will be "yes" all the time.

To put it another way, our job is:

Disk Optimizations

Notice how the disk's media goes idle at the end of the commands above. Disk readahead, disk write buffering, and disk command tag queueing help to eliminate that idle time by giving the disk something to do next before it's finished with what it's doing now:

Disk readahead and disk write buffering can sometimes give you performance boosts even if you aren't using any of the higher-level optimizations described above. Disk command tag queueing usually only helps for video disk I/O if you are also using asynchronous I/O.

You can see if these options are enabled on a disk using the label/show/all command of fx.

You can enable and disable these options on a disk using the label/set/ commands in fx. You must run fx in expert mode (-x) to do this. The label/set options in fx are extremely dangerous, so be careful when you use them not to hose your disk.

Disk Readahead

A read command begins when the host notifies the disk of the request and ends when the disk transfers the last chunk of requested data over the bus to the host.

The disk may already be reading data off its media when the read command begins, if it has readahead enabled. When a disk with readahead enabled has just finished reading data off its media to satisfy a given read command from the host, it will immediately continue to read from the next location on the media, on the theory that the host is about to request that data as part of the next command. If the host is accessing the disk in a sequential manner (as is the generally the case for uncompressed video disk I/O), then readahead allows the disk to do non-stop useful work even though the host may not be able to react to the completion of one command by starting another one quickly. This diagram shows the optimization pictorially:

In this case, if the two commands are sequential, then there is no idle media time.

Readahead is on by default on all SGI-shipped disks. Some disks have fancy readahead algorithms which can track several different threads of sequential access.

Disk Write Buffering

A write command begins when the host notifies the disk of the request and provides the first chunk of data to write. Write buffering, a disk setting familiar to uncompressed video developers, determines when a write command ends:

Here is a diagram of write buffering:

In this diagram, the drive's RAM begins to accept data for a given command while it is still draining data to its media for the previous command. Some drives are not this smart, and will begin to request data just as soon as the media is done with the previous command. Either way, the host has some extra time to react to the completion of each command by issuing the next command without the media going idle. For smart drives, the host also has some extra time to react to the drive's request for data for the next command before the media goes idle.

Like readahead, write buffering allows a disk to do non-stop useful work even if the host cannot immediately react to the completion of one write command by issuing the next write command.

Unlike readahead, write buffering has a nasty side effect. A disk with write buffering enabled can get into a situation where it has multiple write commands pending. There is no efficient way for the host to guarantee that only one write is pending at a time. The disk is allowed to reorder those writes (presumably to reduce seek time and increase throughput). That is, the disk need not modify the media in the order in which the commands were issued. If there is a power failure, it is possible that later commands made it to the disk media but earlier commands did not. This is a major problem for journalling filesystems such as XFS, which rely on the ability to guarantee the ordering of their write commands on the disk media to keep the disk in a consistent state at all times, even if a power failure occurs. XFS assumes that writes to any XFS data or log section will be ordered, and so disks containing those sections can not have write buffering enabled. Because of this clash, write buffering is disabled by default on all SGI-shipped disks. This assumption allows XFS to instantly recover a disk after a power failure, without a tedious fsck step. Writes to an XFS realtime section may be unordered, so disks containing only XLV realtime subvolumes may have write buffering enabled, since the realtime section contains no metadata.

But worry not: it turns out there is something just as good, command tag queueing, which works for all XFS sections and is enabled on all SGI-shipped disks. Read on!

Disk Command Tag Queueing

Disk command tag queuing is an extension of the SCSI protocol which allows the host adapter to queue up multiple commands (up to 256 if the disk supports it) on a disk. It is a logical extension of asynchronous I/O and the queued commands in the adp78/ql host adapter firmware described above. It is supported by the wd95, adp78, and ql host adapters.

Disk command tag queuing is superior to readahead in that the disk does not need to speculate about the next location to read from. If the host issues non-sequential reads, the disk immediately issues a seek rather than wasting time reading off the wrong location on the media.

Disk command tag queueing is superior to write buffering in that the host can enqueue new commands at any time (even before it could with write buffering), and the host gets notified when each command is done on the media.

The disk notifies the host when each enqueued write command has completed on the disk's media. Therefore, disk command tag queuing does not suffer the write ordering problem described for write buffering above. The host can guarantee write ordering by isuing special SCSI tagged commands that are marked as ordered, or by simply assuring that at most one command with ordering constraints is pending on the disk at any given time.

Missed Revolutions

As we hinted above, even a small delay (tens or hundreds of microseconds) in issuing the next command may trigger a multi-millisecond delay called a missed revolution.

To understand missed revolutions, remember that most disk devices (including the insides of disk arrays) use spinning platters. A disk gets a chance to read or write a particular track only once per mechanical revolution. For a 7200 RPM disk, that's once every 8 milliseconds. Say you're doing 128k commands to a 6 MB/sec disk: each command takes about 20ms.

Consider these cases:

In general, the host cannot determine the mapping between logical blocks and physical sectors on a disk, so it cannot even predict which of the cases above will occur. The important thing is to see that large potential penalties (relative to the 20ms command size) can arise even if the host can react in tens or hundreds of microseconds to the end of one command by issuing the next. These penalties would appear on the diagrams above as an unusual delay between the host issuing a command and the disk's media beginning to do useful reading or writing work. Interestingly, this is also what the seek penalty for non-sequential disk access looks like.