Advising the Linux Kernel on File I/O

In this fifth part to a seven-part series on Linux I/O file system calls, you’ll learn how to give advice to the Linux kernel, and more. This article is excerpted from chapter four of the book Linux System Programming: Talking Directly to the Kernel and C Library, written by Robert Love (O’Reilly, 2007; ISBN: 0596009585). Copyright © 2007 O’Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O’Reilly Media.

Advice for Normal File I/O

In the previous subsection, we looked at providing advice on memory mappings. In this section, we will look at providing advice to the kernel on normal file I/O. Linux provides two interfaces for such advice-giving: posix_fadvise() and readahead().

The posix_fadvise( ) System Call

The first advice interface, as its name alludes, is standardized by POSIX 1003.1-2003:

  #include <fcntl.h>

  int posix_fadvise (int fd,
                    
off_t offset,
                    
off_t len,
                     
int advice);

A call to posix_fadvise() provides the kernel with the hint advice on the file descriptor fd in the interval [offset,offset+len) . If len is 0 , the advice will apply to the range [offset,length of file] . Common usage is to specify 0 for len and offset , applying the advice to the entire file.

The available advice options are similar to those for madvise() . Exactly one of the following should be provided for advice :

POSIX_FADV_NORMAL
   The application has no specific advice to give on this
   range of the file. It should be treated as normal.

POSIX_FADV_RANDOM
   The application intends to access the data in the
   specified range in a random (nonsequential) order.

POSIX_FADV_SEQUENTIAL
   The application intends to access the data in the
   specified range sequentially, from lower to higher
   addresses.

POSIX_FADV_WILLNEED
   The application intends to access the data in the
   specified range in the near future.

POSIX_FADV_NOREUSE
   The application intends to access the data in the
   specified range in the near future, but only once.

POSIX_FADV_DONTNEED
   The application does not intend to access the pages
   in the specified range in the near future.

As with madvise() , the actual response to the given advice is implementation-specific—even different versions of the Linux kernel may react dissimilarly. The following are the current responses:

POSIX_FADV_NORMAL
   The kernel behaves as usual, performing a moderate
   amount of readahead.

POSIX_FADV_RANDOM
   The kernel disables readahead, reading only the
   minimal amount of data on each physical read
   operation.

POSIX_FADV_SEQUENTIAL
   The kernel performs aggressive readahead, doubling
   the size of the readahead window.

POSIX_FADV_WILLNEED
   The kernel initiates readahead to begin reading into
   memory the given pages.

POSIX_FADV_NOREUSE
   Currently, the behavior is the same as for
   POSIX_FADV_WILLNEED ; future kernels may perform
   an additional optimization to exploit the “use once”
   behavior. This hint does not have an madvise()
   complement.

POSIX_FADV_DONTNEED
   The kernel evicts any cached data in the given range
   from the page cache. Note that this hint, unlike the
   others, is different in behavior from its madvise()
   counterpart.

As an example, the following snippet instructs the kernel that the entire file represented by the file descriptor fd will be accessed in a random, nonsequential manner:

  int ret;

  ret = posix_fadvise (fd, 0, 0, POSIX_FADV_RANDOM) ;
  if (ret == -1)
         
perror ("posix_fadvise");

On success, posix_fadvise() returns 0. On failure,
-1 is returned, and errno is set to one of the following values:

EBADF 
  
The given file descriptor is invalid.

EINVAL
   The given advice is invalid, the given file descriptor
   refers to a pipe, or the speci fied advice cannot be
   applied to the given file.

The readahead( ) System Call

The posix_fadvise() system call is new to the 2.6 Linux kernel. Before, the readahead() system call was available to provide behavior identical to the POSIX_FADV_WILLNEED hint. Unlike posix_fadvise() , readahead() is a Linux-specific interface:

  #include <fcntl.h>

  ssize_t readahead (int fd,
                    
off64_t offset,
                    
size_t count);

A call to readahead() populates the page cache with the region [offset,offset+count) from the file descriptor fd .

On success, readahead() returns 0. On failure, it returns -1 , and errno is set to one of the following values:

EBADF
   The given file descriptor is invalid.

EINVAL
   The given file descriptor does not map to a file that
   supports readahead.

{mospagebreak title=Advice Is Cheap}

A handful of common application workloads can readily benefit from a little well-intentioned advice to the kernel. Such advice can go a long way toward mitigating the burden of I/O. With hard disks being so slow, and modern processors being so fast, every little bit helps, and good advice can go a long way.

Before reading a chunk of a file, a process can provide the POSIX_FADV_WILLNEED hint to instruct the kernel to read the file into the page cache. The I/O will occur asynchronously, in the background. When the application ultimately accesses the file, the operation can complete without generating blocking I/O.

Conversely, after reading or writing a lot of data—say, while continuously streaming video to disk—a process can provide the POSIX_FADV_DONTNEED hint to instruct the kernel to evict the given chunk of the file from the page cache. A large streaming operation can continually fill the page cache. If the application never intends to access the data again, this means the page cache will be filled with superfluous data, at the expense of potentially more useful data. Thus, it makes sense for a streaming video application to periodically request that streamed data be evicted from the cache.

A process that intends to read in an entire file can provide the POSIX_FADV_SEQUENTIAL hint, instructing the kernel to perform aggressive readahead. Conversely, a process that knows it is going to access a file randomly, seeking to and fro, can provide the POSIX_FADV_RANDOM hint, instructing the kernel that readahead will be nothing but worthless overhead.

{mospagebreak title=Synchronized, Synchronous, and Asynchronous Operations}

Unix systems use the terms synchronized, nonsynchronized, synchronous, and asynchronous freely, without much regard to the fact that they are confusing—in English, the differences between “synchronous” and “synchronized” do not amount to much!

A synchronous write operation does not return until the written data is—at least—stored in the kernel’s buffer cache. A synchronous read operation does not return until the read data is stored in the user-space buffer provided by the application. On the other side of the coin, an asynchronous write operation may return before the data even leaves user space; an asynchronous read operation may return before the read data is available. That is, the operations may only be queued for later. Of course, in this case, some mechanism must exist for determining when the operation has actually completed, and with what level of success.

A synchronized operation is more restrictive and safer than a merely synchronous operation. A synchronized write operation flushes the data to disk, ensuring that the on-disk data is always synchronized vis-à-vis the corresponding kernel buffers. A synchronized read operation always returns the most up-to-date copy of the data, presumably from the disk.

In sum, the terms synchronous and asynchronous refer to whether I/O operations wait for some event (e.g., storage of the data) before returning. The terms synchronized and nonsynchronized, meanwhile, specify exactly what event must occur (e.g., writing the data to disk).

Normally, Unix write operations are synchronous and nonsynchronized; read operations are synchronous and synchronized.* For write operations, every combination of these characteristics is possible, as Table 4-1 illustrates.

Table 4-1. Synchronicity of write operations

                        Synchronized Nonsynchronized
Synchronous Write operations do not return until the data is flushed to disk. This is the behavior if O_SYNC is specified during file open. Write operations do not return until the data is stored in kernel buffers. This is the usual behavior.
Asynchronous Write operations return as soon as the request is queued. Once the write operation ultimately executes, the data is guaranteed to be on disk. Write operations return as soon as the request is queued. Once the write operation ultimately executes, the data is guaranteed to at least be stored in kernel buffers. 

Read operations are always synchronized, as reading stale data makes little sense. Such operations can be either synchronous or asynchronous, however, as illustrated in Table 4-2.

Table 4-2. Synchronicity of read operations

Synchronized
Synchronous Read operations do not return until the data, which is up-to-date, is stored in the provided buffer (this is the usual behavior).
Asynchronous Read operations return as soon as the request is queued, but when the read operation ultimately executes, the data returned is up-to-date.

In Chapter 2, we discussed how to make writes synchronized (via the O_SYNC flag), and how to ensure that all I/O is synchronized as of a given point (via fsync() and friends). Now, let’s look at what it takes to make reads and writes asynchronous.

{mospagebreak title=Asynchronous I/O}

Performing asynchronous I/O requires kernel support at the very lowest layers. POSIX 1003.1-2003 defines the aio interfaces, which Linux fortunately implements. The aio library provides a family of functions for submitting asynchronous I/O and receiving notification upon its completion:

  #include <aio.h>

  /* asynchronous I/O control block */
  struct aiocb {
          int aio_filedes;        /* file descriptor */
          int aio_lio_opcode;     /* operation to perform */
          int aio_reqprio;        /* request priority offset */
          volatile void *aio_buf; /* pointer to buffer */
          size_t aio_nbytes;      /* length of operation */
          struct sigevent aio_sigevent;                     /*
signal number and value */

          /* internal, private members follow... */
  };

  int aio_read (struct aiocb *aiocbp);
  int aio_write (struct aiocb *aiocbp);
  int aio_error (const struct aiocb *aiocbp);
  int aio_return (struct aiocb *aiocbp);
  int aio_cancel (int fd, struct aiocb *aiocbp);
  int aio_fsync (int op, struct aiocb *aiocbp);
  int aio_suspend (const struct aiocb * const cblist[],
                  
int n,
                   const struct timespec *timeout);

Thread-based asynchronous I/O

Linux only supports aio on files opened with the O_DIRECT flag. To perform asynchronous I/O on regular files opened without O_DIRECT, we have to look inward, toward a solution of our own. Without kernel support, we can only hope to approximate asynchronous I/O, giving results similar to the real thing.

First, let’s look at why an application developer would want asynchronous I/O:

  1. To perform I/O without blocking
  2. To separate the acts of queuing I/O, submitting I/O to the kernel, and receiving notification of operation completion

The first point is a matter of performance. If I/O operations never block, the overhead of I/O reaches zero, and a process need not be I/O-bound. The second point is a matter of procedure, simply a different method of handling I/O.

The most common way to reach these goals is with threads (scheduling matters are discussed thoroughly in Chapters 5 and 6). This approach involves the following programming tasks:

  1. Create a pool of “worker threads” to handle all I/O.
  2. Implement a set of interfaces for placing I/O operations onto a work queue.
  3. Have each of these interfaces return an I/O descriptor uniquely identifying the associated I/O operation. In each worker thread, grab I/O requests from the head of the queue and submit them, waiting for their completion.
  4. Upon completion, place the results of the operation (return values, error codes, any read data) onto a results queue.
  5. Implement a set of interfaces for retrieving status information from the results queue, using the originally returned I/O descriptors to identify each operation.

This provides similar behavior to POSIX’s aio interfaces, albeit with the greater overhead of thread management.

Please check back next week for the continuation of this article.

Google+ Comments

Google+ Comments