Configuring and Optimizing Your I/O Scheduler

In this conclusion to a seven-part series on Linux I/O file system calls, you’ll learn how to select, configure, and optimize your I/O scheduler. This article is excerpted from chapter four of the book Linux System Programming: Talking Directly to the Kernel and C Library, written by Robert Love (O’Reilly, 2007; ISBN: 0596009585). Copyright © 2007 O’Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O’Reilly Media.

Selecting and Configuring Your I/O Scheduler

The default I/O scheduler is selectable at boot time via the iosched kernel command-line parameter. Valid options are as, cfq, deadline, and noop. The I/O scheduler is also runtime-selectable on a per-device basis via /sys/block/device/queue/scheduler, where device  is the block device in question. Reading this file returns the current I/O scheduler; writing one of the valid options to this file sets the I/O scheduler. For example, to set the device hda to the CFQ I/O Scheduler, one would do the following:

  # echo cfq > /sys/block/hda/queue/scheduler

The directory /sys/block/ device/queue/iosche d contains files that allow the adminis trator to retrieve and set tunable values related to the I/O scheduler. The exact options depend on the current I/O scheduler. Changing any of these settings requires root privileges.

A good programmer writes programs that are agnostic to the underlying I/O subsystem. Nonetheless, knowledge of this subsystem can surely help one write optimal code.

Optimizing I/O Performance

Because disk I/O is so slow relative to the performance of other components in the system, yet I/O is such an important aspect of modern computing, maximizing I/O performance is crucial.

Minimizing I/O operations (by coalescing many smaller operations into fewer larger operations), performing block-size-aligned I/O, or using user buffering (see Chapter 3), and taking advantage of advanced I/O techniques, such as vectored I/O, positional I/O (see Chapter 2), and asynchronous I/O, are important steps to always consider when system programming.

The most demanding mission-critical and I/O-intense applications, however, can employ additional tricks to maximize performance. Although the Linux kernel, as discussed previously, utilizes advanced I/O schedulers to minimize dreaded disk seeks, user-space applications can work toward the same end, in a similar fashion, to further improve performance.

{mospagebreak title=Scheduling I/O in user space}

I/O-intensive applications that issue a large number of I/O requests and need to extract every ounce of performance can sort and merge their pending I/O requests, performing the same duties as the Linux I/O scheduler. *

Why perform the same work twice, if you know the I/O scheduler will sort requests block-wise, minimizing seeks, and allowing the disk head to move in a smooth, linear fashion? Consider an application that submits a large number of unsorted I/O requests. These requests arrive in the I/O scheduler’s queue in a generally random order. The I/O scheduler does its job, sorting and merging the requests before sending them out to the disk—but the requests start hitting the disk while the application is still generating I/O and submitting requests. The I/O scheduler is able to sort only a small set of requests—say, a handful from this application, and whatever other requests are pending—at a time. Each batch of the application’s requests is neatly sorted, but the full queue, and any future requests are not part of the equation.

Therefore, if an application is generating many requests—particularly if they are for data all over the disk—it can benefit from sorting the requests before submitting them, ensuring they reach the I/O scheduler in the desired order.

A user-space application is not bestowed with access to the same information as the kernel, however. At the lowest levels inside the I/O scheduler, requests are already specified in terms of physical disk blocks. Sorting them is trivial. But, in user space, requests are specified in terms of files and offsets. User-space applications must probe for information, and make educated guesses about the layout of the filesystem.

Given the goal of determining the most seek-friendly ordering given a list of I/O requests to specific files, user-space applications have a couple of options. They can sort based on:

  1. The full path
  2. The inode number
  3. The physical disk block of the file

Each of these options involves a tradeoff. Let’s look at each briefly.

Sorting by path. Sorting by the pathname is the easiest, yet least effective, way of approximating a block-wise sort. Due to the layout algorithms used by most filesystems, the files in each directory—and thus the directories sharing a parent directory—tend to be adjacent on disk. The probability that files in the same directory were created around the same time only amplifies this characteristic.

Sorting by path, therefore, roughly approximates the physical locations of files on the disk. It is definitely true that two files in the same directory have a better chance of being located near each other than two files in radically different parts of the filesystem. The downside of this approach is that it fails to take into account fragmentation: the more fragmented the filesystem, the less useful is sorting by path. Even ignoring fragmentation, a path-wise sort only approximates the actual block-wise ordering. On the upside, a path-wise sort is at least somewhat applicable to all filesystems. No matter the approach to file layout, temporal locality suggests a path-wise sort will be at least mildly accurate. It is also an easy sort to perform.

Sorting by inode. Inodes are Unix constructs that contain the metadata associated with individual files. While a file’s data may consume multiple physical disk blocks, each file has exactly one inode, which contains information such as the file’s size, permissions, owner, and so on. We will discuss inodes in depth in Chapter 7. For now, you need to know two facts: that every file has an inode associated with it, and that the inodes are assigned unique numbers.

Sorting by inode is better than sorting by path, assuming that this relation:

  file i’s inode number < file j’s inode number

implies, in general, that:

  physical blocks of file i < physical blocks of file j

This is certainly true for Unix-style filesystems such as ext2 and ext3. Anything is possible for filesystems that do not employ actual inodes, but the inode number (whatever it may map to) is still a good first-order approximation.

Obtaining the inode number is done via the stat() system call, also discussed in Chapter 7. Given the inode associated with the file involved in each I/O request, the requests can be sorted in ascending order by inode number.

Here is a simple program that prints out the inode number of a given file:

  #include <stdio.h >
  #include <stdlib.h>
  #include <fcntl.h>
  #include <sys/types.h>
  #include <sys/stat.h>

  /*
  
* get_inode – returns the inode of the file associated
  
* with the given file descriptor, or -1 on failure
  
*/
  int get_inode (int fd)
  {
         
struct stat buf;
          int ret;

          ret = fstat (fd, &buf);
          if (ret < 0) {
                 
perror ("fstat");
                 
return -1;
         
}

          return buf.st_ino;
  }

  int main (int argc, char *argv[])
  {
         
int fd, inode;

          if (argc < 2) {
                  
fprintf (stderr, "usage: %s <file>n", argv[0]);
                 
return 1;
         
}

          fd = open (argv[1], O_RDONLY);
         
if (fd < 0) {
                 
perror ("open");
                 
return 1;
         
}

          inode = get_inode (fd);
          printf ("%dn", inode);

          return 0;
  }

The get_inode() function is easily adaptable for use in your programs.

Sorting by inode number has a few upsides: the inode number is easy to obtain, is easy to sort on, and is a good approximation of the physical file layout. The major downsides are that fragmentation degrades the approximation, that the approximation is just a guess, and that the approximation is less accurate for non-Unix filesystems. Nonetheless, this is the most commonly used method for scheduling I/O requests in user space.

Sorting by physical block. The best approach to designing your own elevator algorithm, of course, is to sort by physical disk block. As discussed earlier, each file is broken up into logical blocks, which are the smallest allocation units of a filesystem. The size of a logical block is filesystem-dependent; each logical block maps to a single physical block. We can thus find the number of logical blocks in a file, determine what
physi cal blocks they map to, and sort based on that.

The kernel provides a method for obtaining the physical disk block from the logical block number of a file. This is done via the ioctl() system call, discussed in Chapter 7, with the FIBMAP command:

  ret = ioctl (fd, FIBMAP, &block) ;
  if (ret < 0)
         
perror ("ioctl");

Here, fd is the file descriptor of the file in question, and block is the logical block whose physical block we want to determine. On successful return, block is replaced with the physical block number. The logical blocks passed in are zero-indexed and file-relative. That is, if a file is made up of eight logical blocks, valid values are 0 through 7.

Finding the logical-to-physical-block mapping is thus a two-step process. First, we must determine the number of blocks in a given file. This is done via the stat()
sys tem call. Second, for each logical block, we must issue an ioctl() request to find the corresponding physical block.

Here is a sample program to do just that for a file passed in on the command line:

  #include <stdio.h>
  #include <stdlib.h>
  #include <fcntl.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <sys/ioctl.h>
  #include <linux/fs.h>

  /*
  
* get_block – for the file associated with the given fd, returns
  
* the physical block mapping to logical_block
  
*/
  int get_block (int fd, int logical_block)
  {
         
int ret;

          ret = ioctl (fd, FIBMAP, &logical_block);
         
if (ret < 0) {
                 
perror ("ioctl");
                 
return -1;
         
}

          return logical_block;
  }

  /*
  
* get_nr_blocks – returns the number of logical blocks
  
* consumed by the file associated with fd
   */
  int get_nr_blocks (int fd)
  {
         
struct stat buf;
          int ret;

          ret = fstat (fd, &buf);
         
if (ret < 0) {
                 
perror ("fstat");
                 
return -1;
         
}

          return buf.st_blocks;
  }

  /*
  
* print_blocks – for each logical block consumed by the file
  
* associated with fd, prints to standard out the tuple
  
* "(logical block, physical block)"
  
*/
  void print_blocks (int fd)
  {
         
int nr_blocks, i;

          nr_blocks = get_nr_blocks (fd);
         
if (nr_blocks < 0) {
                  fprintf (stderr, "get_nr_blocks failed!n");
                  return;
         
}

          if (nr_blocks == 0) {
                  printf ("no allocated blocksn");
                  return;
         
} else if (nr_blocks == 1)
                  printf ("1 blocknn");
          else
                  printf ("%d blocksnn", nr_blocks);

          for (i = 0; i < nr_blocks; i++) {
                  int phys_block;

                  phys_block = get_block (fd, i);
                 
if (phys_block < 0) {
                          fprintf (stderr, "get_block failed!n");
                          return;
                 
}
                  if (!phys_block)
                          continue;

                  printf ("(%u, %u) ", i, phys_block);
          }

          putchar (‘n’);
  }

  int main (int argc, char *argv[])
  {
          int fd;

          if (argc < 2) {
                  fprintf (stderr, "usage: %s <file>n", argv[0]);
                  return 1;
         
}

          fd = open (argv[1], O_RDONLY);
          if (fd < 0) {
                 
perror ("open");
                 
return 1;
          }

          print_blocks (fd);

          return 0;
  }

Because files tend to be contiguous, and it would be difficult (at best) to sort our I/O requests on a per-logical-block basis, it makes sense to sort based on the location of just the first logical block of a given file. Consequently, get_nr_blocks() is not needed, and our applications can sort based on the return value from:

  get_block (fd, 0);

The downside of FIBMAP is that it requires the CAP_SYS_RAWIO capability—effectively, root privileges. Consequently, nonroot applications cannot make use of this approach. Further, while the FIBMAP command is standardized, its actual implemen tation is left up to the filesystems. While common systems such as ext2 and ext3 support it, a more esoteric beast may not. The ioctl() call will return EINVAL if FIBMAP is not supported.

Among the pros of this approach, however, is that it returns the actual physical disk block at which a file resides, which is exactly what you want to sort on. Even if you sort all I/O to a single file based on the location of just one block (the kernel’s I/O scheduler sorts each individual request on a block-wise basis), this approach comes very close to the optimal ordering. The root requirement, however, is a bit of a nonstarter for many.

{mospagebreak title=Conclusion}

Over the course of the last three chapters, we have touched on all aspects of file I/O in Linux. In Chapter 2, we looked at the basics of Linux file I/O—really, the basis of Unix programming—with system calls such as read(), write(), open(), and close(). In Chapter 3, we discussed user-space buffering and the standard C library’s implementation thereof. In this chapter, we discussed various facets of advanced I/O, from the more-powerful-but-more-complex I/O system calls to optimization techniques and the dreaded performance-sucking disk seek.

In the next two chapters, we will look at process management: creating, destroying, and managing processes. Onward! 


 * Note that other Unix systems may set errno to EINVAL if count is 0. This is explicitly allowed by the standards, which say that EINVAL may be set if that value is 0, or that the system can handle the zero case in some other (nonerror) way.

* Epoll was introduced in the 2.5.44 development kernel, and the interface was finalized as of 2.5.66.

* Read operations are technically also nonsynchronized, like write operations, but the kernel ensures that the page cache contains up-to-date data. That is, the page cache’s data is always identical to or newer than the data on disk. In this manner, the behavior in practice is always synchronized. There is little argument for behaving any other way.

* Limits on the absolute size of this block number are largely responsible for the various limits on total drive sizes over the years.

* Yes, the man has an I/O scheduler named after him. I/O schedulers are sometimes called elevator algorithms, because they solve a problem similar to that of keeping an elevator running smoothly.

* The following text discusses the CFQ I/O Scheduler as it is currently implemented. Previous incarnations did not use timeslices or the anticipation heuristic, but operated in a similar fashion.

* One should apply the techniques discussed here only to I/O-intensive, mission-critical applications. Sorting the I/O requests—assuming there is even anything to sort—of applications that do not issue many such requests is silly and unneeded 

[gp-comments width="770" linklove="off" ]

chat