Home arrow BrainDump arrow Page 2 - Configuring and Optimizing Your I/O Scheduler

Scheduling I/O in user space - BrainDump

In this conclusion to a seven-part series on Linux I/O file system calls, you'll learn how to select, configure, and optimize your I/O scheduler. This article is excerpted from chapter four of the book Linux System Programming: Talking Directly to the Kernel and C Library, written by Robert Love (O'Reilly, 2007; ISBN: 0596009585). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

TABLE OF CONTENTS:
  1. Configuring and Optimizing Your I/O Scheduler
  2. Scheduling I/O in user space
  3. Conclusion
By: O'Reilly Media
Rating: starstarstarstarstar / 4
January 08, 2009

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement

I/O-intensive applications that issue a large number of I/O requests and need to extract every ounce of performance can sort and merge their pending I/O requests, performing the same duties as the Linux I/O scheduler. *

Why perform the same work twice, if you know the I/O scheduler will sort requests block-wise, minimizing seeks, and allowing the disk head to move in a smooth, linear fashion? Consider an application that submits a large number of unsorted I/O requests. These requests arrive in the I/O scheduler’s queue in a generally random order. The I/O scheduler does its job, sorting and merging the requests before sending them out to the disk—but the requests start hitting the disk while the application is still generating I/O and submitting requests. The I/O scheduler is able to sort only a small set of requests—say, a handful from this application, and whatever other requests are pending—at a time. Each batch of the application’s requests is neatly sorted, but the full queue, and any future requests are not part of the equation.

Therefore, if an application is generating many requests—particularly if they are for data all over the disk—it can benefit from sorting the requests before submitting them, ensuring they reach the I/O scheduler in the desired order.

A user-space application is not bestowed with access to the same information as the kernel, however. At the lowest levels inside the I/O scheduler, requests are already specified in terms of physical disk blocks. Sorting them is trivial. But, in user space, requests are specified in terms of files and offsets. User-space applications must probe for information, and make educated guesses about the layout of the filesystem.

Given the goal of determining the most seek-friendly ordering given a list of I/O requests to specific files, user-space applications have a couple of options. They can sort based on:

  1. The full path
  2. The inode number
  3. The physical disk block of the file

Each of these options involves a tradeoff. Let’s look at each briefly.

Sorting by path. Sorting by the pathname is the easiest, yet least effective, way of approximating a block-wise sort. Due to the layout algorithms used by most filesystems, the files in each directory—and thus the directories sharing a parent directory—tend to be adjacent on disk. The probability that files in the same directory were created around the same time only amplifies this characteristic.

Sorting by path, therefore, roughly approximates the physical locations of files on the disk. It is definitely true that two files in the same directory have a better chance of being located near each other than two files in radically different parts of the filesystem. The downside of this approach is that it fails to take into account fragmentation: the more fragmented the filesystem, the less useful is sorting by path. Even ignoring fragmentation, a path-wise sort only approximates the actual block-wise ordering. On the upside, a path-wise sort is at least somewhat applicable to all filesystems. No matter the approach to file layout, temporal locality suggests a path-wise sort will be at least mildly accurate. It is also an easy sort to perform.

Sorting by inode. Inodes are Unix constructs that contain the metadata associated with individual files. While a file’s data may consume multiple physical disk blocks, each file has exactly one inode, which contains information such as the file’s size, permissions, owner, and so on. We will discuss inodes in depth in Chapter 7. For now, you need to know two facts: that every file has an inode associated with it, and that the inodes are assigned unique numbers.

Sorting by inode is better than sorting by path, assuming that this relation:

  file i's inode number < file j's inode number

implies, in general, that:

  physical blocks of file i < physical blocks of file j

This is certainly true for Unix-style filesystems such as ext2 and ext3. Anything is possible for filesystems that do not employ actual inodes, but the inode number (whatever it may map to) is still a good first-order approximation.

Obtaining the inode number is done via thestat()system call, also discussed in Chapter 7. Given the inode associated with the file involved in each I/O request, the requests can be sorted in ascending order by inode number.

Here is a simple program that prints out the inode number of a given file:

  #include <stdio.h>
  #include <stdlib.h>
  #include <fcntl.h>
  #include <sys/types.h>
  #include <sys/stat.h>

  /*
  
* get_inode - returns the inode of the file associated
  
* with the given file descriptor, or -1 on failure
  
*/
  int get_inode (int fd)
  {
         
struct stat buf;
          int ret;

          ret = fstat (fd, &buf);
          if (ret < 0) {
                 
perror ("fstat");
                 
return -1;
         
}

          return buf.st_ino;
  }

  int main (int argc, char *argv[])
  {
         
int fd, inode;

          if (argc < 2) {
                  
fprintf (stderr, "usage: %s <file>\n", argv[0]);
                 
return 1;
         
}

          fd = open (argv[1], O_RDONLY);
         
if (fd < 0) {
                 
perror ("open");
                 
return 1;
         
}

          inode = get_inode (fd);
          printf ("%d\n", inode);

          return 0;
  }

Theget_inode()function is easily adaptable for use in your programs.

Sorting by inode number has a few upsides: the inode number is easy to obtain, is easy to sort on, and is a good approximation of the physical file layout. The major downsides are that fragmentation degrades the approximation, that the approximation is just a guess, and that the approximation is less accurate for non-Unix filesystems. Nonetheless, this is the most commonly used method for scheduling I/O requests in user space.

Sorting by physical block. The best approach to designing your own elevator algorithm, of course, is to sort by physical disk block. As discussed earlier, each file is broken up into logical blocks, which are the smallest allocation units of a filesystem. The size of a logical block is filesystem-dependent; each logical block maps to a single physical block. We can thus find the number of logical blocks in a file, determine what
physical blocks they map to, and sort based on that.

The kernel provides a method for obtaining the physical disk block from the logical block number of a file. This is done via theioctl()system call, discussed in Chapter 7, with the FIBMAPcommand:

  ret = ioctl (fd, FIBMAP, &block);
  if (ret < 0)
         
perror ("ioctl");

Here,fdis the file descriptor of the file in question, andblockis the logical block whose physical block we want to determine. On successful return,block is replaced with the physical block number. The logical blocks passed in are zero-indexed and file-relative. That is, if a file is made up of eight logical blocks, valid values are 0 through 7.

Finding the logical-to-physical-block mapping is thus a two-step process. First, we must determine the number of blocks in a given file. This is done via thestat()
system call. Second, for each logical block, we must issue anioctl()request to find the corresponding physical block.

Here is a sample program to do just that for a file passed in on the command line:

  #include <stdio.h>
  #include <stdlib.h>
  #include <fcntl.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <sys/ioctl.h>
  #include <linux/fs.h>

  /*
  
* get_block - for the file associated with the given fd, returns
  
* the physical block mapping to logical_block
  
*/
  int get_block (int fd, int logical_block)
  {
         
int ret;

          ret = ioctl (fd, FIBMAP, &logical_block);
         
if (ret < 0) {
                 
perror ("ioctl");
                 
return -1;
         
}

          return logical_block;
  }

  /*
  
* get_nr_blocks - returns the number of logical blocks
  
* consumed by the file associated with fd
   */
  int get_nr_blocks (int fd)
  {
         
struct stat buf;
          int ret;

          ret = fstat (fd, &buf);
         
if (ret < 0) {
                 
perror ("fstat");
                 
return -1;
         
}

          return buf.st_blocks;
  }

  /*
  
* print_blocks - for each logical block consumed by the file
  
* associated with fd, prints to standard out the tuple
  
* "(logical block, physical block)"
  
*/
  void print_blocks (int fd)
  {
         
int nr_blocks, i;

          nr_blocks = get_nr_blocks (fd);
         
if (nr_blocks < 0) {
                  fprintf (stderr, "get_nr_blocks failed!\n");
                  return;
         
}

          if (nr_blocks == 0) {
                  printf ("no allocated blocks\n");
                  return;
         
} else if (nr_blocks == 1)
                  printf ("1 block\n\n");
          else
                  printf ("%d blocks\n\n", nr_blocks);

          for (i = 0; i < nr_blocks; i++) {
                  int phys_block;

                  phys_block = get_block (fd, i);
                 
if (phys_block < 0) {
                          fprintf (stderr, "get_block failed!\n");
                          return;
                 
}
                  if (!phys_block)
                          continue;

                  printf ("(%u, %u) ", i, phys_block);
          }

          putchar ('\n');
  }

  int main (int argc, char *argv[])
  {
          int fd;

          if (argc < 2) {
                  fprintf (stderr, "usage: %s <file>\n", argv[0]);
                  return 1;
         
}

          fd = open (argv[1], O_RDONLY);
          if (fd < 0) {
                 
perror ("open");
                 
return 1;
          }

          print_blocks (fd);

          return 0;
  }

Because files tend to be contiguous, and it would be difficult (at best) to sort our I/O requests on a per-logical-block basis, it makes sense to sort based on the location of just the first logical block of a given file. Consequently,get_nr_blocks()is not needed, and our applications can sort based on the return value from:

  get_block (fd, 0);

The downside ofFIBMAPis that it requires theCAP_SYS_RAWIOcapability—effectively, root privileges. Consequently, nonroot applications cannot make use of this approach. Further, while theFIBMAPcommand is standardized, its actual implementation is left up to the filesystems. While common systems such as ext2 and ext3 support it, a more esoteric beast may not. Theioctl()call will returnEINVALifFIBMAPis not supported.

Among the pros of this approach, however, is that it returns the actual physical disk block at which a file resides, which is exactly what you want to sort on. Even if you sort all I/O to a single file based on the location of just one block (the kernel’s I/O scheduler sorts each individual request on a block-wise basis), this approach comes very close to the optimal ordering. The root requirement, however, is a bit of a nonstarter for many.



 
 
>>> More BrainDump Articles          >>> More By O'Reilly Media
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

BRAINDUMP ARTICLES

- Apple Founder Steve Jobs Dies
- Steve Jobs` Era at Apple Ends
- Google's Chrome Developer Tool Updated
- Google's Chrome 6 Browser Brings Speed to th...
- New Open Source Update Fedora 13 is Released...
- Install Linux with Knoppix
- iPad Developers Flock To SDK 3.2
- Managing a Linux Wireless Access Point
- Maintaining a Linux Wireless Access Point
- Securing a Linux Wireless Access Point
- Configuring a Linux Wireless Access Point
- Building a Linux Wireless Access Point
- Migrating Oracle to PostgreSQL with Enterpri...
- Demystifying SELinux on Kernel 2.6
- Yahoo and Microsoft Create Ad Partnership

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: