Advising the Linux Kernel on File I/O (
Page 1 of 4 )
In this fifth part to a seven-part series on Linux I/O file system calls, you'll learn how to give advice to the Linux kernel, and more. This article is excerpted from chapter four of the book Linux System Programming: Talking Directly to the Kernel and C Library, written by Robert Love (O'Reilly, 2007; ISBN: 0596009585). Copyright © 2007 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.
Advice for Normal File I/O
In the previous subsection, we looked at providing advice on memory mappings. In this section, we will look at providing advice to the kernel on normal file I/O. Linux provides two interfaces for such advice-giving: posix_fadvise() and readahead().
The posix_fadvise( ) System Call
The first advice interface, as its name alludes, is standardized by POSIX 1003.1-2003:
#include <fcntl.h>
int posix_fadvise (int fd,
off_t offset,
off_t len,
int advice);
A call to posix_fadvise() provides the kernel with the hint advice on the file descriptor fd
in the interval
[offset,offset+len)
. If
len
is
0
, the advice will apply to the range
[offset,length of file]
. Common usage is to specify
0
for
len
and
offset
, applying the advice to the entire file.
The available
advice
options are similar to those for
madvise()
. Exactly one of the following should be provided for
advice
:
POSIX_FADV_NORMAL
The application has no specific advice to give on this
range of the file. It should be treated as normal.
POSIX_FADV_RANDOM
The application intends to access the data in the
specified range in a random (nonsequential) order.
POSIX_FADV_SEQUENTIAL
The application intends to access the data in the
specified range sequentially, from lower to higher
addresses.
POSIX_FADV_WILLNEED
The application intends to access the data in the
specified range in the near future.
POSIX_FADV_NOREUSE
The application intends to access the data in the
specified range in the near future, but only once.
POSIX_FADV_DONTNEED
The application does not intend to access the pages
in the specified range in the near future.
As with
madvise()
, the actual response to the given advice is implementation-specific—even different versions of the Linux kernel may react dissimilarly. The following are the current responses:
POSIX_FADV_NORMAL
The kernel behaves as usual, performing a moderate
amount of readahead.
POSIX_FADV_RANDOM
The kernel disables readahead, reading only the
minimal amount of data on each physical read
operation.
POSIX_FADV_SEQUENTIAL
The kernel performs aggressive readahead, doubling
the size of the readahead window.
POSIX_FADV_WILLNEED
The kernel initiates readahead to begin reading into
memory the given pages.
POSIX_FADV_NOREUSE
Currently, the behavior is the same as for
POSIX_FADV_WILLNEED
; future kernels may perform
an additional optimization to exploit the “use once”
behavior. This hint does not have an
madvise()
complement.
POSIX_FADV_DONTNEED
The kernel evicts any cached data in the given range
from the page cache. Note that this hint, unlike the
others, is different in behavior from its
madvise()
counterpart.
As an example, the following snippet instructs the kernel that the entire file represented by the file descriptor
fd
will be accessed in a random, nonsequential manner:
int ret;
ret = posix_fadvise (fd, 0, 0, POSIX_FADV_RANDOM)
;
if (ret == -1)
perror ("posix_fadvise");
On success, posix_fadvise() returns 0. On failure,
-1
is returned, and
errno
is set to one of the following values:
EBADF
The given file descriptor is invalid.
EINVAL
The given advice is invalid, the given file descriptor
refers to a pipe, or the speci
fied advice cannot be
applied to the given file.
The readahead( ) System Call
The posix_fadvise() system call is new to the 2.6 Linux kernel. Before, the readahead() system call was available to provide behavior identical to the POSIX_FADV_WILLNEED
hint. Unlike
posix_fadvise()
,
readahead()
is a Linux-specific interface:
#include <fcntl.h>
ssize_t readahead (int fd,
off64_t offset,
size_t count);
A call to readahead() populates the page cache with the region [offset,offset+count) from the file descriptor fd
.
On success, readahead() returns 0. On failure, it returns -1
, and
errno
is set to one of the following values:
EBADF
The given file descriptor is invalid.
EINVAL
The given file descriptor does not map to a file that
supports readahead.