Using mmap() for Advanced File I/O - Giving Advice on a Mapping (
Page 5 of 5 )
Linux provides a system call named madvise() to let processes give the kernel advice and hints on how they intend to use a mapping. The kernel can then optimize its behavior to take advantage of the mapping’s intended use. While the Linux kernel dynamically tunes its behavior, and generally provides optimal performance without explicit advice, providing such advice can ensure the desired caching and readahead behavior for some workloads.
A call to
madvise()
advises the kernel on how to behave with respect to the pages in the memory map starting at
addr
, and extending for
len
bytes:
#include <sys/mman.h>
int madvise (void *addr,
size_t len,
int advice);
If len is 0, the kernel will apply the advice to the entire mapping that starts at addr
. The parameter
advice
delineates the advice, which can be one of:
MADV_NORMAL
The application has no specific advice to give on this
range of memory. It should be treated as normal.
MADV_RANDOM
The application intends to access the pages in the
specified range in a random (nonsequential) order.
MADV_SEQUENTIAL
The application intends to access the pages in the
specified range sequentially, from lower to higher
addresses.
MADV_WILLNEED
The application intends to access the pages in the
specified range in the near future.
MADV_DONTNEED
The application does not intend to access the pages
in the specified range in the near future.
The actual behavior modifications that the kernel takes in response to this advice are implementation-specific: POSIX dictates only the meaning of the advice, not any potential consequences. The current 2.6 kernel behaves as follows in response to the
advice
values:
MADV_NORMAL
The kernel behaves as usual, performing a moderate
amount of readahead.
MADV_RANDOM
The kernel disables readahead, reading only the
minimal amount of data on each physical read
operation.
MADV_SEQUENTIAL
The kernel performs aggressive readahead.
MADV_WILLNEED
The kernel initiates readahead, reading the given
pages into memory.
MADV_DONTNEED
The kernel frees any resources associated with the
given pages, and discards any dirty and not-yet-
synchronized pages. Subsequent accesses to the
mapped data will cause the data to be paged in from
the backing file.
Typical usage is:
int ret;
ret = madvise (addr, len, MADV_SEQUENTIAL)
;
if (ret < 0)
perror ("madvise");
This call instructs the kernel that the process intends to access the memory region
[addr,addr+len)
sequentially.
Readahead
When the Linux kernel reads files off the disk, it performs an optimization known as readahead. That is, when a request is made for a given chunk of a file, the kernel also reads the following chunk of the file. If a request is subsequently made for that chunk—as is the case when reading a file sequentially—the kernel can return the requested data immediately. Because disks have track buffers (basically, hard disks perform their own readahead internally), and because files are generally laid out sequentially on disk, this optimization is low-cost.
Some readahead is usually advantageous, but optimal results depend on the question of how much readahead to perform. A sequentially accessed file may benefit from a larger readahead window, while a randomly accessed file may find readahead to be worthless overhead.
As discussed in “Kernel Internals” in Chapter 2, the kernel dynamically tunes the size of the readahead window in response to the hit rate inside that window. More hits imply that a larger window would be advantageous; fewer hits suggest a smaller win
dow. The
madvise()
system call allows applications to influence the window size right off the bat.
On success, madvise() returns 0. On failure, it returns -1
, and
errno
is set appropriately. The following are valid errors:
EAGAIN
An internal kernel resource (probably memory) was
unavailable. The process can try again.
EBADF
The region exists, but does not map a file.
EINVAL
The parameter
len
is negative,
addr
is not page-
aligned, the
advice
parameter is invalid, or the
pages were locked or shared with
MADV_DONTNEED
.
EIO
An internal I/O error occurred with
MADV_WILLNEED
.
ENOMEM
The given region is not a valid mapping in this
process’ address space, or MADV_WILLNEED
was
given, but there is insufficient memory to page in the
given regions.
Please check back next week for the continuation of this article.