Linux provides a system call named madvise() to let processes give the kernel advice and hints on how they intend to use a mapping. The kernel can then optimize its behavior to take advantage of the mapping’s intended use. While the Linux kernel dynamically tunes its behavior, and generally provides optimal performance without explicit advice, providing such advice can ensure the desired caching and readahead behavior for some workloads.
A call tomadvise()advises the kernel on how to behave with respect to the pages in the memory map starting ataddr, and extending forlenbytes:
#include <sys/mman.h>
int madvise (void *addr, size_t len, int advice);
If len is 0, the kernel will apply the advice to the entire mapping that starts at addr. The parameteradvice delineates the advice, which can be one of:
MADV_NORMAL The application has no specific advice to give on this range of memory. It should be treated as normal.
MADV_RANDOM The application intends to access the pages in the specified range in a random (nonsequential) order.
MADV_SEQUENTIAL The application intends to access the pages in the specified range sequentially, from lower to higher addresses.
MADV_WILLNEED The application intends to access the pages in the specified range in the near future.
MADV_DONTNEED The application does not intend to access the pages in the specified range in the near future.
The actual behavior modifications that the kernel takes in response to this advice are implementation-specific: POSIX dictates only the meaning of the advice, not any potential consequences. The current 2.6 kernel behaves as follows in response to theadvicevalues:
MADV_NORMAL The kernel behaves as usual, performing a moderate amount of readahead.
MADV_RANDOM The kernel disables readahead, reading only the minimal amount of data on each physical read operation.
MADV_SEQUENTIAL The kernel performs aggressive readahead.
MADV_WILLNEED The kernel initiates readahead, reading the given pages into memory.
MADV_DONTNEED The kernel frees any resources associated with the given pages, and discards any dirty and not-yet- synchronized pages. Subsequent accesses to the mapped data will cause the data to be paged in from the backing file.
Typical usage is:
int ret;
ret = madvise (addr, len, MADV_SEQUENTIAL); if (ret < 0) perror ("madvise");
This call instructs the kernel that the process intends to access the memory region[addr,addr+len)sequentially.
Readahead
When the Linux kernel reads files off the disk, it performs an optimization known as readahead. That is, when a request is made for a given chunk of a file, the kernel also reads the following chunk of the file. If a request is subsequently made for that chunk—as is the case when reading a file sequentially—the kernel can return the requested data immediately. Because disks have track buffers (basically, hard disks perform their own readahead internally), and because files are generally laid out sequentially on disk, this optimization is low-cost.
Some readahead is usually advantageous, but optimal results depend on the question of how much readahead to perform. A sequentially accessed file may benefit from a larger readahead window, while a randomly accessed file may find readahead to be worthless overhead.
On success, madvise() returns 0. On failure, it returns -1, anderrnois set appropriately. The following are valid errors:
EAGAIN An internal kernel resource (probably memory) was unavailable. The process can try again.
EBADF The region exists, but does not map a file.
EINVAL The parameterlenis negative,addris not page- aligned, theadviceparameter is invalid, or the pages were locked or shared withMADV_DONTNEED.
EIO An internal I/O error occurred withMADV_WILLNEED.
ENOMEM The given region is not a valid mapping in this process’ address space, or MADV_WILLNEEDwas given, but there is insufficient memory to page in the given regions.
Please check back next week for the continuation of this article.