Using mmap() for Advanced File I/O

In this fourth part of a seven-part series on Linux I/O file system calls, you will learn how to use mmap(). It is excerpted from chapter four of the book Linux System Programming: Talking Directly to the Kernel and C Library, written by Robert Love (O’Reilly, 2007; ISBN: 0596009585). Copyright © 2007 O’Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O’Reilly Media.

Mapping Example

Let’s consider a simple example program that uses mmap() to print a file chosen by the user to standard out:

  #include <stdio.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <unistd.h>
  #include <sys/mman.h>

  int main (int argc, char *argv[])
  {
         
struct stat sb;
         
off_t len;
         
char *p;
         
int fd;

          if (argc < 2) {
                 
fprintf (stderr, "usage:
%s <file>n", argv[0]);
                 
return 1;
         
}

          fd = open (argv[1], O_RDONLY);
         
if (fd == -1) {
                 
perror ("open");
                 
return 1;
         
}

          if (fstat (fd, &sb) == -1) {
                 
perror ("fstat");
                 
return 1;
         
}

          if (!S_ISREG (sb.st_mode)) {
                  fprintf (stderr, "%s is not a filen", argv[1]);
                  return 1;
         
}

          p = mmap (0, sb.st_size, PROT_READ, MAP_SHARED, fd, 0);
         
if (p == MAP_FAILED) {
                  perror ("mmap");
                  return 1;
         
}

          if (close (fd) == -1) {
                  perror ("close");
                  return 1;
         
}

          for (len = 0; len < sb.st_size; len++)
                  putchar (p[len]);

          if (munmap (p, sb.st_size) == -1) {
                  perror ("munmap");
                  return 1;
         
}

          return 0;
  }

The only unfamiliar system call in this example should be fstat() , which we will cover in Chapter 7. All you need to know at this point is that fstat() returns infor mation about a given file. The S_ISREG() macro can check some of this information, so that we can ensure that the given file is a regular file (as opposed to a device file or a directory) before we map it. The behavior of nonregular files when mapped depends on the backing device. Some device files are mmap-able; other nonregular files are not mmap-able, and will set errno to EACCESS .

The rest of the example should be straightforward. The program is passed a filename as an argument. It opens the file, ensures it is a regular file, maps it, closes it, prints the file byte-by-byte to standard out, and then unmaps the file from memory.

Advantages of mmap()

Manipulating files via mmap() has a handful of advantages over the standard read() and write() system calls. Among them are:

  1. Reading from and writing to a memory-mapped file avoids the extraneous copy that occurs when using the read() or write() system calls, where the data must be copied to and from a user-space buffer.
  2. Aside from any potential page faults, reading from and writing to a memory-mapped file does not incur any system call or context switch overhead. It is as simple as accessing memory.
  3. When multiple processes map the same object into memory, the data is shared among all the processes. Read-only and shared writable mappings are shared in their entirety; private writable mappings have their not-yet-COW (copy-on-write) pages shared.
  4. Seeking around the mapping involves trivial pointer manipulations. There is no need for the lseek() system call.

For these reasons, mmap() is a smart choice for many applications.

Disadvantages of mmap()

There are a few points to keep in mind when using mmap():

  1. Memory mappings are always an integer number of pages in size. Thus, the difference between the size of the backing file and an integer number of pages is “wasted” as slack space. For small files, a significant percentage of the mapping may be wasted. For example, with 4 KB pages, a 7 byte mapping wastes 4,089 bytes.
  2. The memory mappings must fit into the process’ address space. With a 32-bit address space, a very large number of various-sized mappings can result in fragmentation of the address space, making it hard to find large free contiguous regions. This problem, of course, is much less apparent with a 64-bit address space.
  3. There is overhead in creating and maintaining the memory mappings and associated data structures inside the kernel. This overhead is generally obviated by the elimination of the double copy mentioned in the previous section, particularly for larger and frequently accessed files.

For these reasons, the benefits of mmap() are most greatly realized when the mapped file is large (and thus any wasted space is a small percentage of the total mapping), or when the total size of the mapped file is evenly divisible by the page size (and thus there is no wasted space).

{mospagebreak title=Resizing a Mapping}

Linux provides the mremap() system call for expanding or shrinking the size of a given mapping. This function is Linux-specific:

  #define _GNU_SOURCE

  #include <unistd.h >
  #include <sys/mman.h>

  void * mremap (void *addr, size_t old_size,
                 size_t new_size, unsigned long flags);

A call to mremap() expands or shrinks mapping in the region [addr,addr+old_size) to the new size new_size. The kernel can potentially move the mapping at the same time, depending on the availability of space in the process’ address space and the value of flags.

The opening [ in [addr,addr+old_size) indicates that the region starts with (and includes) the low address, whereas the closing ) indicates that the region stops just before (does not include) the high address. This convention is known as interval notation.

The flags parameter can be either 0 or MREMAP_MAYMOVE , which specifies that the kernel is free to move the mapping, if required, in order to perform the requested resizing. A large resizing is more likely to succeed if the kernel can move the mapping.

On success, mremap() returns a pointer to the newly resized memory mapping. On failure, it returns MAP_FAILED, and sets errno to one of the following:

 

EAGAIN
   The memory region is locked, and cannot be resized.

EFAULT
   Some pages in the given range are not valid pages in
   the process’ address space, or there was a problem
   remapping the given pages.

EINVAL
   An argument was invalid.

ENOMEM
   The given range cannot be expanded without moving
   (and MREMAP_MAYMOVE was not given), or there is
   not enough free space in the process’ address space.

Libraries such as glibc often use mremap() to implement an efficient realloc() , which is an interface for resizing a block of memory originally obtained via malloc() . For example:

  void * realloc (void *addr, size_t len)
  {
         
size_t old_size = look_up_mapping_size (addr);
         
void *p;

          p = mremap (addr, old_size, len, MREMAP_MAYMOVE) ;
          if (p == MAP_FAILED)
                  return NULL;
          return p;
  }

This would only work if all malloc() allocations were unique anonymous mappings; nonetheless, it stands as a useful example of the performance gains to be had. The example assumes the programmer has written a look_up_mapping_size() function.

The GNU C library does use mmap() and family for performing some memory alloca tions. We will look that topic in depth in Chapter 8.

{mospagebreak title=Changing the Protection of a Mapping}

POSIX defines the mprotect() interface to allow programs to change the permissions of existing regions of memory:

  #include <sys/mman.h>

  int mprotect (const void *addr,
               
size_t len,
               
int prot);

A call to mprotect() will change the protection mode for the memory pages contained in [addr,addr+len), where addr is page-aligned. The prot parameter accepts the same values as the prot given to mmap() : PROT_NONE , PROT_READ , PROT_WRITE , and PROT_EXEC . These values are not additive; if a region of memory is readable, and prot is set to only PROT_WRITE , the call will make the region only writable.

On some systems, mprotect() may operate only on memory mappings previously created via mmap() . On Linux, mprotect() can operate on any region of memory.

On success, mprotect() returns 0. On failure, it returns -1 , and sets errno to one of the following:

EACCESS

The memory cannot be given the permissions requested by prot . This can happen, for example, if you attempt to set the mapping of a file opened read-only to writable.

EINVAL

The parameter addr is invalid or not page-aligned.

ENOMEM

Insufficient kernel memory is available to satisfy the request, or one or more pages in the given memory region are not a valid part of the process’ address space.

{mospagebreak title=Synchronizing a File with a Mapping} 

POSIX provides a memory-mapped equivalent of the fsync() system call that we discussed in Chapter 2:

  #include <sys/mman.h>

  int msync (void *addr, size_t len, int flags);

A call to msync() flushes back to disk any changes made to a file mapped via mmap(), synchronizing the mapped file with the mapping. Specifically, the file or subset of a file associated with the mapping starting at memory address addr and continuing for len bytes is synchronized to disk. The addr argument must be page-aligned; it is generally the return value from a previous mmap() invocation.

Without invocation of msync() , there is no guarantee that a dirty mapping will be written back to disk until the file is unmapped. This is different from the behavior of write() , where a buffer is dirtied as part of the writing process, and queued for writeback to disk. When writing into a memory mapping, the process directly modifies the file’s pages in the kernel’s page cache, without kernel involvement. The kernel may not synchronize the page cache and the disk anytime soon.

The flags parameter controls the behavior of the synchronizing operation. It is a bitwise OR of the following values:

MS_ASYNC

Specifies that synchronization should occur asynchronously. The update is scheduled, but the msync() call returns immediately without waiting for the writes to take place.

MS_INVALIDATE

Specifies that all other cached copies of the mapping be invalidated. Any future access to any mappings of this file will reflect the newly synchronized on-disk contents.

MS_SYNC

Specifies that synchronization should occur synchronously. The msync() call will not return until all pages are written back to disk.

Either MS_ASYNC or MS_SYNC must be specified, but not both.

Usage is simple:

  if (msync (addr, len, MS_ASYNC) == -1 )
          perror ("msync");

This example asynchronously synchronizes (say that 10 times fast) to disk the file mapped in the region [addr,addr+len) .

On success, msync() returns 0. On failure, the call returns -1 , and sets errno appro priately. The following are valid errno values:

EINVAL

The flags parameter has both MS_SYNC and MS_ASYNC set, a bit other than one of the three valid flags is set, or addr is not page-aligned.

ENOMEM

The given memory region (or part of it) is not mapped. Note that Linux will return ENOMEM , as POSIX dictates, when asked to synchronize a region that is only partly unmapped, but it will still synchronize any valid mappings in the region.

Before version 2.4.19 of the Linux kernel, msync() returned EFAULT in place of ENOMEM .

{mospagebreak title=Giving Advice on a Mapping}

Linux provides a system call named madvise() to let processes give the kernel advice and hints on how they intend to use a mapping. The kernel can then optimize its behavior to take advantage of the mapping’s intended use. While the Linux kernel dynamically tunes its behavior, and generally provides optimal performance without explicit advice, providing such advice can ensure the desired caching and readahead behavior for some workloads.

A call to madvise() advises the kernel on how to behave with respect to the pages in the memory map starting at addr , and extending for len bytes:

  #include <sys/mman.h>

  int madvise (void *addr,
              
size_t len,
              
int advice);

If len is 0, the kernel will apply the advice to the entire mapping that starts at addr . The parameter advice delineates the advice, which can be one of:

MADV_NORMAL
   The application has no specific advice to give on this
   range of memory. It should be treated as normal.

MADV_RANDOM
   The application intends to access the pages in the
   specified range in a random (nonsequential) order.

MADV_SEQUENTIAL
   The application intends to access the pages in the
   specified range sequentially, from lower to higher
   addresses.

MADV_WILLNEED
   The application intends to access the pages in the
   specified range in the near future.

MADV_DONTNEED
   The application does not intend to access the pages
   in the specified range in the near future.

The actual behavior modifications that the kernel takes in response to this advice are implementation-specific: POSIX dictates only the meaning of the advice, not any potential consequences. The current 2.6 kernel behaves as follows in response to the advice values:

MADV_NORMAL
   The kernel behaves as usual, performing a moderate
   amount of readahead.

MADV_RANDOM
   The kernel disables readahead, reading only the 
   minimal amount of data on each physical read
   operation.

MADV_SEQUENTIAL
   The kernel performs aggressive readahead.

MADV_WILLNEED
   The kernel initiates readahead, reading the given
   pages into memory.

MADV_DONTNEED
   The kernel frees any resources associated with the
   given pages, and discards any dirty and not-yet-
   synchronized pages. Subsequent accesses to the
   mapped data will cause the data to be paged in from
   the backing file.

Typical usage is:

  int ret;

  ret = madvise (addr, len, MADV_SEQUENTIAL) ;
  if (ret < 0)
         
perror ("madvise");

This call instructs the kernel that the process intends to access the memory region [addr,addr+len) sequentially.


Readahead

When the Linux kernel reads files off the disk, it performs an optimization known as readahead. That is, when a request is made for a given chunk of a file, the kernel also reads the following chunk of the file. If a request is subsequently made for that chunk—as is the case when reading a file sequentially—the kernel can return the requested data immediately. Because disks have track buffers (basically, hard disks perform their own readahead internally), and because files are generally laid out sequentially on disk, this optimization is low-cost.

Some readahead is usually advantageous, but optimal results depend on the question of how much readahead to perform. A sequentially accessed file may benefit from a larger readahead window, while a randomly accessed file may find readahead to be worthless overhead.

As discussed in “Kernel Internals” in Chapter 2, the kernel dynamically tunes the size of the readahead window in response to the hit rate inside that window. More hits imply that a larger window would be advantageous; fewer hits suggest a smaller win dow. The madvise() system call allows applications to influence the window size right off the bat.


On success, madvise() returns 0. On failure, it returns -1 , and errno is set appropriately. The following are valid errors:

EAGAIN
   An internal kernel resource (probably memory) was
   unavailable. The process can try again.

EBADF
   The region exists, but does not map a file.

EINVAL
   The parameter len is negative, addr is not page-
   aligned, the advice parameter is invalid, or the
   pages were locked or shared with MADV_DONTNEED .

EIO
   An internal I/O error occurred with MADV_WILLNEED .
  

ENOMEM 
   The given region is not a valid mapping in this
   process’ address space, or MADV_WILLNEED was
   given, but there is insufficient memory to page in the
   given regions.

Please check back next week for the continuation of this article. 

[gp-comments width="770" linklove="off" ]

antalya escort bayan antalya escort bayan Antalya escort diyarbakir escort