The MMAP System Call in Linux

In this third part of a seven-part series on Linux I/O file system calls, you’ll learn how to use the mmap() system call, which will give you some flexibility when handling files. This article is excerpted from chapter four of the book Linux System Programming: Talking Directly to the Kernel and C Library, written by Robert Love (O’Reilly, 2007; ISBN: 0596009585). Copyright © 2007 O’Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O’Reilly Media.

Mapping Files into Memory

As an alternative to standard file I/O, the kernel provides an interface that allows an application to map a file into memory, meaning that there is a one-to-one correspondence between a memory address and a word in the file. The programmer can then access the file directly through memory, identically to any other chunk of memory-resident data—it is even possible to allow writes to the memory region to transparently map back to the file on disk.

POSIX.1 standardizes—and Linux implements—the mmap() system call for mapping objects into memory. This section will discuss mmap() as it pertains to mapping files into memory to perform I/O; in Chapter 8, we will visit other applications of mmap() .

mmap( )

A call to mmap() asks the kernel to map len bytes of the object represented by the file descriptor fd , starting at offset bytes into the file, into memory. If addr is included, it indicates a preference to use that starting address in memory. The access permissions are dictated by prot , and additional behavior can be given by flags :

  #include <sys/mman.h>

  void * mmap (void *addr,
               size_t len,
               int prot,
               int flags,
               int fd,
               off_t offset);

The addr parameter offers a suggestion to the kernel of where best to map the file. It is only a hint; most users pass 0. The call returns the actual address in memory where the mapping begins.

The prot parameter describes the desired memory protection of the mapping. It may be either PROT_NONE , in which case the pages in this mapping may not be accessed (making little sense!), or a bitwise OR of one or more of the following flags:

PROT_READ
   The pages may be read.

PROT_WRITE
   The pages may be written.

PROT_EXEC
   The pages may be executed.

The desired memory protection must not conflict with the open mode of the file. For example, if the program opens the file read-only, prot must not specify PROT_WRITE .


Protection Flags, Architectures, and Security

While POSIX defines four protection bits (read, write, execute, and stay the heck away), some architectures support only a subset of these. It is common, for example, for a processor to not differentiate between the actions of reading and executing. In that case, the processor may have only a single “read” flag. On those systems, PROT_READ implies PROT_EXEC. Until recently, the x86 architecture was one such system.

Of course, relying on such behavior is not portable. Portable programs should always set PROT_EXEC if they intend to execute code in the mapping.

The reverse situation is one reason for the prevalence of buffer overflow attacks: even if a given mapping does not specify execution permission, the processor may allow exe cution anyway.

Recent x86 processors have introduced the NX (no-execute) bit, which allows for readable, but not executable, mappings. On these newer systems, PROT_READ no longer implies PROT_EXEC .


The flags argument describes the type of mapping, and some elements of its behavior. It is a bitwise OR of the following values:

MAP_FIXED

Instructs mmap() to treat addr as a requirement, not a hint. If the kernel is unable to place the mapping at the given address, the call fails. If the address and length parameters overlap an existing mapping, the overlapped pages are discarded and replaced by the new mapping. As this option requires intimate knowledge of the process address space, it is nonportable, and its use is discouraged.

MAP_PRIVATE

States that the mapping is not shared. The file is mapped copy-on-write, and any changes made in memory by this process are not reflected in the actual file, or in the mappings of other processes.

MAP_SHARED

Shares the mapping with all other processes that map this same file. Writing into the mapping is equivalent to writing to the file. Reads from the mapping will reflect the writes of other processes.

Either MAP_SHARED or MAP_PRIVATE must be specified, but not both. Other, more advanced flags are discussed in Chapter 8.

When you map a file descriptor, the file’s reference count is incremented. Therefore, you can close the file descriptor after mapping the file, and your process will still have access to it. The corresponding decrement of the file’s reference count will occur when you unmap the file, or when the process terminates.

As an example, the following snippet maps the file backed by fd , beginning with its first byte, and extending for len bytes, into a read-only mapping:

  void *p;

  p = mmap (0, len, PROT_READ, MAP_SHARED, fd, 0) ;
  if (p == MAP_FAILED)
          perror ("mmap");

Figure 4-1 shows the effects of paramaters supplied with mmap() on the mapping between a file and a process’ address space.


Figure 4-1.  Mapping a file into a process’ address space

{mospagebreak title=The page size}

The page is the smallest unit of memory that can have distinct permissions and behavior. Consequently, the page is the building block of memory mappings, which in turn are the building blocks of the process address space.

The mmap() system call operates on pages. Both the addr and offset parameters must be aligned on a page-sized boundary. That is, they must be integer multiples of the page size.

Mappings are, therefore, integer multiples of pages. If the len parameter provided by the caller is not aligned on a page boundary—perhaps because the underlying file’s size is not a multiple of the page size—the mapping is rounded up to the next full page. The bytes inside this added memory, between the last valid byte and the end of the mapping, are zero-filled. Any read from that region will return zeros. Any writes to that memory will not affect the backing file, even if it is mapped as MAP_SHARED . Only the original len bytes are ever written back to the file.

sysconf(). The standard POSIX method of obtaining the page size is with sysconf() , which can retrieve a variety of system-specific information:

  #include <unistd.h>

  long sysconf (int name);

A call to sysconf() returns the value of the configuration item name, or -1 if name is invalid. On error, the call sets errno to EINVAL . Because -1 may be a valid value for some items (e.g., limits, where -1 means no limit), it may be wise to clear errno before invocation, and check its value after.

POSIX defines _SC_PAGESIZE (and a synonym, _SC_PAGE_SIZE ) to be the size of a page, in bytes. Therefore, getting the page size is simple:

  long page_size = sysconf (_SC_PAGESIZE);

getpagesize(). Linux also provides the getpagesize() function:

  #include <unistd.h>

  int getpagesize (void);

A call to getpagesize() will likewise return the size of a page, in bytes. Usage is even simpler than sysconf():

 

  int page_size = getpagesize ();

Not all Unix systems support this function; it’s been dropped from the 1003.1-2001 revision of the POSIX standard. It is included here for completeness.

PAGE_SIZE. The page size is also stored statically in the macro PAGE_SIZE , which is defined in <asm/page.h> . Thus, a third possible way to retrieve the page size is:

  int page_size = PAGE_SIZE;

Unlike the first two options, however, this approach retrieves the system page size at compile-time, and not runtime. Some architectures support multiple machine types with different page sizes, and some machine types even support multiple page sizes themselves! A single binary should be able to run on all machine types in a given architecture—that is, you should be able to build it once and run it everywhere. Hard-coding the page size would nullify that possibility. Consequently, you should determine the page size at runtime. Because addr and offset are usually 0 , this requirement is not overly difficult to meet.

Moreover, future kernel versions will likely not export this macro to user space. We cover it in this chapter due to its frequent presence in Unix code, but you should not use it in your own programs. The sysconf() approach is your best bet.

{mospagebreak title=Return values and error codes}

On success, a call to mmap() returns the location of the mapping. On failure, the call returns MAP_FAILED , and sets errno appropriately. A call to mmap() never returns 0.

Possible errno values include:

EACCES S
   The given file descriptor is not a regular file, or the
   mode with which it was opened conflicts with prot
   or flags .

EAGAIN
   The file has been locked via a file lock.

EBADF
   The given file descriptor is not valid.

EINVAL
   One or more of the parameters addr , len , or off
   are invalid.

ENFILE
   The system-wide limit on open files has been
   reached.

ENODEV
   The filesystem on which the file to map resides does
   not support memory mapping.

ENOMEM
   The process does not have enough memory.

EOVERFLOW
   The result of addr+len exceeds the size of the
   address space.

EPERM 
   PROT_EXEC was given, but the filesystem is mounted 
 noexec
.

{mospagebreak title=Associated signals}

Two signals are associated with mapped regions:

SIGBUS

This signal is generated when a process attempts to access a region of a mapping that is no longer valid—for example, because the file was truncated after it was mapped.

SIGSEGV

This signal is generated when a process attempts to write to a region that is mapped read-only.

munmap()

Linux provides the munmap() system call for removing a mapping created with mmap():

 

  #include <sys/mman.h>

  int munmap (void *addr, size_t len);

A call to munmap() removes any mappings that contain pages located anywhere in the process address space starting at addr, which must be page-aligned, and continuing for len bytes. Once the mapping has been removed, the previously associated mem ory region is no longer valid, and further access attempts result in a SIGSEGV signal.

Normally, munmap() is passed the return value and the len parameter from a previous invocation of mmap() .

On success, munmap() returns 0 ; on failure, it returns
-1 , and errno is set appropriately. The only standard errno value is EINVAL , which specifies that one or more parameters were invalid.

As an example, the following snippet unmaps any memory regions with pages contained in the interval [addr,addr+len] :

  if (munmap (addr, len) == -1)
          perror ("munmap");

Please check back next week for the continuation of this article.

Google+ Comments

Google+ Comments