Linux Files and the Event Poll Interface

In this second part of a seven-part series on Linux I/O file system calls, you will learn about the event poll interface. This article is excerpted from chapter four of the book Linux System Programming: Talking Directly to the Kernel and C Library, written by Robert Love (O’Reilly, 2007; ISBN: 0596009585). Copyright © 2007 O’Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O’Reilly Media.

The Event Poll Interface

Recognizing the limitations of both poll() and select() , the 2.6 Linux kernel* intro duced the event poll (epoll) facility. While more complex than the two earlier interfaces, epoll solves the fundamental performance problem shared by both of them, and adds several new features.

Both poll() and select() (discussed in Chapter 2) require the full list of file descriptors to watch on each invocation. The kernel must then walk the list of each file descriptor to be monitored. When this list grows large—it may contain hundreds or even thousands of file descriptors—walking the list on each invocation becomes a scalability bottleneck.

Epoll circumvents this problem by decoupling the monitor registration from the actual monitoring. One system call initializes an epoll context, another adds monitored file descriptors to or removes them from the context, and a third performs the actual event wait.

Creating a New Epoll Instance

An epoll context is created via epoll_create() :

  #include <sys/epoll.h>

  int epoll_create (int size)

A successful call to epoll_create() instantiates a new epoll instance, and returns a file descriptor associated with the instance. This file descriptor has no relationship to a real file; it is just a handle to be used with subsequent calls using the epoll facility. The size parameter is a hint to the kernel about the number of file descriptors that are going to be monitored; it is not the maximum number. Passing in a good approximation will result in better performance, but the exact number is not required. On error, the call returns -1 , and sets errno to one of the following:

EINVAL
   The size parameter is not a positive number.

ENFILE
   The system has reached the limit on the total number
   of open files.

ENOMEM
   Insufficient memory was available to complete the
   operation.

A typical call is:

  int epfd;

  epfd = epoll_create (100); /* plan to watch ~100 fds * /
  if (epfd < 0)
          perror ("epoll_create");

The file descriptor returned from epoll_create() should be destroyed via a call to close() after polling is finished.

{mospagebreak title=Controlling Epoll}

The epoll_ctl() system call can be used to add file descriptors to and remove file descriptors from a given epoll context:

  #include <sys/epoll.h>

  int epoll_ctl (int epfd,
                int op,
               
int fd,
               
struct epoll_event *event);

The header <sys/epoll.h> defines the epoll_event structure as:

 

  struct epoll_event { 
         
__u32 events; /* events */
        
union {
                 
void *ptr;
                 
int fd;
                 
__u32 u32;
                 
__u64 u64;
        
} data;
  };

A successful call to epoll_ctl() controls the epoll instance associated with the file descriptor epfd . The parameter op specifies the operation to be taken against the file associated with fd . The event parameter further describes the behavior of the operation.

Here are valid values for the op parameter:

EPOLL_CTL_ADD

Add a monitor on the file associated with the file 
descriptor fd to the epoll instance associated with epfd , per the events defined in event .

EPOLL_CTL_DEL

Remove a monitor on the file associated with the file descriptor fd from the epoll instance associated with epfd .

EPOLL_CTL_MOD

Modify an existing monitor of fd with the updated events specified by event .

The events field in the epoll_event structure lists which events to monitor on the given file descriptor. Multiple events can be bitwise-ORed together. Here are valid values:

EPOLLERR

An error condition occurred on the file. This event is always monitored, even if it’s not specified.

EPOLLET

Enables edge-triggered behavior for the monitor of the file (see the upcoming section “Edge- Versus Level-Triggered Events”). The default behavior is level- triggered.

EPOLLHUP

A hangup occurred on the file. This event is always monitored, even if it’s not specified.

EPOLLIN

The file is available to be read from without blocking.

EPOLLONESHOT

After an event is generated and read, the file is automatically no longer monitored. A new event mask must be specified via EPOLL_CTL_MOD to reenable the watch.

EPOLLOUT

The file is available to be written to without blocking.

EPOLLPRI

There is urgent out-of-band data available to read.

The data field inside the event_poll structure is for the user’s private use. The contents are returned to the user upon receipt of the requested event. The common practice is to set event.data.fd to fd , which makes it easy to look up which file descriptor caused the event.

Upon success, epoll_ctl() returns 0 . On failure, the call returns -1 , and sets errno to one of the following values:

EBADF

epfd is not a valid epoll instance, or fd is not a valid file descriptor.

EEXIST

op was EPOLL_CTL_ADD , but fd is already associated with epfd .

EINVAL

epfd is not an epoll instance, epfd is the same as fd , or op is invalid.

ENOENT

op was EPOLL_CTL_MOD , or EPOLL_CTL_DEL , but fd is not associated with epfd .

ENOMEM

There was insufficient memory to process the request.

EPERM

fd does not support epoll.

As an example, to add a new watch on the file associated with fd to the epoll instance epfd , you would write:

  struct epoll_event event;
 
int ret;

  event.data.fd = fd; /* return the fd to us later */
  event.events = EPOLLIN | EPOLLOUT;

  ret = epoll_ctl (epfd, EPOLL_CTL_ADD, fd, &event);
  if (ret)
         
perror ("epoll_ctl");

To modify an existing event on the file associated with fd on the epoll instance epfd , you would write:

  struct epoll_event event;
  int ret;

  event.data.fd = fd; /* return the fd to us later */
  event.events = EPOLLIN;

  ret = epoll_ctl (epfd, EPOLL_CTL_MOD, fd, &event);
  if (ret)
         
perror ("epoll_ctl");

Conversely, to remove an existing event on the file associated with fd from the epoll instance epfd , you would write:

  struct epoll_event event;
  int ret;

  ret = epoll_ctl (epfd, EPOLL_CTL_DEL, fd, &event);
  if (ret)
          perror ("epoll_ctl");

Note that the event parameter can be NULL when op is EPOLL_CTL_DEL , as there is no event mask to provide. Kernel versions before 2.6.9, however, erroneously check for this parameter to be non- NULL . For portability to these older kernels, you should pass in a valid non- NULL pointer; it will not be touched. Kernel 2.6.9 fixed this bug.

{mospagebreak title=Waiting for Events with Epoll}

The system call epoll_wait() waits for events on the file descriptors associated with the given epoll instance:

  #include <sys/epoll.h>

  int epoll_wait (int epfd,
                
struct epoll_event *events,
                
int maxevents,
                
int timeout);

A call to epoll_wait() waits up to timeout milliseconds for events on the files associ ated with the epoll instance epfd . Upon success, events points to memory containing epoll_event structures describing each event, up to a maximum of maxevents events. The return value is the number of events, or -1 on error, in which case errno is set to one of the following:

EBADF
   epfd is not a valid file descriptor.

EFAULT 
   The process does not have write access to the
   memory pointed at by events .

EINTR 
   The system call was interrupted by a signal before it
   could complete.

EINVAL
 epfd is not a valid epoll instance, or maxevents is 
  equal to or less than 0 .

If timeout is 0 , the call returns immediately, even if no events are available, in which case the call will return 0 . If the timeout is -1 , the call will not return until an event is available.

When the call returns, the events field of the epoll_event structure describes the events that occurred. The data field contains whatever the user set it to before invocation of epoll_ctl() .

A full epoll_wait() example looks like this:

  #define MAX_EVENTS   64

  struct epoll_event *events ;
  int nr_events, i, epfd;

  events = malloc (sizeof (struct epoll_event) * MAX_EVENTS);
  if (!events) {
         
perror ("malloc");
         
return 1;
 
}

  nr_events = epoll_wait (epfd, events, MAX_EVENTS, -1);
  if (nr_events < 0) {
         
perror ("epoll_wait");
          free (events);
         
return 1;
  }

  for (i = 0; i < nr_events; i++) {
         
printf ("event=%ld on fd=%dn",
         
events[i].events,
         
events[i].data.fd);

          /*
          
* We now can, per events[i].events, operate on
          
* events[i].data.fd without blocking.
          
*/
  }

  free (events);

We will cover the functions malloc() and free() in Chapter 8.

{mospagebreak title=Edge- Versus Level-Triggered Events}

If the EPOLLET value is set in the events field of the event parameter passed to epoll_ctl() , the watch on fd is edge-triggered, as opposed to level-triggered.

Consider the following events between a producer and a consumer communicating over a Unix pipe:

  1. The producer writes 1 KB of data onto a pipe.
  2. The consumer performs an epoll_wait() on the pipe, waiting for the pipe to contain data, and thus be readable.

With a level-triggered watch, the call to epoll_wait() in step 2 will return immedi ately, showing that the pipe is ready to read. With an edge-triggered watch, this call will not return until after step 1 occurs. That is, even if the pipe is readable at the invocation of epoll_wait() , the call will not return until the data is written onto the pipe.

Level-triggered is the default behavior. It is how poll() and select() behave, and it is what most developers expect. Edge-triggered behavior requires a different approach to programming, commonly utilizing nonblocking I/O, and careful checking for EAGAIN .

The terminology comes from electrical engineering. A level-triggered interrupt is issued whenever a line is asserted. An edge-triggered interrupt is caused only during the rising or falling edge of the change in assertion. Level-triggered interrupts are useful when the state of the event (the asserted line) is of interest. Edge-triggered interrupts are useful when the event itself (the line being asserted) is of interest.

Please check back next week for the continuation of this article

[gp-comments width="770" linklove="off" ]

chat