Apache and the Internet

This article introduces those new to networking to Apache, the Hypertext Transfer Protocol (HTTP), and the basics of system administration. It is excerpted from chapter one of Peter Wainwright’s book Pro Apache (Apress, 2004; ISBN: 1590593006).

THIS CHAPTER IS an introduction to both Apache and the concepts that underlie it; that is, the Hypertext Transfer Protocol (HTTP) and the basics of networking and the Internet. It’s aimed at those totally new to Apache and Web servers in general. This chapter is introductory in nature, so if you’re familiar with system administration or are well read on Internet subjects, you might want to skip ahead to Chapter 2.

In this chapter, I’ll also discuss the most important criteria to consider when choosing server hardware. Although it’s quite easy to install Apache manually, you’ll also look at dedicated server solutions for those looking for ready-made solutions with vendor support. Finally, I’ll round off the chapter by presenting some of the graphical configuration tools available for Apache installations.

Apache: The Anatomy of a Web Server

In this section, I’ll introduce some of the basic concepts behind Web servers. You’ll also look at how Apache works and why it has become the Web server of choice on the Internet.

The Apache Source

Apache is the most popular Web server software on the Internet. The true secret of Apache’s success is that the source code is freely available. This means that anyone who wants to add features to their Web server can start with the Apache code and build on it. Indeed, some of Apache’s most important modules began as externally developed projects. mod_vhost_alias and mod_dav are both good examples.

To encourage this kind of external development, all binary distributions now come with a complete copy of the source code that’s ready to build. Examining the source code can be instructive and educational, and sometimes, it can even turn up a bug— such is the power of open peer review. When a bug is found in Apache, anyone can post a fix for it to the Internet and notify the Apache development team. This produces rapid development of the server and third-party modules, as well as faster fixes for any bugs discovered. It’s also a core reason for its reputation as a secure Web server.

The Apache License

Like the majority of source code available on the Internet, Apache is covered by a license permitting its distribution. Unlike the majority of source code, however, Apache uses its own license rather than the GNU Public License (GPL). Apache’s license is considerably more relaxed than the GPL—it permits a much broader range of commercial applications and makes only a few basic provisions.

Generally, if you intend to use Apache for your own purposes, you don’t have anything to worry about. If you intend to distribute, rebadge, or sell a version of Apache or a product that includes Apache as a component, the license becomes relevant. This is an approximation and shouldn’t be taken as generally applicable—if in doubt, read the license. The license for Apache is actually quite short and easily fits on a single page. It’s included in every source and binary distribution and is reproduced for convenience in Online Appendix C.

Keep in mind also that there are several third-party products that build on Apache, and those products have additional licenses of their own. Apache’s license may not apply to them, or it may apply only in part. Apache may be free, but proprietary extensions of it may not be.

Support for Apache

Apache isn’t directly supported by the Apache Software Foundation (ASF), although it’s possible to submit bugs and problem reports to them if all other avenues of information have been exhausted. As with most open-source projects, the best source of support is the informative but informal online community. For many applications, this is sufficient because Apache’s reliability record is such that emergency support issues don’t often arise.

In particular, Apache servers don’t need the emergency fixes that are common for certain other Windows-based Web servers. Given that Apache is more popular than all other Web servers combined, this says a lot about its resiliency (popularity statistics are available at http://www.netcraft.com/survey/).

However, if support is a concern, there are a few options available:

IBM: IBM’s WebSphere product line uses Apache as its core component on AIX, Linux, Solaris, and Windows NT. IBM offers support on its own version of Apache, which it calls the IBM HTTPD Server.

Apple: Apple Computers integrated Apache into both its MacOS X Server and MacOS X desktop operating systems as a standard system component. Because MacOS X is based on a BSD Unix derivative, Apache on MacOS X is remarkably unchanged from a typical BSD or Linux installation.

Hewlett-Packard: The Hewlett-Packard Apache-based Web server v.2.0.0 on hp-ux 11.0 and 11i (PA-RISC) is available.

SuSE and Red Hat: The vendors of Linux-based distributions that incorporate Apache (for example, SuSE and Red Hat) offer support on their products, including support for Apache. As with most support services, the quality of this varies from vendor to vendor. Fortunately, and especially where Linux is concerned, researching the reliability of vendors online is easy; there’s usually no shortage of people offering their opinion.

ISPs and so on: The Internet Service Providers (ISPs) and system integrators who provide Apache support. You can find a list of these on the Apache Web site at http://www.apache.org/info/support.cgi. The number of ISPs that offer Apache-based servers has grown considerably in the past few years. The choices of Apache services offered by ISPs include dedicated servers and colocation, virtual servers, and hosted accounts. Different ISP packages offer varying degrees of control over Apache. Some will only allow minor configuration for a virtual host on a server administered by the ISP, and other options include a complete dedicated server over which you have complete control. The choice of convenience over flexibility is one you have to make.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=How Apache Works}

Apache doesn’t run like a user application such as a word processor. Instead, it runs behind the scenes, providing services for other applications that communicate with it, such as a Web browser.

NOTE In Unix terminology, applications that provide services rather than directly communicate with users are called daemons .Apache runs on Windows NT, where the same concept is known as a service .Windows 95/98 and Windows ME aren’t capable of running Apache as a service; it must be run from the command line (the MS-DOS prompt or the Start menu’s Run command), even though Apache doesn’t interact with the user once it’s running.

Apache is designed to work over a network, so Apache and the applications that talk to it don’t have to be on the same computer. These applications are generically known as clients. Of course, a network can be defined as anything from a local intranet to the whole Internet, depending on the server’s purpose and target audience. I’ll cover networks in more detail later in this chapter.

The most common kind of client is of course a Web browser; most of the time when I say client, I mean browser. However, there are several important clients that aren’t browsers. The most important are Web robots and crawlers that index Web sites, but don’t forget streaming media players, news ticker applications, and other desktop tools that query Internet servers for information. Web proxies are also a kind of client because they forward requests for other clients.

The main task of a Web server is to translate a request into a response suitable for the circumstances at the time. When the client opens communication with Apache, it sends Apache a request for a resource. Apache either provides that resource or provides an alternative response to explain why the request couldn’t be fulfilled. In many cases, the resource is a Hypertext Markup Language (HTML) Web page residing on a local disk, but this is only the simplest option. It can be many other things, too—an image file, the result of a script that generates HTML output, a Java applet that’s downloaded and run by the client, and so on.

Apache uses HTTP to talk with clients. It’s a request/response protocol, which means that it defines how clients make requests and how servers respond to them: Every HTTP communication starts with a request and ends with a response. The Apache executable takes its name from the protocol, and on Unix systems is generally called httpd, short for HTTP daemon. I’ll discuss the basics of HTTP later in this chapter; the details are, more or less, the rest of the book.

Running Apache: Unix vs. Windows

Apache was originally written to run on Unix servers, and today it’s most commonly found on Linux, BSD derivatives, Solaris, and other Unix platforms. Since Apache was ported to Windows 95 and NT, it has made substantial inroads against the established servers from Microsoft and other commercial vendors—a remarkable achievement given the marketing power of those companies in the traditionally proprietary world of Windows applications.

Because of its Unix origins, Apache 1.3 was never quite as good on Windows as it was on Unix, but with Apache 2, programmers have completely redesigned the core of the Apache server. One major change is the abstraction of platform-specific implementation details into the Apache Portable Runtime (APR), and the server’s core processing logic has been moved into a separate module, known as a Multi Processing Module (MPM). As a result, Apache runs faster and more reliably on Windows because of an MPM dedicated to those platforms. NetWare, BeOS, and OS/2 also benefit from an MPM tuned to their platform-specific needs.

Apache runs differently on Unix systems than on Windows. When you start Apache 1.3 on Unix, it creates (or forks) several new child processes to handle Web server requests. Each new process created this way is a complete copy of the original Apache process. Apache 2 provides this behavior in the prefork MPM, which is designed to provide Apache 1.3 compatibility.

Windows doesn’t have anything resembling the fork system call, so Apache was extensively rewritten to use the native Windows threads. Theoretically, this is a much more efficient and lightweight solution because threads can share resources (thereby reducing their memory requirements). It also allows more intelligent switching between tasks by the operating system. However, Apache 1.3 used the Windows POSIX emulation layer (a Unix compatibility standard) to implement threads, which meant that it never ran as well as it theoretically would have. Apache 2 uses native Windows threads directly, courtesy of the APR, and accordingly runs much more smoothly.

Thread support in Apache 2 for the Unix platform is found in the worker, leader, threadpool, and perchild MPMs, which provide different processing models depending on your requirements. The new architecture coupled with the benefits of threaded programming provide a welcome boost in performance and also reduce the differences between Windows and Unix, thus simplifying future development work on both platforms.

NOTE Apache is more stable on Windows NT, 2000, and XP than on Windows 9x and ME because the implementation of threads is cleaner on the former. To run Apache on Windows with any degree of reliability, choose an NT-derived platform because it allows Apache to run as a system service.

However, if reliability and security are a real concern, you should consider only a Unix server for the sake of both Apache’s and the server. Additionally, new versions of Apache stabilize much faster on Unix than Windows, so choose Unix to take advantage of improvements to the server as soon as possible.

Apache is capable of running on many operating systems, in most cases straight from an installed binary distribution—notably OS/2, 680×0, PowerPC-based Macs (both pre–MacOS X and post–MacOS X), BeOS, and NetWare. MacOS X is remarkable in that it’s almost entirely unremarkable; it’s only a Unix variant. Apache 2 provides MPMs for Unix, Windows, OS/2, BeOS, and NetWare as standard, all of which I’ll cover in Chapter 9. Other MPMs are also in development. I won’t cover these additional MPMs in depth, but you can find more information on the ASF Web site at http://www.apache.org/.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=Configuring Apache}

Apache is set up through configuration files in which directives can be written to control Apache’s behavior. Apache supports an impressive number of directives, and each module that’s added to the server provides more.

The approach Apache takes to configuration makes it extremely versatile and gives the administrator comprehensive control over the features and security provided by the server. It gives Apache a major edge over its commercial rivals, which don’t offer nearly the same degree of flexibility and extensibility. It’s also one of the reasons for Apache’s slightly steeper learning curve, but the effort is well worth the reward of almost complete control over every aspect of the Web server’s operation.

The drawback to Apache’s versatility is that, unlike other commercial offerings, there’s currently no complete solution for configuring Apache with a Graphical User Interface (GUI) editor—for involved tasks, you must edit the configuration by hand. That said, there are some credible attempts at creating a respectable configuration tool to work with Apache’s power and flexibility. Depending on your requirements, one of these might prove adequate for your needs. More information is available in the “Using Graphical Configuration Tools” section in Chapter 2. The drawback is that many configuration tools handle only the most common configuration tasks, so the more advanced your needs become, the more you’ll find yourself editing the configuration directly. The fact that you’re editing in a GUI editor’s window doesn’t alter the fact that the GUI can help you only so much.

Most of this book is concerned with solving problems using Apache’s configuration directives. I introduce the most important ones in Chapter 4 and more advanced ones throughout the rest of the book in the context of the features they provide. However, I’ll also take some time to consider peripheral issues. For example, Chapter 3 covers building Apache from source when you can apply some additional configuration not available at any other time. Chapters 10 and 11 cover Web server security from the point of view of Apache and the server as a whole. This is also a configuration issue, but it’s one that extends outside merely configuring Apache.

Understanding Modules

One of Apache’s greatest strengths is its modular structure. The main Apache executable contains only a core set of features. Everything else is provided by modules (as shown in Figure 1-1), which can either be built into Apache or be loaded dynamically when Apache is run. Apache 2 takes this concept even further, removing platform-specific functionality to MPMs and subdividing monolithic modules such as mod_proxy and mod_cache into core and specific implementation submodules. This allows you to pick and choose precisely the functionality you want. It also provides an extensible architecture for new proxy and cache types.

Consequently, the Web server administrator can choose which modules to include and exclude when building Apache from the source code, and unwanted functionality can be removed. That makes the server smaller, require less memory, and less prone to misconfiguration. Therefore, the server is that much more secure. Conversely, modules not normally included in Apache can be added and enabled to provide extra functionality.

Apache also allows modules to be added so you don’t have to rebuild Apache each time you want to add new functionality. Adding a new module involves simply installing it and then restarting the running Apache server—nothing else is necessary. To support added modules, Apache consumes a little more memory than otherwise, and the server starts more slowly because it has to load modules from disk. This is a minor downside but possibly an important one when high performance is a requirement. Additionally, the supplied apxs tool enables you to compile and add new modules from the source code to your server using the same settings that were used to build Apache itself.


Figure 1-1. Apache and module interaction

There’s a vast array of third-party modules for Apache. Some of Figure 1-1. Apache and module interaction them, such as mod_fastcgi, provide specific additional features of great use in extending Apache’s power. With mod_fastcgi, Apache can cache CGI scripts in a way that makes them respond better to users and consume of fewer system resources.

Other modules provide major increases in power and flexibility. For example, mod_perl integrates a complete Perl interpreter into Apache, allowing it to use the whole range of software available for Perl. Some previously third-party modules have even been added to the Apache 2 distribution as permanent features, notably mod_dav. Probably the biggest new entry, however, is the cornerstone of Secure Sockets Layer (SSL) support in Apache—mod_ssl. This module eliminates a host of inconvenient issues (including the need to patch Apache’s source code) that Web server administrators had to deal with in Apache 1.3.

It’s this flexibility, Apache’s stability and performance, and the availability of its source code that makes it the most popular Web server software on the Internet.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=The Hypertext Transfer Protocol}

HTTP is the underlying protocol that all Web servers and clients use. Whereas HTML defines the way that Web pages are described, HTTP is concerned with how clients request information and how servers respond to them.

HTTP usually works beneath the surface, but a basic understanding of how HTTP works can be useful to the Web server administrator when diagnosing problems and dealing with security issues. This information is also useful because many of Apache’s features are HTTP-related as well.

The HTTP/1.1 protocol is defined in detail in RFC 2616, which can be accessed in text form at http://www.w3.org/Protocols/rfc2616/rfc2616.txt. Although this is a technical document, it’s both shorter and much more readable than might be expected. Administrators are encouraged to at least glance at it, and those who expect to use Apache’s more advanced features will want to keep a printed copy handy. Portable Document Format (PDF) versions are also available.

HTTP is a request/response stateless protocol, which means that the dialogue between a Web client (which may or may not be a browser) and server consists of a request from the client, a response from the server, and any necessary intermediate processing. After the response, the communication stops until another request is received. The server doesn’t anticipate further communication after the immediate request is complete, unlike other types of protocols that maintain a waiting state after the end of a request.

HTTP Requests and Responses

The first line of an HTTP request consists of a method that describes what the client wants to do, a Uniform Resource Identifier (URI) indicating the resource to be retrieved or manipulated, and an HTTP version. This is followed by a number of headers that modify the request in various ways, for example, to make it conditional on certain criteria or to specify a hostname (required in HTTP/1.1). On receipt of the request and any accompanying headers, the server determines a course of action and responds to the request. A typical request for an HTML document might be this:

GET /index.html HTTP/1.1
Host: www.alpha-complex.com

TIP Using the telnet command or a similar command line connection utility, you can connect to a running server and type the request in by hand to see the request and response directly. For example, type telnet localhost 80 and then press Enter twice to send the request after typing both lines. See Chapter 2 for more about using telnet.

Successful requests return a status code of 200 and the requested information, prefixed by the server’s response headers. A typical set of response headers for an Apache server looks something like this:

HTTP/1.1 200 OK
Date: Mon, 28 Jul 2003 16:22:41 GMT
Server: Apache/2.0.46 (Unix)
Last-Modified: Mon, 28 Jul 2003 16:22:41 GMT
ETag: “d456-68-248fdd00″
Accept-Ranges: bytes
Content-Length: 104
Content-Type: text/html; charset=ISO-8859-1

The status line, which contains the protocol type and success code, appears first, followed by the date and some information about the server. Next are the rest of the response headers, which vary according to the server and request. The most important is the Content-Type header, which tells the client what to do with the response. The Content-Length header lets the client know how long the body of the response is. The Date, ETag, and Last-Modified headers are used in caching.

If an error occurs, an error code and reason are returned on the status line:

HTTP/1.1 404 Not Found

It’s also possible for the server to return a number of other codes in certain circumstances, for example, redirection.

Understanding HTTP Methods

Methods tell the server what kind of request is being made. The examples shown in Table 1-1 are truncated to illustrate the nature of the request and response. A real Apache server will likely send far more headers than these, as illustrated by the sample responses.

In HTTP/1.1, the methods shown in Table 1-2 are also supported.

*****************************************************

Table 1-1. Basic HTTP Methods

Method Function

GET Get a header and resource from the server.

A blank line separates the header and resource.

HEAD Return the header that would be returned by a GET method, but don’t return the resource itself.

Note that the content length is returned even though there’s no content.

POST Send information to the server. The server’s response can contain confirmation that the information was received.

The server must be configured to respond appropriately to a POST, for example, with a CGI script.

Request

GET /index.html HTTP/1.0

HEAD /index.html HTTP/1.0

POST /cgi-bin/search.cgi HTTP/1.0Content-Length: 46query=alpha+complex&casesens=false&cmd=submit

Response

HTTP/1.1 200 OK Date: Mon, 28 Jul 2003 17:02:08 GMT Server: Apache/2.0.46 (Unix) Content-Length: 177 6 Content-Type: text/html; charset=ISO-8859-1 Connection: close

<!DOCUTYPE HTML PUBLIC “-//IETF//DTD HTML 2.0//EN”> <html> …

</html>

HTTP/1.1 200 OK Date: Mon, 28 Jul 2003 17:01:13 GMT Server: Apache/2.0.46 (Unix) Content-Length: 177 6 Content-Type: text/html; charset=ISO-8859-1 Connection: close

HTTP/1.1 201 CREATED Date: Mon, 28 Jul 2003 17:02:20 GMT Server: Apache/2.0.46 (Unix) Content-Type: text/html; charset=ISO-8859-1 Connection: close

<!DOCUTYPE HTML PUBLIC “-//IETF//DTD HTML 2.0//EN”> <HTML> … </HTML>

Table 1-2. Additional HTTP Methods

Method Function

OPTIONS Return the list of methods allowed by the server.

This is of particular relevance to WebDAV servers, which support additional methods defined in RFC 2518 .

TRACE Trace a request to see what the server actually sees.

This displays what the request looks like after it has passed through any intermediate proxies. It may also be directed at an intermediate proxy by the Max-Request header to discover information about intermediate servers.

For more information on TRACE, see RFC 2616.

DELETE Delete a resource on the server.

In general, the server should not allow DELETEmethods, so attempting to use it should produce a response like that given in the example. The exception isWebDAV servers, which do implement DELETE.

Request

OPTIONS * HTTP/1.1Host: www.alpha-complex.com

TRACE * HTTP/1.1Host: www.alpha-complex.com

DELETE /document.html HTTP/1.1Host: www.alpha-complex.com

Response

HTTP/1.1 200 OK Date: Mon, 28 Jul 2003 16:54:55 GMT Server: Apache/2.0.46 (Unix) Allow: GET, HEAD, POST, OPTIONS, TRACE Content-Length: 0 Content-Type: text/plain; charset=ISO-8859-1

HTTP/1.1 200 OK Date: Mon, 28 Jul 2003 17:09:18 GMT Server: Apache/2.0.46 (Unix) Content-Type: message/http; charset=ISO-8859-1

TRACE * HTTP/1.1 Host: www.alpha-complex.com

HTTP/1.1 405 Method Not Allowed Date: Mon, 28 Jul 2003 17:24:37 GMT Server: Apache/2.0.46 (Unix) DAV/2 Allow: GET, HEAD, OPTIONS, TRACE Content-Type: text/html; charset=ISO-8859-1

<!DOCUTYPE HTML PUBLIC “-//IETF//DTD HTML 2.0//EN”> <HTML><HEAD> <TITLE>405 Method Not Allowed</TITLE> </HEAD><BODY> <H1>Method Not Allowed</H1> <P>The requested method DELETE is not allowed for the URL /document.html.</P> </BODY></HTML>

HTTP/1.1 201 CREATEDDate: Mon, 28 Jul 2003 17:30:12 GMTServer: Apache/2.0.46 (Unix) DAV/2Content-Type: text/html; charset=ISO-8859-1<!DOCUTYPE HTML PUBLIC “-//IETF//DTD HTML 2.0//EN”><html>…</HTML>

PUT /newfile.txt HTTP/1.1Host: www.alpha-complex.comContent-Type: text/plainContent-Length: 63

This is the contents of a file we want to create on the server

PUT Create or change a file on the server.

In general, the server should not allow PUT methods because POST is generally used instead. PUT implies a direct relationship between the URI in the PUT request and the same URI in a subsequent GET, but this is notimplied by POST. Again, WebDAV servers may implement PUT.

CONNECT Enable proxies to switch to a tunneling mode for protocols like SSL.

See the AllowCONNECT directive in Chapter 8 for more details.

Understanding URIs

A URI is a textual string that identifies a resource, either by name, by location, or by any other format that can be understood by the server. URIs are defined in RFC 2396.

The URI is usually a conventional Uniform Resource Locator (URL) as understood by a browser, of which the simplest possible form is the forward slash (/). Any valid URI on the server can be specified here, for example:

/index.html
/centralcontrol/bugreport.htm:80
http://www.alpha-complex/images/ultraviolet/photos/outside.jpg

If the method doesn’t require a specific resource to be accessed, the asterisk (*) URI can be used. The OPTIONS example in Table 1-2 just shown uses the asterisk. Note that for these cases, it’s not incorrect to use a valid URI, just redundant.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=Understanding the HTTP Protocol}

The protocol version is one of the following:

  • HTTP/0.9

  • HTTP/1.0

  • HTTP/1.1

In practice, nothing ever sends HTTP/0.9 because the protocol argument itself was introduced with HTTP/1.0 to distinguish 1.0 requests from 0.9 requests. HTTP/0.9 is assumed if the client doesn’t send a protocol, but only GET and POST can work this way because other methods didn’t exist before the introduction of HTTP version 1.0.

HTTP Headers

HTTP headers (also known as HTTP header fields) can pass with HTTP messages in either direction between client and server. Any header can be sent if both ends of the connection agree about its meaning, but HTTP defines only a specific subset of headers.

Recognized HTTP headers are divided into three groups:

Request headers are sent by clients to the server to add information or modify the nature of the request. The Accept-Language header, for example, informs the server of the languages the client accepts, which Apache can use for content negotiation.

Response headers are sent by the server to clients in response to requests.

Standard headers generally sent by Apache include the Date and Connection.

Entity headers may be sent in either direction and add descriptive information (also called meta information) about the body of an HTTP message. HTTP requests are permitted to use entity headers only for methods that allow a body, that is, PUT and POST. Requests with bodies are obliged to send a Content-Length header to inform the server how large the body is. Servers may instead send a Transfer-Encoding header but must otherwise send a Content-Length header. In addition to content headers, which also include Content-Language, Content-Encoding, and the familiar Content-Type, two useful entity headers are Expires and Last-Modified. Expires tells browsers and proxies how long a document remains valid; Last-Modified enables a client to determine if a cached document is current. (To illustrate how this is useful, consider a proxy with a cached document for which a request arrives. The proxy first sends the server a HEAD request and looks for a Last-Modified header in the response. If it finds one and it’s no newer than the cached document, the proxy doesn’t bother to request the document from the server but sends the cached version instead. (See mod_expires in Chapter 4 for more details.)

Online Appendix H gives a full list of recognized HTTP headers.

Networking and TCP/IP

Although a computer can work in isolation, it’s generally more useful to connect it to a network. For a Web server to be accessible, it needs to be connected to the outside world.

To network two or more computers together, some kind of communication medium is required. In an office this is usually something such as Ethernet, with a network card installed in each participating computer and connecting cables. Wireless networking cards and hubs are another increasingly common option. Wired or not, however, hardware alone isn’t enough. Although it’s still possible to get away with sending data as-is on a serial connection, computers sharing a network with many other computers need a more advanced protocol for defining how data is transmitted, delivered, received, and acknowledged.

Transport Communication Protocol/Internet Protocol (TCP/IP) is one of several such protocols for communicating between computers on a network, and it’s the protocol predominantly used on the Internet. Others include Token Ring (which doesn’t run on Ethernet) and SPX/IPX (which does), both of which are generally used in corporate intranets.

Definitions

TCP/IP is two protocols, one built on top of the other. As the lower level, IP routes data between sender and recipient by splitting the data into packets and attaching a source and destination address to each packet.

There are now two versions of IP available. The older, and most common, is IPv4 (IP version 4). This is the protocol on which the bulk of the Internet still operates, but it’s now beginning to show its age. Its successor is IPv6, which extends the addressing range from 32 to 128 bits, adds support for mobile IP and quality-of-service determination, and provides optional authentication and encryption of network connections. This part of the protocol is large enough in its own right that it’s published in a separate specification known as IPSec, and it’s the basis of Virtual Private Networks (VPNs).

TCP relies on IP to handle the details of getting data from one point to another. On top of this, TCP provides mechanisms for establishing connections, ensuring that data arrives in the order that it was sent, and handling data loss, errors, and recovery. TCP defines a handshake protocol to detect network errors and defines its own set of envelope information, including a sequence number, which it adds to the packet of data IP sends.

TCP isn’t the only protocol that uses IP. Also part of the TCP/IP protocol suite is User Datagram Protocol (UDP ). Unlike TCP, which is a reliable and connection-oriented protocol, UDP is a connectionless and nonguaranteed protocol used for noncritical transmissions and broadcasts, generally for messages that can fit into one packet. Because it doesn’t check for successful transmission or correct sequencing, UDP is useful in situations where TCP would be too unwieldy, such as Internet broadcasts and multiplayer games. It’s also the basis of peer-to-peer networks such as Gnutella (http://www.gnutella.com/), which implement their own specialized error detection and retransmission protocols.

TCP/IP also includes Internet Control Message Protocol (ICMP), which is used by TCP/IP software to communicate messages concerning the protocol itself, such as a failure to connect to a given host. ICMP is intended for use by the low-level TCP/IP protocol and is rarely intended for user-level applications.

NOTE The TCP, UDP, ICMP, and IP protocols are defined in the following RFCs: UDP: 768, IP: 791, ICMP: 792, TCP: 793. See Online Appendix A for a complete list of useful RFCs and other documents.

Packets and Encapsulation

I mentioned earlier that IP and TCP both work by adding information to packets of data that are then transmitted between hosts. To really understand TCP/IP, it’s helpful to know a little more about IP and TCP.

When an application sends a block of data—a file or a page of HTML—TCP splits the data into packets for transmission. The Ethernet standard defines a maximum packet size of 1500 bytes. (On older networks, the hardware might limit packets to a size of 576 bytes.) When establishing a connection, TCP/IP determines how large a packet is allowed to be. Even if the local network can handle 1500 bytes, the destination or an intermediate network might not. Unless an intermediate step can perform packet splitting, the whole communication will have to drop down to the lowest packet size.

Once TCP knows what the packet size is, it encapsulates each block of data destined for a packet with a TCP header that contains a sequence number, source and destination ports, and a checksum for detecting errors. This header is like the address on an envelope, with the data packet as the enclosed letter.

IP then adds its own header to the TCP packet, in which it records the source and destination IP addresses so intermediate stages know how to route the packet. It also adds a protocol type to identify the packet as TCP, UDP, ICMP, or some other protocol, and another checksum. If you’re using IPv6, the packet can be signed to authenticate the sender and encrypted for transmission.

Furthermore, if the packet is to be sent over an Ethernet network, Ethernet adds yet another header containing the source and destination Ethernet addresses for the current link in the chain, a type code, and another checksum. The reason for this is that while IP records the IP addresses of the sending and receiving hosts in the header, Ethernet uses the Ethernet addresses of the network interfaces for each stage of the packet’s trip. Each protocol works at a closer range than the one it encapsulates, describing shorter and shorter hops in the journey from source to destination.

Both IP and TCP add 20 bytes of information to a data packet, all of which has to fit inside the 1500-byte limit imposed by Ethernet. So the maximum size of data that can fit into a TCP/IP packet is actually 1460 bytes. Of course, if IP is running a serial connection instead an Ethernet, it isn’t necessarily limited to 1500 bytes for a packet. Other protocols may impose their own limitations.

ACKs, NAKs, and Other Messages

The bulk of TCP transmissions are made up of data packets, as I just described. However, IP makes no attempt to ensure that the packet reaches its destination, so TCP requires that the destination send an Acknowledged message (ACK) to tell the sending host that the message arrived. ACKs are therefore nearly as common as data messages, and in an ideal network, exactly as many ACKs occur as data messages. If something is wrong with the packet, TCP requires the destination to send a Not Acknowledged message (NAK) instead.

In addition to data, ACKs, and NAKs, TCP also defines synchronization (SYN), for establishing connections, and FIN, for ending them. The client requests a connection by sending a SYN message to a server, which establishes or denies the connection by sending an ACK or NAK, respectively. When either end of the connection wants to end it, it sends a FIN message to indicate it no longer wants to communicate. Figure 1-2 illustrates this process.


Figure 1-2. TCP communication messages

There are, therefore, three eventualities the sending host can expect:

  • The destination host receives a packet, and if the packet is the one it expected or a new connection, it sends an ACK.

  • The packet’s checksum doesn’t match or the sequence number of the packet is wrong, so the destination sends a NAK to inform the host that it needs to send the packet again.

  • The destination doesn’t send anything at all. In this case, TCP eventually decides that the packet or the response got lost and sends it again.

Several kinds of Denial of Service (DoS) attacks exploit aspects of TCP/IP to attempt to tie up servers unnecessarily. One such attack is the SYN flood, when many SYN packets are sent to a server, but the acceptance of the requested connections is never acknowledged by the client. Clearly, a little understanding of TCP packets can be of more than just academic interest. Actually doing something about such attacks is one of the topics of Chapter 10.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=The TCP/IP Network Model}

TCP and IP form two layers in a hierarchy of protocols stretching from the application at the top to the hardware at the bottom. The TCP/IP network model is a simplified version of the OSI seven-layer networking model, which it resembles but isn’t completely compliant with. Although the OSI model is often compared to TCP/IP in network references, the comparison is next to useless because nothing else entirely complies with OSI either. An understanding of TCP/IP on its own is far more valuable. TCP/IP is a four-level network hierarchy, built on top of the hardware and below the application. Figure 1-3 shows a simplified stack diagram.


Figure 1-3. Four-layer network model

The Data Link level is shown as a single level, but in practice it often contains multiple levels. However, the point of TCP/IP is that you don’t need to care. For example, in a typical communication between a Web server and client, the layers might look like the following: at the server, connected to an Ethernet network (see Figure 1-4) and at the client, a user on a dial-up network account (see Figure 1-5).


Figure 1-4.
TCP/IP layers on a typical Web server


Figure 1-5. TCP/IP layers on a client communicating with a Web server

In this case, an additional PPP, which breaks the bottom data link layer into two layers, has been used to enable IP to work over the basic serial protocol used between modems.

When the user asks for a Web page through his or her browser, the browser generates the request using HTTP. It’s then transmitted over a TCP-initiated connection using IP to route the packet containing the request to a gateway across a serial connection using PPP.

IP routes the packet through potentially many intermediate servers. The address information in the packet tells each intermediate server where the packet needs to go next.

At the server, the network interface sees a packet whose IP address identifies it as for the server. The server pulls the packet off the network and sends it up to TCP, which sees that it’s a connection request and acknowledges it. A little later, the network sees a data packet that’s again sent up to TCP, which identifies it as being for the connection just established. It acknowledges the data packet, strips off the envelope information, and presents the enclosed HTTP request to Apache.

Apache processes the request and sends a response back to the client, working its way down the hierarchy again and back across the Internet to the client.

If instead you were trying to manage a mail system on a Unix e-mail server, the protocol layers would look like Figure 1-6.


Figure 1-6. TCP/IP layers on a mail server

As you can see, the only difference is the top-level protocol and the application you use—TCP/IP handles everything else.

Non-IP Protocols

There are several other protocols that run directly over Ethernet and don’t use IP. For example, the Address Resolution Protocol (ARP) is used on Ethernet networks to deduce the Ethernet address of a network interface from its IP address. Rival protocols such as SPX/IPX also run on Ethernet without involving IP. The design of Ethernet allows all these protocols to coexist peacefully.

Very few of these protocols are found on the Internet because the majority of them aren’t capable of making the journey from source to destination in more than one hop—this is what IP provides. Therefore, protocols that need it, such as TCP or UDP, are built on top of it rather than independently.

IP Addresses and Network Classes

Each host in a TCP/IP network needs to have a unique IP address assigned to it by the network administrators. In addition, if the host is to communicate over the Internet, it needs to have a unique IP address across the whole of the Internet as well.

IPv4 addresses are 32-bit numbers, usually written as 4 bytes, or octets, with a value between 0 and 255, separated by periods—for example, 192.168.20.181.

IPv6 addresses are 128-bit numbers, represented as colon-separated blocks of hexadecimal numbers—for example, fe80::910:a4ff:aefe:9a8. The observant will notice that there aren’t enough digits to make up a 128-bit address. This is because a number of zeros have been compressed into the space occupied by the double colon, so you don’t have to list them explicitly. This number is intended to be only partially under your control; part of it is derived from the Ethernet address of the network interface. This allows automatic allocation of IPv6 addresses and mobile IP networking, one of the design goals of IPv6. IPv6 is discussed in more detail later in the chapter.

The total range of IP addresses is partitioned into regions within which different classes of networks reside. The rest of the Internet considers IP addresses within a network class to be part of the same network, and it expects to use one point of contact, called a gateway, to route packets to hosts inside that network.

In addition, certain IP addresses (the first, all 0s, and the last, all 255s) in each class are considered special, so there aren’t quite as many addresses for hosts as you might expect. I’ll discuss these special addresses in a moment.

The IPv4 address space, which is still the addressing scheme on the Internet, is nominally divided into regions of class A, class B, and class C networks for the purposes of allocation.

  • Class A networks, of which there are very few, occupy the address range whose first number is between 1 and 126. The first number only is fixed, and the total number of possible hosts in a class A network is 16,777,214.

  • Class B networks occupy the range from 128 to 191. Both the first and second numbers are fixed, giving a total of 16,382 possible class B networks, each with a possible 65,534 hosts.

  • Class C networks are the smallest, occupying the range 192 to 223. The first three numbers are fixed, making more than two million class C networks available, but each one is capable of having only 254 hosts.

  • The range from 224 to 254 is reserved in the TCP/IP specification.

The IPv6 address space is divided similarly but across a wider range: 6 octets (48 bits) are fixed, with the remaining 10 (80 bits) assigned to the local network.

Special IP Addresses

Certain IP addresses get special treatment from TCP/IP networks. Within a network class, an address of 0s denotes an anonymous source address when the host doesn’t know what IP address it is—a rare occurrence. An address of all 255s is a broadcast address for the network (all hosts on the network may receive a broadcast). The net-mask isn’t strictly an address; it defines which addresses in an IP address range are considered directly connected (that is, on the same network segment). Addresses differing by more than the netmask are on different networks and must use gateways and routers to communicate.

Depending on the network class, the number of 0s or 255s varies, as the three example networks in Table 1-3 illustrate.

Table 1-3. IP Address Classes

Class Anonymous Broadcast Netmask
A 16.0.0.0 16.255.255.255 255.0.0.0
B 181.18.0.0 181.18.255.255 255.255.0.0
C 192.168.32.0 192.168.32.255 255.255.255.0

Because broadcasts are connectionless—the originating host sends the data to any host capable of receiving it—they’re done using UDP. IPv6 works differently than IPv4 in this respect and doesn’t support broadcasting. Instead, it uses multicasting. For simplicity, I’ll skip this and stick to IPv4 for this discussion.

There are also a few IP address ranges that networking hardware such as routers treat differently. Addresses within these ranges are considered private, and packets for them are never transmitted outside the local network by routers. For this reason, these addresses make good choices for testing networks or for intranets that’ll never be directly connected to the Internet. Table 1-4 shows the complete list of private IP address ranges.

 
Table 1-4. Reserved IP Address Blocks Defined by RFC 1918
Class Private Networks
A 10.0.0.0
B 172.16.0.0 to 172.31.0.0
C 192.168.0.0 to 192.168.255.0

Another special IP address is the loopback address, 127.0.0.1, which refers to the local host (often given the name localhost, appropriately enough). Use this to access servers running on the local machine.

Mail servers use other addresses in the 127 network to identify open relays and other undesirable mail origins. Services such as MAPS, ORDB, ORBZ, and Spews all operate Domain Name System (DNS) query servers that return an address in the 127 network when the originating IP address is blacklisted. This works because the address isn’t legal, which makes it an effective way for a yes or no query to be made from a DNS server. This is a nonstandard use of TCP/IP addressing standards but an effective one.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=Netmasks and Routing}

IP addresses are made up of two parts: the network address on the left and the local host address to the right. The network classes A, B, and C correspond to networks with an exact number of octets, but you can use a netmask (sometimes called a subnet mask) to divide the network and local address at points of your choosing using binary arithmetic. This tells you whether two hosts are local to each other or on different networks. The netmask is a fundamental attribute of the network interface, just like the IP address. A server can use it to determine whether a destination IP address is local and can be contacted directly or must be reached indirectly through an intermediate router.

The netmask is a number that looks like an IP address but isn’t. It defines the binary bits from the network part of the address. It’s all 1s to the right of the dividing line between network and local host and all 0s to the left. A netmask with a 0 to the right of a 1 is invalid and illegal. To get the network address for an IP address, the net-mask is logically joined to it by an AND command—this gives you 181.18.0.0 for the network address in the class B example.

The netmask of an IP address is an added string that determines the network block—A, B, or C—a given address belongs in as well as the size of the address space of the host address. Normally, a netmask takes the form of, for example, belonging to a class C address space, 255.255.255.0. The octets with a value of 255 are the indicator of class placement. One octet of 255 followed by zeros indicate a class A, two octets of 255 indicate a class B, and of course, the example of three octets of 255 indicates a class C.

Simply put, the netmask does exactly what it sounds like it does: It masks the net or network. In the example just shown, the class C host is determined solely by its last octet value; therefore, the first three octets are network-related. Using this knowledge lets you create the netmask of 255.255.255.0, or N.N.N.H, where N is network and H is host.

If two IP addresses map to the same network address after being joined by an AND with the netmask, they’re on the same network; if not, they’re on different networks. IPv6 netmasks are no different from their IPv4 counterparts, just longer and less interesting to look at.

For example, note the three hosts with IP addresses shown in Table 1-5.

Table 1-5. Example Hosts on Different Networks

IP Address Host
192.168.1.1 Host A
192.168.1.2 Host B
192.168.2.1 Host C

If you define a netmask of 255.255.255.0 for the network interfaces on each host, Host A and Host B will be assumed to be on the same network. If Host A sends a packet, TCP/IP will attempt to send it directly to Host B. However, Host B can’t send a packet to Host C directly because the netmask stipulates that 192.168.1 and 192.168.2 are different networks. Instead it’ll send the packet to a gateway. Each host is configured with the IP address of at least one gateway to send packets it can’t deliver itself.

If, however, you define a netmask of 255.255.0.0, all three hosts will be considered to be on the same network. In this case, Host A will be able to send to Host C directly, assuming they’re connected to the same physical network. When you experience routing problems on a network, a badly configured netmask is often the cause, particularly if Host A can connect to Host B, but not vice versa.

IP is responsible for ensuring that a packet addressed to a particular host gets delivered to that host. By dividing the address space into logical networks, the task of finding a host becomes much simpler—instead of having to know every host on the Internet, a host needs to know only a list of gateways and pick the one that’s the next logical step on the route. The identity of the next stop is then fed to the underlying protocol (for example, Ethernet) so that the packet is forwarded to it for onward delivery. In Ethernet’s case, the identity is the Ethernet address of the gateway on the local network. The gateway carries out the same procedure using its own list of gateways and so on until the packet reaches the final gateway and its destination.

NOTE For a practical example of netmasks in action, see the sample ifconfig output given later in the chapter. This shows two local addresses and an external Ethernet interface partitioned by a netmask to force them onto separate networks.

Web Services: Well-Known Ports

When a client contacts a server, it’s generally because the client wants to use a particular service—e-mail or File Transfer Protocol (FTP), for example. To differentiate between services, TCP implements the concept of ports, allowing a single network interface to provide many different services. When a client makes a network connection request to a server, it specifies not only the IP address of the server it wants to contact as required by IP, but also a port number.

By default, HTTP servers such as Apache server port 80, which is the standard port number for HTTP. When a connection request arrives for port 80, the operating system knows that Apache is watching that port and directs the communication to it. Each standard network service and protocol has an associated port that clients may connect to for that service, be it HTTP, FTP, telnet, or another service.

The standard list of port numbers is defined under Unix in a file called /etc/ services, which lists all the allocated port numbers. The corresponding file under Windows is called Services and is located in the installation directory of Windows C:WINNTsystem32driversetc. In fact, the operating system and the various daemons responsible for providing services already know what ports they use. Other applications use /etc/services to refer to a service by name instead of by number. /etc/services also specifies which protocol (TCP or UDP) a service uses; many services handle both TCP and UDP connections. The following is a short list of some of the most common port numbers, extracted from a typical /etc/services file:

ftp 21/tcp # File Transfer Protocol
finger 79/tcp # Finger Daemon
www 80/tcp http # WorldWideWeb HTTP
www 80/udp # HyperText Transfer Protocol
pop-2 109/tcp

postoffice

# Post Office Protocol
pop-2 109/udp # Version 2
pop-3 110/tcp # Post Office Protocol
pop-3 110/udp # Version 3
nntp 119/tcp

readnews untp

# USENET News Transfer Protocol
ntp 123/tcp # Network Time Protocol
ntp 123/udp #
imap2 143/tcp imap # Interactive Mail Access
imap2 143/udp imap # Protocol V2
snmp 161/udp # Simple Net Management Protocol
imap3 220/tcp # Interactive Mail Access
imap3 220/udp # Protocol V3
https 443/tcp # Secure HTTP
https 443/udp # Secure HTTP
uucp 540/tcp uucpd # Unix to Unix Copy

Of particular interest in this list is the HTTP port at 80 and the HTTPS port at 443. Note that both UDP and TCP connections are acceptable on these ports. How they’re handled when used on a given port depends on the program handling them. Just because a service is listed doesn’t mean that the server will respond to it. Indeed, there are plenty of good reasons not to respond to some services—telnet, FTP, SNMP, POP-3, and finger are all entirely unrelated to serving Web pages and can be used to weaken server security.

On Unix systems, port numbers below 1024 are reserved for system services and aren’t useable by programs run by nonprivileged users. For Apache to run on port 80, the standard HTTP port, it has to be started by root or at system startup by the operating system. Nonprivileged users can still run an Apache server as long as they configure Apache to use a port number of 1024 or higher. On Windows, no such security conditions exist.

Internet Daemon: The Networking Super Server

Not every service supplied by a host is handled by a constantly running daemon. Because that would be very wasteful of system resources, Unix runs many of its services through the Internet daemon (inetd), a super server that listens to many different ports and starts a program to deal with connections as it receives them.

One such service is FTP, which usually runs on port 21. Unlike Apache, which usually runs stand-alone and appears as several httpd processes, there’s no ftpd process running under normal conditions. However, inetd is looking at port 21, and when it receives a TCP connection request, it starts a copy of ftpd to handle the connection. Once started, ftpd negotiates its own private connection with the client, allowing inetd to get back to listening. Once the communication is over—in FTP’s case, when the file is transferred or aborted—the daemon exits.

Apache 1.3 has a configuration directive, ServerType, which allows it to run either as a stand-alone service or to be invoked by inetd such as FTP. In this configuration, there are no httpd processes running until inetd receives a connection request for port 80 (or on whatever port inetd has been configured to start Apache). inetd then runs httpd and gives it the incoming connection, allowing Apache to handle the request. Because a separate invocation of Apache is started for each individual client connection, and each invocation lasts only for as long as it takes to satisfy the request, this is a hideously inefficient way to run Apache—this is why almost all Apache configurations are stand-alone. Consequently, Apache 2 removes the option entirely.

inetd isn’t without its problems. As the central coordinating daemon for many lesser networking services, it’s one of the biggest sources of network security breaches. The daemon itself isn’t insecure, but it implements services such as telnet that are. As a result, many Web server administrators choose to disable it entirely because none of the services it manages are necessary for a Web server. More recent Unix distributions come with an improved daemon called xinetd that builds in additional security measures, but in most cases there are still no compelling reasons to enable it. See Chapter 10 for more information on this topic.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=The Future: IPv6}

The current IP protocol, IPv4, uses four 8-bit numbers to make up IP addresses, allowing for 232 possible addresses. Even allowing for anonymous and broadcast addresses, that’s theoretically enough to give one to almost every person on the planet and certainly everyone with a computer. Unfortunately, because of the way all these addresses are divided up into A, B, and C class networks, IP addresses are in danger of running out.

The solution to this is IPv6, version 6 of the IP protocol, which makes provisions for 128-bit addresses instead of the current 32 bits. Whereas IPv4 addresses are generally written as four decimal numbers separated by periods, IPv6 addresses are written as eight four-digit hexadecimal numbers separated by colons. Within each block, leading zeros can be omitted and replaced by a double colon for brevity, so an IPv6 address could look like fe80::910:a4ff:aefe:9a8, which is short for fe80:0910:0000:0000:0000:a4ff:aefe:09a8. This will allow a mind-boggling 2128 possible IP addresses.

IPv6 also introduces support for several other important features. One is quality-of-service information, which allows for the prioritizing of data across a network. This allows servers to handle HTTP traffic with a higher priority than, for example, e-mail. Another is authentication and encryption, which is provided for by IPSec, the security specification built into the IPv6 protocol.

NOTE IPSec at its simplest is a replacement for SSL, but it’s capable of much more, including the authentication and secure delivery of individual packets of information. It’s the basis of modern VPNs and is well worth investigation by companies looking to extend their private intranets securely to remote offices and mobile computers.

IPv6 support is now commonly available for most platforms, but Linux and BSD have had it the longest. Commercial platforms caught up more recently. Apache 2 now supports IPv6 addresses in all directives that deal with the network, notably Listen, VirtualHost, allow, and deny. Implementation of IPv6 networks is still happening slowly, though, despite the advantages that it offers.

However, adoption of IPv6 will gain critical mass only when enough servers support it. Therefore, consider adding IPv6 to Apache’s configuration, and if you’re hosting a server at an ISP, encourage the ISP to add support for IPv6 as well. If the ISP can’t yet support IPv6, hassle them until they do or move to one that does. Apache 2 will automatically build in support for IPv6 if it’s compiled on an operating system that supports it.

IPv6 is essentially a separate network running alongside IPv4. The principal network supporting IPv6 during its setup and deployment is known as the IPv6 backbone (6bone), and access points to it are available in most countries. There are three ways to get an IPv6 address and become part of the IPv6 network:

  • Get a 6bone address through an ISP. These addresses are ultimately assigned by 6bone.
  • Get a production IPv6 address from an ISP with a production IPv6 top-level network identifier. The International Regional Internet Registry (RIR) assigns these addresses.

  • Use an IPv6 to IPv4 tunnel to connect a local IPv4 address to an external IPv6 address. Addresses in this range start with 2002, followed by the IPv4 address of the router on the local network; the remaining bits form the local portion of the IPv6 address and are allocated by the ISP.

You can find more information on 6bone and IPv6, as well as detailed instructions on how to get established on an IPv6 network, at http://www.6bone.net/. Note especially the page on how to join 6bone.

Networking Tools

Administering a network is a complex process too involved to discuss here, but some aspects of administration from a performance and security point of view are discussed in Chapters 8 and 10. However, there are a few utilities that a Web server administrator might sometimes find useful when troubleshooting a server. Unix is generally better equipped than most other operating systems for this kind of analysis because it evolved hand-in-hand with the Internet and is the predominant operating system for implementing Internet systems.

Displaying the Configuration

ifconfig is a standard utility on any Unix system and deals with network interface configuration (if is short for interface). You can use it to display the current configuration of a network interface. A privileged user can also use it to change any parameter of a network interface, be it an Ethernet card, a serial PPP link, or the loopback interface. For example, to display the configuration of all network interfaces on the host, use this:

$ /sbin/ifconfig -a

On Windows, use the analogous ipconfig command:

> ipconfig /all

On a host with one Ethernet interface, this might produce something such as the following, showing two interfaces:

eth0 Link encap:Ethernet HWaddr 00:10:A4:FE:09:68 
     inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.128 
     inet6 addr: fe80::910:a4ff:aefe:9a8/10 Scope:Link 
     UP BROADCAST NOTRAILERS RUNNING MTU:1500 Metric:1 
     RX packets:112 errors:0 dropped:0 overruns:0 frame:0 
     TX packets:14 errors:0 dropped:0 overruns:0 carrier:0 
     collisions:0 txqueuelen:100
     RX bytes:9109 (8.8 Kb) TX bytes:5658 (5.5 Kb)

lo   Link encap:Local Loopback 
     inet addr:127.0.0.1 Mask:255.0.0.0
     inet6 addr: ::1/128 Scope:Host
     UP LOOPBACK RUNNING MTU:16436 Metric:1
     RX packets:1540 errors:0 dropped:0 overruns:0 frame:0 
     TX packets:1540 errors:0 dropped:0 overruns:0 carrier:0
     collisions:0 txqueuelen:0
     RX bytes:231276 (225.8 Kb) TX bytes:231276 (225.8 Kb)

lo:1 Link encap:Local Loopback 
     inet addr:192.168.1.131 Mask:255.255.255.128 
     UP LOOPBACK RUNNING MTU:16436 Metric:1

lo:2 Link encap:Local Loopback 
     inet addr:192.168.1.132 Mask:255.255.255.128 
     UP LOOPBACK RUNNING MTU:16436 Metric:1

The first interface is an Ethernet card with its own unique fixed Ethernet address assigned by the manufacturer, plus an IP address and netmask, which are configurable. This particular interface is on a server with IPv6 support, so it has both IPv4 and IPv6 addresses assigned to it by the operating system. The IPv4 address also has a netmask that puts it on a class C network and a broadcast address that’s a combination of the IP address and netmask. ifconfig also shows that the interface is up and running and capable of broadcasts, and it provides a set of statistics about the activity of the interface.

NOTE The Maximum Transmission Unit (MTU) is 1500—the maximum for Ethernet.

The second is the local loopback interface. Because it’s a loopback device and doesn’t depend on any actual hardware, it has neither an Ethernet address nor a broadcast address. Because Ethernet’s packet limit doesn’t apply to the loopback interface, it can get away with packets of up to 16,436 bytes. Because all data must loop back, the amount received is the same as the amount sent. If it weren’t, something strange would be happening.

The third and fourth interfaces are IP aliases, which are a feature of some modern operating systems that allows several IP addresses to be assigned to the same interface and produce virtual interfaces. These particular aliases are for the loopback address, but you could alias the Ethernet interface, too, if you wanted to respond to several external IP addresses on the same server.

Note that the addresses don’t need to be related to the primary interface’s address; in fact, these interfaces have addresses on the same class C network as the Ethernet interface. Because they’re by definition on different networks, the netmask is set so that a final octet value of 0-127 is considered separate from 128-255. The aliased interfaces are 131 and 132, so they’re seen as separate from the Ethernet interface, which has a final octet of 1. This is essential to prevent real network traffic from being sent to purely local network addresses, and vice versa.

Of course, the command-line arguments and output of ifconfig can vary from system to system. Use man ifconfig to bring up the manual page for ifconfig on your system.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=Monitoring a Network}

In addition to ifconfig, netstat is another standard Unix tool and useful for monitoring a network under Unix. It can extract a lot of different kinds of information on all or just one network interface. A short rundown of some of the arguments netstat uses will give you an idea of how to use this tool (see Table 1-6).

Argument

Effect

Display open connections (sockets)

-a

Also show listening and non-listening sockets

-c

Redisplay selected table continuously

-i

Display network interfaces

-n

Display IP addresses, don’t resolve names

-r

Display network routes

-s

Display network statistics

-v

Provide verbose information
Table 1-6. Command Line Arguments for netstat

netstat supports many more arguments, especially for the default (open network connections) table—see http://snowhite.cis.uoguelph.ca/course_info/27420/netstat.html for details.

Examining the Packets

Both these utilities enable an administrator to examine the packets being sent on a network. snoop is available on Solaris, and tcpdump is a free tool of similar capability available on Linux and FreeBSD. (It can be used on any platform that can build it because the source code is freely available.)

Both tools allow packets to be examined as they appear on the network. Various options allow packets to be filtered according to source IP address and port, destination IP address and port, protocol, message type, and so on. For example, Apache’s communications could be monitored on port 80, filtered down to data packets.

Note that it isn’t necessary to be on the server to do this. Any computer connected to the same network as the server will do, but Unix usually requires that a user is privileged to spy on the network for security reasons.

Pinging the Server

ping, the simplest and handiest network tool of them all, sends out an ICMP message to a remote hostname or IP address to establish that it’s both present and reachable and reports the time taken for the round-trip. Most versions of ping also allow the remote server to be pinged at regular intervals—handy for preventing a network connection from timing out and disconnecting.

Testing the Handling Capacity of the Network and Server

A variant of ping whose name may vary, spray floods a destination server with ping packets to test the handling capacity of the network and server. The higher the percentage of packets that reaches the destination, the better the network. This is an unfriendly thing to do to a network that’s handling real network traffic, so you should use it with caution.

Diagnosing Problems

traceroute is useful for diagnosing problems with establishing network connections, for example, in cases where ping fails to reach the remote server. traceroute uses the ICMP protocol to ask for routing information from every intermediate step in the route, from the host to the destination. Across the Internet, this can return upward of 20 lines in some cases.

traceroute is particularly useful when diagnosing problems surrounding failed connections because it can sometimes pinpoint where along the line the connection attempt is failing. It can also be useful for determining incorrectly configured or faulty systems in the network. Again, see http://www.stopspam.org/usenet/mmf/man/
traceroute.html
for more information.

Server Hardware

When choosing server hardware for your Web site, there are several issues to consider, especially whether to buy hardware at all. See the section “Get Someone Else To Do It” at the end of this chapter for more information.

Supported Platforms

Apache runs on a wide range of platforms. Typically, it runs on Unix systems, of which the most popular are the free Unix-like operating systems, Linux and FreeBSD. MacOS X is also popular, if only because every machine shipped with OS X includes an Apache installation by default.

Apache also runs on Windows NT, but Apache 1.3 isn’t quite as smooth on NT as it is in the Unix implementation. Apache 2 is better suited to Windows and provides improved performance and resiliency. There are also efforts to port Apache to other platforms in case you have a specific preference for one as yet unsupported.

Corporations that have service contracts and care about support should opt for the most relevant platform, assuming it runs Apache and performance and stability issues aren’t a concern. For anyone on a budget, a cheap PC with Linux or FreeBSD is economical, and both platforms have a good record for stability. Building an inexpensive cluster out of a selection of cheap servers is also more practical than it might appear at first. Simple clustering can be done using nothing more than Apache and a name server. For a simple server with undemanding Web sites, even an old 486 can be perfectly adequate. If you have old hardware to spare and want to put off a purchasing decision until you have a better idea of what you’ll need, older machines that have been retired from desktop use can fit the bill nicely. Alternatively, you can buy a cheap PC for development in the interim.

When it comes to free software, Linux and FreeBSD are both popular choices. The main difference between them is that FreeBSD is slightly more stable and has faster networking support, but Linux has vastly more software available for it. The distinction is slight, however, because Linux is easily stable enough for most Web applications, and porting software from Linux to FreeBSD is usually not difficult.

If stability is of paramount importance, and you don’t intend to install much additional software, choose FreeBSD. If you plan to install additional packages for database support, security, or e-commerce, Linux is probably preferable. Other BSD variants that are popular for Web servers are OpenBSD, NetBSD, and of course MacOS X.

As of writing, the following platforms fully support Apache:

AIX A/UX BS2000/OSD
BSDI DGUX DigitalUnix
FreeBSD HP-UX IRIX
Linux MacOS X NetBSD
NetWare OpenBSD OS/2
OSF/1 QNX ReliantUnix
UnixWare Windows 9 x and ME Windows NT, 2000, and XP

 

Basic Server Requirements

If you’re in a homogeneous environment such as a company, it makes sense to use the same kind of equipment for your server as you use elsewhere, if only to preserve the sanity of the system administrators and make network administration simpler.

However, this isn’t as important a consideration as it might seem. If your server isn’t strongly connected to the rest of the company intranet (for example, if it doesn’t require access to a database), it’s a good idea to isolate the server from your intranet entirely for security. Because there’s no communication between the Web server and other servers, compatibility issues don’t arise.

Apache will run on almost anything, so unless you have a specific reason to buy particular vendor hardware, any reliable low-cost or medium-cost PC will do the job. Stability is far more important than brand.

Using Dedicated Hardware

One point that is still worth mentioning: Run Apache on its own dedicated hardware. Given the demands that a Web server can impose on a server’s Central Processing Unit (CPU), disk, and network, and given that Apache will run on very cheap hardware, there’s no reason not to buy a dedicated server for Web sites and avoid sharing resources with other applications. It’s also not a good idea to use a computer that hosts important applications and files for a public-access Web site.

Using High-Performance/High-Reliability Servers

For demanding applications, consider using a multiprocessor system. With expandable systems, you can scale up the server with additional processors or memory as demand on it increases.

Alternatively, and possibly preferably, from both an expense and reliability point of view, clustering several independent machines together as a single virtual server is also a possibility. Several solutions exist to do this, as well as use custom clusters, which I cover in Chapter 7.

Memory

You can never have too much memory. The more you have, the more data can be cached for quick access. This applies not only to Apache but to any other processes you run on the server.

You need the amount of memory that allows the server and any attendant processes to run without resorting to virtual memory. If the operating system runs out of memory, it’ll have to temporarily move data out of memory to disk (also known as swapping). When that data is needed again, it has to be swapped in and something else swapped out, unless memory has been freed in the meantime.

Clearly this is inefficient; it holds up the process that needs the data and ties up the disk and processor. If the data being swapped is the Web server’s cache or frequently accessed database tables, the performance impact can be significant.

To calculate how much memory you need, add the amount of memory each application needs and use the total. This is at best an inexact science, so the rule of thumb remains: Add more memory. Ultimately, only analyzing the server in operation will tell if you have enough memory.

The vmstat tool on most Unix systems is one way to monitor how much the server is overrunning its memory and how much time it’s spending on swapping. Similar tools are available for other platforms. Windows NT has a very good tool called perfmon (Performance Monitor).

An operating system that handles memory efficiently is also important (see the “Operating System Checklist” section later in this chapter).

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

{mospagebreak title=Network Interface}

CPU performance and plenty of memory by themselves and won’t prevent a bottleneck if Input/Output (I/O) performance (frequency of access to interface card and hard disk) of the system is insufficient.

In an intranet, very high demands are made of and from the network and interface card. Here, an older 10Base2 or 10BaseT connection can easily become a problem. A 10Base network can cope with a maximum throughput of six to eight megabits per second, and a Web server accessed at a rate of 90 hits per second will soon reach this limit.

100baseT network cards and cabling are now negligibly more expensive, so there’s no reason to invest in 10Base networking unless you have a legacy 10Base network that you can’t easily replace. Even in this case, dual 10/100Base network cards are a better option—you can always upgrade the rest of the network later. For the most demanding applications, Gigabyte Ethernet is also available, but it costs considerably more to implement.

Provided that other computers don’t unduly stretch the network, a normal Ethernet card will in most cases be sufficient, as long as it’s not the cheapest card available at a cut-price computer store. Note that many lower-end cards don’t use features such as Direct Memory Access (DMA), also called bus mastering, so they perform significantly worse even though they’re “compatible” with more expensive ones.

Using Dual Network Connections

Fitting two network cards and assigning them different IP addresses on different networks is an excellent approach for servers, especially if you intend to connect them both to an ISP. The external network interface is then used exclusively for Web server access, and the internal network interface links to the database server or backup systems, allowing you to process database requests and make backups without affecting the external bandwidth. Similarly, a busy Web site won’t affect bandwidth on the internal network.

Dual network interfaces have an additional security benefit: By isolating the internal and external networks and eliminating any routing between them, it becomes relatively easy to deny external users access to the internal network. For example, if you have a firewall, you can put it between the internal network interface and the rest of the network, which leaves the server outside but everything else inside.

Internet Connection

If the server is going to be on the Internet, you need to give careful consideration both to the type of connection you use and the capabilities of the ISP that’ll provide it.

Here are some questions when considering an ISP:

  • Are they reliable?

  • Do they have good connectivity with the Internet (who are they peered with, and do they have redundant circuits)?

  • Are you sharing bandwidth with many other customers?

  • If so, do they offer a dedicated connection?

If you’re running a site with an international context (for example, if you run all the regional sites of an international company from one place), find out the answers to the following, as well:

  • Do they have good global connectivity?

  • Does technical support know the answer to all these questions when called?

Note that just because an ISP can offer high bandwidth doesn’t necessarily mean that users on the Internet can utilize that bandwidth—that depends on how well connected the ISP is to its peers and the Internet backbone in general. Many ISPs rely on one supplier for their own connectivity, so if that supplier is overloaded, your high bandwidth is useless to you and your visitors, even if the ISP’s outgoing bandwidth is theoretically more than adequate.

Hard Disk and Controller

Fast hard disks and a matching controller definitely make sense for a Web server, and a SCSI system is infinitely preferable to Integrated Device Electronics (IDE) if performance is an issue.

For frequently accessed Web sites, it also makes sense to use several smaller disks rather than one large hard disk. If, for instance, one large database or several large virtual servers are operated, for superior access performance, store the data on their own disks because one hard disk can read from only one place at one time.

RAID 0 (striping) can also be used to increase the performance from a disk array. Combining it with RAID 1 for redundancy can be an effective way of improving server performance. This is known as RAID 0+1, RAID 1+0, and RAID 10—all three are the same. However, it can be expensive.

Operating System Checklist

For the server to run effectively (that is, be both stable and efficient), the hosting operating system needs to be up to the task. I have discussed operating systems in reference to Apache’s supported platforms, and I mentioned that as a server platform, Unix is generally preferred for Apache installations. Whatever operating system you choose, it should have all the following features to some degree:

Stability: The operating system should be reliable and capable of running indefinitely without recourse to rebooting. Bad memory management is a major course of long-term unreliability.

Security: The operating system should be resistant to all kinds of attack, including DoS attacks (which tie up system resources and prevent legitimate users from getting service), and have a good track record of security. Security holes that are discovered should be fixed rapidly by the responsible authority. Note that rapidly means days, not weeks.

Performance: The operating system should use resources effectively by handling networking without undue load on the rest of the operating system and performing task-switching efficiently. Apache in particular runs multiple processes to handle incoming connections; inefficient switching causes a performance loss. If you plan to run on a multiprocessor system, Symmetric Multi Processor (SMP) performance is also a key issue to consider.

Maintenance: The operating system should be easy to upgrade or patch for security concerns, shouldn’t require rebooting or being taken offline to perform anything but significant upgrades, and shouldn’t require that the whole system be rebooted to maintain or upgrade just one part of it.

Memory: The operating system should use memory effectively, avoid swapping unless absolutely necessary and then swap intelligently, and have no memory leaks that tie up memory uselessly. (Leaky software is one of the biggest causes of unreliable Web servers. For example, until recently, Windows NT has had a very bad record in this department.) However, leaky applications are also problematic. Fortunately, Apache isn’t one of them, but it used to be less stellar in this regard than it is now.

License: The operating system shouldn’t come with strings attached that may compromise your ability to run a secure server. Some vendors, even large and very well-known ones, have been known to insert new clauses in license agreements that must be agreed to in order to apply critical security patches. Some of the terms and conditions in these licenses grant permission for the vendor to modify or install software at will over the Internet. This is a clear security concern, not to mention a confidentiality issue for systems handling company or client information, so any vendor with a track record of this kind of behavior should be eliminated for consideration, irrespective of how well they score (or claim to score) otherwise.

Third-party modules can be more of a problem, but Apache supplies the MaxRequestsPerChild directive to forcibly restart Apache processes periodically, preventing unruly modules from misbehaving too badly. If you plan to use large applications such as database servers, you should check their records, too.

Redundancy and Backup

If you’re planning to run a server of any importance, you should give some attention to how you intend to recover the server if, for whatever reason, it dies. For example, you may have a hardware failure, or you might get cracked and have your data compromised. A RAID array is a good first line of defense, but it can be expensive. It also keeps the backup in the server itself, which isn’t much comfort if the server happens to catch fire and explodes. (Yes, this actually happens.)

A simple backup solution is to equip the server with a DAT drive or other mass storage device and configure the server to automatically copy the relevant files to tape at regular scheduled times. This is easy to set up even without specialist backup software; on a Unix platform, a simple cron job will do this for you.

A better solution is to back up across an internal network, if you have one. This would allow data to be copied off the server to a backup server that could stand in when the primary server goes down. It also removes the need for manual intervention because DAT tapes don’t swap by themselves.

If the server is placed on the Internet (or even if it isn’t), you should take precautions against the server being compromised. If this happens, there is only one correct course of action: Replace everything from reliable backups. That includes reinstalling the operating system, reconfiguring it, and reinstalling the site or sites from backups. If you’re copying to a single backup medium every day and don’t spot a problem before the next backup occurs, you have no reliable backup the following day. The moral is to keep multiple, dated backups.

There are several commercial tools for network backups, and your choice may be influenced by the server’s environment—the corporate backup strategy most likely can extend to the server, too. Free options include obvious but crude tools such as FTP or NFS to copy directory trees from one server to another. (Unless you have a commandingly good reason to do so, you should probably not ever have NFS enabled on the server because this could compromise its security.)

A better free tool for making backups is rsync, which is an intelligent version of the standard Unix rcp (remote copy) command that copies only the differences between directory hierarchies. Better still, it can run across an encrypted connection supplied by Openssh (secure shell), another free tool. If you need to make remote backups of the server’s files across the Internet, you should seriously consider this approach. (I cover both rsync and ssh in Chapter 10.) On the subject of free tools, another more advanced option worth noting is the Concurrent Versioning System (CVS). More often applied to source code, it works well on HTML files, too. (For more information on CVS, see http://www.cvshome.org/.)

A final note about backups across the network: Even if you use a smart backup system that knows how to make incremental backups of the server, a large site can still mean a large quantity of data. If the server is also busy, whenever a backup is performed this data will consume bandwidth that would otherwise be put to toward handling browser requests, so it pays to plan backups and schedule them appropriately. If you have a lot of data to copy, consider doing it in stages (on a per-directory basis, for example) and definitely do it incrementally. Having dual network connections, backing up on the internal one, and leaving the external one for HTTP requests is a definite advantage here.

Specific Hardware Solutions

Many vendors now sell hardware with Apache or an Apache derivative preinstalled, coupled with administrative software to simplify server configuration and maintenance. At this point, all these solutions are Unix-based, predominantly Linux. Several ISPs are also offering some of these solutions as dedicated servers for purchase or hire.

Larger vendors include HP, Dell, Sun, and of course IBM, as well as a diverse list of smaller companies. The list of vendors is growing all the time—the Linux VAR HOWTO at http://en.tldp.org/HOWTO/VAR-HOWTO.html (and other places) has some useful pointers.

Get Someone Else to Do It

As an alternative to setting up a server yourself—with all the attendant issues of reliability, connectivity, and backups this implies—you can buy or hire a dedicated server at an ISP, commonly known as colocation.

The advantages of this are that the ISP handles all the issues involving day-to-day maintenance, but you still get all the flexibility of a server that belongs entirely to you. You can even rebuild and reconfigure Apache as you want it because you have total control of the server. This also means you have total control over wrecking the server, so this doesn’t eliminate the need for a Web server administrator just because the server isn’t physically present.

The disadvantage is that you’re physically removed from the server. If it has a serious problem, you may be unable to access it to find out what the problem is. The ISP will most likely also impose bandwidth restrictions, which you should be aware of. You’re also reliant on the ISP’s service, so checking out their help desk before signing up is recommended.

Note that services vary from one ISP to another—some will back up the server files automatically; others will not. As with most things on the Internet, it pays to check prospective ISPs by looking them up on discussion lists and Usenet newsgroups.

Caveat emptor!

NOTE More introductory material is available at http://httpd.apache.
org/docs/misc/FAQ.html#what
.

Summary

In this chapter, I covered the basic concepts of what a Web server is and introduced you to Apache. There are many reasons that Apache is a popular Web server, including the important fact that it’s free. The best form of support for Apache is the informative and informal support of the online community that’s very active in developing and maintaining it.

I also discussed how Apache works on Unix and Windows as well as some networking tools such as ifconfig, netstat, snoop, tcpdump, ping, spray, and traceroute. In the latter part of the chapter, I covered the basic server requirements and some specific hardware solutions for your Web server.

In the next chapter, I’ll cover installing Apache and configuring it as a basic Web server.

This article is excerpted from Pro Apache by Peter Wainwright (Apress, 2004; ISBN  1590593006). Check it out at your favorite bookstore today. Buy this book now.

[gp-comments width="770" linklove="off" ]

chat sex hikayeleri Ensest hikaye