Professional Apache

This excerpt from Wrox Press Ltd.’s Professional Apache covers Chapter 8 – Improving Apache’s Performance. It tells you how to configure Apache for peak performace using caching and clustering, plus much more. Buy it on Amazon.com now!

This manuscript is chapter 8 Improving Apache’s Performance from the Wrox Press book Professional Apache.

Professional Apache is the book for anybody that needs to get the most out of the Apache Web Server. It provides the information needed to apply Apache to real world problems, using extensive examples. The book covers;

  • The Apache 1.3.x server, including new features in Apache 1.3.9
  • Getting Apache Installation right-whether you download or build from source
  • Configuring and tuning Apache to suit the web site you want to create.
  • Setting up Apache to deliver dynamic content efficiently and securely.
  • Adding SSL encryption support to your Apache server
  • Extending Apache with third-party modules

For further details about the book, and other books in our range, visit the Wrox Press Web Site. {mospagebreak title=Apache’s Performance Directives}

Getting the best performance possible out of a web server is a prime concern for many administrators. Apache has a justifiably good reputation for performance, but that doesn’t absolve the administrator of responsibility for making sure Apache is working at its best in their particular circumstances.

Before reading this chapter, administrators serious about performance should rebuild Apache from source on the platform on which it is to run, for the reasons described in Chapter 3. It pays to pick and choose modules carefully, only including those which are necessary, and then to build Apache statically. Not only does this remove the need for mod_so, it also makes Apache a few percent faster in operation. Coupled with platform optimizations, this can make a significant difference to Apache’s performance even before other considerations.

Apache defines several directives that are directly related to performance, controlling the operation of the server at the process and protocol levels. In addition, many aspects of Apache’s configuration that are not directly performance related can have either a positive or negative effect on performance, depending on how we define them. Being aware of these is an important part of configuring Apache if performance is a concern.

Rather than improving the performance of Apache itself, Apache can also be used to improve the performance of other web servers (which can also be Apache servers, of course) by being set up as an intermediate proxy, also known as a reverse proxy.

Eventually, a point will be reached when no amount of performance tuning is going to make much difference. When this point is reached, there are two possible solutions: migrate to more powerful hardware or, more interestingly, add more low-power servers and create a cluster. Apache’s proxying capabilities combined with a little ingenuity in rewriting URLs give us one excellent way of creating such a cluster with very little effort.

In this chapter we will look at:

  • Using Apache’s core performance directives
  • Configuring Apache for better performance
  • Setting up Apache as a web proxy
  • Enabling caching on Apache proxy servers
  • Clustering web servers for reliability and performance

Apache’s Performance Directives

Aside from general configuration issues that affect Apache’s performance, which we discuss in the next section, Apache has a number of directives for tuning the server’s performance in different circumstances. These fall into two main groups:

Process-level directives that control the number of Apache processes (or threads, on Windows) that Apache starts and maintains as a pool to handle incoming requests.

Protocol-level directives that control how Apache manages the connection with clients and how long it will wait for activity before deciding to close a connection itself.

In turn, process-level directives divide into two groups, depending on the platform that Apache is running on. All but one is effective only on UNIX systems; the remaining one is only effective on Windows platforms.

Controlling Apache Processes on UNIX

Apache runs on UNIX platforms as a pre-forking server. This means that on startup it creates a pool of child processes ready to handle incoming client requests. As requests are processed, Apache tries to make sure that there are at least a few spare servers running for subsequent request. Apache provides three directives to control the pool:

StartServers <number> (default 5)

This determines the number of child processes Apache will create on startup. However, since Apache controls the number of processes dynamically depending on the server activity, this does not have very much effect, and there is not much to gain by varying it. In particular, if MinSpaceServers is higher, it will cause Apache to spawn additional processes immediately.

MinSpareServers <number> (default 5)

This sets the minimum number of Apache processes that must be available at any one time; if processes become busy with client requests, Apache will start up new processes to keep the pool of available servers at the minimum value. Because of Apache’s algorithm for starting servers on demand, raising this value is mostly only meaningful for handling large numbers of simultaneous requests rapidly; for sites with millions of hits per day, the following is appropriate:


MinSpareServers 32
MaxSpareServers <number> (default 10)

This sets the maximum number of Apache processes that can be idle at one time; if many processes are started to handle a peak in demand and then the demand tails off, this directive will ensure that excessive numbers of processes will not remain running. This value should be equal to or higher than MinSpareServers to be meaningful. Sites with a million or more hits per day can use the following as a reasonable value:


MaxSpareServers 64

These directives used to be a lot more significant than they are now. Since version 1.3 Apache has a very responsive algorithm for handling incoming requests, starting from 1 to a maximum of 32 new processes each second until all client requests are satisfied. The objective of this is to prevent Apache starting up excessive numbers of processes all at once unless it is actually necessary because of the performance cost. The server starts with one, then doubles the number of new processes started each second, so only if Apache is genuinely experiencing a sharp rise in demand will it start multiple new processes.

The consequence of this strategy is that Apache’s dynamic handling of the server pool is actually quite capable of handling large swings in demand. Adjusting these directives has little actual effect on Apache’s performance on anything other than extremely busy sites, and it is usually satisfactory to stay with the default of:


StartServers 5
MinSpareServers 5
MaxSpareServers 10

Apache has another two directives related to the control of processes:

MaxClients <number> (default 256)

Irrespective of how busy Apache gets, it will never create more processes than the limit set by MaxClients, either to maintain the pool of spare servers or to handle actually requests. Clients that try to connect when all processes are busy will get Server Unavailable error messages. For this reason, the value of MaxClients should not be set too low, for example:


MaxClients 100

Setting MaxClients lower helps to increase performance of client requests that succeed, at the cost of causing some client requests to fail. It is therefore a double-edged tool and indicates the server either needs to be tuned for performance more elsewhere, upgraded, or clustered.

The maximum number of clients is set to 256 by default, which is the Apache internal limit built into the server binary. To override this limit requires two things: first, ensuring that the platform will allow the process to spawn more than 256 processes, then rebuilding Apache after setting HARD_SERVER_LIMIT. The simplest way to do this is to set CFLAGS in the environment before running configure as explained in Chapter 3. However, we can also edit the src/Configuration and add it by hand to EXTRA_CFLAGS.

Note that as well as determining the maximum number of processes, MaxClients also determines the size of the scoreboard file required on some platforms (see Chapter 2 for details on the scoreboard file). Apache loads this into memory, so a large value causes Apache to use a little more memory, even if the limit is not reached.

MaxRequestsPerChild <number> (default 0)

This limits the maximum number of requests a given Apache process will handle before voluntarily terminating. The object of this is to prevent memory leaks causing Apache to consume increasing quantities of memory; while Apache is well behaved in this respect the underlying platform might not be. Normally this is set to zero, meaning that processes will never terminate themselves:


MaxRequestsPerChild 0

This is the best value to choose if we are confident that there are no, or at least no significant, memory leaks to cause problems. (Tools such as ps, top and vmstat are useful in monitoring memory usage and spotting possible leaks).

A low value for this directive will cause performance problems as Apache will be frequently terminating and restarting processes. A more reasonable value for platforms that have memory leak problems is 1000 or 10000:


MaxRequestsPerChild 10000

If Apache is already running enough servers according to the MinSpareServers directive this also helps to thin out the number of processes running if Apache has been through a busy period. Otherwise Apache will start a new process to make up for the one that just terminated each time a process reaches its maximum request threshold.

Ultimately the UNIX version of Apache will also run in a multi-threaded mode, at which point the ThreadsPerChild directive below will also be significant.

Controlling Apache Processes on Windows

On Windows platforms, Apache does not fork; consequently, the directives for controlling the number of processes or their lifetime have no effect. Instead, Apache runs as a multi-threaded process, theoretically more efficient, although the maturity of threaded implementations on Windows platforms means that Apache is not as stable.

The number of simultaneous connections a Windows Apache server is capable of is configured with the ThreadsPerChild directive, which is analogous to both the StartServers and MaxClients directives for UNIX and defaults to 50:


ThreadsPerChild 50

Since there is only one child, this limits the number of connections to the server as a whole. For a busy site, we can raise this to a maximum of 1024, which is the limit built in to the server:


ThreadsPerChild 1024

It is possible to raise this limit higher by adjusting HARD_SERVER_LIMIT as described for the MaxClients directive above.

Protocol-Related Performance Directives

In addition to controlling the server pool, Apache also provides some directives to control performance-related issues at the TCP/IP and HTTP protocol levels. Since these are platform independent, they work regardless of the platform Apache is running on:

SendBufferSize <bytes>

This directive determines the size of the output buffer used in TCP/IP connections and is primarily useful for queuing data for connections where the latency (that is, the time it takes for a packet to get to the remote end and for the acknowledgement message to come back) is high. For example, 32 kilobyte buffers can be created with:


SendBufferSize 32768

Each TCP/IP buffer created by Apache will be sized to this value, one per client connection, so a large value has a significant effect on memory consumption, especially for busy sites.

KeepAlive <on|off>

Persistent connections were first introduced by Netscape in the HTTP/1.0 era. HTTP/1.1 developed this idea further and used a different mechanism. Both approaches are enabled by the KeepAlive directive, which allows multiple sequential HTTP requests to be made by a client on the same connection if the client indicates that it is capable and would like to do it. The default behavior is to enable persistent connections, equivalent to:


KeepAlive on

There are few reasons to disable this; if a client is not capable of persistent connections, it will generally not ask for them. The exception is some Netscape 2 browsers that claim to be able to handle persistent connections but in fact have a bug that prevents them from detecting when Apache drops its end of the connection. For this reason, Apache ships with a default configuration that contains BrowserMatch directives to set special variables to disable persistent connections in some cases – see “Apache’s Environment” in Chapter 4 for more details.

KeepAlive allows a much more rapid dialog between a client and the server to take place, at the cost of preventing the attached server process from handling any other requests until the client disconnects. To deal with this issue, Apache provides two additional directives to handle the lifetime of a persistent connection:

KeepAliveTimeout <seconds>

This directive specifies the amount of time an Apache process (or thread, under Windows) will wait for a client to issue another HTTP request before closing the connection and returning to general service. This should be a relatively short value, and the default is 15 seconds, equivalent to:


KeepAliveTimeout 15

This value should be a little larger than the maximum time we expect the server to spend generating and sending a response – very short for static pages, longer if the site’s main purpose is dynamically generated information – plus a few seconds for the client to react. It does not pay to make this value too large. If a client does not respond in time, it must make a new connection, but it is otherwise unaffected and the server process is freed for general use in the meantime.

MaxKeepAliveRequests <number>

Regardless of the time-out value, persistent connections will also automatically terminate when the number of requests specified by MaxKeepAliveRequests is reached. In order to maintain server performance, this value should be kept high, and the default is accordingly 100:


MaxKeepAliveRequests 100

Setting this value to zero will cause persistent connections to remain active forever, so long as the time-out period is not exceeded and the client does not disconnect. This is a little risky since it makes the server vulnerable to denial-of-service attacks, so a high but finite value is preferable.

TimeOut

This is a catchall directive that determines how long Apache will allow an HTTP connection to remain when it becomes apparently inactive, as determined by the following criteria:

  • The time since a connection being established and a GET request being received. This does not affect persistent connections, for which KeepAliveTimeout is used instead.

  • The time since the last packet of data was received on a PUSH or PUT HTTP request.

  • The time since the last ACK (acknowledgement) response was received if the server is waiting for more.

Since these three values are rather different in nature, it is expected that they will at some point in the future become separate directives. For now they are all handled by the one value set by TimeOut. The default value for TimeOut is 5 minutes, which is equivalent to:


TimeOut 300

This is far more than should ever be necessary and is set this way because the timer is not guaranteed to be reset for every kind of activity, specifically some packet-level triggers, due to legacy code. If we’re willing to accept the possible occasional disconnection, we can set this to a much lower value:


TimeOut 60

This may cause requests that genuinely take a long time to process to get disconnected if the value is set too low. File uploads performed with POST or PUT can be also be detrimentally affected by a low time-out value if we expect to upload large files across links that can suffer performance problems at peak periods (such as transatlantic connections).

ListenBacklog

Connection requests from clients collect in a queue until an Apache process becomes free to service them. The maximum length of the queue is controlled with the ListenBacklog directive, which has a default value of 511. If we wanted to change it, we could use something like:


ListenBacklog 1023

There is rarely any need to alter this value, however. If the queue is filling up because Apache is failing to process requests fast enough, performance improvements elsewhere are more beneficial than allowing more clients to queue. In addition, many operating systems will reduce this value to a system limit.

HTTP Limit Directives

In addition to the protocol-related directives mentioned above, Apache supplies four directives to limit the size of the HTTP requests made by clients. These principally prevent clients from abusing the server’s resources and causing denial-of-service problems and are therefore also relevant to server security. The directives are:

LimitRequestBody

This limits the size of the body of an HTTP request (as sent with a PUT or POST method). The default value is zero, which translates to unlimited. The maximum value is 2147483647, or 2 gigabytes. If a client sends a body in excess of the body limit, the server responds with an error rather than servicing the request.

We can use this value to prevent abnormally large posts from clients by limiting the body size to a reasonable value. For example, we have a script that accepts input from an HTML form via PUT. We know the maximum size of the response from the filled-out form is guaranteed to be less than 10 kilobytes, so we could say:


LimitRequestBody 10240

This presumes that we don’t have any other scripts on the server that might validly receive a larger HTTP request body, of course.

LimitRequestFields

This limits the number of additional headers that can be sent by a client in an HTTP request, and defaults to 100. In real life, the number of headers a client might reasonably be expected to send is around 20, although this value can creep up if content negotiation is being used. A large number of headers may be an indication of a client making abnormal or hostile requests of the server. A lower limit of 50 headers can be set with:


LimitRequestFields 50
LimitRequestFieldSize

This limits the maximum length of an individual HTTP header sent by the client, including the initial header name. The default (and maximum) value is 8190 characters. We can set this to limit headers to a maximum length of 100 characters with:


LimitRequestFieldSize 100
LimitRequestLine

This limits the maximum length of the HTTP request itself, including the HTTP method, URL, and protocol. The default limit is 8190 characters; we can reduce this to 500 characters with:


LimitRequestLine 500

The effect of this directive is to effectively limit the size of the URL that a client can request, so it must be set large enough for clients to access all the valid URLs on the server, including the query string sent by GET requests. Setting this value too low can prevent clients from sending the results of HTML forms to the server when the form method is set to GET.

©1999 Wrox Press Limited, US and UK.

{mospagebreak title=Configuring Apache for Better Performance}

Many aspects of Apache’s general configuration can have important performance implications if set without regard to their processing cost.

Directives That Effect Performance

There are a large number of directives that can have a beneficial or adverse effect on performance, depending on how they are used. Some of these are obvious; others rather less so:

DNS and Host Name Lookups

Any use of DNS significantly effects Apache’s performance. In particular, use of the following two directives should be avoided if possible:

HostNameLookups on/off/double

This allows Apache to log information based on the host name rather than the IP address, but it is very time consuming, even though Apache caches DNS results for performance. Log analyzers like Analog, discussed in Chapter 9, do their own DNS resolution when it comes to generating statistics from the log at a later point, so there is little to be gained from forcing the running server to do it on the fly.

UseCanonicalName on/off/dns

This causes Apache to deduce the name of a server from its IP address, rather than generate it from the ServerName and Port directives (UseCanonicalName on) or just accept the client value (UseCanonicalName off). This can be occasionally useful for things like mass virtual hosting with mod_vhost_alias. Because it only caches the names of hosts being served by Apache rather than the whole Internet, it is less demanding than HostNameLookups, but even so, if it is avoidable, avoid it.

In addition, any use of a host name, whole or partial, may cause DNS lookups to take place, either from name to IP address or IP address to name. This affects the allow and deny directives in mod_access, ProxyBlock, NoProxy and NoCache in mod_proxy, and so on.

Following Symbolic Links and Permission Checking

Apache can be told to follow or refuse to follow symbolic links with the FollowSymLinks option. Unless enabled, each time Apache retrieves a file or runs a CGI script, it must spend extra time checking the entire path, from the root directory down, to see if any parent directories (or the file itself) are symbolic links.

Alternatively, if symbolic links are enabled with SymLinksIfOwnerMatch, Apache will follow links, but only if the ownership of the link is the same as that of the server (or virtual host, in the case of suEXEC). This also causes Apache to check the entire path for symbolic links and, in addition, check that the ownership of the link is valid.

For maximum performance, always specify FollowSymLinks and never SymLinksIfOwnerMatch:


Options FollowSymLinks

However, these options exist to improve security, and this strategy is the most permissive, which may be unpalatable to administrators more worried about security than performance.

Caching Dynamic Content

Normally, Apache will not send information to proxies telling them to cache documents if they have been generated dynamically. The burden on the server can therefore be considerably reduced by using mod_expires to force an expiration time onto documents, even if it is very short:


ExpiresByType text/html 600

This directive would be suitable for a server that updates an information page like a stock market price page every ten minutes – even if the page expires in a time as short as this, if the page is frequently accessed, we save ourselves a lot of hits if clients can get the page from a proxy instead.

Even so, some proxies will not accept documents they think are generated dynamically, requiring us to fool them by disguising CGI scripts as ordinary HTML:


RewriteEngine on RewriteRule ^(/pathtocgi/.*).html$ $1.cgi [T=application/x-httpd-cgi]
Caching Negotiated Content

HTTP/1.1 clients already have sufficient information to know how and when to cache documents delivered by content negotiation. HTTP/1.0 proxies however do not, so to make them cache negotiated documents we can use:


CacheNegotiatedDocs

This can have unexpected side effects if we are a multilingual site, however, since clients may get the wrong page. It should therefore be used with caution, if at all. The number of HTTP/1.0 clients affected is small and decreasing, so this can usually be ignored.

A different aspect of content negotiation is when the configure directory index is specified without a suffix (thereby causing content negotiation to be performed on it). Since index files are very common URLs for clients to retrieve, it is always better to specify a list, even if most of them don’t exist, than have Apache generate an on-the-fly map with MultiViews. For example, don’t put:


DirectoryIndex index

Instead put something like:


DirectoryIndex index.html index.htm index.shtml index.cgi
Logging

One of the biggest users of disk and CPU time is logging. It therefore pays not to log information that we don’t care about or, if we really want to squeeze performance from the server, don’t log at all. It is inadvisable not to have an error log, but we can disable the access log by simply not defining one. Otherwise, we can minimize the level of logging with the LogLevel directive:


LogLevel error

An alternative approach is to put the log on a different server, either by NFS mounting the logging directory onto the web server or, preferably, using the system log daemon to do it for us. NFS is not well known for its performance, and it introduces security risks by making other servers potentially visible to users on the web server.

Session Tracking

Any kind of session tracking is time consuming, first because Apache is responsible for checking for cookies and/or URL elements and setting them if missing, and second, because for tracking to be useful, it has to be logged somewhere, creating additional work for Apache. The bottom line is not to use modules like mod_usertrack or mod_session unless it is absolutely necessary, and even then use Directory, Location, or Files directives to limit its scope.

.htaccess Files

If AllowOverride is set to anything other than None, Apache will check for directives in .htaccess files for each directory from the root all the way down to the directory in which the requested resource resides, after aliasing has been taken into account. This can be extremely time consuming since Apache does this check every time a URL is requested, so unless absolutely needed, always put:


# AllowOverride is directory scope only, so we use the root directory
<Directory/>
AllowOverride None
</Directory>

This also has the side effect of making the server more secure. Even if we do wish to allow overrides in particular places, this is a good directive to have in the server-level configuration to prevent Apache searching all the directories from the root down. By enabling overrides only in the directories that are needed, Apache will only search a small part of the pathname, rather than the whole chain of directories.

Extended Status

mod_status allows an extended status page to be generated if the ExtendedStatus directive is set to on. However, this causes Apache to make two calls to the operating system for time information on each and every client request. Time calls are one of the most expensive system calls on any platform, so this can cause significant performance loss, especially as the directive is only allowed at the server level and not on a per-virtual hosts basis. The solution is to simply not enable ExtendedStatus.

Rewriting URLs

Any use of mod_rewrite‘s URL rewriting capabilities can cause significant performance loss, especially for complex rewriting strategies. The RewriteEngine directive can be specified on a per-directory or per-virtual host basis, so it is worth enabling and disabling mod_rewrite selectively if the rules are complex and needed only in some cases.

In addition, certain rules can cause additional performance problems by making internal HTTP requests to the server. Pay special attention to the NS flag, and be wary of using the -F and especially -U conditional tests.

Large Configuration Files

Lastly, the mere fact of a configuration file being large can cause Apache to respond more sluggishly. Modules like mod_rewrite can benefit performance by reducing the number of lines needed to achieve a desired effect. The mod_vhost_alias module is also particularly useful for servers that need to host large numbers of virtual hosts.

Performance Tuning CGI

Any script or application intended for use as a CGI script should already be written with performance in mind; this means not consuming excessive quantities of memory or CPU time, generating the output as rapidly as possible, and caching if at all possible the results, so they can be returned faster if the conditions allow it.

In addition, Apache defines three directives for controlling what CGI directives are allowed to get away with:

RLimitCPU

controls how much CPU time is allowed

RLimitMEM

controls how memory can be allocated

RLimitNPROC

controls how many CGI instances can run simultaneously

All these directives are described in more detail in Chapter 6. A better approach is to write dynamic content applications in a more efficient way to take better advantage of Apache. The most obvious option is FastCGI, also covered Chapter 6. Perl programmers will also want to check out mod_perl in Chapter 11.

Additional Directives for Tuning Performance

Although not part of the standard Apache executable, there are several modules, both included with Apache and third-party, designed to improve server performance in various ways:

MMapFile

MMapFile is supplied by mod_mmap_static, an experimental UNIX specific module supplied as standard with Apache but not compiled or enabled by default. When active, it allows nominated files to be memory mapped, if the UNIX platform supports it. Memory-mapped files are kept in memory permanently, allowing Apache to deliver them to clients rapidly, without retrieving them from the disk first. For example, to map the index page and a banner logo so they are stored in memory, we might put:


MMapFile /home/www/alpha-complex/index.html /home/www/alpha-complex/banner.gif

This will only work for files that are static and present on the filing system – dynamically generated content and CGI scripts will not work with MMapFile. To cache CGI scripts, use the FastCGI module or mod_perl and Apache::Registry (for Perl scripts).

The MMapFile is not flexible in its syntax and does not allow wildcards. There is also no MMapDirectory equivalent to map groups of files at once.

It is important to realize that once a file is mapped, it will never be retrieved from disc again, even if it changes. Apache must be restarted (preferably gracefully with apachectl graceful) for changed files to be remapped into memory.

mod_bandwidth

mod_bandwidth is available from the contributed modules archive on any Apache mirror, and in addition, a current version can be found at http://www.cohprog.com/. It provides Apache with the ability to limit the amount of data sent out per second based on the domain or IP address of the remote client or, alternatively, the size of the file requested.

Bandwidth limits may also be used to divide available bandwidth according to the number of clients connecting, allowing a service to be maintained to all clients even if there is theoretically insufficient bandwidth for them.

As it is a contributed module, mod_bandwidth is not enabled by default and needs to be added to Apache in the usual way – by either rebuilding the server or building and installing it as a dynamic module with the apxs utility. Once installed, bandwidth limiting can be enabled with:


BandWidthModule on

Bandwidth limits are configured to work on a per-directory basis, allowing a server to customize different parts of a web site with different bandwidth restrictions. For example, we can limit bandwidth usage on the non-secure part of a web site, ensuring that traffic to our secure online ordering system always has bandwidth available to it.

Limiting Bandwidth Based on the Client

Once enabled, bandwidth limitations may be set with:


<Directory/>

BandWidth localhost 0


BandWidth friendly.com 4096

BandWidth 192.168 512

BandWidth all 2048

</Directory>

This tells Apache not to limit local requests (potentially from CGI scripts) by setting a value of 0, to limit internal network clients to 512 bytes per second, to allow a favored domain 4k per second and to allow all other hosts 2k with the special all keyword. The order is important as the first matching directive will be used; if the friendly.com domain resolved to the network address 192.168.30.0 it would be overridden by the directive for 192.168 if it were placed after it. Similarly, if a client from 192.168.0.0 happened to be in the friendly.com domain, they’d get 4k access.

Limiting Bandwidth Based on File Size

Bandwidth limits can also be set on file size with the LargeFileLimit directive, allowing large files to be sent out more gradually than small ones. This can be invaluable when large file transfers are being carried out on the same server as ordinary static page requests. If a LargeFileLimit and BandWidth directive apply to the same URL then the smaller of the two is selected.

The LargeFileLimit takes two parameters, a file size in kilobytes and transfer rate. Several directives can be cascaded to produce a graded limit; for example:


<Directory /home/www/alpha-complex>

LargeFileLimit 50 8092

LargeFileLimit 1024 4096

LargeFileLimit 2048 2048

</Directory>

This tells Apache not to limit files smaller than 50kb , generally corresponding to HTML pages and small images, to limit files up to 1Mb to 8kb per second and files between 1Mb and 2Mb to 4k per second. Files larger than 2Mb are limited to 2k per second. As with the BandWidth directive, order is important – the first directive that has a file size greater than the file requested will be used, so directives must be given in smallest to largest order to work.

If more than one client is connected at the same time, mod_bandwidth also uses the bandwidth limits as proportional values and allocates the available bandwidth allowed, based on the limit values for each client; if ten clients all connect with a total bandwidth limit of 4096 bytes per second, each client gets 410 bytes per second allocated to it.

Minimum Bandwidth and Dividing Bandwidth Between Clients

Bandwidth is normally shared between clients by mod_bandwidth, based on their individual bandwidth settings. So, if two clients both have bandwidth limits of 4k per second, mod_bandwidth divides it between them, giving each client 2k per second. However, the allocated bandwidth will never drop below the minimum bandwidth set by MinBandWidth, which defaults to 256 bytes per second:


MinBandWidth all 256

MinBandWidth takes a domain name or IP address as a first parameter with the same meaning as BandWidth. Just as with BandWidth, it is also applied in order with the first matching directive being used:


<Directory/> BandWidth localhost 0 BandWidth friendly.com 4096 MinBandWidth friendly.com 2096 BandWidth 192.168 512 BandWidth all 2048 MinBandWidth all 512 </Directory>

Bandwidth allocation can also be disabled entirely, using a special rate of -1. This causes the limits defined by BandWidth and LargeFileLimit to be taken literally, rather than relative values to be applied in proportion when multiple clients connect. To disable all allocation specify:


MinBandWidth all -1

In this case, if ten clients all connect with a limit of 4096 bytes per second, mod_bandwidth will allow 4096 bytes per second for all clients, rather than dividing the bandwidth between them.

Transmission Algorithm

mod_bandwidth can transmit data to clients based on two different algorithms. Normally it parcels data into packets of 1kb and sends them as often as the bandwidth allowed: If the bandwidth available after allocation is only 512 bytes, a 1kb packet is sent out approximately every two seconds.

The alternative mode is set with the directive BandWidthPulse, which takes a value in microseconds as a parameter. When this is enabled, mod_bandwidth sends a packet after each interval, irrespective of the size. For example, to set a pulse rate of one second, we would put:


BandWidthPulse 1000000

This means that for a client whose allocated bandwidth is 512 bytes per second, a 512-byte packet is sent out once per second. The advantage of this is smoother communications, especially when the load becomes very high and the gap between packets gets large. The disadvantage is that the proportion of bandwidth dedicated to network communications, as opposed to actual data transmission, increases in proportion.

©1999 Wrox Press Limited, US and UK.

{mospagebreak title=Proxying}

In addition to its normal duties, Apache is also capable of operating as a proxy, either specifically, or combined with serving normal web sites from the local server.

Proxies are intermediate servers that stand between a client and a remote server and makes requests to the remote server on behalf of the client. The objective is twofold: First, a caching proxy can make a record of a suitable document so that next time a client asks for it the proxy can deliver it from the cache without contacting the remote server; Second, a proxy allows clients and servers to be logically isolated from each other, so security can be placed between them to ensure no unauthorized transactions can take place.

In this section, we concentrate on Apache’s proxy-related features, before going on to discuss caching and more developed examples in the next section.

Installing and Enabling Proxy Services

Apache’s proxy functionality is encapsulated in mod_proxy, an optional module supplied as standard with Apache. This primarily implements HTTP/1.0-style proxying but has recently gained some HTTP/1.1 features such as support for Via headers. To enable it, either recompile the server statically or compile it as a dynamic module and include it into the server configuration as described in Chapter 3. Note that the dynamic proxy module is called libproxy.so not mod_proxy.so.

Once installed, proxy operation is simply enabled by specifying the ProxyRequests directive:


ProxyRequests on

We can also switch off proxy services again with:


ProxyRequests off

This directive can go only in the server-level configuration or, more commonly, in a virtual host. However we can configure proxy behavior based on the requested URL using a Directory tag.

Normal Proxy Operation

Proxies on a network can work in two directions, forward and reverse, and may also operate in both modes at once.

A forward proxy relays requests from clients on the local network and caches pages from other sites on the Internet, reducing the amount of data transferred on external links; this is a popular application for companies that need to make efficient use of their external bandwidth.

A reverse proxy relays requests from clients outside the local network and caches pages from the local web sites, reducing the load on the servers.

When a client is configured to use a proxy to fetch a remote HTTP or FTP URL, it contacts the proxy, giving the complete URL, including the protocol and remote domain name. The proxy server then checks to see if it is allowed to relay this request and, if so, fetches the remote URL on behalf of the client and returns it. If the proxy is set up to cache documents and the document is cacheable, it also stores it for future requests.

A proxy server with dual network interfaces makes a very effective firewall; external clients connect to one port, and internal clients to the other. The proxy relays requests in and out according to its configuration and deals with all connection requests. Since there are no direct connections between the internal network and the rest of the world, security is much improved.

Configuring Apache as a Proxy

In order for Apache to function as a proxy, the only required directive is ProxyRequests, which enables Apache for both forward and reverse proxying – it makes no distinction about whether the client or remote server are internal or external since Apache is not aware of the network topology.

Once proxying is enabled, requests for URLs that the Apache server is locally responsible for are served as normal, but requests for URLs on hosts that do not match any of the hosts that are running on that server cause Apache to attempt to retrieve the URL itself as a client and pass the response back to the client.

Rather bizarrely, we can test a proxy server is working by proxying it with a web site served by the same Apache server. Because the server will serve its own content directly, we have to put the proxy on a different port number – say 8080:


Port 80

Listen 80

Listen 8080

User httpd

Group httpd

# dynamic servers load modules here…

ServerName www.alpha-complex.com

ServerAdmin webmaster@alpha-complex.com

DocumentRoot /home/www/alpha-complex

ErrorLog logs/main_error

TransferLog logs/main_log

<VirtualHost 204.148.170.3:4444>

ServerName proxy.alpha-complex.com

ProxyRequests On

ErrorLog logs/proxy_error

TransferLog logs/proxy_log

</VirtualHost>

If we test this configuration without telling the client to use the proxy and ask for http://www.alpha-complex.com/, we get the standard home page as expected and a line in the access log main_log that looks like this:


127.0.0.1–[27/Aug/1999:17:09:30 +0100] “GET http://www.alpha-complex.com/” 200 1030

If we now configure the client to use www.alpha-complex.com, port 8080 as a proxy server, we get the same line in main_log:


127.0.0.1–[27/Aug/1999:17:50:21 +0100] “GET / HTTP/1.0″ 200 103

followed almost immediately by a line in the proxy log:


127.0.0.1–[27/Aug/1999:17:50:21 +0100] “GET http://www.alpha-complex.com:8080/” 200 103

What has happened here is that the proxy has received the request on port 4444, stripped out the domain name, and issued a forward HTTP request to that domain on port 80, the default port for HTTP requests. The main server gets this request and responds to it, returning the index page to the proxy which then returns it to the client.

From this it might appear that enabling proxy functionality in a virtual host overrides the default behavior which would be to serve the page directly, since the virtual host inherits the DocumentRoot directive from the main server. If the ProxyRequests directive were not present this is what we would expect to happen. However, the truth is a little more involved. If we ask for the URL http://www.alpha-complex.com:8080/, we get the index page, served directly by the virtual host, without the proxy. If we look in the proxy_log file we see:


127.0.0.1–[27/Aug/1999:17:50:21 +0100]

“GET http://www.alpha-complex.com:8080/” 200 103

But no corresponding line in main_log, indicating that the proxy server actually served the page directly. Why is this? Simple, if we remember how Apache matches URLs to virtual hosts. The virtual host inherits the settings of the main server, so the actual configuration of the proxy looks like this:


<VirtualHost 204.148.170.3:4444>

Port 80

User httpd

Group httpd

ServerAdmin webmaster@alpha-complex.com

DocumentRoot /home/www/alpha-complex

ServerName proxy.alpha-complex.com

ProxyRequests On

ErrorLog logs/proxy_error

TransferLog logs/proxy_log

</VirtualHost>

The Listen directives are not inherited, since they are not valid in containers. The User and Group directives are only inherited if suEXEC is in use. Otherwise, they have no effect.

When we configured our client to use the proxy and asked for the URL without a port number, the virtual host received the request but was unable to satisfy it, because the default http port is 80, not 8080. It therefore could not satisfy the request itself and had to use the proxy functionality to make a request for http://www.alpha-complex.com on port 80. This request is picked up by the server but no longer matches the virtual host on port 8080, and so is received by the main server, which satisfies the request. The response is then sent out by Apache in the guise of the main server, back to itself in the guise of the virtual host, which then returns the page to the client.

However, when we asked for the index page on port 8080, the virtual host could satisfy that request because it can receive requests made for port 8080. It has a valid DocumentRoot directive, so it serves the page directly to the client without forwarding the request itself.

Note that if we put a ProxyRequests on directive into the server-level configuration, every virtual host becomes a proxy server and will happily serve proxy requests for any URL it can’t satisfy itself. This is interesting, but not necessarily useful behavior. To make a proxy available only when and how we want it, we can customize the scope and operation of the proxy with both Directory and VirtualHost containers.

URL Matching with Directory Containers

As mentioned previously, when a client is configured to use a server as a proxy, it sends the server a URL request including the protocol and domain name (or IP address) of the document it desires.

Apache defines a special variant of the Directory container to allow proxy servers to be configured conditionally based on the URL using the prefix proxy: in the directory specification. Just as with normal Directory containers, the actual URL can be wildcarded, so the simplest container can match all possible proxy requests with:


<Directory proxy:*>

… directives for proxy requests only …

</Directory>

With this directive present, ordinary URL requests will be served by the main site, whereas proxy requests will be served according to the configuration inside the Directory container. This allows us to insert host or user authentication schemes that only apply when the server is used as a proxy, as opposed to a normal web server.

We can also be more specific. The proxy module by default proxies HTTP, FTP, and Secure HTTP (SSL) connections, which correspond to the protocol identifiers http:, ftp:, and https:. We can therefore define protocol specific directory containers on the lines of:


<Directory proxy:http:*>

… proxy directives for http …

</Directory>

<Directory proxy:ftp:*>

… proxy directives for ftp …

</Directory>

We can extend the URL in the container as far as we like to match specific hosts or wildcarded URLs:


<Directory proxy:*/www.alpha-complex.com/*>

… proxy directives for www.alpha-complex.com …

</Directory>

When a client makes a request by any protocol to www.alpha-complex.com, the directives in this container are applied to the request; we can put proxy cache directives here, allow and deny directives to control access, and so on. Here’s a complete virtual host definition with host-based access control:


<VirtualHost 204.148.170.3:8080>

ServerName proxy.alpha-complex.com

ErrorLog logs/proxy_error

TransferLog logs/proxy_log

ProxyRequests on

CacheRoot /usr/local/apache/cache

# limit use of this proxy to hosts on the local network

<Directory proxy:*>

order deny,allow

deny from all

allow from 204.148.170

</Directory>

</VirtualHost>

We’ve added a CacheRoot directive to implement a cache. We’d normally want to specify a few more directives than this, as we will see in the next section, but this will work. We’ve also added a directory container allowing the use of this proxy by hosts on the local network only; this makes the proxy available for forward proxying but barred from performing reverse proxying – external sites cannot use it as a proxy for www.alpha-complex.com

Blocking Sites via the Proxy

It is frequently desirable to prevent a proxy from relaying requests to certain remote servers; this is especially true for proxies that are primarily designed to cache pages for rapid access. We can block access to sites with the ProxyBlock directive; for example:


ProxyBlock www.badsite.com baddomain.dom badword

This directive causes the proxy to refuse to retrieve URLs from hosts with names that contain any of these text elements. In addition, when Apache starts it tries out each parameter in the list with DNS to see if it resolves to an IP address; if so, the IP address is also blocked.

Note this is not the directive to use to counter the effects of a ProxyRemote directive, so a server will satisfy requests to hosts it serves itself rather than forward them to the remote proxy – for that, use NoProxy.

Localizing Remote URLs and Hiding Servers from View

Rather than simply passing on URLs for destinations that are not resolvable locally, a server can also map the contents of a remote site into a local URL using the ProxyPass directive. Unlike all the other directives of mod_proxy, this works even for hosts that are not proxy servers and does not require that Proxyrequests has been set to on.

For example, suppose we had three internal servers www.alpha-complex.com, users.alpha-complex.com, and secure.alpha-complex.com. Instead of allowing access to all three, we could map the users and secure web sites so they appear to be part of the main web site by adding these two directives to the configuration for www.alpha-complex.com:


ProxyPass /users/ http://users.alpha-complex.com/

ProxyPass /secure/ http://secure.alpha-complex.com/secure-part/


As mentioned above, we don’t need to specify ProxyRequests on for this to work.

We can also create what looks like a real web site, but is in fact just a proxy by mapping the URL /. This allows us to hide a real web site behind a proxy firewall without external users being aware of any unusual activity:


ProxyPass / http://realwww.intranet.alpha-complex.com

In order for this subterfuge to work, we also have to take care of redirections that the internal server realwww.intranet.alpha-complex.com might send in response to the client request.

Without intervention, this may pass the real name of the internal server to the client, causing the proxy to be bypassed or the request to simply fail in the case of a firewall. Fortunately, we can use ProxyPassReverse, which rewrites the Location: header of a redirection received from the internal host so it matches the proxy rather than the internal server. The rewritten response then goes to the client, which is none the wiser.

ProxyPassReverse takes exactly the same arguments as the ProxyPass directive it parallels:


ProxyPass / http://realwww.intranet.alpha-complex.com

ProxyPassReverse / http://realwww.intranet.alpha-complex.com

In general, wherever we put a ProxyPass directive, we probably want to put a ProxyPassReverse directive, too.

This feature is intended primarily for reverse proxies where external clients are asking for documents on local servers. It is unlikely to be useful for forward proxying scenarios.

Redirecting Requests to Remote Proxy

Rather than satisfy all proxy requests itself, a proxy server can be configured to use other proxies with the ProxyRemote directive, making use of already cached information, rather than contacting the destination server directly. ProxyRemote takes two parameters: a URL prefix and a remote proxy to contact when the requested URL matches that prefix. For example:


ProxyRemote http://www.mainsite.com

http://mirror.mainsite.com:8080

This causes any request URL that starts with http://www.mainsite.com to be forwarded to a mirror site on port 8080 instead. The URL prefix can be as short as we like, so we can instead proxy all HTTP requests with:


ProxyRemote http http://http.proxy.remote.com

We can also proxy ftp in the same way (assuming the proxy server is listening on port 21, the ftp port):


ProxyRemote ftp ftp://ftp.ftpmirror.com

Alternatively, we can encapsulate FTP requests in HTTP messages with:


ProxyRemote ftp http://http.ftpmirror.com

Finally, we can just redirect all requests to a remote proxy with a special wildcard symbol:


ProxyRemote * http://proxy.remote.com

It is possible to specify several ProxyRemote directives, in which case Apache will run through them in turn until it reaches a match. More specific remote proxies must therefore be listed first to avoid being overridden by more general ones:


ProxyRemote http://www.mainsite.com

http://mirror.mainsite.com:8080

ProxyRemote http http://http.proxy.remote.com

ProxyRemote * http://other.proxy.remote.com

Note that the only way to override a ProxyRemote once it is set is via the NoProxy directive. This is useful for enabling local clients to access local web sites on proxy servers; the proxy will satisfy the request locally rather than automatically ask the remote proxy – see “Proxies and Intranets” later in the chapter.

Proxy Chains and the Via: header

HTTP/1.1 defines the Via: header, which proxy servers automatically add to returned documents en route from the remote destination to the client that requested them. A client that asks for a document that passes through proxies A, B, and C thus returns with Via: headers for C, B, and A, in that order.

Some clients can choke on Via: headers, however, and there are sometimes reasons to disguise the presence of a proxy – security being one of them. For this reason, Apache allows us to control how Via: headers are processed by proxy servers with the ProxyVia directive, which takes one of four parameters:

ProxyVia off

(default)

The proxy does not add a Via: header to the HTTP response, but allows any existing Via: headers through untouched. This effectively hides the proxy from sight.

ProxyVia on

The proxy adds a conventional Via: header to say that the document was relayed by it.

ProxyVia full

The proxy adds a Via: header and in addition appends the Apache server version.

ProxyVia block

The proxy strips all Via: headers from the response and does not add one for itself.

Note that the default setting of ProxyVia is off, so a proxy will not add a Via: header unless we specifically ask it to.

ProxyVia is occasionally confused with the ProxyRemote directive – although its name suggests that ProxyVia has something to do with relaying requests onward, that job is actually performed by ProxyRemote.

Proxies and Intranets

Defining remote proxies is useful for processing external requests, but presents a problem when it comes to serving documents from local servers to local clients. Making the request via an external proxy is at best unnecessary and time consuming, and at worst will cause a request to fail entirely if the proxy server is set up on a firewall that denies the remote proxy access to the internal site.

We can disable proxying for particular hosts or domains with the NoProxy directive to enable a list of whole or partial domain names and whole or partial IP addresses to be served locally. For example, if we wanted to use our web server as a forward proxy for internal clients but still allow web servers on the local 204.148 network, we could specify the following directives:


ProxyRequests on

ProxyRemote * http://proxy.remoteserver.com:8080

NoProxy 204.148

ProxyDomain .alpha-complex.com

This causes the server to act as a proxy for requests to all hosts outside the local network and relay all such requests to proxy.remoteserver.com. Local hosts, including virtual hosts on the web server itself, are served directly, without consulting the remote proxies.

NoProxy also accepts whole or partial hostnames and a bitmask for subnets, so the following are all valid:


NoProxy 204.148.0.0/16 internal.alpha-complex.com intranet.net

A related problem comes from the fact that clients on a local network don’t need to fully qualify the name of the server they want if it is in the same domain, i.e., instead of a URL of http://www.alpha-complex.com, they can put http//www. This can cause problems for proxies, since the shortened name will not match parameters in other Proxy directives like ProxyPass or NoProxy. To fix this, the proxy can be told to append a domain name to incomplete host names with ProxyDomain, as shown in the example above. Since the specified domain is literally appended, it is important to include a dot at the start:


ProxyDomain .domain.of.proxy
Handling Errors

When a client receives a server-generated document like an error message after making a request through a proxy (or chain of proxies), it is not always clear whether the remote server or a proxy generated the document. To help clarify this, Apache provides the core directive ServerSignature, which is allowed in any scope and generates a footer line with details of the server. This footer is appended to any document generated by the proxy server. The directive takes one of three parameters:

off (default)

Appends no additional information

on

Appends a line with the server name and version number

email

As on, but additionally appends a mailto: URL with the ServerAdmin email address


For example, to generate a full footer line with an administratorís email address, we would put:

ServerSignature email

Now error documents generated by the proxy itself have a line appended identifying the proxy as the source of the error, while documents retrieved from the remote server (be they server generated or otherwise) are passed through as is.

This directive is not technically proxy-related, since it can be used by non-proxy servers, too, however its primary application is in proxy configurations.

Tunneling Other Protocols

Proxying is mainly directed towards the HTTP and FTP protocols, and either http: or ftp: URLs can be specified for directives that use URLs as arguments. In addition, mod_proxy will also accept HTTP CONNECT requests from clients that wish to connect a remote server via a protocol other than HTTP or FTP.

When the proxy receives a CONNECT request, it compares the port used to a list of allowed ports. If the port is allowed, the proxy makes a connection to the remote server specified on the same port number and maintains the connection to both remote server and client, relaying data, until one side or the other closes their link.

By default, Apache accepts CONNECT requests on ports 443 (https) and 563 (snews). These ports can be overridden with the AcceptConnect directive, which takes a list of port numbers as a parameter. For example, Apache can be told to proxy https and telnet connections by specifying port 23, the telnet port, and port 443:


AllowCONNECT 443 23

A CONNECT request from a client that uses a telnet: or https: URL will then be proxies. To test a telnet proxy, we can go to the command line and telnet to the proxy:


telnet proxy.alpha-complex.com 8080

Then enter a CONNECT request for a host:


CONNECT remote.host:23 HTTP/1.0

And press Return twice.

If the proxy allows the request, the remote host will be contacted on port 23 and a telnet session started, producing a login prompt.

Tuning Proxy Operations

The ProxyReceiveBufferSize directive specifies a network buffer size for HTTP and FTP transactions and takes a number of bytes as a parameter. If defined, it has to be greater than 512 bytes; for example:


ProxyReceiveBufferSize 4096

If a buffer size of zero is specified, Apache uses the default buffer size of the operating system. Adjusting the value of ProxyReceiveBuffer size may improve (or worsen) the performance of the proxy.

mod_proxy also defines a number of directives to control how, where, and for how log documents are cached, and we’ll discuss these in the next section.

Squid – A High-Performance Proxy Alternative

Apache’s mod_proxy is adequate for small-to-medium web sites, but for more intensive duty, it’s performance is lacking. An alternative proxy server is Squid, which is specifically designed to handle multiple requests and high loads.

As well as HTTP, it also handles and caches FTP, GOPHER, WAIS and SSL requests, and runs on AIX, Digital UNIX, FreeBSD, HP-UX, Irix, Linux, NetBSD, Nextstep, SCO, and Solaris – but not Windows or Macintosh.

Squid is open source and freely available from http://squid.nlanr.net, which also contains support documentation, a user guide and FAQ, and the Squid mailing list archives.

©1999 Wrox Press Limited, US and UK.

{mospagebreak title=Caching}

One of the primary reasons for establishing a proxy server is to cache documents retrieved from remote hosts. Both forward and reverse proxies can benefit from caching. Forward proxies reduce the bandwidth demands of clients accessing servers elsewhere on the internet by caching frequently accessed pages, which is invaluable for networks with limited bandwidth to the outside world. Reverse proxies, conversely, cache frequently accessed pages on a local server so that it is not subjected to constant requests for static pages when it has more important dynamic queries to process.

Enabling Caching

Caching is not actually required by proxy servers and is not enabled by the use of the ProxyRequests directive. Rather, caching is implicitly enabled by defining the directory under which cached files are to be stored with CacheRoot:


CacheRoot /usr/local/apache/proxy/

Other than the root directory for caching mod_proxy provides two other directives for controlling the layout of the cache:

CacheDirLevels: defines the number of subdirectories that are created to store cached files. The default is three. To change it to six we can put:


CacheDirLevels 6

CacheDirLength: defines the length of the directory names used in the cache. The default is 1. It is inadvisable to use names longer than 8 on Windows systems due to the problems of long file names on these platforms.

These two directives are reciprocal – a single letter directory name leaves relatively few permutations for Apache to run through, so a cache intended to store a lot of data will need an increased number of directory levels. Conversely, a longer directory name allows many more directories per level, which can be a performance issue if the number of directories becomes large, but allows a shallower directory tree.

Setting the Cache Size

Probably the most important parameter to set for a proxy cache is its size. The default cache size is only 5 kilobytes, so we would usually increase it with the CacheSize directive which takes a number of kilobytes as a parameter. To set a 100mb cache, we would put:


CacheSize 102400

However, this in itself means nothing unless Apache is also told to trim down the size of the cache when it exceeds this limit. This is called garbage collection and is governed by the CacheGcInterval directive, which schedules a time period in hours between scans of the cache. To scan and trim down the cache once a day, we would put:


CacheGcInterval 24

The chosen value is a compromise between performance and disk space – if we have a quiet period once a day, it makes sense to trim the cache every 24 hours, but we also have to make sure that the cache can grow above its limit for a day without running into disk space limitations.

We can also schedule a very rapid cache time by using a decimal number:


# trim the cache every 75 minutes

CacheGcInterval 1.25

# trim the cache every 12 minutes

CacheGcInterval 0.2

Without a CacheGcInterval directive, the cache will never be trimmed and will continue to grow indefinitely. This is almost certainly a bad idea, so CacheGcInterval should always be set on caching proxies.

Delivering Cached Documents and Expiring Documents from the Cache

Apache will only deliver documents from the cache to clients if they are still valid, otherwise it will fetch a new copy from the remote server and cache it in place of the expired version. Apache also trims the cache based on their validity. Each time the time period specified by CacheGcInterval lapses, Apache scans the cache looking for expired documents.

The expiry time of a document can be set in five ways:

  • HTTP/1.1 defines the Expires: header that a server can use to tell a proxy how long a document is considered valid.

  • We can set a maximum time after which all cached documents are considered invalid irrespective of the expiry time set in the Expires: header.

  • HTTP documents that do not specify an expiry time can have one estimated based on the time they were last modified.

  • Non-HTTP documents can have a default expiry time set for them.

  • Documents from both HTTP/1.0 and HTTP/1.1 hosts may send a header telling the proxy whether or not the document can be cached, though the header differs between the two.

    The maximum time after which a document automatically expires is set by CacheMaxExpires, which takes a number of hours as an argument. The default period is one day, or 24 hours, which is equivalent to the directive:


    CacheMaxExpires 24

    To change this to a week we would put:


    CacheMaxExpires 168

    This time period defines the absolute maximum time a file is considered valid, starting from the time it was stored in the cache. Although other directives can specify shorter times, longer times will always be overridden by CacheMaxExpires.

    HTTP documents that do not carry an expiry header can have an estimated expiry time set using the CacheLastModifiedFactor. This gives the document an expiry time equal to the time since the file was last modified, multiplied by the specified factor. The factor can be a decimal value, so to set an expiry time of half the age of the document, we would put:


    CacheLastModifiedFactor 0.5

    If the calculated time exceeds the maximum expiration time set by CacheMaxExpire, the maximum expiration time takes precedence, so outlandish values that would result from very old documents are avoided. Likewise, if a factor is not set at all, the document expires when it exceeds the maximum expiry time.

    The HTTP protocol supports expiry times directly, but other protocols do not. In these cases, a default expiry time can be specified with CacheDefaultExpire, which takes a number of hours as a parameter. For example, to ensure that cached files fetched with FTP expire in three days, we could put:


    CacheDefaultExpire 72

    For this directive to be effective, it has to specify a time period shorter than CacheMaxExpire; if no default expiry time is set, files fetched with protocols other than HTTP automatically expire at the time limit set by CacheMaxExpire.

    A special case arises when the proxy receives a content-negotiated document from an HTTP/1.0 source. HTTP/1.1 provides additional information to let a proxy know how valid a content-negotiated document is, but HTTP/1.0 does not. By default, Apache does not cache documents from HTTP/1.0 sources if they are content negotiated unless they come with a header telling Apache it is acceptable to do so. If the remote host is running Apache, it can add this header with the CacheNegotiatedDocs directive – see “Content Negotiation” in Chapter 4 for more details.

    Caching Incomplete Requests

    Sometimes a client will disconnect from a proxy before it has finished transferring the requested document from the remote server. Ordinarily, Apache will discontinue transferring the document and discard what it has already transferred unless it has already transferred over 90 percent. This percentage can be changed with CacheForceCompletion, which takes a number between 0 and 100 as a percentage. For example, to force the proxy to continue loading a document and cache it if 75 percent or more of it has already been transferred we would put:


    CacheForceCompletion 75

    A setting of 0 is equivalent to the default, 90. A setting of 100 means Apache will not cache the document unless it completely transfers before the client disconnects.

    Disabling Caching for Selected Hosts, Domains, and Documents

    Just as NoProxy defines hosts, domains, or words that cause matching URLs not to be passed to remote proxies, NoCache causes documents from hosts, domains, or words that match the URL to remain uncached. For example:


    NoCache interactive.alpha-complex.com uncacheddomain.net badword

    This will cause the proxy to avoid caching any document from interactive.alpha-complex.com, any host in the domain uncachedomain.net, and any domain name with the word badword anywhere in it. If any parameter to NoCache resolves to a unique IP address via DNS, Apache will make a note of it at startup and also avoid caching any URL that equates to the same IP address. Caching can also be disabled completely with a wildcard:


    NoCache *

    This is equivalent to commenting out the corresponding CacheRoot directive.

    ©1999 Wrox Press Limited, US and UK.

    {mospagebreak title=Fault Tolerance and Clustering}

    When web sites become large and busy, issues of reliability and performance become more significant. It can be disastrous if the server of an important web site like an online store front or a web-hosting ISP falls over, and visitors are put off by sites that are sluggish and hard to use.

    Both these problems can be solved to a greater or lesser extent in two basic ways:

    • We can make our servers more powerful, adding more memory and faster disks or upgrading the processor to a faster speed or a multiprocessor system. This is simple, but potentially expensive.
    • We can install more servers and distribute the load of client requests between them. Because they are sharing the load, the individual servers do not have to be expensive power servers, just adequate to the job.

    Multiple servers are an attractive proposition for several reasons: They can be cheap and therefore easily replaceable, individual servers can fall over without the web site becoming unavailable, and increasing capacity is just a case of adding another server without needing to open up or reconfigure an existing one.

    However, we can’t just dump a bunch of servers on a network and expect them to work as one. We need to make them into a cluster, so that external clients do not have to worry about, and preferably aren’t aware of, the fact that they are talking to a group of servers and not just one.

    There are two basic approaches to clustering, DNS load sharing and Web server clustering, and several solutions in each. Which we choose depends on exactly what we want to achieve and how much money we are prepared to spend to achieve it. We’ll first look at DNS solutions before going on to look at true web clusters and a home-grown clustering solution using Apache.

    In its favor, however, is the fact that this works not just for web servers, but ftp archive or any other kind of network server, since it is protocol independent.

    Backup Server Via Redirected Secondary DNS

    The simplest of the DNS configuration options, this approach allows us to create a backup server for the primary web server by taking advantage of the fact that all domain names have at least two nominated name servers, a primary and a secondary, from which their IP address can be determined.

    Ordinarily, both name servers hold a record for the name of the web server with the same IP address:


    www.alpha-complex.com. IN A 204.148.170.3

    However, there is no reason why the web server cannot be the primary name server for itself. If we set up two identical servers, we can make the web server its own primary name server and give the secondary server a different IP address for the web server. For example:


    www.alpha-complex.com. IN A 204.148.170.203

    In normal operation, the IP address of the web server is requested by other name servers directly from the web server’s own DNS service. If for any reason the web server falls over, however, the primary name server will no longer be available and DNS requests will resort to the secondary name server. This returns the IP address of the backup server rather than the primary so client requests will succeed.

    The Time To Live (TTL) setting of the data served by the primary DNS server on the web server needs to be set to a low value like 30 minutes, or external name servers will cache the primary web server’s IP address and not request an update from the secondary name server in a timely fashion, making the web server apparently unavailable until the DNS information expires. We can give the A record a time to live of 30 minutes by altering it to:


    www.alpha-complex.com. 30 IN A 204.148.170.3

    There are several caveats to this scheme: session tracking, user authentication, and cookies are likely to get confused when the IP address switches to the backup server and no provision is made for load sharing – the backup server is never accessed until the primary server becomes unavailable, no matter how busy it might be. Note that unavailable means totally unavailable. If the httpd daemon crashes but the machine is still capable of DNS resolution, the switch will not take place.

    Load Sharing with Round-Robin DNS

    Since version 4.9 BIND, the internet daemon that runs the bulk of the world’s DNS servers provides a configuration called round-robin DNS. This was an early approach to load sharing between servers and still works today. It works by specifying multiple IP addresses for the same host:


    www.alpha-complex.com. 60 IN A 204.148.170.1

    www.alpha-complex.com. 60 IN A 204.148.170.2

    www.alpha-complex.com. 60 IN A 204.148.170.3

    When a DNS request for the IP address for www.alpha-complex.com is received, BIND returns one of these three addresses and makes a note of it. The next request then gets the next IP address in the file and so on until the last one, after which BIND returns to the first address again. Subsequent requests will therefore get IP addresses in the order: 204.148.170.1, 204.148.170.2, 204.148.170.3, 204.148.170.1 …

    Just as with the backup server approach, we have to deal with the fact that other name servers will cache the response they get from us, thwarting the round-robin. To stop this, we set a short time-to-live value on the order of an hour or so, which we do with the addition of the 60 values in the records given above.

    We can specify a lower value, but this causes more DNS traffic in updates, which improves the load sharing on our web servers at the expense of increasing the load on our name server.

    The attraction of round-robin DNS is its simplicity – we only have to add a few lines to one file to make it work (two files if you include the secondary name server). It also works for any kind of server, not just web servers. The drawback is that this is not true load balancing, only load sharing – the round-robin takes no account of which servers are loaded and which are free or even which are actually up and running.

    Hardware Load Balancing

    Various manufacturers such as Cisco have load balancing products for networks that cluster servers at the TCP/IP level. These are highly effective but can also be expensive.

    Clustering with Apache

    Apache provides a simple but clever way to cluster servers using features of mod_rewrite and mod_proxy together. This gets around DNS caching problems by hiding the cluster with a proxy server and because it uses Apache it is totally free, of course.

    To make this work, we have to nominate one machine to be a proxy server, handling requests to several back-end servers on which the web site is actually located. The proxy takes the name www.alpha-complex.com, and we call our back-end servers www1 to www6.

    The solution comprises of two parts:

    • Using mod_rewrite to randomly select a back-end server to service the client request.
    • Using mod_proxy‘s ProxyPassReverse directive to disguise the URL of the back-end server so clients are compelled to direct further requests through the proxy.

    Part one makes use of the random text map feature of mod_rewrite, which was developed primarily to allow this solution to work. We create a map file containing a single line:


    # /usr/local/apache/rewritemaps/cluster.txt

    #

    # Random map of back-end web servers

    www www1|www2|www3|www4|www5|www6

    When used, this map will take the key www and randomly return one of the values www1 to www6.

    We now write some mod_rewrite directives into the proxy server’s configuration to make use of this map to redirect URLs to a random server:


    # switch on URL rewriting

    RewriteEngine on

    # define the cluster servers map

    RewriteMap cluster rnd:/usr/local/apache/rewritemaps/cluster.txt

    # rewrite the URL if it matches the web server host

    RewriteRule ^http://www.(.*)$ http://{cluster:www}.$2 [P,L]

    # forbid any URL that doesn’t match

    RewriteRule .* – [F]

    Depending on how sophisticated we want to be, we can make this rewrite rule a bit more advanced and cope with more than one cluster at a time:

    Map file:


    www www1|www2|www3|www4|www5|www6

    secure secure-a|secure-b

    users admin.users|normal.users

    Rewrite Rule:


    # rewrite the URL based on the hostname asked for. If nothing matches,

    # default to ‘www1′:

    RewriteRule ^http://([^.]+).(.*)$ http://{cluster:$1|www1}.$2 [P,L]

    We can even have the proxy cluster both HTTP and FTP servers, so long as it’s listening to port 20:

    Map file:


    www www1|www2|www3|www4|www5|www6

    ftp ftp|archive|attic|basement

    Rewrite Rule:


    # rewrite the URL based on the protocol and hostname asked for:

    RewriteRule ^(http|ftp)://[^.]+.(.*)$ $1://${cluster:$1}.$2 [P,L]

    Part two makes use of mod_proxy to rewrite URLs generated by the back-end servers due to a redirection. Without this, clients will receive redirection responses with locations starting with www1 or www3 rather than www. We can fix this with ProxyPassReverse:


    ProxyPassReverse / http://www1.alpha-complex.com

    ProxyPassReverse / http://www2.alpha-complex.com



    ProxyPassReverse / http://www6.alpha-complex.com

    A complete Apache configuration for creating a web cluster via proxy would look something like this:


    # Apache Server Configuration for Clustering Proxy

    #

    ### Basic Server Setup

    # The proxy takes the identity of the web site…

    ServerName www.alpha-complex.com

    ServerAdmin webmaster@alpha-complex.com

    ServerRoot /usr/local/apache

    DocumentRoot /usr/local/apache/proxysite

    ErrorLog /usr/local/apache/proxy_error

    TransferLog /usr/local/apache/proxy_log

    User nobody

    Group nobody

    # dynamic servers load their modules here…

    # don’t waste time on things we don’t need

    HostnameLookups off

    # this server is only for proxying so switch off everything else

    <DIRECTORY />

    Options None

    AllowOverride None

    </DIRECTORY>

    # allow a local client to access the server status

    <LOCATION />

    order allow,deny

    deny from all

    allow from 127.0.0.1

    SetHandler server-status

    </LOCATION>

    ### Part 1 – Rewrite

    # switch on URL rewriting

    RewriteEngine on

    # Define a log for debugging but set the log level to zero to disable it for
    # performance

    RewriteLog logs/proxy_rewrite

    RewriteLogLevel 0

    # define the cluster servers map

    RewriteMap cluster rnd:/usr/local/apache/rewritemaps/cluster.txt

    # rewrite the URL if it matches the web server host

    RewriteRule ^http://www.(.*)$ http://{cluster:www}.$2 [P,L]

    # forbid any URL that doesn’t match

    RewriteRule .* – [F]

    ### Part 2 – Proxy

    ProxyRequests on

    ProxyPassReverse / http://www1.alpha-complex.com/

    ProxyPassReverse / http://www2.alpha-complex.com/

    ProxyPassReverse / http://www3.alpha-complex.com/

    ProxyPassReverse / http://www4.alpha-complex.com/

    ProxyPassReverse / http://www5.alpha-complex.com/

    ProxyPassReverse / http://www6.alpha-complex.com/

    # We don’t want caching, preferring to let the back end servers take the load,
    # but if we did:
    #

    #CacheRoot /usr/local/apache/proxy

    #CacheSize 102400

    Because this works at the level of an HTTP/FTP proxy rather than lower level protocols like DNS or TCP/IP, we can also have the proxy cache files and use it to bridge a firewall, allowing the cluster to reside on an internal and protected network.

    The downside of this strategy is that it does not intelligently distribute the load. We could fix this by replacing the random map file with an external mapping program that attempted to make intelligent guesses about which servers are most suitable, though the program should be very simple to not adversely affect performance, since it will be called for every client request.

    Other Clustering Solutions

    There are several commercial and free clustering solutions available from the Internet. Here are a few that might be of interest if none of the other solutions here is sophisticated enough:

    Eddie

    The Eddie Project is an open-source initiative sponsored by Ericsson to develop advanced clustering solutions for Linux, FreeBSD, and Solaris; Windows NT is under development.

    There are two packages available: an enhanced DNS server that takes the place of the BIND daemon and performs true load balancing and an intelligent HTTP gateway that allows web servers to be clustered across disparate networks. A sample Apache configuration is included with the software, and binary RPM packages are available for x86 Linux systems.

    Eddie is available from http://www.eddieware.org/.

    TurboCluster

    TurboCluster is a freely available clustering solution developed for TurboLinux: http://community.turbolinux.com/cluster/.

    Sun Cluster

    Solaris system will most probably be interested in Sun’s own clustering application, however, this is not a free or open product. See http://www.sun.com/clusters/.

    Freequalizer

    Freequalizer is a freely available version of Equalizer, produced by Coyote Point Systems, designed to run on a FreeBSD server (Equalizer, the commercial version, runs on its own dedicated hardware). GUI monitoring tools are available as part of the package.

    Freequalizer is available from http://www.coyotepoint.com.

    ©1999 Wrox Press Limited, US and UK.

    Google+ Comments

  • Google+ Comments