Professional Apache - Proxying (
Page 4 of 6 )
In addition to its normal duties, Apache is also capable of operating as a
proxy, either specifically, or combined with serving normal web sites from the
local server.
Proxies are intermediate servers that stand between a client and a remote
server and makes requests to the remote server on behalf of the client. The
objective is twofold: First, a caching proxy can make a record of a suitable
document so that next time a client asks for it the proxy can deliver it from
the cache without contacting the remote server; Second, a proxy allows clients
and servers to be logically isolated from each other, so security can be placed
between them to ensure no unauthorized transactions can take place.
In this section, we concentrate on Apache's proxy-related features, before
going on to discuss caching and more developed examples in the next
section.
Installing and Enabling Proxy Services
Apache's proxy functionality is encapsulated in mod_proxy, an optional
module supplied as standard with Apache. This primarily implements
HTTP/1.0-style proxying but has recently gained some HTTP/1.1 features such as
support for Via headers. To enable it, either recompile the server statically or
compile it as a dynamic module and include it into the server configuration as
described in Chapter 3. Note that the dynamic proxy module is called libproxy.so
not mod_proxy.so.
Once installed, proxy operation is simply enabled by specifying the
ProxyRequests directive:
ProxyRequests on
We can also switch off proxy services again with:
ProxyRequests off
This directive can go only in the server-level configuration or, more
commonly, in a virtual host. However we can configure proxy behavior based on
the requested URL using a Directory tag.
Normal Proxy Operation
Proxies on a network can work in two directions, forward and reverse, and may
also operate in both modes at once.
A forward proxy relays requests from clients on the local network and caches
pages from other sites on the Internet, reducing the amount of data transferred
on external links; this is a popular application for companies that need to make
efficient use of their external bandwidth.
A reverse proxy relays requests from clients outside the local network and
caches pages from the local web sites, reducing the load on the servers.
When a client is configured to use a proxy to fetch a remote HTTP or FTP URL,
it contacts the proxy, giving the complete URL, including the protocol and
remote domain name. The proxy server then checks to see if it is allowed to
relay this request and, if so, fetches the remote URL on behalf of the client
and returns it. If the proxy is set up to cache documents and the document is
cacheable, it also stores it for future requests.
A proxy server with dual network interfaces makes a very effective firewall;
external clients connect to one port, and internal clients to the other. The
proxy relays requests in and out according to its configuration and deals with
all connection requests. Since there are no direct connections between the
internal network and the rest of the world, security is much
improved.
Configuring Apache as a Proxy
In order for Apache to function as a proxy, the only required directive is
ProxyRequests, which enables Apache for both forward and reverse proxying - it
makes no distinction about whether the client or remote server are internal or
external since Apache is not aware of the network topology.
Once proxying is enabled, requests for URLs that the Apache server is locally
responsible for are served as normal, but requests for URLs on hosts that do not
match any of the hosts that are running on that server cause Apache to attempt
to retrieve the URL itself as a client and pass the response back to the
client.
Rather bizarrely, we can test a proxy server is working by proxying it with a
web site served by the same Apache server. Because the server will serve its own
content directly, we have to put the proxy on a different port number - say
8080:
Port 80
Listen 80
Listen 8080
User httpd
Group httpd
# dynamic servers load modules here...
ServerName www.alpha-complex.com
ServerAdmin webmaster@alpha-complex.com
DocumentRoot /home/www/alpha-complex
ErrorLog logs/main_error
TransferLog logs/main_log
<VirtualHost 204.148.170.3:4444>
ServerName proxy.alpha-complex.com
ProxyRequests On
ErrorLog logs/proxy_error
TransferLog logs/proxy_log
</VirtualHost>
If we test this configuration without telling the client to use the proxy and
ask for http://www.alpha-complex.com/, we get the standard home page as expected
and a line in the access log main_log that looks like this:
127.0.0.1--[27/Aug/1999:17:09:30 +0100]
"GET http://www.alpha-complex.com/" 200 1030
If we now configure the client to use www.alpha-complex.com, port 8080 as a
proxy server, we get the same line in main_log:
127.0.0.1--[27/Aug/1999:17:50:21 +0100]
"GET / HTTP/1.0" 200 103
followed almost immediately by a line in the proxy log:
127.0.0.1--[27/Aug/1999:17:50:21 +0100]
"GET http://www.alpha-complex.com:8080/" 200 103
What has happened here is that the proxy has received the request on port
4444, stripped out the domain name, and issued a forward HTTP request to that
domain on port 80, the default port for HTTP requests. The main server gets this
request and responds to it, returning the index page to the proxy which then
returns it to the client.
From this it might appear that enabling proxy functionality in a virtual host
overrides the default behavior which would be to serve the page directly, since
the virtual host inherits the DocumentRoot directive from the main server. If
the ProxyRequests directive were not present this is what we would expect to
happen. However, the truth is a little more involved. If we ask for the URL
http://www.alpha-complex.com:8080/, we get the index page, served directly by
the virtual host, without the proxy. If we look in the proxy_log file we
see:
127.0.0.1--[27/Aug/1999:17:50:21 +0100]
"GET http://www.alpha-complex.com:8080/" 200 103
But no corresponding line in main_log, indicating that the proxy server
actually served the page directly. Why is this? Simple, if we remember how
Apache matches URLs to virtual hosts. The virtual host inherits the settings of
the main server, so the actual configuration of the proxy looks like this:
<VirtualHost 204.148.170.3:4444>
Port 80
User httpd
Group httpd
ServerAdmin webmaster@alpha-complex.com
DocumentRoot /home/www/alpha-complex
ServerName proxy.alpha-complex.com
ProxyRequests On
ErrorLog logs/proxy_error
TransferLog logs/proxy_log
</VirtualHost>
The Listen directives are not inherited, since they are not
valid in containers. The User and Group directives are only inherited if suEXEC
is in use. Otherwise, they have no effect.
When we configured our client to use the proxy and asked for the URL without
a port number, the virtual host received the request but was unable to satisfy
it, because the default http port is 80, not 8080. It therefore could not
satisfy the request itself and had to use the proxy functionality to make a
request for http://www.alpha-complex.com on port 80. This request is picked up
by the server but no longer matches the virtual host on port 8080, and so is
received by the main server, which satisfies the request. The response is then
sent out by Apache in the guise of the main server, back to itself in the guise
of the virtual host, which then returns the page to the client.
However, when we asked for the index page on port 8080, the virtual host
could satisfy that request because it can receive requests made for port 8080.
It has a valid DocumentRoot directive, so it serves the page directly to the
client without forwarding the request itself.
Note that if we put a ProxyRequests on directive into the server-level
configuration, every virtual host becomes a proxy server and will happily serve
proxy requests for any URL it can't satisfy itself. This is interesting, but not
necessarily useful behavior. To make a proxy available only when and how we want
it, we can customize the scope and operation of the proxy with both Directory
and VirtualHost containers.
URL Matching with Directory
Containers
As mentioned previously, when a client is configured to use a server as a
proxy, it sends the server a URL request including the protocol and domain name
(or IP address) of the document it desires.
Apache defines a special variant of the Directory container to allow proxy
servers to be configured conditionally based on the URL using the prefix proxy:
in the directory specification. Just as with normal Directory containers, the
actual URL can be wildcarded, so the simplest container can match all possible
proxy requests with:
<Directory proxy:*>
... directives for proxy requests only ...
</Directory>
With this directive present, ordinary URL requests will be served by the main
site, whereas proxy requests will be served according to the configuration
inside the Directory container. This allows us to insert host or user
authentication schemes that only apply when the server is used as a proxy, as
opposed to a normal web server.
We can also be more specific. The proxy module by default proxies HTTP, FTP,
and Secure HTTP (SSL) connections, which correspond to the protocol identifiers
http:, ftp:, and https:. We can therefore define protocol specific directory
containers on the lines of:
<Directory proxy:http:*>
... proxy directives for http ...
</Directory>
<Directory proxy:ftp:*>
... proxy directives for ftp ...
</Directory>
We can extend the URL in the container as far as we like to match specific
hosts or wildcarded URLs:
<Directory proxy:*/www.alpha-complex.com/*>
... proxy directives for www.alpha-complex.com ...
</Directory>
When a client makes a request by any protocol to www.alpha-complex.com, the
directives in this container are applied to the request; we can put proxy cache
directives here, allow and deny directives to control access, and so on. Here's
a complete virtual host definition with host-based access control:
<VirtualHost 204.148.170.3:8080>
ServerName proxy.alpha-complex.com
ErrorLog logs/proxy_error
TransferLog logs/proxy_log
ProxyRequests on
CacheRoot /usr/local/apache/cache
# limit use of this proxy to hosts on the local network
<Directory proxy:*>
order deny,allow
deny from all
allow from 204.148.170
</Directory>
</VirtualHost>
We've added a CacheRoot directive to implement a cache. We'd normally want to
specify a few more directives than this, as we will see in the next section, but
this will work. We've also added a directory container allowing the use of this
proxy by hosts on the local network only; this makes the proxy available for
forward proxying but barred from performing reverse proxying - external sites
cannot use it as a proxy for www.alpha-complex.com
Blocking Sites via
the Proxy
It is frequently desirable to prevent a proxy from relaying requests to
certain remote servers; this is especially true for proxies that are primarily
designed to cache pages for rapid access. We can block access to sites with the
ProxyBlock directive; for example:
ProxyBlock www.badsite.com baddomain.dom badword
This directive causes the proxy to refuse to retrieve URLs from hosts with
names that contain any of these text elements. In addition, when Apache starts
it tries out each parameter in the list with DNS to see if it resolves to an IP
address; if so, the IP address is also blocked.
Note this is not the directive to use to counter the effects of a ProxyRemote
directive, so a server will satisfy requests to hosts it serves itself rather
than forward them to the remote proxy - for that, use
NoProxy.
Localizing Remote URLs and Hiding Servers from View
Rather than simply passing on URLs for destinations that are not resolvable
locally, a server can also map the contents of a remote site into a local URL
using the ProxyPass directive. Unlike all the other directives of
mod_proxy, this works even for hosts that are not proxy servers and does
not require that Proxyrequests has been set to on.
For example, suppose we had three internal servers www.alpha-complex.com,
users.alpha-complex.com, and secure.alpha-complex.com. Instead of allowing
access to all three, we could map the users and secure web sites so they appear
to be part of the main web site by adding these two directives to the
configuration for www.alpha-complex.com:
ProxyPass /users/ http://users.alpha-complex.com/
ProxyPass /secure/ http://secure.alpha-complex.com/secure-part/
As mentioned above, we don't need to specify ProxyRequests on for this to
work.
We can also create what looks like a real web site, but is in fact just a
proxy by mapping the URL /. This allows us to hide a real web site behind a
proxy firewall without external users being aware of any unusual activity:
ProxyPass / http://realwww.intranet.alpha-complex.com
In order for this subterfuge to work, we also have to take care of
redirections that the internal server realwww.intranet.alpha-complex.com might
send in response to the client request.
Without intervention, this may pass the real name of the internal server to
the client, causing the proxy to be bypassed or the request to simply fail in
the case of a firewall. Fortunately, we can use ProxyPassReverse, which rewrites
the Location: header of a redirection received from the internal host so it
matches the proxy rather than the internal server. The rewritten response then
goes to the client, which is none the wiser.
ProxyPassReverse takes exactly the same arguments as the ProxyPass directive
it parallels:
ProxyPass / http://realwww.intranet.alpha-complex.com
ProxyPassReverse / http://realwww.intranet.alpha-complex.com
In general, wherever we put a ProxyPass directive, we probably want to put a
ProxyPassReverse directive, too.
This feature is intended primarily for reverse proxies where external clients
are asking for documents on local servers. It is unlikely to be useful for
forward proxying scenarios.
Redirecting Requests to Remote
Proxy
Rather than satisfy all proxy requests itself, a proxy server can be
configured to use other proxies with the ProxyRemote directive, making use of
already cached information, rather than contacting the destination server
directly. ProxyRemote takes two parameters: a URL prefix and a remote proxy to
contact when the requested URL matches that prefix. For example:
ProxyRemote http://www.mainsite.com
http://mirror.mainsite.com:8080
This causes any request URL that starts with http://www.mainsite.com to be
forwarded to a mirror site on port 8080 instead. The URL prefix can be as short
as we like, so we can instead proxy all HTTP requests with:
ProxyRemote http http://http.proxy.remote.com
We can also proxy ftp in the same way (assuming the proxy server is listening
on port 21, the ftp port):
ProxyRemote ftp ftp://ftp.ftpmirror.com
Alternatively, we can encapsulate FTP requests in HTTP messages with:
ProxyRemote ftp http://http.ftpmirror.com
Finally, we can just redirect all requests to a remote proxy with a special
wildcard symbol:
ProxyRemote * http://proxy.remote.com
It is possible to specify several ProxyRemote directives, in which case
Apache will run through them in turn until it reaches a match. More specific
remote proxies must therefore be listed first to avoid being overridden by more
general ones:
ProxyRemote http://www.mainsite.com
http://mirror.mainsite.com:8080
ProxyRemote http http://http.proxy.remote.com
ProxyRemote * http://other.proxy.remote.com
Note that the only way to override a ProxyRemote once it is set is via the
NoProxy directive. This is useful for enabling local clients to access local web
sites on proxy servers; the proxy will satisfy the request locally rather than
automatically ask the remote proxy - see "Proxies and Intranets" later in the
chapter.
Proxy Chains and the Via: header
HTTP/1.1 defines the Via: header, which proxy servers automatically add to
returned documents en route from the remote destination to the client that
requested them. A client that asks for a document that passes through proxies A,
B, and C thus returns with Via: headers for C, B, and A, in that order.
Some clients can choke on Via: headers, however, and there are sometimes
reasons to disguise the presence of a proxy - security being one of them. For
this reason, Apache allows us to control how Via: headers are processed by proxy
servers with the ProxyVia directive, which takes one of four parameters:
|
ProxyVia off
(default) |
The proxy does not add a Via: header to the HTTP response, but allows any
existing Via: headers through untouched. This effectively hides the proxy from
sight. |
|
ProxyVia on |
The proxy adds a conventional Via: header to say that the document was
relayed by it. |
|
ProxyVia full |
The proxy adds a Via: header and in addition appends the Apache server
version. |
|
ProxyVia block |
The proxy strips all Via: headers from the response and does not add one for
itself. |
Note that the default setting of ProxyVia is off, so a proxy will not add a
Via: header unless we specifically ask it to.
ProxyVia is occasionally confused with the ProxyRemote directive - although
its name suggests that ProxyVia has something to do with relaying requests
onward, that job is actually performed by ProxyRemote.
Proxies and
Intranets
Defining remote proxies is useful for processing external requests, but
presents a problem when it comes to serving documents from local servers to
local clients. Making the request via an external proxy is at best unnecessary
and time consuming, and at worst will cause a request to fail entirely if the
proxy server is set up on a firewall that denies the remote proxy access to the
internal site.
We can disable proxying for particular hosts or domains with the NoProxy
directive to enable a list of whole or partial domain names and whole or partial
IP addresses to be served locally. For example, if we wanted to use our web
server as a forward proxy for internal clients but still allow web servers on
the local 204.148 network, we could specify the following directives:
ProxyRequests on
ProxyRemote * http://proxy.remoteserver.com:8080
NoProxy 204.148
ProxyDomain .alpha-complex.com
This causes the server to act as a proxy for requests to all hosts outside
the local network and relay all such requests to proxy.remoteserver.com. Local
hosts, including virtual hosts on the web server itself, are served directly,
without consulting the remote proxies.
NoProxy also accepts whole or partial hostnames and a bitmask for subnets, so
the following are all valid:
NoProxy 204.148.0.0/16 internal.alpha-complex.com intranet.net
A related problem comes from the fact that clients on a local network don't
need to fully qualify the name of the server they want if it is in the same
domain, i.e., instead of a URL of http://www.alpha-complex.com, they can put
http//www. This can cause problems for proxies, since the shortened name will
not match parameters in other Proxy directives like ProxyPass or NoProxy. To fix
this, the proxy can be told to append a domain name to incomplete host names
with ProxyDomain, as shown in the example above. Since the specified domain is
literally appended, it is important to include a dot at the start:
ProxyDomain .domain.of.proxy
Handling Errors
When a client receives a server-generated document like an error message
after making a request through a proxy (or chain of proxies), it is not always
clear whether the remote server or a proxy generated the document. To help
clarify this, Apache provides the core directive ServerSignature, which is
allowed in any scope and generates a footer line with details of the server.
This footer is appended to any document generated by the proxy server. The
directive takes one of three parameters:
|
off (default) |
Appends no additional information |
|
on |
Appends a line with the server name and version number |
|
email |
As on, but additionally appends a mailto: URL with the ServerAdmin email
address |
For example, to generate a full footer line with an
administratorís email address, we would put:
ServerSignature email
Now error documents generated by the proxy itself have a line appended
identifying the proxy as the source of the error, while documents retrieved from
the remote server (be they server generated or otherwise) are passed through as
is.
This directive is not technically proxy-related, since it can be used by
non-proxy servers, too, however its primary application is in proxy
configurations.
Tunneling Other Protocols
Proxying is mainly directed towards the HTTP and FTP protocols, and either
http: or ftp: URLs can be specified for directives that use URLs as arguments.
In addition, mod_proxy will also accept HTTP CONNECT requests from
clients that wish to connect a remote server via a protocol other than HTTP or
FTP.
When the proxy receives a CONNECT request, it compares the port used to a
list of allowed ports. If the port is allowed, the proxy makes a connection to
the remote server specified on the same port number and maintains the connection
to both remote server and client, relaying data, until one side or the other
closes their link.
By default, Apache accepts CONNECT requests on ports 443 (https) and 563
(snews). These ports can be overridden with the AcceptConnect directive, which
takes a list of port numbers as a parameter. For example, Apache can be told to
proxy https and telnet connections by specifying port 23, the telnet port, and
port 443:
AllowCONNECT 443 23
A CONNECT request from a client that uses a telnet: or https: URL will then
be proxies. To test a telnet proxy, we can go to the command line and telnet to
the proxy:
telnet proxy.alpha-complex.com 8080
Then enter a CONNECT request for a host:
CONNECT remote.host:23 HTTP/1.0
And press Return twice.
If the proxy allows the request, the remote host will be contacted on port 23
and a telnet session started, producing a login prompt.
Tuning Proxy
Operations
The ProxyReceiveBufferSize directive specifies a network buffer size for HTTP
and FTP transactions and takes a number of bytes as a parameter. If defined, it
has to be greater than 512 bytes; for example:
ProxyReceiveBufferSize 4096
If a buffer size of zero is specified, Apache uses the default buffer size of
the operating system. Adjusting the value of ProxyReceiveBuffer size may improve
(or worsen) the performance of the proxy.
mod_proxy also defines a number of directives to control how, where,
and for how log documents are cached, and we'll discuss these in the next
section.
Squid - A High-Performance Proxy Alternative
Apache's mod_proxy is adequate for small-to-medium web sites, but for
more intensive duty, it's performance is lacking. An alternative proxy server is
Squid, which is specifically designed to handle multiple requests and high
loads.
As well as HTTP, it also handles and caches FTP, GOPHER, WAIS and SSL
requests, and runs on AIX, Digital UNIX, FreeBSD, HP-UX, Irix, Linux, NetBSD,
Nextstep, SCO, and Solaris - but not Windows or Macintosh.
Squid is open source and freely available from http://squid.nlanr.net, which
also contains support documentation, a user guide and FAQ, and the Squid mailing
list archives.
©1999 Wrox Press Limited, US and UK.