Getting Started with Apache 2.0 Part II

In this second article in a three-part series, you will learn how to customize the the log files generated by the Apache Web server, and much more.

Last time, I showed you how to install Apache on your own machine from scratch, and explained the configuration directives that help run the Web server. You also learned how to configure the latest versions of PHP and Apache to work together – an exciting combination, though I must admit that there are conflicting opinions on this subject.

Today, I’ll explain the configuration of the “main server.” I opted to skip this topic in the last article. Next, I’ll talk about the different log files generated by the Apache Web server and how you can customize them to your requirements. Finally, we’ll see how to create “Virtual Hosts” that give us the ability to run multiple websites on a single instance of the Apache Web server.

So, what’re you waiting for? Let’s get cracking!

One Server, One Website

In the previous article, I gave you a quick overview of the “httpd.conf” configuration file but I stopped short of explaining the section that dealt with the “main server” configuration due to space constraints. So, let me talk about it before I run out of bytes again.

By default, Apache is configured to serve a single website, and the “main server” configuration section encompasses “directives” that drive this default website. Let me explain them one-by-one:

ServerAdmin webmaster@mysite.com

The “ServerAdmin” directive allows you to specify an e-mail address for the administrator of the Web server. This comes in handy if an error occurs on the server: Apache will display this e-mail address on the Web page for further action by the visitor who encountered the error.

Humans being humans, website visitors cannot be expected to remember the bunch of numbers that form the IP address when most fail to even remember their own anniversaries. This is where the use of human-friendly “domain names” and the “ServerName” Apache directive comes in handy, as it is easier to remember a URL such as www.mysite.com than its IP address. Take a look:

ServerName www.mysite.com:80

As seen above, this directive allows you to explicitly specify the name for your Web server and the port on which it runs. However, if you do not have a valid host name for your server, you can also opt to list the IP address of the Web server with this directive.

DocumentRoot /usr/local/apache/htdocs

This directive specifies the location of the “root” folder of the Web server. By default, all requests are served from the files located under this folder.

Next, I’ll talk about the “Directory” directive – this special directive allows you to specify a list of features for each file-system directory on the Web server. Consider the following snippet:

<Directory />
    Options FollowSymLinks
    AllowOverride None
</Directory>

Let us start with the root folder on the server. For obvious security reasons, the default Apache configuration enforces high security for this folder by disabling most features. However, you can customize these very features for subsequent folders located under the root folder. Let me start with the root folder of the “Web server” as specified in the “DocumentRoot” directive:

<Directory /usr/local/apache/htdocs>
    Options Indexes FollowSymLinks
    AllowOverride None
    Order allow,deny
    Allow from all
</Directory>

The “Options” directive allows you to list the features that you would like to enable for the directory in question; in the above example, you have the “DocumentRoot” folder under scrutiny. For example, the above snippet enables two of them: “Indexes,” which directs the Web server to list the contents of the directory in the absence of an index page (for example, index.htm) and “FollowSymLinks,” which, as the name suggests, instructs the Web server to follow any “symbolic” links that may be located in the current folder.

There are other features that you can enable in this “Directory” directive, including the “ExecCGI” directive that allows you to execute CGI scripts on your Web server, the “Includes” directive allows you to include Server-Side Include (SSI) commands in your Web pages and much more. For a complete list, please review the following URL: http://httpd.apache.org/docs-2.0/mod/core.html#options

Next, the “AllowOverride” directive enables the use of a folder-specific “.htaccess” file, which in turn can contain its own set of Apache directives to drive access and security on a per-folder basis.

Note that the “directives” that you can use in this “.htaccess” file are governed by the value specified for the “AllowOverride” directive above. If this is set to “None” – as in the code listing above – the “.htaccess” files are totally ignored. For other options, I recommend that you read the following documentation: http://httpd.apache.org/docs-2.0/mod/core.html#allowoverride.

Finally, you have the “Order,” “Allow” and “Deny” directives. These three directives are provided by the “mod_access” Apache module and allow you to implement a fine-grained access policy across the entire Web server. The “Order” directive lists the “order” in which the “Allow” and “Deny” directives should be implemented, whereas the “Allow” and “Deny” directives allow you to specify a domain or an IP address (even partial ones) or a combination of a network/netmask addresses – thereby permitting or restricting access to specific users respectively. For more information on this “mod_access” module and its associated directives, hop over to http://httpd.apache.org/docs-2.0/mod/mod_access.html

Now, you must be wondering why I’ve opted to talk about these “main server” specific directives now. The reason is simple. Later in the article, I’ll show you how to configure a single Web server to run several websites, and you should be able re-use these directives for each Web site.

{mospagebreak title=The Apache Log Files}

You’ve probably read (or heard) the fairy tale in which Hansel and Gretel left breadcrumbs to find their way back home. But, what on earth does this bedtime story have to do with the topic that I am currently discussing – the Apache log files?

Frankly, not much!

To be honest, it was an attempt to highlight the similar purpose of two diverse actions: the breadcrumbs dropped by the two characters in the fairy tale and the log files generated by the Apache Web server. The former ensures that the children (in the story) were able to return home safely after their escape and the latter allows a webmaster to learn more about the visitors based on the data (i.e. breadcrumbs) left behind after every visit to the website. And if that’s not enough, the error log files can help developers resolve nasty errors that occur on the server.

Generally speaking, Apache can be configured to generate two types of log files. The first records every request made to the Web server and the second logs all the errors encountered by the Web server such as the infamous “404 – File Not Found” error or the notorious “500 – Internal Server Error” and many more!

By default, the Apache Web server is configured to generate several versions of these log files, but only two are active when you start it for the first time. In order to give you a better picture, allow me to explain the first log-related configuration directive from the “http.conf” file:

ErrorLog logs/error_log

As the name suggests, the “ErrorLog” directive allows you to specify the name of the “error” log file. This path can either be relative to the server root (as above) or represent the absolute path to the file on your system. Note that the default name (of the log file) will vary with the OS platform that you use; you can always modify it here.

Now, let us take a peek at the contents of the error log file generated by my local Apache instance:

[Sat Dec 18 15:11:54 2004] [error] [client 127.0.0.1] request failed: error reading the headers
[Sat Dec 18 15:15:34 2004] [error] [client 127.0.0.1] File does not exist: /usr/local/apache/htdocs/library/styles.css

Now let’s decipher these entries. The first value represents the date and time the error occurred, the second indicates the “error level” of the current entry (more on this later), the third column is the IP address of the client machine and finally, we have a message that attempts to describe the nature of the error. Note that the system location of the file, instead of its web path, is written into the log file.

The next directive de-mystifies the “error level” term used above. Aptly named “LogLevel,” it can be assigned any one value from a pre-defined list and allows you to control the errors (based on their severity) that you wish to record in your “error” log file. Take a look:

#
# LogLevel: Control the number of messages logged to the
# error.log.
# Possible values include: debug, info, notice, warn,
# error, crit, alert, emerg.
#
LogLevel warn

I have deliberately copied the comments that precede this directive in the configuration file – you’ll notice that they list the different values that you assign to the “LogLevel” directive. There is no doubt that the names of different error levels give a good indication of the nature of errors to be recorded for each level. It is generally recommended that you set this value to “debug” on development servers and to “error” on production ones.

For complete information on all of the error levels listed above, you can review the official Apache documentation at: http://httpd.apache.org/docs-2.0/mod/core.html#loglevel

Here’s a little tip (excerpted from the official documentation) before we move to the next section: you can use the following command on most *NIX systems to monitor your error log file on a continuous basis:

$ tail -f /usr/local/apache/logs/error_log

{mospagebreak title=Who Are You?}

In the previous section, I showed you how to configure the error log file generated by the Apache Web server. While this helps a developer during the development and maintenance phases of a project, it may not be very useful to a Web master. The latter wants to analyze the traffic and the nature of visitors that visit his website, not errors. Fortunately, for those requirements, he has the Apache “access” log file.

Now it’s time to review the directives that govern these “access” log file(s):

CustomLog logs/access_log common

The syntax of the “CustomLog” directive is similar to that of the “ErrorLog” directive: you have to specify the name and the path (absolute or relative) of the log file. The only difference is the presence of the “common” keyword at the end of the line – this represents the “nickname” for the log entry format that you would like to use.

Yes, you can define your own custom format (as well as “nicknames”) using the “LogFormat” directive. There are four pre-defined formats listed in the default configuration file:

LogFormat “%h %l %u %t “%r” %>s %b “%{Referer}i” “%{User-Agent}i”" combined
LogFormat “%h %l %u %t “%r” %>s %b” common
LogFormat “%{Referer}i -> %U” referer
LogFormat “%{User-agent}i” agent

In general, the syntax of the “LogFormat” directive looks something like this:

LogFormat LOG_FILE_ENTRY_FORMAT  NICKNAME

While the syntax describing the format appears complex at first glance, it will make sense once you understand what each symbol in the format string stands for. Consider the following “LogFormat” entry:

LogFormat “%h %l %u %t “%r” %>s %b” common

The keyword “common” represents its nickname and is used in the “CustomLog” directive to refer to this format; we’ve already seen that above.

Now, let me concentrate the format string itself, where each symbol has a very specific purpose. In order to make things easier, let me paste a sample entry from the my local “access_log” file:

127.0.0.1 – root [18/Dec/2003:12:52:43 +0530] “GET /phpmyadmin/db_details_structure.php?lang=en-iso-8859-1& server=1&db=industry HTTP/1.1″ 200 28406
 

Next, let me map each symbol from the format string to the actual entry in the above log file snippet:

  • The “%h” (value in log file: 127.0.0.1 ) represents the IP address of the client machine (in most cases).

  • The “%l” (value in log file: – ) is replaced by the RFC 1413 identity of the client. However, the official documentation states that this value is “highly unreliable and should almost never be used except on tightly controlled internal networks.” Note that a “hyphen,” i.e. the “-” symbol is used by the Web server to indicate that it could not retrieve a value for a particular parameter.

  • The “%u” (value in log file: root) indicates the username of the user accessing the Web server using HTTP  authentication. Often, this value is not recorded as most visitors are anonymous to the Web server.

  • The “%t” (value in log file: 18/Dec/2003:12:52:43 +0530) represents that date and time of the request. Note that you can customize the format of the time stamp using the syntax use for the strftime() C function.

  • The “%r” (value in log file: GET /phpmyadmin/db_details_structure.php?lang=en-iso-8859-1&server=1&db=industry HTTP/1.1) is replaced by the actual request URL sent by the client machine. Along with the request method (GET or POST), the entry also lists all the parameters sent in the query string, as seen above, for a GET request.

  • The “%>s” (value in log file: 200) represents the HTTP status code and is very useful to programmers and Web masters. Some of the common values listed in this column are 200 (indicating a successful response), 404 (indicating that the requested file was not found) and 500 (indicating an error occurred during the execution of the requested script).

    You can view a list of all HTTP status code at the following URL: http://www.w3.org/Protocols/rfc2616/rfc2616.txt 

  • The “%b” symbol represents the size of the response in bytes; this gives an indication of the bandwidth used by the website.

Note that if you wish to insert quotes in the log files, you have to escape them in the log format string. For example, the “%r” syntax encloses the request URL within quotes in the log file.

And that’s not all – there are many more symbols that you can use in your format string. Here are some important ones:

  • The “%A” symbol will display the local IP address.

  • The “%B” will display the number of bytes sent to the client, excluding the size of the HTTP headers. This is useful if you want to get an accurate picture of the bandwidth used by the different elements of your website such as images, style sheets, and so forth.

  • The “%{VARNAME}e” will list the contents of the “VARNAME ” environment variable.

  • The “%f” symbol will display the filename requested by the client.

  • The “%H” will indicate the request protocol.

  • The “%m” symbol will represent the request method.

  • The “%T” will be replaced by the actual time taken by the server to respond to the request.

Finally, there are two more important symbols that I would like to highlight before I move to the next directive:

  • The “%{User-agent}i” symbol is used to store the details of the client accessing the website. This can be used to identify the different browsers that you should support on the basis of the visitors accessing the website.

  • The “%{Referer}i” symbol is used to store the details of the resource that referred the visitor to the current page. Once again, this is an ideal mechanism to study how visitors are redirected to your website.

Finally, there is one more directive that deserves a mention -the “HostnameLookups” directive informs the Web server to attempt to map the IP address of the client to its host name.

HostnameLookups Off

By default, this directive is turned “Off.” However, if you turn “On” this directive, the log file should contain human-readable domain names (such as “http://www.kcsonline.biz“) instead of the machine-friendly IP addresses (such as “69.44.155.211″). There is one caveat that you should keep in mind: the Web server has to make an additional request for every request in order to obtain the hostname, which in turn could slow down the logging process, thereby severely affecting performance.

Before I conclude this section, let me give you a little note on the analysis of the Apache log files: leveraging the popularity of this Open Source Web server, there are a multitude of products that help you to analyze the log files generated by Apache. At one end of the spectrum, you have HTTP-Analyze (http://www.http-analyze.org/) available for free to personal users, and at the other end you have sophisticated (read expensive) tools such as Web Trends (http://www.webtrends.com/). The choice is yours!

{mospagebreak title=One Server, One Hundred Websites}

In an earlier section, I introduced the “ServerName” directive that allows you to store the domain name or IP address of the machine hosting your Web server. Theoretically, every computer is assigned a unique IP address on the network. So, it would not wrong for you to assume that one can host only one website on a single machine – in fact, that’s especially true, because IP addresses are slowly becoming a scarce commodity.

To be frank – the above conclusion is not correct, thanks to “Virtual Hosting,” a feature that allows you to host more than one website on a single server. There are two possible mechanisms you can use to implement this: “IP-based” hosting (different IP addresses for the different host names) and “name-based” hosting (different host names on a single server). In this article, I’ll focus on the second concept of “name-based” hosting. This feature gives you the ability to run several websites – each with its own unique URL – on a single Web server (with a single IP address).

A little caveat before I proceed – note that support for “Virtual Hosting” was introduced in version 1.1 of the HTTP protocol. Older clients that do not support this version may experience difficulties accessing such websites. But there is no reason to panic: take a look at this URL later – http://httpd.apache.org/docs-2.0/vhosts/name-based.html#compat – for a suitable work around.

Coming back to “name-based” hosting, consider the following scenario that I have often find myself in: a website project that I am working with has two different versions. The first version represents a “beta” version that I am constantly tweaking and the second is the “live” version, which contains code that has been tested and approved by the client. While it is recommended that one should host these two versions on different machines, it was not an option for my project because of infrastructure and budgetary constraints.

So, I came up with a solution that will implement two “Virtual Hosts,” one for each version of the website, in order to achieve some semblance of isolation between the two. For example, I host the “beta” version at “http://beta.mysite.com” and the “live” version at “http://www.mysite.com.”

Now, let me show you how this can be implemented using “name-based Virtual Hosting.” Let’s go back to the good old “httpd.conf” file – first, you need to specify the IP address of the machine serving requests for all “Virtual Hosts” as shown below:

#NameVirtualHost *:80
NameVirtualHost 127.0.0.1:80

Above, I have used the “NameVirtualHost” directive to specify the local IP address for my Web server (i.e. 127.0.0.l) locally along with the default HTTP port (i.e. 80). Note that you must specify the port if you plan to configure other ports on the server differently; for example, if you have implemented SSL on port 443 and do not wish to use “Virtual Hosting” for SSL requests.

You could also opt for the “*” symbol (with the “NameVirtualHost” directive) in order to use “Virtual Hosting” for all IP addresses that your server is configured with.

Next, I have listed a sample Virtual Host included in the default version of the “httpd.conf” file:

#<VirtualHost *:80>
#    ServerAdmin webmaster@dummy-host.example.com
#    DocumentRoot /www/docs/dummy-host.example.com
#    ServerName dummy-host.example.com
#    ErrorLog logs/dummy-host.example.com-error_log
#    CustomLog logs/dummy-host.example.com-access_log common
#</VirtualHost>

Let me tweak with the above sample listing in order to implement the requirements of ”beta” and “live” URLs that I spoke about earlier.

<VirtualHost 127.0.0.1:80>
DocumentRoot /usr/local/apache/htdocs/beta
ServerName beta.mysite.com
ServerAdmin admin-beta@mysite.com 
ErrorLog logs/mysite-beta-error.log
CustomLog logs/mysite-beta-access.log common
</VirtualHost>

<VirtualHost 127.0.0.1:80>
DocumentRoot /usr/local/apache/htdocs/live
ServerName www.mysite.com
ServerAdmin admin-beta@mysite.com
ErrorLog logs/mysite-live-error.log
CustomLog logs/mysite-live-access.log common
</VirtualHost>

For starters, I have set each “VirtualHost” block to match the IP address listed in the “NameVirtualHost” directive. Note that this is only because I do not plan to use any other IP address on this server. For obvious reasons, the IP addresses would have been different for each block if I was implementing “IP-based Virtual Hosting” or the Web server had been configured to server more than one IP address.

Next, you’ll notice that I have used different Apache directives such “ServerName,” “DocumentRoot,” “ErrorLog,” “CustomLog” and so forth within each “VirtualHost” block. While you’re already familiar with the functionality of each directive, the above listing highlights the ability to customize these directives for each “virtual” website.

There is one drawback to this exercise of defining multiple virtual hosts using the “NameVirtualHost” directive: the default configuration listed under the “Main server” section is null and void. So, if you would like to display the default website when there are no matching virtual hosts for a particular visitor request, you’ll need to replicate the settings of the “main server” as another “virtual host” as shown below:

NameVirtualHost 127.0.0.1:80

# settings for the default Web site
<VirtualHost 127.0.0.1:80>
    DocumentRoot /usr/local/apache/htdocs/
    ServerName www.site.com
    ErrorLog logs/error.log
    CustomLog logs/access.log common
</VirtualHost>

# settings for the http://beta.mysite.com
<VirtualHost 127.0.0.1:80>
DocumentRoot /usr/local/apache/htdocs/beta
ServerName beta.mysite.com
ServerAdmin admin-beta@mysite.com 
ErrorLog logs/mysite-beta-error.log
CustomLog logs/mysite-beta-access.log common
</VirtualHost>

# settings for the http://www.mysite.com

// snip

That was quick overview on how to configure “Virtual Hosts” on your Web server and I’ll admit that I have only touched the tip of the iceberg above. So, if you have more complex requirements, the following URL should help: http://httpd.apache.org/docs-2.0/vhosts/.

End Game

That’s about it for this part of the Apache series. Today, I started with a quick overview of the default server configuration “directives” that serves as a precursor to the section on “Virtual Hosting.” Next, I spoke about the different log files generated by the Apache Web server and showed you how to customize these log files for your requirements. Finally, I explained how you could implement “name-based Virtual Hosting,” an interesting feature that gives you the ability to host several websites (each with an unique URL) on a single Web server.

In the next part of this series, I shall show you how to configure the Apache server as proxy, talk a little about URL re-writing (a powerful feature that allows you to “re-write” requests to the Web server), configure user-specific directories on your Web server and much more.

Till then, ciao and take care!

Note: All examples in this article have been tested on Linux/i586 with Apache 2.0.52, MySQL 3.23 and PHP 5.0.3. Examples are illustrative only, and are definitely NOT meant for a production environment.

Google+ Comments

Google+ Comments