Site Search with HTDIG

Want to add a search engine to your Web site but don’t know how? Well, today’s your lucky day! In this tutorial, find out how to obtain, install and use the popular ht://Dig indexing engine to add powerful, effective search capabilities to your site with minimal time and fuss.



Today, when you think about looking for something on the Web, there’s only one search engine you turn to – Google. With its speed, unique indexing technology and huge database of Web pages, Google has rapidly become the best search engine on the Web, with results that are frighteningly accurate and search algorithms that are optimized for the hyperlinked, diversified information structure of the Web.

However, this isn’t an article about Google – more than enough has already been written about it, by people far more experienced and knowledgeable than yours truly. Rather, this is about something related – setting up a search engine for your own Web site, so that users can locate what they’re looking for quickly and efficiently. If you’ve ever attempted this exercise, you know that it can take anywhere between a few hours to a couple of weeks, depending on the requirements and the amount of precision needed.

There are many ways to index the content of your site. You could store the content in a database, index it and use SQL queries to look for records matching the search string. You could scan the site content to build word frequency tables, and use those tables to locate matching pages. You could use a natural-language or fuzzy search engine to create an index for your site and return results scored by relevance. Or you could save yourself a lot of development time and effort, and just install ht://Dig

What is ht://Dig? Good question. Come on in and find out.

{mospagebreak title=Digging Deep}

In the words of its official website ht://Dig is “a complete world wide web indexing and searching system for a domain or intranet…meant to cover the search needs for a single company, campus, or even a particular sub section of a web site.” ht://Dig was originally developed at San Diego State University, and is today very popular amongst developers looking to quickly add search engine capabilities to a Web site.

ht://Dig works by traversing a Web site and creating a database of all the unique words it finds as it follows hyperlinks from one page to another. This database, together with information on the URL associated with each document, is created every time you request a re-indexing of the site, and is merged with the results of previous index runs to create the foundation for the search engine.

Every time a search is executed, this database is scanned for matches to the search string and a list of results retrieved. The matches are further ranked according to an internal scoring system to filter down to the most relevant, and the results returned to the user, together with links to the pages on which the matches occurred. The process, though somewhat complicated, is nonetheless extremely fast and — thanks to intelligent search algorithms and scoring systems — also very accurate.

ht://Dig also supports Boolean searches, which make it possible to selectively widen or close a search; fuzzy searching, in which the search is automatically expanded to include similar-sounding words, synonyms and plurals; depth-limited searching, in which only documents which are at a particular depth from the tree root are searched; and META-tag indexing for more accurate search results. Both search and result pages can be extensively customized in the ht://Dig system, and — since the source code is freely available under the GPL –developers can even modify and enhance the application to their own specific needs.

Now that you have the background – let’s get to work, by installing and configuring ht://Dig.

{mospagebreak title=Source Control}

First, you’ll need to install ht://Dig on the Linux box you plan to use as a Web server. Drop by the official ht://Dig website and get yourself the latest stable release of the software (this tutorial uses ht://Dig 3.1.6). Note that you will need a C compiler and a running Web server in order to use the software (this tutorial uses GCC 3.2 and Apache 1.3.26).

Once you’ve downloaded the source code archive to your Linux server, log in as “root”,
[code]
[output]
$ su -
Password: ****
[/output]
[/code]

and extract the source to a temporary directory.

[code]
$ cd /tmp
$ tar -xzvf /home/me/htdig-3.1.6.tar.gz
[/code]

The next step is to configure the package using the provided “configure” script. Before doing this, though, there are a couple of decisions you need to make.

There are two primary components to ht://Dig: the binaries used to index the site and create the database of search words, and the program used to perform a search on this database and return a result set. The indexing tools, and the database that results from their use, can be placed anywhere in the filesystem, but the search binary must be located in the Web server’s CGI directory.

Additionally, the images used in the result page created after an ht://Dig search must also be located under the Web server root, so that they appear correctly when the page is viewed through a Web browser (assuming, of course, that you’re using the default result page templates).

Given this information, and assuming the Web server is located in “/usr/local/apache/”, the server’s CGI area is “/usr/local/apache/cgi-bin/” and the server’s document root is “/usr/local/apache/htdocs/”, you will need to give the “configure” script the following arguments:

[code]
$ cd /tmp/htdig-3.1.6
$ ./configure --prefix=/usr/local/htdig
--with-cgi-bin-dir=/usr/local/apache/cgi-bin/
--with-image-dir=/usr/local/apache/htdocs/htdig/images
--with-image-url-prefix=/htdig/images
--with-search-dir=/usr/local/apache/htdocs/htdig/sample
[/code]

This tells the system to install the indexing tools to “/usr/local/htdig/”, the CGI search binary to “/usr/local/apache/cgi-bin/”, and the result page images and a sample search form to directories under “/usr/local/apache/htdocs/htdig/”.

{mospagebreak title=Script Barf}

In case the “configure” script barfs and spits messages at you about “installing the libstdc++ library”, and if you’re sure the library is already installed (the default situation if you’re using GCC 3.x), you can try modifying the command above to include some additional variables:

[code]
$ cd /tmp/htdig-3.1.6
$ CXXFLAGS=-Wno-deprecated CPPFLAGS=-Wno-deprecated ./configure
--prefix=/usr/local/htdig --with-cgi-bin-dir=/usr/local/apache/cgi-bin/
--with-image-dir=/usr/local/apache/htdocs/htdig/images
--with-image-url-prefix=/htdig/images
--with-search-dir=/usr/local/apache/htdocs/htdig/sample
[/code]

Next, compile and install it.

[code]
$ make
$ make install
[/code]

ht://Dig should now have been installed to the directory “/usr/local/htdig”.

You can verify this by doing a quick directory scan of that directory –
here’s what you should see.

[code]
$ ls -lR /usr/local/htdig/
total 16
drwxr-xr-x 2 root root 4096 Oct 15 18:32 bin/
drwxr-xr-x 2 root root 4096 Oct 15 18:39 common/
drwxr-xr-x 2 root root 4096 Oct 15 18:32 conf/
drwxr-xr-x 2 root root 4096 Oct 15 18:44 db/

 

/usr/local/htdig/bin:
total 2860
-rwxr-xr-x 1 root root 580424 Oct 15 18:32 htdig*
-rwxr-xr-x 1 root root 580424 Oct 15 18:32 htdump*
-rwxr-xr-x 1 root root 390930 Oct 15 18:32 htfuzzy*
-rwxr-xr-x 1 root root 580424 Oct 15 18:32 htload*
-rwxr-xr-x 1 root root 381489 Oct 15 18:32 htmerge*
-rwxr-xr-x 1 root root 376361 Oct 15 18:32 htnotify*
-rwxr-xr-x 1 root root 2158 Oct 15 18:32 rundig*

 

/usr/local/htdig/common:
total 6248
-rw-r--r-- 1 root root 84 Oct 15 18:32 bad_words
-rw-r--r-- 1 root root 923308 Oct 15 18:32 english.0
-rw-r--r-- 1 root root 5756 Oct 15 18:32 english.aff
-rw-r--r-- 1 root root 197 Oct 15 18:32 footer.html
-rw-r--r-- 1 root root 891 Oct 15 18:32 header.html
-rw-r--r-- 1 root root 194 Oct 15 18:32 long.html
-rw-r--r-- 1 root root 1404 Oct 15 18:32 nomatch.html
-rw-r--r-- 1 root root 2285568 Oct 15 18:39 root2word.db
-rw-r--r-- 1 root root 67 Oct 15 18:32 short.html
-rw-r--r-- 1 root root 14481 Oct 15 18:32 synonyms
-rw-r--r-- 1 root root 90112 Oct 15 18:39 synonyms.db
-rw-r--r-- 1 root root 1275 Oct 15 18:32 syntax.html
-rw-r--r-- 1 root root 3022848 Oct 15 18:39 word2root.db
-rw-r--r-- 1 root root 1108 Oct 15 18:32 wrapper.html

 

/usr/local/htdig/conf:
total 12
-rw-r--r-- 1 root root 8580 Oct 15 18:42 htdig.conf

 

/usr/local/htdig/db:
total 236
-rw-r--r-- 1 root root 63488 Oct 15 18:44 db.docdb
-rw-r--r-- 1 root root 11991 Oct 15 18:42 db.docs
-rw-r--r-- 1 root root 5120 Oct 15 18:44 db.docs.index
-rw-r--r-- 1 root root 54004 Oct 15 18:44 db.wordlist
-rw-r--r-- 1 root root 82944 Oct 15 18:44 db.words.db
[/code]

The Search Binary

The search binary should have been installed to “/usr/local/apache/cgi-bin/htsearch”,

[code]
$ ls -l /usr/local/apache/cgi-bin
total 560
-rwxr-xr-x 1 root root 558796 Oct 15 18:32 htsearch*
-rw-r--r-- 1 root root 268 Aug 18 16:37 printenv
-rw-r--r-- 1 root root 757 Aug 18 16:37 test-cgi
[/code]

with a sample search form and images to “/usr/local/apache/htdocs/htdig/”.

For an explanation of what each binary does, visit the ht://Dig documentation here.

Once you’ve got ht://Dig installed, the next step is to configure it and start indexing your site. Let’s look at that next.

Building An Index

ht://Dig is configured via a single configuration file, named “htdig.conf” and located in the installation’s “conf” directory. Most of the time, this configuration file is set up automatically based on the arguments you passed to the “configure” script, and only needs to be altered to reflect the URL at which indexing should begin.

Pop open this file in your favourite text editor, and look for the “start_url” variable:

[code]
#
# This specifies the URL where the robot (htdig) will start. You can specify
# multiple URLs here. Just separate them by some whitespace.
# The example here will cause the ht://Dig homepage and related pages to be
# indexed.
# You could also index all the URLs in a file like so:
# start_url: `${common_dir}/start.url`
#
start_url: http://localhost/
[/code]

Alter this variable to reflect the URL at which indexing should begin, and save the changes back to the file.

{mospagebreak title=Variable Control}

You can also alter a number of other variables that control ht://Dig behaviour through the configuration file. Amongst other things, you can modify the location for the search database, specify a list of URLs and extensions to be bypassed while indexing, enable or disable the fuzzy logic algorithms, limit the amount of content stored in the search database and control the maximum amount of data read over an HTTP connection.

The next step is to actually build the search database. As noted previously, when indexing a Web site, ht://Dig recursively spiders the site(s) and builds an index of all the unique words it finds. This process is activated via the “rundig” script, found in the installation’s “bin” directory:

[code]
$ /usr/local/htdig/bin/rundig
New server: localhost, 80
0:0:0:http://localhost/: +* size = 487
1:1:1:http://localhost/company/: -+++* size = 2867
2:2:2:http://localhost/services/: -***+++++- size = 5219
...
htmerge: Sorting...
htmerge: Merging...
htmerge: 100:creative
htmerge: 200:good
htmerge: 300:online
htmerge: 400:specifically
...
htfuzzy/endings: words: 13200
htfuzzy/endings
htfuzzy/synonyms: 1519 worshipping
htfuzzy/synonyms: Done.
htfuzzy: Done.
[/code]

The “rundig” script looks up the configuration file to figure out which URL to use as the root for indexing, and begins traversing and scanning the pages under that URL.

Once it’s done, the search database will have been created (in the installation’s “db” directory) and is ready for use. The next step is to integrate the ht://Dig search form and form processor into the Web site.

{mospagebreak title=A Well-Formed Plan}

When ht://Dig is first installed, a sample search form is automatically installed into the directory specified via the “–with-search-dir” configuration parameter. In this particular example, I had specified that the form be installed to “/usr/local/apache/htdocs/htdig/sample” – so trot on over there and take a look inside:

[code]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>ht://Dig WWW Search</title>
</head>
<body bgcolor="#eef7ff">
<h1>
<a href="http://www.htdig.org"><IMG SRC="/htdig/images/htdig.gif"
align="bottom" alt="ht://Dig" border="0"></a>
WWW Site Search</h1>
<hr noshade size="4">
This search will allow you to search the contents of
all the publicly available WWW documents at this site.
<br>
<p>
<form method="post" action="/cgi-bin/htsearch">
<font size="-1">
Match: <select name="method">
<option value="and">All
<option value="or">Any
<option value="boolean">Boolean
</select>
Format: <select name="format">
<option value="builtin-long">Long
<option value="builtin-short">Short
</select>
Sort by: <select name="sort">
<option value="score">Score
<option value="time">Time
<option value="title">Title
<option value="revscore">Reverse Score
<option value="revtime">Reverse Time
<option value="revtitle">Reverse Title
</select>
</font>
<input type="hidden" name="config" value="htdig">
<input type="hidden" name="restrict" value="">
<input type="hidden" name="exclude" value="">
<br>
Search:
<input type="text" size="30" name="words" value="">
<input type="submit" value="Search">
</form>
<hr noshade size="4">
</body>
</html>
[/code]

{mospagebreak title=What You See}
When you view this form through your Web browser, you should see something like this:

ht dig

Enter a search string into the form field, and ht://Dig should go to work processing your search request. Here’s what the result looks like:

ht dig

Needless to say, you can customize this output, and even the manner in which the search is carried out. If, for example, you tell ht://Dig to display the results in “short” rather than “long” format, you’ll see something like this:

ht dig



You can also perform a Boolean search, simply by selecting “Boolean” from the drop-down list:

ht dig

{mospagebreak title=Custom Job}

ht://Dig allows you to customize both the search form, and the result page generated from a query. In order to demonstrate, I’ll create a plain-vanilla search form, called “search.html”, which looks like this:

[code]
<form method="post" action="/cgi-bin/htsearch">

<input type="text" name="words" size="15">

<input type="submit" value="Begin Search">

</form>
[/code]

There are a couple of important things to note here. The first is the
ACTION attribute of the <FORM> element, which must point to the “htsearch” utility located in the Web server’s CGI directory – you’ll remember that this location was specified when configuring the software.

[code]
<form method="post" action="/cgi-bin/htsearch">
...
</form>
[/code]

The second is the search box itself – note that this element must be named “words”, so that the “htsearch” utility knows to use the data within it as the search string.

[code]
<input type="text" name="words" size="15">
[/code]

A number of other variables may also be set in this form to control the
behavior of “htsearch” – here’s a brief list:

VARIABLE

WHAT IT MEANS

config

sets the name of the configuration file to use

matchesperpage

sets number of records per result page

method

sets type of search (any word, all words, Boolean)

exclude

if set, excludes URLs matching this pattern from the result
set

restrict

if set, includes only those URLs matching this pattern in
the result set

sort

sorting method for result set

Controlled Behavior

Here’s an example:

[code]
<form method="post" action="/cgi-bin/htsearch">

<input type="text" name="words" size="15">

<input type="hidden" name="format" value="builtin-long">

<input type="hidden" name="matchesperpage" value="25">

<input type="hidden" name="method" value="and">

<input type="hidden" name="sort" value="score">

<input type="submit" value="Begin Search">

</form>
[/code]

More information on what these variables mean can be found in the ht://Dig documentation, at http://www.htdig.org/. For a working example, refer to the sample form installed by the software (as discussed on the previous page).

{mospagebreak title=Out With The Old}

Once the form is submitted, the “htsearch” utility takes over and queries the database to retrieve a list of pages matching the query string. This utility also takes care of generating the result page, as per the formatting parameters specified. In case you’d like to control the appearance of this page, ht://Dig allows you to customize the result page templates as well, though the process is somewhat convoluted.

  • You can alter the header and footer of the result page by setting the “search_results_header” and “search_results_footer” variables in the configuration file to point to custom header and footer files. Alternatively, simply customize the application’s default “header.html” and “footer.html” files, installed to the application’s “common” directory.

  • You can customize the page that appears in case a syntax error is found in a Boolean search, by modifying the file “syntax.html” in the installation’s “common” directory. Alternatively, create your own file and tell ht://Dig where to find it by setting the “syntax_error_file” variable in the configuration file.

  • You can tell ht://Dig to display a custom page if the search produces zero matches, by setting the “nothing_found_file” variable in the configuration file, or by altering the “nomatch.html” template in the installation’s “common” directory.

  • By default, ht://Dig ships with two built-in templates, “builtin-long” and “builtin-short”, which are used to display the results found during a search. You can modify the results page to conform to the design of your site by altering these templates, named “long.html” and “short.html” respectively, or use your own templates by setting the “template_map” variable in the configuration file.

All these templates can contain special ht://Dig variables that will dynamically be replaced with actual values when a search is performed.

While on the topic, here’s a quick tip: if you’re setting up a simple search engine, or if you’re pressed for time, I’d advise you to simply modify the default templates that ship with the application, rather than creating new templates and configuring the system to recognize them.

{mospagebreak title=Caveat Emptor}

Thus far, the previous examples have assumed a Web site consisting of static HTML pages as the base for ht://Dig’s indexing routines. But in today’s interactive Web, such Web sites are far less common than database-backed, highly-interactive and content-rich portals. How does ht://Dig do when faced with one of these?

The answer, not surprisingly, is quite well. You don’t need to do anything special to get ht://Dig to index a database-driven site – simply give it the starting URL as usual, and the program will take care of traversing the dynamically-generated content and building an index.

One thing to remember here, however, is that since such sites change frequently, it’s a good idea to recreate the ht://Dig database on a periodic basis to ensure that the changes are reflected in the search database, and to ensure that users always get the most accurate results from the system. This can easily be accomplished by adding a “cron” job to execute the “rundig” script on a periodic basis – perhaps once every day around midnight, so that users aren’t impacted too much by the temporary performance drag as the index is regenerated.

Previous examples have also assumed that ht://Dig was being used to index a single site. If you’d like to index multiple sites, the ht://Dig FAQ suggests two ways to accomplish this. Door #1 involves indexing everything into a single database, and then using “restrict” and “exclude” parameters in the search form to constrain searches on a per-site basis. Door #2 involves creating separate databases for each site (through separate configuration files) and telling “htsearch” which configuration file (and hence which database to look in) through the “config” parameter in the search form. Either way, when dealing with such sites, it’s also a good idea to configure ht://Dig to archive smaller descriptions for each page, so as to reduce the disk space taken up by the search database. See the ht://Dig online FAQ for more information on how to do this.

{mospagebreak title=Ending The Dig}

And that’s about it for this article. Over the last few pages, I introduced you to the ht://Dig indexing system, explaining its important features and guiding you through the process of compiling and installing it on your Linux box. With the tools installed, I then showed you how to configure it for your specific site hosting needs, and how to actually begin indexing a Web site.

With the index created, I then moved on to a discussion of the front-end interface, explaining how to build a search form to capture user queries, and pass those queries on to the ht://Dig search utility through CGI. I also demonstrated the process of altering both the search form and the search results page to blend in with the design and aesthetics of your own site design. Finally, I showed you how you could use ht://Dig to index a content-heavy database-driven site, as opposed to the standard static pages used in previous examples.

However, everything I’ve discussed in this article is only the tip of the
iceberg – ht://Dig can handle more than just the common scenarios discussed in this article, and if you’re serious about using it on your Web site, you should also take a look at the following links:

The ht://Dig Web site, at http://www.htdig.org/

The ht://Dig FAQ, at http://www.htdig.org/FAQ.html

The ht://Dig configuration variable reference, at
http://www.htdig.org/confindex.html

The ht://Dig mailing list, at http://www.htdig.org/mailarchive.html

ConfigDig, at http://configdig.sourceforge.net/

A number of other alternatives also exist to ht://Dig – take a look at the following links to learn more about some of them:

PhpDig, at http://phpdig.toiletoine.net/

iSearch, at http://www.digvid.info/isearch/home.php

mnoGoSearch, at http://www.mnogosearch.org/

And until next time…happy searching!

Note: Examples are illustrative only, and are not meant for a production environment. Melonfire provides no warranties or support for the source code described in this article.

[gp-comments width="770" linklove="off" ]

chat