PHP
  Home arrow PHP arrow Page 2 - Search This!
Dev Shed Forums 
Administration  
AJAX  
Apache  
BrainDump  
DHTML  
Flash  
Java  
JavaScript  
Multimedia  
MySQL  
Oracle  
Perl  
PHP  
Practices  
Python  
Reviews  
Security  
Smartphone Development  
Style-Sheets  
Web Services  
XML  
Zend  
Zope  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Forums Sitemap 
E-Commerce Hosting 
Linux Web Hosting 
Managed Hosting 
Small Business Hosting 
VPS Hosting 
Weekly Newsletter

 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
Google.com  
PHP

Search This!
By: Colin Viebrock
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: starstarstarstarstar / 24
    1999-03-15


    Table of Contents:
  • Search This!
  • Configuring ht://Dig
  • Indexing the Site
  • Building the Search Page
  • Performing the Search
  • Displaying the Results

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      error-file:tidyout.log Del.ici.ous error-file:tidyout.log Digg
      error-file:tidyout.log Blink error-file:tidyout.log Simpy
      error-file:tidyout.log Google error-file:tidyout.log Spurl
      error-file:tidyout.log Y! MyWeb error-file:tidyout.log Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article

     
     
    ADVERTISEMENT


    Search This! - Configuring ht://Dig
    (Page 2 of 6 )

    Before going any further, it would help if you have a clear idea of how ht://Dig works. There is a very nice explanation on the ht://Dig website (follow the link to "How it Works" on the left side), but here's the abridged version.

    The ht://Dig system performs three major tasks that should be performed in the following order:

    Digging

    Before you can search, a database of all the documents that need to be searched has to be created.

    Merging

    Once the document database has been created, it has to be converted to something that can be searched quickly. Also, if you want to only update changed documents, these changes have to be merged into the searchable database.

    Even though this task could be performed at the same time as digging, it is a separate process for efficiency reasons. It also gives more flexibility to what actually happens at merge time.

    Searching

    Finally, the databases that were created in the previous steps can be used to perform actual searches. Normally, searches will be invoked by a CGI program which gets its input from the user through an HTML form ... but we'll be doing it all with PHP3!

    So, let's see what the installation created, shall we?


    % cd /usr/local/htdig/ % ls bin common conf db

    The bin/ directory contains the executables required by ht://Dig. Configurations files are in conf/ (believe it or not!) and I think you can guess where the database files are.

    [Note: I'm going to assume from now on the you installed ht://Dig in the /usr/local/htdig directory as shown above, and that the path to your website files is /www, probably a symlink to /usr/local/htdocs.]

    If your web server is like mine, you're probably running several web sites off the same machine. In this case, I've found it useful to have separate configuration files for each site. ht://Dig installed a sample configuration file called htdig.conf, so I recommend you copy it to a new filename and make some changes. Here is the sw98.conf file I used for the SummerWorks Theatre Festival's website:


    database_dir: /usr/local/htdig/db/sw98 start_url: http://www.summerworks.on.ca/ limit_urls_to: http://www.summerworks.on.ca/ exclude_urls: /staff/ /search/ .inc .doc .mcw search_algorithm: exact:1.0 endings:0.3 matches_per_page: 50 excerpt_length: 200 template_map: sw98 sw98 /www/summerworks/search/results-template.html search_results_header: /www/summerworks/search/results-header.html search_results_footer: nothing_found_file: /www/summerworks/search/results-nomatch.html syntax_error_file: /www/summerworks/search/results-syntaxerror.html star_image: http://www.summerworks.on.ca/gifs/x-star.gif star_blank: http://www.summerworks.on.ca/gifs/x-nostar.gif max_stars: 5 max_doc_size: 100000 valid_punctuation: .-_/!#$%^&*'«»"

    What do all these settings mean? Some are obvious, but the following probably are not:

    database_dir:
    Normally all database files are in the /usr/local/htdig/db/ directory. To keep things a bit cleaner, I've made sub-directories for all the sites on my server, so this is the sub-directory for the SummerWorks site. You'll need to make this directory manually before you can index the site:


    % cd /usr/local/htdig/db % mkdir sw98

    exclude_urls:
    There are some files we don't want to search. If a URL contains any of the space separated patterns, it will not be indexed. In this case, I didn't want the staff pages, nor the search pages indexed. I also wanted it to ignore .inc files (common PHP code that I include on various pages), and some Word documents for PC and Mac. On the real SummerWorks site, there are a couple more exclusions, but you get the idea (I hope!).

    search_algorithm:
    ht://Dig lets you turn "fuzzy" searching on or off. When it's on, people who search for "play", for example, will also find things like "plays", "player", "players", etc.. This slows down the digging process a fair amount, but can be useful. The different weights applied to each algorithm mean, in this case, that a search for "play" will turn up "player", but not as high in the results as an exact match for "play".

    template_map:
    This makes a new template called "sw98". More on templates below.

    search_results_header, search_results_footer, nothing_found_file, and syntax_error_file:
    These are files used by the template, and parsed by PHP. More below.

    max_doc_size:
    This is bigger than usual, since there are some big pages on the site. If you find that some of your pages aren't being indexed, setting this to a higher value will often solve the problem.

    valid_punctuation:
    I basically only added the French quotes and the single quote to the list of valid punctuation. This is the set of characters which will be deleted from the document before determining what a word is. This means that if a document contains something like "Andrew's" the digger will see this as "Andrews".

    The same transformation is performed on the keywords sent to the search engine.

    There are plenty of other setting you can set in the .conf file. For a definitive list with explanations, check out the ht://Dig website.

    Templates

    ht://Dig makes extensive use of templates for it's output. These templates are (usually) plain HTML documents containing variables, which are substituted with the search results.

    There are four "standard" template setting in the .conf file:

    search_results_header:
    This specifies a filename to be output at the start of search results.

    search_results_footer:
    This specifies a filename to be output at the end of search results.

    nothing_found_file:
    This specifies the file which contains the text to display when no matches were found.

    syntax_error_file:
    This points to the file which will be displayed if a boolean expression syntax error was found.

    There is also the result template file. This is the file referenced in the template_map attribute of the .conf file. This is where all the information you want the search to return is displayed.

    I said that these templates usually contain a bunch of HTML, with some variable substitution. When using ht://Dig with PHP, it would be much easier if a search returned the raw results of the search. You can then use PHP to parse and display that information however you want.

    The easiest way to do this is to not put any HTML in the template files, but instead just put in the list of variables you want substituted.

    So, make your results-header.html file contain only the following 4 lines:


    $(MATCHES) $(FIRSTDISPLAYED) $(LASTDISPLAYED) $(LOGICAL_WORDS)

    results-nomatch.html should contain only 1 line (note that it's not a variable):

    NOMATCH
    

    results-template.html consists of 4 lines:


    $(TITLE) $(URL) $(PERCENT) $(EXCERPT)

    Finally, results-syntaxerror.html consists of 2 lines:


    SYNTAXERROR $(SYNTAXERROR)

    It's important that there be no extra line breaks at the beginning or end of your files, since we're going to rely on the fact that every fourth line of the results-template.html file (for instance) is the title of the document.

    All of the template files are in a sub-directory of the website called search/.

    If you want to find out what all the variables mean (and what other ones are available), check the ht://Dig site.



     
     
    >>> More PHP Articles          >>> More By Colin Viebrock
     

       

    PHP ARTICLES

    - Implementing the Iterator SPL Interface
    - Building a Data Access Layer for the Data Ma...
    - Building a Singleton Database with Restricti...
    - Working with Reflected Properties with the R...
    - The Iterator, Countable and ArrayAccess SPL ...
    - Implementing the Data Mapper Design Pattern ...
    - Defining an Abstract Class with Restrictive ...
    - The Reflection API: Working with Reflected M...
    - Using Restrictive Constructors in PHP 5
    - Getting Information on a Reflected Class wit...
    - Introducing the Reflection API in PHP 5
    - Swift Mailer's Batchsend Method and Other Fe...
    - Embedding Attachments into Email Messages wi...
    - Dynamically Attaching Files with Swift Mailer
    - Using Different Paths for Attachments with S...


    Code Analysis Tools
    Enterprise code analysis tools that deliver quality and reliable code



    © 2003-2010 by Developer Shed. All rights reserved. DS Cluster 8 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek