So you've got a dynamic site, filled with all sorts of user inputs, whether it be a 'phorum', or like my own site at
http://www.knowpost.com. ht://dig will take care of indexing and searching your html pages, but if you are like me, you have very few html pages, and most of your "content" resides in BLOBs in your database. You can't do anything useful using a like %searchword% query, it just isn't coming back relevant.
There has to be a better way, and indeed there is, with a few easy steps. Here's how to slap one together:
Noise ReductionThe first problem with your content is that it is filled with clunky "noisewords," like "a,the,where,look" Things that are there to help us humans to communicate, but really don't have anything to do with relevance. We gotta get rid of those. Essentially, what we're trying to do here is get all those noisewords out of your data, and build a table with two columns, the word, and its indicator (the content associated with it). We want something that will eventually look like this:
+------+------------+
| qid | word |
+------+------------+
| 6 | links |
| 5 | Fire |
| 5 | topics |
| 5 | related |
| 5 | Shakespeare|
| 4 | people |
| 4 | Knowpost |
| 3 | cuba |
| 3 | cigar |
+------+------------+
Lets create our table now:
mysql> CREATE TABLE search_table(
word VARCHAR(50),
qid INT
)
Next, since you want to make all your data compatible, not
just new data, we need to grab your sticky blobs, and their identifiers out of your database:
<?
$query = "SELECT blob,identifier FROM your_table";
$result = mysql_query($query);
$number = mysql_numrows($result);
$j = 0;
WHILE ($j < $number){
/* Your "blob" */
$body = mysql_result($result,$j,"blob");
/*Your "identifier" */
$qid = mysql_result($result,$j,"qid");
/* Open the noise words into an array */
$noise_words = file("noisewords.txt");
$filtered = $body;
/* Got to put a space before the first word in the body, so that we can
recognize the word later */
$filtered = ereg_replace("^"," ",$filtered);
/* Now we suck out all the noisewords, and transform whats left into an
array */
/* Brought to you by poor ereg coding! */
for ($i=0; $i<count($noise_words); $i++) {
$filterword = trim($noise_words[$i]);
$filtered = eregi_replace(" $filterword "," ",$filtered);
}
$filtered = trim($filtered);
$filtered = addslashes($filtered);
$querywords = ereg_replace(",","",$filtered);
$querywords = ereg_replace(" ",",",$querywords);
$querywords = ereg_replace("?","",$querywords);
$querywords = ereg_replace("(","",$querywords);
$querywords = ereg_replace(")","",$querywords);
$querywords = ereg_replace(".","",$querywords);
$querywords = ereg_replace(",","','",$querywords);
$querywords = ereg_replace("^","'",$querywords);
$querywords = ereg_replace("$","'",$querywords);
/* We should now have something that looks like 'Word1','Word2','Word3'
so lets turn it into an array */
$eachword = explode(",", $querywords);
/* and finally lets go through the array, and place each word into the
database, along with its identifier */
for ($k=0; $k<count($eachword); $k++){
$inputword = "INSERT INTO search_table VALUES($eachword[$k],$qid)";
mysql_query($inputword);
}
/* Get the next set of data */
$j++;
}
?>
That script just handles your old data. You'll want to
include a similar function to strip the noisewords out for every time new information comes into your database, through user input, your input, etc... so that your search engine is updated on the fly.
{mospagebreak title=Searching the Table} Now you have an easy to-use table of keywords and their associations. How do you query this table? Here's what I do:
First I format each searchterms passed into the script as 'word1','word2','word3' and stick it in a string called $querywords.
Then I throw them into this SQL query:
SELECT count(search_table.word) as score, search_table.qid,your_table.blob
FROM search_table,your_table
WHERE your_table.qid = search_table.qid AND search_table.word
IN($querywords)
GROUP BY search_table.qid
ORDER BY score DESC";
Set that query to $search, and print out the results like
so:
<?
$getresults = mysql_query($search);
$resultsnumber = mysql_numrows($getresults);
IF ($resultsnumber == 0){
PRINT "Your search returned no results. Try other keyword(s).";
}
ELSEIF ($resultsnumber > 0){
PRINT "Your search returned $resultsnumber results<BR>Listing them
in order of relevance<BR><BR>";
for($count = 0; $count<$resultsnumber; $count++){
$body = mysql_result($getresults,$count,"blob");
$qid = mysql_result($getresults,$count,"qid");
//tighten up the results
$body2print = substr($body, 0, 100);
$cnote = $count+1;
PRINT "$cnote. <a href=yourcontent.php3?qid=$qid>
<i>$body2print...</i></a><BR>";
}
}
?>
Presto, you've got keyword searching for your database,
complete with relevancy ranking. It may not be Google or altavista.
It may not support all those fancy boolean operators, or excite's (*cough*) conceptual mapping technology. But it works, its quick and enough to handle your user's demand.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
More PHP Articles
More By Clay Johnson
developerWorks - FREE Tools! |
This whitepaper provides areas to consider when evaluating any software configuration management solution. It addresses how the IBM solutions (Rational ClearCase and Rational ClearQuest) meet the needs and requirements of both project leaders and developers to provide successful Software Change and Configuration Management. FREE! Go There Now!
|
|
|
|
Whether you are creating new applications or modifying existing ones, managing integration of new components with traditional z/OS elements is a critical part of building and deploying modern applications. Listen to this webcast to see how IBM can help you optimize your development process using an IDE like Rational Developer for System z that integrates with management tools, such as ClearCase to manage your application development on mainframes. FREE! Go There Now!
|
|
|
|
As organizations integrate software into every aspect of business, they are constantly pressured to deliver faster, better, and cheaper results. Unfortunately, a “dis-integrated” software delivery approach reduces returns while increasing costs. This IBM Rational White Paper shows how Integrated Requirements Management aligns organizations around maximizing value and keeping pace with change. FREE! Go There Now!
|
|
|
|
Portfolio Management is about effectively managing portfolio value by aligning portfolio investments with business goals. This complimentary e-kit provides a collection of materials that can help you understand how IBM Rational enables and automates best practices for improved governance and clear visibility into portfolio and project performance across the entire IT project lifecycle. FREE! Go There Now!
|
|
|
|
With IBM Rational Systems Development Solution, you can deliver products faster with higher quality. Within this kit, Read the “Model Driven Systems Development” white paper to see how to improve product quality and communication. Then check out the rest of the e-Kit to learn more about important topics that can affect the success of any software project through customer examples, tutorials, informative Webcasts, and best practices for designing, building and managing systems. From start to finish, at every stage in your projects, Rational Systems Development Solution can help your company reach its full potential. FREE! Go There Now!
|
|
|
|
Effective governance for lean development isn’t about command and control. Instead, the focus is on enabling the right behaviors and practices through collaborative and supportive techniques. Hear from Scott Ambler on how it is far more effective to motivate people to do the right thing than it is to force them to do so. Learn how to form a lightweight, collaboration-based framework that reflects the realities of modern IT organizations. FREE! Go There Now!
|
|
|
|
Visit IBM developerWorks to download a free trial version of IBM Rational Business Developer V7.1. Rational Business Developer offers rapid and simplified development of business applications and services through Enterprise Generation Language (EGL) tools, generating Java or mainframe solutions while shielding developers from technical complexities. FREE! Go There Now!
|
|
|
|
Join this webcast to learn how IBM Rational's Functional Testing solution enables you to implement automation your way, at your pace, with your existing staff. In this webcast, you’ll learn how you can eliminate redundancy of manual test scripts, reduce errors, and increase test coverage through test automation. After this presentation you will understand how IBM Rational Functional Testing solution can streamline your manual testing and make test automation easily attainable. FREE! Go There Now!
|
|
|
|
Visit IBM developerWorks to download the latest trial version of IBM Data Studio V1.1 at no cost. IBM Data Studio is a comprehensive data management solution that helps you effectively design, develop, deploy and manage your data, databases, and database applications throughout the data management life cycle utilizing a consistent and integrated user interface. Unlike other client-side data management solutions that focus on only one aspect of the application lifecycle or database administration, Data Studio complements the Rational Software Delivery platform, providing unparalleled flexibility for a heterogeneous data server environment across platforms. FREE! Go There Now!
|
|
|
|
Join this Rational Talks to You teleconference, to hear how Enterprise Generation Language (EGL) eliminates the need for tedious and error-prone low level coding, so developers can focus on business requirements. EGL extends the Rational software development platform with a simplified programming language that enables developers who have little or no experience with Java, Web technologies or Service Oriented Architecture, to create enterprise-class applications and services quickly and easily. It also allows developers who may have little or no mainframe programming experience to quickly create traditional mainframe components. FREE! Go There Now!
|
|
|
|
All FREE IBM® developerWorks Tools! |