Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments March 3, 2007

How does search work? Part V: Search Stopwords Redux

Filed under: Search, phpBB — Dave Rathbun @ 3:02 am CommentsComments (2) 

From this post earlier in the Search series I made this statement:

The changes also reduced the amount of time required to rebuild my search tables by 4%. As it turned out later on, the reduction in time was not because of the uniquefy process, but something completely unexpected. You will just have to come back for the next installment to find out what it was. 8)

The “uniquefy” process was based on the idea that when you’re getting ready to store your post words you don’t need duplicate words to be processed. They just take extra time for no benefit. So I wrote a MOD that would include logic to “uniquefy” the list of words from a post before they were processed any further. I thought that was where the 4% performance benefit was coming from. I was wrong.

I am now ready to reveal the secret. ;-)

This post was where I first started talking about the search “stopwords” file used by phpBB. If you want to go read it first, go ahead. I’ll wait right here.

Ah, you’re back, or perhaps you never left. In either case, here’s a very quick review of the purpose of the search_stopwords.txt file. It contains words that should never be added to your word index (phpbb_search_wordlist + phpbb_search_wordmatch) tables. By reviewing your search word index tables and identifying the most popular words, you can reduce the size of these tables and make search run quicker. Now here’s the ironic thing: that last statement is wrong.

Adding stopwords will – as discussed in the prior post – make your search word index tables smaller. But it actually will degrade the performance of posting and searching. So it’s a win for database size, but a loss for performance. At first glance this does not seem to make sense. Why would increasing the size of your stopwords file be a performance issue?

Current Stopwords Code
Here is a look at the code from includes/functions_search.php that relates to the stopwords handling. This code is from version 2.0.22 which was current as I wrote this blog entry.

if ( !empty($stopword_list) )
{
  for ($j = 0; $j < count($stopword_list); $j++)
  {
    $stopword = trim($stopword_list[$j]);

    if ( $mode == 'post' || ( $stopword != 'not' && $stopword != 'and' && $stopword != 'or' ) )
    {
      $entry = str_replace(' ' . trim($stopword) . ' ', ' ', $entry);
    }
  }
}

The variable $entry contains a space-delimited string of all of the words from the post. In the prior blog entry mentioned at the start of this post I showed how to ensure that the list of words was unique, so I won't go back over that again. The $stopword_list is an array passed by reference into this function. If it contains at least one value, then this line of code is executed for each word in your stopwords file:

$entry = str_replace(' ' . trim($stopword) . ' ', ' ', $entry);

Let me say that again. The line of code shown above is executed for every single word in the stopwords text file. You read that right... the code is based on the size of your stopwords file, not the number of words actually contained in your post. So after going to all of the trouble to optimize the contents of $entry and make sure the list of words was unique, it really didn't matter. The remaining code still runs 287 str_replace() commands. (My stopwords file currently has 287 words in it.)

And the bad news doesn't stop there. The entire process is repeated to process your post subject as well! :shock: That's another 287 str_replace() operations, on what is no doubt an even shorter value for $entry.

Wouldn't it make more sense to process the other way around? To take out the stopwords that we know are present, rather than looking for every single possibility?

I have a big board that I love to use for analysis of stuff like this. I believe that there is no substitute for real data when it comes to tuning or determine optimizing strategies. This board has nearly 300K posts as I write this. The average number of words left after removing stopwords words is 22. Twenty two! That's 22 unique words per post, yet the entire stopwords file is checked. What the code shown above is doing is reading each word in the stoplist file and then removing it from the list of post words.

Even if the word isn't there. :shock:

To Cache or Not To Cache
I've put a lot of work into working out a caching system used for quite a few phpBB Doctor MODs. The caching system was designed so that I could use it in more than one MOD, making it easier for me to bundle MODs together without conflicts. Why cache?

When you run a query there are several steps:

  • Connect to the database
  • Execute the query
  • Read the results from disk

When you get right down to it, the database is just a filter for reading a disk file. What if you don't want a filter... what if you want to read the entire file?

  • Open the file
  • Read it

Much more efficient. Note that the efficiencies are lost if you want to read only a part of the file, or you want to read bits from two (or more files) joined togehter. For that, you really want to leverage a relational database. So what's my point?

The phpBB stopwords process is reading the entire file, every time. I don't want to read the entire file! I only want to check the words that are actually in my post. Because of this, what I really need is a query. I'm going to move a text file cached on disk back into the database. :-)

Search Stopwords Manager
I have already written a MOD called the Search Stopwords Manager. This MOD (not yet released) allows a board administrator manage their stopwords file contents via an admin control panel (ACP) interface rather than having to edit a text file. After the processing is completed, the file contents are written back out to the disk. Now I am going to change that. I am going to leave the stopwords in the database.

Let's go back to the stopwords code shown above. The entire stopwords file has been read from disk into an array. Then every word in the array is removed from the $entry string variable, whether it exists or not. Not very efficient.

To fix this, I have added code into functions_search.php that uses the following process flow:

  • Uniquefy the words contained in $entry
  • Query the stopwords database table using that list of words as a WHERE clause
  • Remove any stopwords found from $entry

As you can see, it turns the current process completely around backwards. I get only the words that I am interested in from the database table containing my stopwords. For each word that I find, I remove it. Short and simple. But does it work?

Benchmarking Results
I created a test board and loaded it with about 30,000 posts. These posts had an average word count of 64 and an average post length of 700 characters. I ran a Rebuild Search Index MOD that I wrote a while back. (This MOD - like most I have seen of this type - simply works its way through every post on the board, passing the posts through the standard phpBB post processing.) Before the changes that process took nearly an hour (actual time was 52:32). After loading the stopwords into the database and making the alterations outlined above the process took a bit over half of the original time (actual time was 32:04). I think you would agree that is quite an improvement. :-) It works out to about 10 seconds per 100 posts versus 6 seconds per 100 posts. That's about a 40% reduction in post processing time, and a lot of cpu cycles that have been recovered. And it gets better... more on that in a few more paragraphs. :-)

What is the cost? We have to run a query to get a specific list of stopwords used for each post. The benefit is that we don't process the entire file. An additional benefit (for me) is that the Search Stopwords Management MOD that I wrote can be simplified. I don't have to worry about writing out a disk file... I just have to manage the contents of a database table.

Summary
Back in the original post I mentioned a 4% reduction in post processing time. I had originally thought that the reduction was a result of the uniquefy process I applied to the post words. I was wrong. As part of testing the regex I had removed a bunch of "short" words from my stopwords text file. It was that slight change... removing a few short words from my stopwords file... that caused the difference in the rebuild process.

After moving the stopwords into the database and altering the code to take advantage of that, I've saved 40%. After removing the code that reads in the (now useless) stopwords file and adding the search synonyms file to a different database table I have reduced the processing time down to about 4 seconds per 100 posts, a 60% reduction in processing time for my rebuild MOD.

Posting is also going to benefit from these changes. First I uniquefy the words from the post, then I check that specific list of words against both the stopwords and synonyms lists. You might not notice a microsecond or two on each post, but your server will, especially on a busy board.

I still need to work out how to handle multiple languages, but that should be a simple matter. If you are wondering, I don't have any code to release for this yet, but I certainly plan to do so. I would think that anything that can reduce processing time by over half would be well received by the phpBB community.

Stay tuned for more details. 8)

2 Comments

  1. O MY GOD!

    Considering that my wordmatch list for example has 3.8 million rows and is allmost 100Mb large with a BUSY board crunching aprox 10GB of Bandwidth every day… well such a mod wouldnt only just decrese the load on the server it sits (Even if the server happens to eb a huge cluster server system) it would allso increase the posting time slightly, eventhough miliseconds I count that as a lott.

    Cant wait to read through such a mod and even try it out.

    PS! Great blog, have actually been learning few tricks from theese posts of you, thank you.

    Comment by Boban Alempijevic — April 29, 2007 @ 11:17 am

  2. yeah, dave’s systematic approach is great. I would love to have seen him have input into phpBB3 search system.

    Comment by Esmond Poynton — April 30, 2007 @ 9:39 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress