Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments February 1, 2007

How does search work? Part III: Efficient clean_words() Function

Filed under: Search, phpBB — Dave Rathbun @ 2:04 am CommentsComments (2) 

This is part three of a series of posts about how the search process in phpBB works. In prior posts I have talked about the search table design and how to use stopwords. This post is going to describe how to roll back to code found way back in version 2.0.4 for one specific line. If you don’t implement this change you might see short words (two or fewer letters) or long words (greater than 20 letters) in your search database. This post also details a few additional tweaks that I have made to the clean_words() function found in includes/search_functions.php that help overall performance, both in posting and in searching. All of the changes I discuss in this post are available in MOD format. This is a bit of a long post, but stay with me, I think it’s worth it. :-)

Background
One of the last great mysteries for me in writing php code is how to use regular expressions. I know, I really should know this stuff. I’ve been using unix systems and writing perl code long enough that some of this should have managed to sink in. :-) I do know enough to generally figure out what’s going on.

My first board started years ago when 2.0.1 was just released. As additional upgrades came out I faithfully applied all of the changes. As time went by, however, my codebase diverged more and more from the basic installation due to customizations that I had done. As a result, I began to only apply selected portions of each update. I would review the code each time an update came out and apply those things that I could see a clear need for, and skip the others. As a result, the oldest (and biggest) board that I run does not have the issue outlined in this post. I only have words of lengths between 3 and 20 (inclusive) in my search word list. If you have installed a recent version or if you have applied every single update ever suggested then you are not so lucky. Your search word list will contain short and long words that clutter up your system. One person posted on phpbb.com that after cleaning up this problem on their board the size of their phpbb_search_wordmatch table was cut in half. :shock:

Regular Expression Comparison
So let’s get started. Here is one line of code from the function clean_words() from includes/functions_search.php:

// ‘words’ that consist of <3 or >20 characters are removed.
$entry = preg_replace(’/\b([a-z0-9]{1,2}|[a-z0-9]{21,})\b/’,’ ‘, $entry);

And here is the same line from a different version of phpBB:

// ‘words’ that consist of <3 or >20 characters are removed.
$entry = preg_replace(’/[ ]([\S]{1,2}|[\S]{21,})[ ]/’,’ ‘, $entry);

The first one works, the second one does not. The second one has been in place since at least version 2.0.6, and perhaps 2.0.5. (I no longer have a copy of 2.0.5 to check, but it doesn’t really matter at this point seeing how old it is.)

Before I dissect the regular expression, let’s set up echo() statements to see exactly what is going on. I added some of these commands before and after the regular expression as well as just after the line where the stopwords were applied. I created a test post designed to cause problems as it has a number of short two-letter words in a row, which is one of the issues that was reported in a discussion at phpbb.com. The post was:

A post with a few short words and if is ok a few in a row.

Here is the debug output for the “old” regex:

Starting process: a post with a few short words and if is ok a few in a row
After clean: post with few short words and few row
After stopwords: post few short words few row 

And now the “new” regex:

Starting process: a post with a few short words and if is ok a few in a row
After clean: post with few short words and is a few a row
After stopwords: post few short words few row 

Oops. :oops: See the difference? There are some words that slipped through the new code, as shown in the second line. The string of words in the post is if is ok a and the words that got through are is and a. Why? If you look carefully, it’s because they are the second two-letter word in a sequence of short words. If these words are separated by longer words they’re eliminated just fine. But if there are two one or two-letter words in a row the second one is slipping through. Now in this case the stopwords file is eliminating them from the final results. But what if I eliminate all one or two-letter words from my stopwords file?

As an experiment, I removed all one or two-letter words from my stopwords list, truncated my search tables, and rebuilt the search word index using the “new” regex. After that process was completed I found the following word-length distribution in my search tables:

+-------------------+----------+
| length(word_text) | count(*) |
+-------------------+----------+
|                 1 |     1626 |
|                 2 |     4646 |
|                 3 |     7481 |
|                 4 |    15222 |
|                 5 |    15990 |
|                 6 |    13899 |
|                 7 |    12969 |
|                 8 |     7715 |
|                 9 |     5258 |
|                10 |     3220 |
|                11 |     1942 |
|                12 |      989 |
|                13 |      745 |
|                14 |      285 |
|                15 |      183 |
|                16 |      119 |
|                17 |       72 |
|                18 |       71 |
|                19 |       43 |
|                20 |       41 |
|                21 |        8 |
|                22 |        3 |
|                23 |        9 |
|                24 |        5 |
|                25 |        5 |
|                26 |        3 |
|                27 |        1 |
|                28 |        2 |
|                31 |        2 |
|                33 |        1 |
|                35 |        2 |
|                36 |        1 |
|                37 |        1 |
|                43 |        1 |
|                47 |        1 |
+-------------------+----------+

Big problems. I have one and two-letter words, not to mention a bunch of words of over 20 letters. I reset the regex back to the “old” version, left the stopwords list alone, and ran the process again. Here are the results:

+-------------------+----------+
| length(word_text) | count(*) |
+-------------------+----------+
|                 1 |       37 |
|                 3 |     7397 |
|                 4 |    15214 |
|                 5 |    15967 |
|                 6 |    13889 |
|                 7 |    12960 |
|                 8 |     7716 |
|                 9 |     5255 |
|                10 |     3220 |
|                11 |     1942 |
|                12 |      989 |
|                13 |      745 |
|                14 |      285 |
|                15 |      183 |
|                16 |      119 |
|                17 |       72 |
|                18 |       71 |
|                19 |       43 |
|                20 |       41 |
+-------------------+----------+

Even with the original regex there are still some short one-letter words in my list. That’s because there are some characters over ASCII 125 stored in my posts. If I run a query to get the ASCII code for these single characters I find this:

+------------------+
| ascii(word_text) |
+------------------+
|              133 |
|              145 |
|              146 |
|              147 |
|              148 |
|              149 |
|              150 |
|              160 |
+------------------+

These characters correspond to various special characters that came about because of folks copying and pasting from a non-text editor. The characters (and their equivalents) are shown here:

133 … ellipses
145 ‘ single back quote
146 ’ single forward quote
147 “ double back quote
148 ” double forward quote
149 • bullet
150 – dash
160   nbsp or non-breaking space

These experiments do seem to indicate to me that the older regex works better than the newer, but both have an issue with “strange characters” that might occur. The function clean_words() already had a list of special characters that are removed during processing that includes things like % and $ and ^ and so on. To fix this, I have updated my function with these additional characters so they are mapped out as well. (The array is named $drop_char_match and it was originally defined as a static array. It can no longer be static as I am using the chr() function to store these new characters in the array.)

That fixes that part of the issue. This change is part of the MOD linked at the end of this post.

Uniquefy Your Words
It was at this point that I got distracted by something other than dissecting the regular expressions and their differences. I noticed something else that is more subtle, and I thought I could see room for improvement. Let’s go back and look at the debug output.

Starting process: a post with a few short words and if is ok a few in a row
After clean: post with few short words and few row
After stopwords: post few short words few row 

After the post words have been cleaned and stopwords removed the final text for the post is post few short words few row. These are the words that will be evaluated for inclusion into the search_wordlist and search_wordmatch tables. Notice anything? Look closer, I’ll wait a few seconds before I reveal the secret. ;-)

One Mississippi…

Two Mississippi…

Three Mississippi…

A hint: post few short words few row

Okay, you got it, right? The word “few” is repeated twice. That’s actually okay as far as the post content as the word does, in fact, appear more than once. But for processing purposes, don’t you think we would want to remove it? There’s no point in processing extra words because they appear in the post more than once. So I initially added this code just before the return statement from the clean_words() function:

$entry = implode(' ', array_unique(explode(' ', preg_replace('#\s+#', ' ', $entry))));

That statement makes sure that there are never more than two spaces in a row (the preg_replace() takes care of that), loads the space-delimited words into an array via the explode() function, uniquefies that list using the array_unique() function, and finally translates the unique word list back into a text string via the implode() function. Wow. :shock: I have no idea how efficient this statement is. But I have to assume that the efficiencies lost here will be gained later on when I don’t have to process the same word more than once. Adding some debug statements shows this:

Entry before: post few short words few row
Entry after: post few short words row

This is for a post with only a few (heh, pun intended :-P ) repeated words. What if I enter a post with a bunch of repeated words? I won’t bother to post more along this train of thought as it really depends on how folks post on your board. I think it should help, as it applies a unique filter to the words of the post before they’re processed further. The longer the post, the more likely (I would think) there would be a benefit to this change.

What about searching?
This same cleaning process is applied to search keywords. So if someone enters a search using the keywords “search search search” how many words do you really want to search for? In the current scenario phpBB will search for the word “search” three times and combine the results. That’s a bit of a waste of time, isn’t it? It seems to me that the extra time spent to uniquefy the word list would make sense. In fact, the code does help. But it can’t be run all the time.

There is an option on search to find “all” or “any” of the words entered. If you search for “all” then each word as entered will be included in the search. So if your search terms are repeated you’re wasting time. Why not uniqefy the list?

Because sometimes there is a reason to have duplicate words. If you opt for “any” (the default in a standard phpBB installation) then you have the option to match one or more words (that’s the “any” part in action) or you can actually enter boolean logic. Suppose you wanted to search for cake and ice and cream but you didn’t want to see any posts regarding coffee (also served with cream). You can actually do that with the following combination of search terms:

cake and ice and cream and not coffee

That’s really awesome, and I bet 98% of the folks that search a phpBB database don’t know that. But notice the words… the word “and” is repeated three times. In this context that’s very important. If I were use my unique word filter process on these words it would cause a problem.

So it boils down to this: if I search for “all” words, or if I don’t include any boolean logic words, then I can apply the unique word filter and save myself some time even for searching. But I don’t know whether I’m using “any” or “all” while I am in the clean_words() function. I do know if I am in ‘post’ mode, but that’s all I’ve got to work with.

It’s MOD Writin’ Time!
At this point I have done enough research that it’s time to write a MOD. :-) I called this MOD “Efficient clean_words() Function” although “Uniquefy Your Words” came in a close second. :-P The MOD alters code in both includes/functions_search.php and the base search.php file. You can review the MOD using a link at the end of the post, but I’ve included a basic logic flow here:

  • includes/functions_search.php
    • Add some more characters to the drop list
    • Revert to the original regex
    • if $mode == ‘post’ then uniquefy the words
  • search.php
    • if only one search word, ignore the rest of the process
    • If $search_terms == ‘all’ then uniquefy the words
    • if $search_terms == ‘any’ and $search_keywords includes ‘and’ or ‘or’ or ‘not’ then use $search_keywords as entered, else uniquefy the words

Summary
So from this post I’ve shown that the regex that is in the current phpBB clean_words() function appears to be flawed. I haven’t completely dissected the regex as compared to that from 2.0.4 yet, but this post has gone on long enough. Suffice it to say that there are three issues that I’ve discussed with the current clean_words() function. First, there are some “weird” characters that can get stored in your search_wordlist table. Second, the regex appears to be better in phpBB version 2.0.4. And finally, taking the time to create a unique list of words from the clean_words() function should pay dividends both for post processing and for searching. All of these adjustments are included in the Efficient clean_words() Function MOD.

Next time we’ll really dissect the regex and see if we can tell what went wrong with the current version.

If you’re wondering, “uniquefy” is not a real word, but it does appear in the urban dictionary. ;-)

Final Results
By the way, after making the changes in this MOD and rebuilding my search index I ran the same query that checks the length of my search words and got the following output:

+-------------------+----------+
| length(word_text) | count(*) |
+-------------------+----------+
|                 3 |     7411 |
|                 4 |    15186 |
|                 5 |    15951 |
|                 6 |    13881 |
|                 7 |    12956 |
|                 8 |     7699 |
|                 9 |     5244 |
|                10 |     3221 |
|                11 |     1940 |
|                12 |      987 |
|                13 |      742 |
|                14 |      284 |
|                15 |      183 |
|                16 |      119 |
|                17 |       72 |
|                18 |       71 |
|                19 |       43 |
|                20 |       41 |
+-------------------+----------+

The changes have fixed it so that no single-letter words appear. Here is the SQL code that I used to generate this output:

SELECT  length(word_text)
,       count(*)
FROM    phpbb_search_wordlist w
,       phpbb_search_wordmatch m
WHERE   w.word_id = m.word_id
GROUP BY length(word_text)

The changes also reduced the amount of time required to rebuild my search tables by 4%. As it turned out later on, the reduction in time was not because of the uniquefy process, but something completely unexpected. You will just have to come back for the next installment to find out what it was. 8)

Related Links

2 Comments

  1. Hi,

    I have one big problem… I seem to have a really nasty search table in the moment with around 25 million entrys. I have a quad core with 8gb mem. and would like to rebuild the table but…..this is going to take for ever I guess? is there any way to do this somehow without the board must be disabled?

    Comment by Oliver — April 29, 2008 @ 6:40 pm

  2. Hi, Oliver, welcome to my blog. :) The code that I wrote to rebuild the search index tables is best run with the board disabled as the first thing it does is truncate the tables and start fresh. If people were posting while the process were running, then the later posts would be already indexed when the program got to them. There are more sophisticated MODs on phpbb.com that might do a better job for you. Some of them tag posts as indexed or not, and some are designed to run via cron jobs rather than interactive php jobs.

    On a very large board that would likely be a better solution.

    Comment by Dave Rathbun — April 30, 2008 @ 1:25 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress