Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments January 30, 2007

How does search work? Part II: Making Effective Use of “Stop Words”

Filed under: Search, phpBB — Dave Rathbun @ 9:58 am CommentsComments (7) 

In a prior post I started reviewing the search process used in phpBB2. In summary, there are two tables involved. One table includes a set of words, each assigned a unique word_id value. The second table stores a cross reference of words and posts where that word appears. That makes searching for unique words very fast. It makes your search tables very large. In my opinion the best tool a board owner has to combat this is the stopwords file. This post will cover that file and how you can use it to tune searching on your board.

What is a stopword file?
The stopwords file is simply a text file stored on your server. It’s an alphabetical listing of words that should never be indexed. Remember that as each post is processed as it is entered, and a record of each word is stored in the phpbb_search_wordlist table. Imagine how many posts include the word “the” or “and” or other such common words. Since they are three letters long, they would be included by the standard process. But do you really need to search for such common words?

Case in point: I have a large board that I manage that (as I type this) has 288,870 posts. The word “the” appears in 247,957 of those. And the query to determine that took nearly a second (0.88 seconds, to be precise). How much value is there in looking for a word that appears in over 85% of the posts? Not much… in fact, I would say not any at all.

In general adding such common terms to your search parameters would be fairly useless. In fact big seach engines like google tend to throw out such words for that very reason. Since there is no value in having those words in your search tables phpBB provides a “stopwords” file. This file is a language specific text file and so is located in the language folders. The specific address (related to the forum root path) is

/language/lang_english/search_stopwords.txt

Replace “english” with your language of choice, of course. What does this file look like? Here’s a sample of some of the words from a default installation:

a
about
after
ago
all
almost
along
alot
also
am
an
and
answer
any
anybody
anybodys
anyone
anywhere
are
arent

Now the first thing you might notice is that there are one and two letter words in this file. You might also remember that short words (less than 3 letters) are supposed to be purged from the search word storage process anyway, so they probably don’t need to be included in the stopwords file. But they’re there. The bottom of the file includes some standard abbreviations often used on board posts like AFAIK, IIRC, and YMMV.

So that’s what a standard stopwords file looks like.

How can this file help my board?
At some point certain words become less useful for searching just because of the sheer volume of posts that contain those words. Can you imagine how many posts at phpbb.com include words like “board” or “post” or “topic” or “database”? It just doesn’t make sense to search for those individual words. (Word combinations? Maybe, but that creates its own challenge, detailed later.)

Back to my “big board” for a bit. Here’s a list of the 25 most popular words found in my search index, along with the number of posts where they appear:

+------------+------------+
| word_text  | word_count |
+------------+------------+
| work       |      27323 |
| date       |      23642 |
| database   |      22927 |
| repository |      22681 |
| set        |      22265 |
| prompt     |      21860 |
| select     |      20788 |
| client     |      20406 |
| name       |      20348 |
| run        |      20338 |
| values     |      20283 |
| different  |      19533 |
| change     |      19300 |
| make       |      18712 |
| able       |      18671 |
| first      |      18284 |
| used       |      18168 |
| oracle     |      18050 |
| tables     |      17479 |
| issue      |      17221 |
| value      |      17027 |
| another    |      16990 |
| tried      |      16989 |
| still      |      16779 |
| number     |      16732 |
+------------+------------+
25 rows in set (26.73 sec)

The SQL code used to generate that list is at the bottom of this post. Notice how long it took? Nearly 27 seconds! :shock: That’s an eternity on a very powerful server (dual Xeon, 4GB of RAM). The reason for the time was that this query joins and scans two of the larger tables in the database. I have 174,000 words in my wordlist and 6,678,489 records in the word – post cross reference table. Let’s go back to that 174,000 number for a second. In general if a word appears in more than 10% of the posts on your board I would suggest that it falls into a “commonly used” category. Commonly used words probably do not provide search value, especially the way that phpBB performs the search. How can we stop those words from being indexed?

Now you see where this post is going… the search_stopwords.txt file serves that exact purpose. I’m going to go back to my query results and manually pick out some words that I want to drop. I’ve decided that I am going to go with this subset of the list and selected the following words (displayed here as a percentage of the total posts where they appear):

work		 27,323 	16%
date		 23,642 	14%
database	 	 22,927 	13%
repository	 22,681 	13%
set		 22,265 	13%
prompt		 21,860 	13%
select		 20,788 	12%
client		 20,406 	12%
name		 20,348 	12%
run		 20,338 	12%
values		 20,283 	12%
different		 19,533 	11%
change		 19,300 	11%
make		 18,712 	11%
able		 18,671 	11%
first		 18,284 	11%
used		 18,168 	10%

If I remove all of these words I will drop 355,529 rows from the phpbb_search_wordmatch table, dropping it about 5%. That might not sound like a lot, but it’s still worth doing. I have repeated this process several times for this database, and each time it gets harder to make a substantial dent. The first time I dropped about 35% of the rows, and that was a major difference.

How do I properly add stopwords?
You can easily edit the search_stopwords.txt file and add in the words you want to remove. It’s just a text file, and all you have to do is ensure that each word appears on one line. The words do not have to be in alphabetical order although I tend to do that so the words are easier to check later on. But adding the words to the stopwords list is just the first step. It doesn’t do anything to reclaim space being used by those words in your search tables. (I have a MOD in development that will allow you to manage your stopwords file without having to directly edit it on your server. That MOD may or may not ever see the light of day, but I’m using it on all of my boards right now.)

The next step is removing those words from the two tables used by the search process. There are a couple of options to do this. First, you can download, install, and execute one of the many “Rebuild Search MODs” available at phpbb.com. Those MODs will clear out both search tables and rebuild them from scratch. It’s actually not a bad idea to do that on a periodic basis anyway, even if you haven’t done any search stopwords maintenance. But with a database of over a quarter million posts that process takes over eight hours to run, even on my server.

So I do it the hard way instead. :-) At some point I will write this up as some php code, but for now I do everything via direct SQL. (Note the viewers at home: do not try this without first making a backup of your database, just in case something goes wrong.)

Step 1: Edit your language/lang_english/search_stopwords.txt file
Step 2: Remove rows from phpbb_search_wordlist
Step 3: Remove rows from phpbb_search_wordmatch

Step 1 is done. I added all of the selected words to my stopwords file. Next I ran a query to get the list of word_id values associated with those words.

+---------+
| word_id |
+---------+
|     780 |
|     346 |
|    1157 |
|     353 |
|      20 |
|     358 |
|      30 |
|      44 |
|     769 |
|     377 |
|     689 |
|     752 |
|     255 |
|      59 |
|     397 |
|      74 |
|     235 |
+---------+
17 rows in set (0.01 sec)

After obtaining the word_id values I put them into a text editor as a comma-separated list of values, as I will need that list for two upcoming queries.

780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235

Step 2 can now commence… removing the words from the search_wordlist table, followed by Step 3 which is removing the words from the search_wordmatch table.

DELETE FROM phpbb_search_wordmatch
WHERE word_id in (780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235);

DELETE FROM phpbb_search_wordlist
WHERE word_id in (780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235);

Running the first query took 2 minutes and 25 seconds and removed 355,529 rows from the table. Running the second query is much faster since the words are unique in the search_wordlist table; it took less than a second and removed the individual rows associated with the 17 words I selected to remove. Running these queries used over 50% of my server resources for this single process. :shock: If you need to do maintenance like this it would be best to do it when your board is not very busy.

What is the net result of all of this work? My search process should run quicker as there are fewer rows in the search_wordlist table. Posting speed might also improve as there are fewer words to process (stopwords are ignored during the indexing process as a post is saved). The negative impact? Those seventeen words I just dropped cannot ever be used to search the posts. That presents some challenges when board members are sure that the word “database” appears in at least one post, because they’re looking at it. But when they search it says that there were no matches to be found. Very confusing. :-?

I have started a MOD (private, not to be published at this time) that provides some feedback to the user as far as what is going on with the search process. Here is a sample search screen:

Search Input

In this sample you will see that the default is Search All Words rather than Search Any Words. That was done in an attempt to encourage folks to add more words to reduce their search rather than increase it. Now as you might remember the word “database” was just added as a stopword. How is that communicated to the user? Here is a screen shot of the search results output:

Search Input

The search keywords “test” and “database” are displayed on the header. The words that were actually used are listed, as are the words that were “stopped” by the search process. And on the right side of the screen is an option to fine-tune the current search without returning to the search screen. Any settings made on the initial search screen like searching a specific forum or for posts from a particular user are carried over to the secondary search even if they are not displayed on this screen. I think this will be a really nice MOD when it is finished as it allows me to be even more aggressive with my stopwords while keeping my users informed as to what they can and cannot search for.

Summary
The search tables get big. There’s really no way around that with the standard phpBB search process. By using stopwords you can reduce the size of those tables and improve the efficiency of the search process. This can result in user confusion when they search for words that they know exist in the posts. I believe that the “Search Feedback MOD” that I showed will reduce or eliminate that confusion by showing the user exactly how their search was processed, as well as providing the ability to allow them to fine-tune their search while looking at the results.

Your stopwords file should not have to include words of 2 letters or less. In the first post in this series I mentioned that words of two letters or less (or more than 20) are not supposed to be indexed. They are, or at least they can be sometimes. :-) I will cover that in more detail in the next post in this series.


SQL Code to obtain the 25 most frequently used words

SELECT 	word_text
, 	count(post_id) as word_count
FROM 	phpbb_search_wordlist w
, 	phpbb_search_wordmatch p
WHERE 	w.word_id = p.word_id
GROUP BY word_text
ORDER BY word_count desc
LIMIT 25

SQL Code to get Word IDs for selected words

SELECT  word_id
FROM    phpbb_search_wordlist
WHERE   word_text in ('work', 'date', 'database', 'repository', 'set', 'prompt', 'select', 'client',
'name', 'run', 'values', 'different', 'change', 'make', 'able', 'first', 'used')

SQL Code to remove selected words from the search tables based on word_id values

DELETE FROM phpbb_search_wordmatch
WHERE word_id in (780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235);

DELETE FROM phpbb_search_wordlist
WHERE word_id in (780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235);

Change your search to default to ALL rather than ANY terms
I’ve just released a very simple MOD that will allow you to default your search to ALL rather than ANY. The changes are very minor; you only need to make two edits to a single template file. You can download the code here from the phpBBDoctor MOD Managerâ„¢.

7 Comments

  1. Hi there,

    This is one of the most interesting posts I ever read regarding this obscure way to handle Search on PHPBB :)

    I’ve started to look after my ugly *search_word* tables to find out if they’re giving a hard time to the CPU of our webserver :)

    Well done for your article!
    Xavier

    Comment by Xavier — February 5, 2007 @ 6:46 pm

  2. Hi, Xavier, and welcome to my blog. I am glad you found this post useful. There is a Part III already published and Part IV is a draft at this time. Search is such an important part of phpBB for many folks, so any tweaks that can improve how search performs should probably benefit a lot of people.

    Thanks for your comment. 8)

    Comment by dave.rathbun — February 6, 2007 @ 7:54 am

  3. My fiancees phpBB2 forum uses a lott of custom smilies. Amazingly enough the wordmatch table went down with 10 MB in size after I ran through al thoose names that was cutommade for thoose weirdo smilies. Theese where words that where no real words so I could safely follow your howto with thoose words. I would guess that some forums having a lot of woman *(such as my fiancees :D ) or a lott of teenagers have allos a huge usage of smilies. worth checking thoose smilies names out with this howto :) might cut down a bit on the table :) Great howto by the way. Slimmed my wordmatch table with allmost 45 % a LOTT of MB :)

    Comment by Boban Alempijevic — April 29, 2007 @ 10:18 am

  4. 45% is indeed a nice size reduction. :-) However, adding those extra words to the stopwords file does have a downside in that it will take longer to enter new posts. Have you read the blog post in moving the stopwords into the database, and the time savings it provides? I need to turn that into a formal MOD and have some other folks start testing.

    In any case, I’m glad you found my post useful, and thank you for your comment. 8)

    Comment by dave.rathbun — April 29, 2007 @ 10:50 am

  5. Yess I read all yoru posts and allso about the downside. Out of thoose 45 % (Mathematicly counted) rughly half of it was thanx to something as weird as the custom smilies my Finacee has been putting up on the forum. The names of the smilies are allso stored in same table ofcourse and I noticed that it was not only my imagination that the board was filed with a lott of smilies in most of the posts :) Half of the slimming came thanx to me putting in the names of the arge smilie arsenal into the stopword. I will not go on slimming further, but will instead wait for the mod to get finnished, and I am waiting eagerly for it :) I whish you good luck with writing it mate.

    Comment by Boban Alempijevic — May 1, 2007 @ 2:25 am

  6. Hi,

    I can only thank you for all your time and afford you put into all this. This all helps a lot. I ran as well a pretty big forum with seldom less than 1000 guests and around 200 logged in users in the same time. Performance is a big problem for me since the forum grows and grows and it got kind out of hand with the years. I have also a big server with 8GB mem etc. but still the machine suffers a lot. I made so far some changes you sugguest, lets see how this helps. The portal is in portuguese language and it seems my stopwords really suck so I hope this helped me. My wordmatch table droped with the first 30 keywords around 4 million lines down but still has around 21 million and I need to do this more detailed. I never payed attention before in this so i discovered now a lot of 1,2 or 3 letter words in my wordlist table as well as tons of numbers like 1,11,111 etc…. is this normal?

    thanks again, I am glad out there are people like you….
    best regards
    Oliver

    Comment by Oliver — April 29, 2008 @ 6:28 pm

  7. great blindness! I just now found your other topic about this with the anwser to my questions :)
    thanks man… you help me and my users really a lot!

    Comment by Oliver — April 29, 2008 @ 6:31 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress