How does search work? Part II: Making Effective Use of “Stop Words”
In a prior post I started reviewing the search process used in phpBB2. In summary, there are two tables involved. One table includes a set of words, each assigned a unique word_id value. The second table stores a cross reference of words and posts where that word appears. That makes searching for unique words very fast. It makes your search tables very large. In my opinion the best tool a board owner has to combat this is the stopwords file. This post will cover that file and how you can use it to tune searching on your board.
What is a stopword file?
The stopwords file is simply a text file stored on your server. It’s an alphabetical listing of words that should never be indexed. Remember that as each post is processed as it is entered, and a record of each word is stored in the phpbb_search_wordlist table. Imagine how many posts include the word “the” or “and” or other such common words. Since they are three letters long, they would be included by the standard process. But do you really need to search for such common words?
Case in point: I have a large board that I manage that (as I type this) has 288,870 posts. The word “the” appears in 247,957 of those. And the query to determine that took nearly a second (0.88 seconds, to be precise). How much value is there in looking for a word that appears in over 85% of the posts? Not much… in fact, I would say not any at all.
In general adding such common terms to your search parameters would be fairly useless. In fact big seach engines like google tend to throw out such words for that very reason. Since there is no value in having those words in your search tables phpBB provides a “stopwords” file. This file is a language specific text file and so is located in the language folders. The specific address (related to the forum root path) is
/language/lang_english/search_stopwords.txt
Replace “english” with your language of choice, of course. What does this file look like? Here’s a sample of some of the words from a default installation:
a about after ago all almost along alot also am an and answer any anybody anybodys anyone anywhere are arent
Now the first thing you might notice is that there are one and two letter words in this file. You might also remember that short words (less than 3 letters) are supposed to be purged from the search word storage process anyway, so they probably don’t need to be included in the stopwords file. But they’re there. The bottom of the file includes some standard abbreviations often used on board posts like AFAIK, IIRC, and YMMV.
So that’s what a standard stopwords file looks like.
How can this file help my board?
At some point certain words become less useful for searching just because of the sheer volume of posts that contain those words. Can you imagine how many posts at phpbb.com include words like “board” or “post” or “topic” or “database”? It just doesn’t make sense to search for those individual words. (Word combinations? Maybe, but that creates its own challenge, detailed later.)
Back to my “big board” for a bit. Here’s a list of the 25 most popular words found in my search index, along with the number of posts where they appear:
+------------+------------+ | word_text | word_count | +------------+------------+ | work | 27323 | | date | 23642 | | database | 22927 | | repository | 22681 | | set | 22265 | | prompt | 21860 | | select | 20788 | | client | 20406 | | name | 20348 | | run | 20338 | | values | 20283 | | different | 19533 | | change | 19300 | | make | 18712 | | able | 18671 | | first | 18284 | | used | 18168 | | oracle | 18050 | | tables | 17479 | | issue | 17221 | | value | 17027 | | another | 16990 | | tried | 16989 | | still | 16779 | | number | 16732 | +------------+------------+ 25 rows in set (26.73 sec)
The SQL code used to generate that list is at the bottom of this post. Notice how long it took? Nearly 27 seconds! That’s an eternity on a very powerful server (dual Xeon, 4GB of RAM). The reason for the time was that this query joins and scans two of the larger tables in the database. I have 174,000 words in my wordlist and 6,678,489 records in the word – post cross reference table. Let’s go back to that 174,000 number for a second. In general if a word appears in more than 10% of the posts on your board I would suggest that it falls into a “commonly used” category. Commonly used words probably do not provide search value, especially the way that phpBB performs the search. How can we stop those words from being indexed?
Now you see where this post is going… the search_stopwords.txt file serves that exact purpose. I’m going to go back to my query results and manually pick out some words that I want to drop. I’ve decided that I am going to go with this subset of the list and selected the following words (displayed here as a percentage of the total posts where they appear):
work 27,323 16% date 23,642 14% database 22,927 13% repository 22,681 13% set 22,265 13% prompt 21,860 13% select 20,788 12% client 20,406 12% name 20,348 12% run 20,338 12% values 20,283 12% different 19,533 11% change 19,300 11% make 18,712 11% able 18,671 11% first 18,284 11% used 18,168 10%
If I remove all of these words I will drop 355,529 rows from the phpbb_search_wordmatch table, dropping it about 5%. That might not sound like a lot, but it’s still worth doing. I have repeated this process several times for this database, and each time it gets harder to make a substantial dent. The first time I dropped about 35% of the rows, and that was a major difference.
How do I properly add stopwords?
You can easily edit the search_stopwords.txt file and add in the words you want to remove. It’s just a text file, and all you have to do is ensure that each word appears on one line. The words do not have to be in alphabetical order although I tend to do that so the words are easier to check later on. But adding the words to the stopwords list is just the first step. It doesn’t do anything to reclaim space being used by those words in your search tables. (I have a MOD in development that will allow you to manage your stopwords file without having to directly edit it on your server. That MOD may or may not ever see the light of day, but I’m using it on all of my boards right now.)
The next step is removing those words from the two tables used by the search process. There are a couple of options to do this. First, you can download, install, and execute one of the many “Rebuild Search MODs” available at phpbb.com. Those MODs will clear out both search tables and rebuild them from scratch. It’s actually not a bad idea to do that on a periodic basis anyway, even if you haven’t done any search stopwords maintenance. But with a database of over a quarter million posts that process takes over eight hours to run, even on my server.
So I do it the hard way instead. At some point I will write this up as some php code, but for now I do everything via direct SQL. (Note the viewers at home: do not try this without first making a backup of your database, just in case something goes wrong.)
Step 1: Edit your language/lang_english/search_stopwords.txt file
Step 2: Remove rows from phpbb_search_wordlist
Step 3: Remove rows from phpbb_search_wordmatch
Step 1 is done. I added all of the selected words to my stopwords file. Next I ran a query to get the list of word_id values associated with those words.
+---------+ | word_id | +---------+ | 780 | | 346 | | 1157 | | 353 | | 20 | | 358 | | 30 | | 44 | | 769 | | 377 | | 689 | | 752 | | 255 | | 59 | | 397 | | 74 | | 235 | +---------+ 17 rows in set (0.01 sec)
After obtaining the word_id values I put them into a text editor as a comma-separated list of values, as I will need that list for two upcoming queries.
780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235
Step 2 can now commence… removing the words from the search_wordlist table, followed by Step 3 which is removing the words from the search_wordmatch table.
DELETE FROM phpbb_search_wordmatch WHERE word_id in (780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235); DELETE FROM phpbb_search_wordlist WHERE word_id in (780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235);
Running the first query took 2 minutes and 25 seconds and removed 355,529 rows from the table. Running the second query is much faster since the words are unique in the search_wordlist table; it took less than a second and removed the individual rows associated with the 17 words I selected to remove. Running these queries used over 50% of my server resources for this single process. If you need to do maintenance like this it would be best to do it when your board is not very busy.
What is the net result of all of this work? My search process should run quicker as there are fewer rows in the search_wordlist table. Posting speed might also improve as there are fewer words to process (stopwords are ignored during the indexing process as a post is saved). The negative impact? Those seventeen words I just dropped cannot ever be used to search the posts. That presents some challenges when board members are sure that the word “database” appears in at least one post, because they’re looking at it. But when they search it says that there were no matches to be found. Very confusing.
I have started a MOD (private, not to be published at this time) that provides some feedback to the user as far as what is going on with the search process. Here is a sample search screen:
In this sample you will see that the default is Search All Words rather than Search Any Words. That was done in an attempt to encourage folks to add more words to reduce their search rather than increase it. Now as you might remember the word “database” was just added as a stopword. How is that communicated to the user? Here is a screen shot of the search results output:
The search keywords “test” and “database” are displayed on the header. The words that were actually used are listed, as are the words that were “stopped” by the search process. And on the right side of the screen is an option to fine-tune the current search without returning to the search screen. Any settings made on the initial search screen like searching a specific forum or for posts from a particular user are carried over to the secondary search even if they are not displayed on this screen. I think this will be a really nice MOD when it is finished as it allows me to be even more aggressive with my stopwords while keeping my users informed as to what they can and cannot search for.
Summary
The search tables get big. There’s really no way around that with the standard phpBB search process. By using stopwords you can reduce the size of those tables and improve the efficiency of the search process. This can result in user confusion when they search for words that they know exist in the posts. I believe that the “Search Feedback MOD” that I showed will reduce or eliminate that confusion by showing the user exactly how their search was processed, as well as providing the ability to allow them to fine-tune their search while looking at the results.
Your stopwords file should not have to include words of 2 letters or less. In the first post in this series I mentioned that words of two letters or less (or more than 20) are not supposed to be indexed. They are, or at least they can be sometimes. I will cover that in more detail in the next post in this series.
SQL Code to obtain the 25 most frequently used words
SELECT word_text , count(post_id) as word_count FROM phpbb_search_wordlist w , phpbb_search_wordmatch p WHERE w.word_id = p.word_id GROUP BY word_text ORDER BY word_count desc LIMIT 25
SQL Code to get Word IDs for selected words
SELECT word_id FROM phpbb_search_wordlist WHERE word_text in ('work', 'date', 'database', 'repository', 'set', 'prompt', 'select', 'client', 'name', 'run', 'values', 'different', 'change', 'make', 'able', 'first', 'used')
SQL Code to remove selected words from the search tables based on word_id values
DELETE FROM phpbb_search_wordmatch WHERE word_id in (780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235); DELETE FROM phpbb_search_wordlist WHERE word_id in (780,346,1157,353,20,358,30,44,769,377,689,752,255,59,397,74,235);
Change your search to default to ALL rather than ANY terms
I’ve just released a very simple MOD that will allow you to default your search to ALL rather than ANY. The changes are very minor; you only need to make two edits to a single template file. You can download the code here from the phpBBDoctor MOD Managerâ„¢.

Hi there,
This is one of the most interesting posts I ever read regarding this obscure way to handle Search on PHPBB
I’ve started to look after my ugly *search_word* tables to find out if they’re giving a hard time to the CPU of our webserver
Well done for your article!
Xavier
Comment by Xavier — February 5, 2007 @ 6:46 pm
Hi, Xavier, and welcome to my blog. I am glad you found this post useful. There is a Part III already published and Part IV is a draft at this time. Search is such an important part of phpBB for many folks, so any tweaks that can improve how search performs should probably benefit a lot of people.
Thanks for your comment.
Comment by dave.rathbun — February 6, 2007 @ 7:54 am
My fiancees phpBB2 forum uses a lott of custom smilies. Amazingly enough the wordmatch table went down with 10 MB in size after I ran through al thoose names that was cutommade for thoose weirdo smilies. Theese where words that where no real words so I could safely follow your howto with thoose words. I would guess that some forums having a lot of woman *(such as my fiancees
) or a lott of teenagers have allos a huge usage of smilies. worth checking thoose smilies names out with this howto
might cut down a bit on the table
Great howto by the way. Slimmed my wordmatch table with allmost 45 % a LOTT of MB
Comment by Boban Alempijevic — April 29, 2007 @ 10:18 am
45% is indeed a nice size reduction.
However, adding those extra words to the stopwords file does have a downside in that it will take longer to enter new posts. Have you read the blog post in moving the stopwords into the database, and the time savings it provides? I need to turn that into a formal MOD and have some other folks start testing.
In any case, I’m glad you found my post useful, and thank you for your comment.
Comment by dave.rathbun — April 29, 2007 @ 10:50 am
Yess I read all yoru posts and allso about the downside. Out of thoose 45 % (Mathematicly counted) rughly half of it was thanx to something as weird as the custom smilies my Finacee has been putting up on the forum. The names of the smilies are allso stored in same table ofcourse and I noticed that it was not only my imagination that the board was filed with a lott of smilies in most of the posts
Half of the slimming came thanx to me putting in the names of the arge smilie arsenal into the stopword. I will not go on slimming further, but will instead wait for the mod to get finnished, and I am waiting eagerly for it
I whish you good luck with writing it mate.
Comment by Boban Alempijevic — May 1, 2007 @ 2:25 am
Hi,
I can only thank you for all your time and afford you put into all this. This all helps a lot. I ran as well a pretty big forum with seldom less than 1000 guests and around 200 logged in users in the same time. Performance is a big problem for me since the forum grows and grows and it got kind out of hand with the years. I have also a big server with 8GB mem etc. but still the machine suffers a lot. I made so far some changes you sugguest, lets see how this helps. The portal is in portuguese language and it seems my stopwords really suck so I hope this helped me. My wordmatch table droped with the first 30 keywords around 4 million lines down but still has around 21 million and I need to do this more detailed. I never payed attention before in this so i discovered now a lot of 1,2 or 3 letter words in my wordlist table as well as tons of numbers like 1,11,111 etc…. is this normal?
thanks again, I am glad out there are people like you….
best regards
Oliver
Comment by Oliver — April 29, 2008 @ 6:28 pm
great blindness! I just now found your other topic about this with the anwser to my questions
thanks man… you help me and my users really a lot!
Comment by Oliver — April 29, 2008 @ 6:31 pm