I continue to get feedback from my users that – to be concise – the search process sucks. As regular readers of my blog will probably remember, I have done a lot of work to understand and fine-tune the standard phpBB search process. I have moved stop words into the database. I have adjusted the regular expression used to parse and index the words. I have added code to provide cleaner input to the search routine. All of these changes were made to optimize the process as it works today.
But folks are still not happy.
They don’t like the fact that certain words are on the stop words list. My board is related to a specific brand of software used for reporting. It’s not too surprising, then, that the word “report” appears in nearly 30% of the half-million posts on my board. Yet they still feel like they would gain value by having that word in their search for some reason.
They don’t like the fact that short words (which in our case includes version numbers) are not included either.
They don’t like the fact that they can’t search for word combinations (exact phrase search).
So today I started testing out a FULLTEXT index on my posts table. I created the index on both the post text and the title. It took a minute and a half and spiked my CPU to about 33% use. The index is over half the size of the database table. On the other hand, the index is smaller than the index on the search_wordmatch table so that’s something positive.
Over the coming weeks I am going to be experimenting with different search keywords and will be trying to get some metrics as to how well the fulltext index performs. There are three aspects that I am hoping to use to rate the success of this experiment. First, how fast are the results provided. Second, how effective are the results. Third, how easy is it going to be to give the user an interface to use the new index.
Stay tuned for more details.
I have had frequent complaints from folks on my biggest board regarding the search results. It seems that for many users page two (and beyond) was not available. Frankly, I was stumped. I worked with some users to verify that they have cookie settings that work and to ensure that their search session was not timing out. I checked (and tweaked) the code that clears out old search sessions to make sure their search results were not being truncated too early. The main thing was, I was never able to reproduce the error. That makes it very frustrating when trying to debug and fix an issue.
It turns out that the root cause had nothing to do with a cookie or session problem. The cause was that these users were running searches that returned a huge number of posts or topics. I figured that out by looking at the MySQL query logs and noted something specific to this issue.
Why should the number of resulting posts matter if I am only trying to go to page 2? Shouldn’t it be an issue only if I am trying to view page 800 instead?
One of the ideas that comes up regularly when discussion tweaks to the phpBB2 search system includes the idea of setting up some sort of cron job that helps to maintain the search index table. There are a lot of transactions that hit that table during the posting process, and the posting process is what the user sees. If the posting process is slow, then the board “feels” slow. So I really like this next tweak, as it can improve the posting process quite a bit.
One of the reasons I like the phpBB2 search system so much is I can understand it. If I can understand it, I can tweak it. And that’s what an entire series of blog posts has ended up being about. And they’re not over yet. But I do have to be honest; the search system provided with phpBB2 does have two big challenges. It doesn’t scale well, so very large boards will likely have to do a lot of tweaks or turn it off or seek some alternative. And it is missing one extremely important aspect that most users look for: the ability to search for an exact phrase.
As regular blog readers will probably know, I am always playing around with the search process. On my largest board I now have nearly 350K posts, and anything I can do to make the search process even marginally more effective and efficient can help. My latest idea (which I was up until the “wee hours” a few nights ago building a prototype for) is to grant moderators the ability to mark a topic with an “unsearchable” flag.
Post Updated March 18, new notes at the end, thanks.
I recently completed some code that moves the “stopwords” into a database table and changes the way they are applied to a post or to the search process. At the same time I also moved the search synonyms into a database table. During testing I was very interested to find out that the way I thought the search synonyms were applied is not the way they are actually applied at all. This post will clarify how the synonyms are used, and point out something interesting about the internal consistency of the phpbb search_synonyms.txt file.
Yesterday I posted about moving the stopwords file into the database and changing the way the stopwords (and search syonyms) are processed. Then I rebuilt the search index for my largest board. The results? I reindexed 298,070 posts in about five and a half hours. Read more for details… More…
From this post earlier in the Search series I made this statement:
The changes also reduced the amount of time required to rebuild my search tables by 4%. As it turned out later on, the reduction in time was not because of the uniquefy process, but something completely unexpected. You will just have to come back for the next installment to find out what it was.
The “uniquefy” process was based on the idea that when you’re getting ready to store your post words you don’t need duplicate words to be processed. They just take extra time for no benefit. So I wrote a MOD that would include logic to “uniquefy” the list of words from a post before they were processed any further. I thought that was where the 4% performance benefit was coming from. I was wrong.
I am now ready to reveal the secret.
In an earlier post I disected the regular expression used in the clean_words() function to remove short and long words from the search index tables. I figured out how it worked, and then why it was broken from 2.0.6 forward. I fixed it.
For english boards only.
I have had some feedback from an owner of a german (swiss-german, actually) board, and the fix I provided does not work for his board. I think I know why. I don’t (yet) know how to fix it. More…
This is part IV of a series of posts about the phpBB2 search process. Previous posts include:
You don’t have to read all of the prior parts in order to read this one. The last post was quite long, and so part of what I wanted to cover there was postponed until this post. In this post I’m going to analyze what one particular regex (regular expression) from the clean_words() function is doing. In very early versions of phpBB2 it worked very well at keeping short and long words out of your search index tables. In later versions it did not work so well. In this post I will explain why, and provide an extremely easy fix.