I have had frequent complaints from folks on my biggest board regarding the search results. It seems that for many users page two (and beyond) was not available. Frankly, I was stumped. I worked with some users to verify that they have cookie settings that work and to ensure that their search session was not timing out. I checked (and tweaked) the code that clears out old search sessions to make sure their search results were not being truncated too early. The main thing was, I was never able to reproduce the error. That makes it very frustrating when trying to debug and fix an issue.
It turns out that the root cause had nothing to do with a cookie or session problem. The cause was that these users were running searches that returned a huge number of posts or topics. I figured that out by looking at the MySQL query logs and noted something specific to this issue.
Why should the number of resulting posts matter if I am only trying to go to page 2? Shouldn’t it be an issue only if I am trying to view page 800 instead?
More…
One of the ideas that comes up regularly when discussion tweaks to the phpBB2 search system includes the idea of setting up some sort of cron job that helps to maintain the search index table. There are a lot of transactions that hit that table during the posting process, and the posting process is what the user sees. If the posting process is slow, then the board “feels” slow. So I really like this next tweak, as it can improve the posting process quite a bit.
More…
One of the reasons I like the phpBB2 search system so much is I can understand it. If I can understand it, I can tweak it. And that’s what an entire series of blog posts has ended up being about. And they’re not over yet.
But I do have to be honest; the search system provided with phpBB2 does have two big challenges. It doesn’t scale well, so very large boards will likely have to do a lot of tweaks or turn it off or seek some alternative. And it is missing one extremely important aspect that most users look for: the ability to search for an exact phrase.
More…
As regular blog readers will probably know, I am always playing around with the search process. On my largest board I now have nearly 350K posts, and anything I can do to make the search process even marginally more effective and efficient can help. My latest idea (which I was up until the “wee hours” a few nights ago building a prototype for) is to grant moderators the ability to mark a topic with an “unsearchable” flag.
More…
Post Updated March 18, new notes at the end, thanks.
I recently completed some code that moves the “stopwords” into a database table and changes the way they are applied to a post or to the search process. At the same time I also moved the search synonyms into a database table. During testing I was very interested to find out that the way I thought the search synonyms were applied is not the way they are actually applied at all. This post will clarify how the synonyms are used, and point out something interesting about the internal consistency of the phpbb search_synonyms.txt file.
More…
Yesterday I posted about moving the stopwords file into the database and changing the way the stopwords (and search syonyms) are processed. Then I rebuilt the search index for my largest board. The results? I reindexed 298,070 posts in about five and a half hours. Read more for details… More…
From this post earlier in the Search series I made this statement:
The changes also reduced the amount of time required to rebuild my search tables by 4%. As it turned out later on, the reduction in time was not because of the uniquefy process, but something completely unexpected. You will just have to come back for the next installment to find out what it was.
The “uniquefy” process was based on the idea that when you’re getting ready to store your post words you don’t need duplicate words to be processed. They just take extra time for no benefit. So I wrote a MOD that would include logic to “uniquefy” the list of words from a post before they were processed any further. I thought that was where the 4% performance benefit was coming from. I was wrong.
I am now ready to reveal the secret.
More…
In an earlier post I disected the regular expression used in the clean_words() function to remove short and long words from the search index tables. I figured out how it worked, and then why it was broken from 2.0.6 forward. I fixed it.
For english boards only.
I have had some feedback from an owner of a german (swiss-german, actually) board, and the fix I provided does not work for his board. I think I know why. I don’t (yet) know how to fix it. More…
This is part IV of a series of posts about the phpBB2 search process. Previous posts include:
You don’t have to read all of the prior parts in order to read this one. The last post was quite long, and so part of what I wanted to cover there was postponed until this post. In this post I’m going to analyze what one particular regex (regular expression) from the clean_words() function is doing. In very early versions of phpBB2 it worked very well at keeping short and long words out of your search index tables. In later versions it did not work so well. In this post I will explain why, and provide an extremely easy fix.
More…
This is part three of a series of posts about how the search process in phpBB works. In prior posts I have talked about the search table design and how to use stopwords. This post is going to describe how to roll back to code found way back in version 2.0.4 for one specific line. If you don’t implement this change you might see short words (two or fewer letters) or long words (greater than 20 letters) in your search database. This post also details a few additional tweaks that I have made to the clean_words() function found in includes/search_functions.php that help overall performance, both in posting and in searching. All of the changes I discuss in this post are available in MOD format. This is a bit of a long post, but stay with me, I think it’s worth it.
More…