One of the reasons I like the phpBB2 search system so much is I can understand it. If I can understand it, I can tweak it. And that’s what an entire series of blog posts has ended up being about. And they’re not over yet. But I do have to be honest; the search system provided with phpBB2 does have two big challenges. It doesn’t scale well, so very large boards will likely have to do a lot of tweaks or turn it off or seek some alternative. And it is missing one extremely important aspect that most users look for: the ability to search for an exact phrase.
I had a conversation with naderman on IRC a few weeks ago. He mentioned that he had read some of my search statistics (which was cool for me to hear). And he shared some statistics about how phpbb.com has been running, which were very enlightening. You see, phpbb.com is not running the standard search process right now, they are testing an implementation of sphinx that naderman is working on. That is why (if you hadnâ€™t noticed) the â€œsearch this topicâ€ doesnâ€™t work, as he hasnâ€™t had the time to integrate that part of it yet
What is sphinx? Glad you asked.
Why is this important? Searching is A Big Deal, and I put that in capital letters for a reason. To be very frank, certain parts of phpBB are quite simple. Oh, there is a lot of code wrapped around it, there are parts that are quite elegant, but at the very core a message board isnâ€™t that big of a deal to write. In the phpBB â€œOrigin Storyâ€ I think theFinn said he had a working prototype in a few days. I am not trying to denigrate the efforts of any of the developers, past, present, or future. I am saying that the basic concept of Board -> Category -> Forum -> Post just isnâ€™t that hard. phpBB3 has improved on the implementation substantially, but the core aspects of the system are the same.
But searching, and searching well, now that to me is more like rocket science. Look at how much money google and other search engines pour into research and development every year. It’s not a trivial thing. Here’s a quote from another site (linked below) that talks about the challenges of building a web search engine. We’re talking about searching a database, they’re searching a web, but I think the comments are still relevant:
At serve time, you have to get the results out of the index, sort them as per their relevancy to the query, and stick them in a pretty Web page and return them.
If it sounds easy, then you haven’t written a search engine. Remember, first, that some queries have more than one word. This means that you have to intersect the index entries for the two words. My advice is to have them presorted in some canonical URL number order so that you can view the two (n) index entries as two stacks and pop until the tops are equal, in which case, you win the prizeâ€”the URL is in both index entries. These sorts of computations have to be run at query time and they need to be run quickly, so think hard about how you are going to do intersections.
Next problem, query time ranking. Now that you have the list of URLs, you have to rank them according to your relevancy algorithm. This has to be fast. People are waiting.
That’s from the last page of the article, but I think it sums up search challenges well. Read it, index it, but be prepared to spit it back out on request. As the author says, people are waiting.
At this time we have two search options that I can see being used for phpBB. One is the built-in algorithm (which I have already said many times that I like), and the other is to ignore that code and fall through to MySQL’s full text option. In the second case we’re not writing any search algorithms, we are relying on the talented MySQL folks to do so instead. Which is really my point for this post.
And that point is? Yes, I know you have been asking that as I ramble on and on.
My point is that it takes a lot of effort to really understand and write a good search algorithm. It’s not simple. Efficient search algorithms are complicated. There are two parts, and both are in my opinion equally important. First index and then retrieve. Many database systems do an excellent job of both. But figuring out how to create your index and then process it during the scan and retrieval? There’s a bit of magic involved.
So instead of spending the time and effort to reinvent that particular wheel, I find it interesting that naderman has been researching other options. Some of the features from the sphinx web site, and my comments to go with them…
high indexing speed (up to 10 MB/sec on modern CPUs)
Clearly this is important at the beginning, as you will want to index your database / site prior to using the search process.
high search speed (avg query is under 0.1 sec on 2-4 GB text collections)
This is probably the most important. Searching performance on very large phpBB boards is what suffers the most. My board with nearly 350K posts is still chugging along nicely but I have made a few tweaks to the search process and have included an extensive stopwords list. Speaking of which…
No matter how efficient something is, I can’t imagine that you want to index the word “the” a quarter billion times.
high scalability (up to 100 GB of text, upto 100 M documents on a single CPU)
Now that sounds impressive. I don’t have any where near one GB of database content, much less 100 times that much. If I could dedicate one of my four cpu’s for searching and handle everything I throw at it, that would be interesting.
supports phrase searching
Yes! *pumps fist in the air* This is what’s missing from phpBB, and while I have played with ideas of retrofitting something into the existing search process wouldn’t it be great if it was already there?
supports phrase proximity ranking, providing good relevance
I am going to be on the fence for this one. Relevance engines have come a long way, but I’ve still seen weird associations that defy explanation. It’s cool that it’s there, but I would like to see it work.
naderman has said that once he gets this engine fully integrated with phpbb.com that he will release it as a MOD. Unfortunately for me, it’s going to be for phpBB3. I bet I can work with it though, and I look forward to seeing the results of his efforts.
Some Related Links