Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments September 12, 2009

Experimenting With FULLTEXT Indexing

Filed under: Search, phpBB — Dave Rathbun @ 10:43 am CommentsComments (8) 

I continue to get feedback from my users that – to be concise – the search process sucks. :) As regular readers of my blog will probably remember, I have done a lot of work to understand and fine-tune the standard phpBB search process. I have moved stop words into the database. I have adjusted the regular expression used to parse and index the words. I have added code to provide cleaner input to the search routine. All of these changes were made to optimize the process as it works today.

But folks are still not happy.

They don’t like the fact that certain words are on the stop words list. My board is related to a specific brand of software used for reporting. It’s not too surprising, then, that the word “report” appears in nearly 30% of the half-million posts on my board. Yet they still feel like they would gain value by having that word in their search for some reason.

They don’t like the fact that short words (which in our case includes version numbers) are not included either.

They don’t like the fact that they can’t search for word combinations (exact phrase search).

So today I started testing out a FULLTEXT index on my posts table. I created the index on both the post text and the title. It took a minute and a half and spiked my CPU to about 33% use. The index is over half the size of the database table. On the other hand, the index is smaller than the index on the search_wordmatch table so that’s something positive.

Over the coming weeks I am going to be experimenting with different search keywords and will be trying to get some metrics as to how well the fulltext index performs. There are three aspects that I am hoping to use to rate the success of this experiment. First, how fast are the results provided. Second, how effective are the results. Third, how easy is it going to be to give the user an interface to use the new index.

Stay tuned for more details.

8 Comments

  1. Good to see you’re making the right decision. ;-) Yeah, FULLTEXT seems like it’s got some draw-backs (one of them being index size roughly 3/4 of data size), but really, when you weigh the pros versus cons, it’s truly the way to go when you want a lot of flexibility in the search. MySQL has come a long way since a lot of the discussions on it at phpbb.com back in 2003-04.

    Comment by Dog Cow — September 13, 2009 @ 2:06 pm

  2. So far I have been underwhelmed by the accuracy of the search results. When ordered by “relevance” I get lots of unrelevant posts. Or I get posts that I know are not even close to the best answer to the question.

    I have been experimenting with running search queries directly on the database. Eventually I will alter search.php to provide the interface required to let the users do the same sort of search.

    Comment by Dave Rathbun — September 14, 2009 @ 10:44 am

  3. Are you using the right search type? There’s boolean, and then there’s some other types as well. You need > 5.0 for the best searching features.

    Comment by Dog Cow — September 14, 2009 @ 1:16 pm

  4. I have been testing with IN BOOLEAN MODE and specifying various different search terms that I am very familiar with. They keep pulling up posts from three or four years back. If I go this route, I believe I am going to have to come up with some sort of weighting so that more recent posts carry more weight somehow.

    Comment by Dave Rathbun — September 14, 2009 @ 6:10 pm

  5. “They keep pulling up posts from three or four years back.”
    You can use an AND clause, you know. ;-)

    Comment by Dog Cow — September 14, 2009 @ 6:54 pm

  6. Use an AND clause … to do what? Limit the search to more recent posts?

    What I had expected (hoped) was the relevance engine would do that job for me. I know my board; I know the content very well. Okay, so I don’t have all 500,000 posts memorized, but I do know what I expect to see. So far, I’m not getting it.

    More experimentation is required.

    Comment by Dave Rathbun — September 16, 2009 @ 1:16 am

  7. MySQL FULLTEXT doesn’t know anything about the dates of posts. Therefore, you need to use an AND clause with post_time column

    Comment by Dog Cow — September 16, 2009 @ 4:34 pm

  8. I’m not saying what I want to very well it seems.

    I know that when I search by text fields it doesn’t pay attention to dates. What I had hoped, or rather what I had expected, was that the most “relevant” posts would be the most recent because the discussions are better. Instead when I search for words X, Y, and Z often times I come up with a topic with a single post from four years ago. It’s not very helpful, and I know there are other topics (posts) that in my mind would match better.

    It’s an issue of the relevance engine, not really the dates. I wasn’t clear about that before.

    Comment by Dave Rathbun — September 16, 2009 @ 11:35 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress