Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments August 30, 2007

Yet Another Search Tweak

Filed under: Search, phpBB — Dave Rathbun @ 3:26 am CommentsComments (8) 

One of the ideas that comes up regularly when discussion tweaks to the phpBB2 search system includes the idea of setting up some sort of cron job that helps to maintain the search index table. There are a lot of transactions that hit that table during the posting process, and the posting process is what the user sees. If the posting process is slow, then the board “feels” slow. So I really like this next tweak, as it can improve the posting process quite a bit.

It’s quite simple, really. I’m going to show how to get a potentially huge boost in performance during the process of editing or removing posts. This tweak removes 3 queries (or one really big one if you’re not using MySQL). And one of those queries is a nasty one with both a group by and a having clause. :shock: And the kicker? Most of the time it’s a query that scans a whole bunch of rows, and returns nothing! I will explain what to do, and then why.

Do This Now, Thank Me Later

Open includes/functions_search.php
Find this code:

function remove_search_post($post_id_sql)
{
        global $db;

        $words_removed = false;

After, Add:

/*

Find this code:

        $sql = "DELETE FROM " . SEARCH_MATCH_TABLE . "
                WHERE post_id IN ($post_id_sql)";

Before, Add:

*/

What this does is comment out (effectively remove) a whole block of code. So that’s the “what”, how about the “why”?

What Do I Lose?

A brief review is in order. The phpbb_search_wordlist table contains a list of words that appear in one or more posts on your board. The phpbb_search_wordmatch table contains the index of which posts include those words. When you insert a new post, any new (unique) words are inserted into the phpbb_search_wordlist table and assigned word_id values. Then the combinations of the word_id and post_id values are inserted into the phpbb_search_wordmatch table. Both of these steps need to happen, so I won’t change those. I have already optimized this a bit in prior posts by changing how the stopwords and synonyms are processed.

But what about editing a post? Now that process can be tweaked. I can gain a huge boost in performance while giving up very little functionality.

Faster Edits are Good

When I edit a post phpBB does not try to keep track of a “before” and “after” picture of the text of my post. Instead, it will remove every row from the phpbb_search_wordmatch table (the index). Then it will check to see if any of the words in my post were unique to that post. If so, it will remove them from the phpbb_search_wordlist table. Here’s the problem with that. phpBB2 takes 3 queries against a MySQL database to do that operation. And one of those queries is really ugly, it looks like this:

$sql = "SELECT word_id
	FROM " . SEARCH_MATCH_TABLE . "
	WHERE word_id IN ($word_id_sql)
	GROUP BY word_id
	HAVING COUNT(word_id) = 1";

What this is doing is taking a list of word_id values and getting a count of how many times they appear in the index table. Then the “having” clause kicks in and drops any words that are counted more than once. That’s the uniqueness test, and it can be really slow.

I had at first revised this query to look like this:

$sql = "SELECT word_id
	,	count(word_id) as word_count
	FROM " . SEARCH_MATCH_TABLE . "
	WHERE word_id IN ($word_id_sql)
	GROUP BY word_id";

What this did was skip the having clause and instead return the count of the word_id back to the result array. From there, I checked to see if the count was one or not. That might have saved some time on the query, but it turns out that I decided it didn’t matter anyway. No, instead I am going to skip the entire query. And the one before it, and the one after it. Here’s why.

I don’t care about removing words from the wordlist table!

Think about it. When I edit or remove a post, sure, I want the index table to be cleaned up. But the odds are good that if a word was entered into the wordlist once before, it will be used again. The wordlist table isn’t the problem in this situation. Going back to the original definition… the phpbb_search_wordlist table contains words that appear in one or more posts on your board. After this tweak the definition will be a table that contains words that appear in zero or more posts on your board.

In other words, the “cost” of this tweak is that you might end up with words that don’t belong to any posts. The “benefit” is that you skip 2 or 3 queries every time someone edits or deletes a post. I am willing to make that trade, especially since the “having count(word_id) = 1″ query has been showing up in my slow query log on my server. Every time one query slows down, all of the other queries – even the tuned ones – can suffer.

By the way, this exact same process is run during pruning. So it’s not just edits where this will make a difference.

Summary

This is a subtle tweak that improves editing and removing posts. If you never edit or delete (or prune) a post, this tweak will not help. But if you edit posts on a large board you might notice how slow the submit process is. That’s because it is first determining which words are in the post, then figuring out which are unique to that post, then removing those unique words from the wordlist table, then finally removing all the index rows. That is four queries to remove the post from the search index. If you don’t care if your words stick around, do this tweak. It will still maintain your search index, but skip the extra maintenance on the wordlist table.

Here is the funny part: if the edits were minor, then the code is going to immediately process some of those same words and put them right back into the database again! :lol:

I am testing this tweak on my largest board right now.

To test I made the changes documented in this post. I created a post with an easily identified (unique) word and saved it. I confirmed that the word was in the word list table and in the index table. Then I deleted that post. The index rows were gone, but the word was still in the wordlist table. I can live with that.

And if I later decide I can’t, then this is a perfect application for a cron job. Run a job at midnight that checks your wordlist table and removes any words that do not appear in the index. 8)

8 Comments

  1. You speak logic here.Have done this hope the edits will become faster.

    Regards.

    Jack.

    Comment by Jack — April 16, 2008 @ 2:21 am

  2. I contacted you some time ago about the database mod. Promised to write here..sorry for the delay.
    I have installed this tweak and it IS faster now. :) Keep up the good work. Thank you!

    Comment by Thomas — September 16, 2008 @ 2:45 pm

  3. Thomas, glad to hear it’s faster. It’s an easy edit with substantial benefits and few disadvantages.

    Comment by Dave Rathbun — September 16, 2008 @ 3:22 pm

  4. What would be the query that one would run in the cron to “Checks your wordlist table and removed any words that do not appear in the index”

    Thanks

    Comment by JLA — October 9, 2008 @ 9:47 am

  5. Hi, JLA, welcome to my blog. :)

    There are several ways you could go about finding words from the word table that do not exist in a post. But to be honest, it’s probably not worth the effort. The words table probably has less than 250,000 rows in it, even on a large board. I just checked mine and with nearly half a million posts there are only 234,421 unique “words” in the wordlist. The main thing I would check is the min / max length of words in your phpbb_search_wordlist table. If you have words shorter than 3 or words longer than 20 you need to fix your regular expression.

    But to answer your query, the most straightforward would be to use a “Not In” clause but it’s not going to have the best performance:

    SELECT word_id, word_text
    FROM phpbb_search_wordlist
    WHERE word_id NOT IN
       (SELECT distinct word_id
          FROM phpbb_search_wordmatch)

    Something like that would give you a list of words that appear in your wordlist table that do not appear in your search index.

    I ran this on my board just now and found 3,573 “orphaned” words. I don’t plan to remove them at this time. They account for about 1.5% of the rows in the table. If it got up to 10% I might get concerned.

    Comment by Dave Rathbun — October 9, 2008 @ 8:49 pm

  6. Thanks for sharing your work, I think this will be of great benefit for everyone (who has to stay with phpBB2 due to special changes that make an update very complicated ;) )

    You know… I have a very strange result when executing the code to get the “orphaned” words :) Out of 150,000 words in total there seem to be 145,000 “orphaned” words… Very strange…

    Moreover, I have not yet modified the databases, but since my last backup (160 MB) my database has shrunk radically. I now know it is because the “wordlist database combination” now has 150,000 entries (today) instead of 1,450,000 (a week ago) oO is this a known issue? Maybe the database was deleted by the server administrator as it was to big? :D

    Or can this be some sort of pruning function? Well though I don’t know much about it, I sense there ought to be something wrong, as I once executed your code to show the “most referenced” words, the most used word has the count of 38 (in 100,000 Posts !?)

    This night the admin will launch the reindexing process, I hope this will work.

    Oh well, what about your workaround for german characters? It’s a german board -.- It might sound kind of arrogant, but you ever tried to replace the chars? ä -> ä or ä -> ae; (hm this could lead to global stuff, right?)

    Yours Chris

    Comment by Chris — November 16, 2008 @ 4:17 pm

  7. Hi, Chris, and welcome. To take your comments in reverse order…

    I am quite comfortable with a lot of things, but dealing with global languages issues is not one of them, unfortunately. Your suggestion may very well be exactly what is needed but at this time I don’t have the time to set it up and test it thoroughly.

    If you are getting mostly orphaned words, and your search tables are tiny, then what has likely happened is your database table got dropped or emptied at some point along the way. I venture to guess that after you run your “rebuild search” code you will see both of those issues addressed.

    Comment by Dave Rathbun — November 17, 2008 @ 9:01 am

  8. Thanks for your reply.

    Well that’s the problem,right? It seems almost impossible to replace all possible characters -.-

    After reindexing the table we have >1,500,000 entries in the wordmatch stuff plus 120,000 entries in the wordlist (after adding some more stopwords).

    Mhh was this problem of splitted words there before I attached your mod or is it created by this “[ ]” -> “\b” replacement? (I did not want to read through all of the former wordlist entries,the most common ones were without special characters – and certainly were incomplete as I mentioned in my last comment)

    Well however I’m looking forward to more interesting posts of you :) When I have time,I will read through your blog :) Everything I found by now was really nice to know / could come in handy :)

    And thanks again :)

    YouRs Chris

    Comment by Chris — November 17, 2008 @ 1:06 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress