Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments March 4, 2007

Rebuilding Search Index: Performance Results

Filed under: Search, phpBB — Dave Rathbun @ 1:27 pm CommentsComments (8) 

Yesterday I posted about moving the stopwords file into the database and changing the way the stopwords (and search syonyms) are processed. Then I rebuilt the search index for my largest board. The results? I reindexed 298,070 posts in about five and a half hours. Read more for details…

Last year when I ran the rebuild index process it took well over eight hours for around 200,000 posts. Yesterday the post count stood at 298,070, so if you project from last year I would have expected the process to run for closer to 12 hours this time. After making the changes suggested in the prior post (moving search stopwords and search synonyms into the database, skip reading the current text files, applying a uniquefy process to the words to be processed, increasing the number of stopwords as appropriate) here are some numbers for you to consider. In the chart below, the line indicates total elapsed time. Each marker represents a “batch of posts” that consists of all of the posts from 1 to 100, 101 to 200, 201 to 300, and so on.

Elapsed Time Graph

There are four time trials in all, two with the “old” logic and two with the “new and improved” logic. Here are the numbers behind the graph; I think they’re fairly compelling:

Elapsed Time Graph

Here is the easy conclusion: the new stopwords process results in a 50% reduction in processing time. The average time per 100 posts under the old code is eleven seconds (0:11) while the average under the new code is five (0:05).

What happened when I let the process run on my large database? First, I found some bugs in my code. :lol: I had projected that the process would run for five hours and thirty seven minutes (5:37:00). It actually ran for 6:04:11. Hm, well, okay. So my estimate was off, right? It turns out that wasn’t the problem. The problem was that php doesn’t barf if you manage to spell a variable name wrong. ;-) Yes, there are settings for that, I don’t have them turned on. So what did happen during those 6 hours?

What happened was the every post got processed, and the stopwords were handled as I designed, but no stopwords were actually removed from the post. So it turns out do be a happy accident, really. Recall that it took over eight hours last time for me to run the reindexing process, and then it took six for 50% more posts. The result was that I got over 12 million rows in my search index table because none of the stopwords were removed. :shock:

Let me clarify my point a bit: the process ran, and ran efficiently. With stopwords in the database instead of a text file I shaved many hours of actual runtime off of the process. It just didn’t stop the stopwords that it found, so they got inserted into the database along with the more valuable words that I wanted to index.

The next step was for me to fix that bug and rerun the process. Remember that I had originally projected it to run in 5:37:00? After fixing the bugs and running the process a second time, the total elapsed time was 5:25:07. Not bad. :-) And after the fix, I had about 6.5 million rows in my stopwords table. This is why I consider the bug to be a happy accident as now I have some more interesting numbers to analyze. Specifically:

With stopwords I reduced the size of my search index from 12.3 million to 6.5 million, a 48% reduction. There’s the value in stopwords, right there. Before I cleaned up my bug I had nearly 238K posts with the word “the” indexed. Yes, that’s 238,000 out of 298,000 posts that contained the word “the”. You can see why it would be pointless to index that word, along with many of the other more common words. I wish now that I had determined the most popular word in my database before deleting everything; that might have been an interesting piece of data.

Another interesting observation: the difference in the total execution time pre-bug and post-bug was only 39 minutes. That means that it took my server 39 minutes to insert the extra 5.8 million rows of data into the search index table. Since the code was essentially the same (other than actually removing the stopwords from the processed data) the remaining time was all insert statements.

Summary
The process took 5:25:07 to process 298,070 posts. There were 172,367 “words” indexed. There are 6,450,990 records in the phpbb_search_wordmatch table now that it’s all done. Due to the efficiencies gained through my revised code I have increased the number of stopwords from 287 (my original file) to 1,215. And no, I am not going to go back and try to test the old code with that size of stopwords file. :-P

The reason my stopwords file increased so much is I added all of the three-digit numbers to my file, starting from 100 and going to 999. I noticed quite a few numeric values in my search index, and decided that I don’t need those. In fact, I’m going to see about removing any “number” words from the process via a regular expression instead… something that looks for any “words” composed strictly of the digits 0 – 9 and remove them. I guess that’s another step in the MOD. 8)

8 Comments

  1. Dave,

    As always a great post with some very interesting results. Be interesting to hear someone like naderman comment on this type of research. As he seems to be the dev handling searching in phpBB3, and we still have the same concept now called ignore words.

    You plan on covering search synonyms at all? I know are used on the query side of the search engine + phpBB ships with some funny ones. My fav being “gynecology gynaecology”, useful for 0.0001% of boards!!

    Comment by Esmond Poynton — March 5, 2007 @ 6:02 am

  2. Esmond, your last two comments have been marked as spam. :-?

    Search synonyms are, in fact, a topic of an upcoming post. They don’t work the way I expected them to. They work, of course, but they’re just different. In my revised code they have also been moved into the database, and they have been given the same treatment. That is to say that after moving them to the database I can then check for synonyms only of the words that are actually in use rather than applying the entire synonym file… even to a post with only one word. :-)

    I like the fact that search synonyms can be used to “correct” for words that are commonly spelled wrong.

    I had not known who the main developer was for the search process. And thanks for confirming that some of this research might even be appropriate for phpBB3 as well.

    I should make one thing very clear: I like the phpBB2 search process. I think that if we could figure out a way to effectively add a “search exact phrase” option to the process it would be awesome. The point of this series of posts has been to justify to detractors why I like the search process ;-) and along the way, as I have discovered things, to document some tweaks that can make it even better.

    As far as future posts, I am going to talk about the search synonym process, the “common words” problem, an enhancement to search.php that I’m writing that will help the user understand how to get better search results… and a few more that at the moment I don’t remember. :-) For example we have the concept of search stopwords which I have covered now in great detail, the concept of search synonyms which is coming… I have an idea for a MOD called Search Addwords aka the Jargon MOD. There are certain “words” that get dropped because they are too short (or too long) or structured in such a way that they get ignored. For certain boards it might be vital to search for certain two-letter words, but not all two-letter words.

    I plan to expand on the process of putting search words into the database by creating a MOD that allows board owners to add words that they want to ensure get included in the search index. Those words would likely be very unique to each board. For example, on your board maybe you want people to be able to search for “Mustang V8 GT” or something like that. Well, GT and V8 would be ignored because they’re too short. With a Seach Addwords table you could ensure that those words do, in fact, get included in the search index. It seems to be a logical extension of the existing search process; I’m surprised I have not seen it suggested somewhere else before.

    Which doesn’t mean that it hasn’t been… it just means I haven’t seen it. :-D

    Thanks again for your comment; I wish I knew what you had done to trigger the spam filter. If at some point a comment of yours gets removed, it was an accident and nothing personal. Some times I open the spam list and click “Delete All” without reading through the list; it depends on how many are waiting.

    Comment by dave.rathbun — March 5, 2007 @ 7:56 am

  3. Think I only made that one post :-?
    Looking forward to the upcoming posts. Doing some reading myself on DB design and queries, as my MOD has a few queries “using temporary” in the MySQL extra explain. So maybe a post on where index’s should be created would be good.

    Comment by Esmond Poynton — March 6, 2007 @ 1:57 am

  4. That comment was tagged by Akismet also. Fortunately I read them before deleting them. I don’t know what’s going on.

    If you care to post a query and the explain plan, along with the index structure for the tables involved, I will be happy to use it as a case study for a future post. 8)

    Comment by dave.rathbun — March 6, 2007 @ 2:16 am

  5. OK…here goes!! Hope the formatting works :) (edited to correct formatting, Wordpress uses html rather than bbcode. – Dave)

    This query is to get vehicles with the most money spent on them.
    Vehicles are stored in phpbb_garage_vehicles. Each vehicle can have lots of modifications, each modification is stored in phpbb_garage_modifications. Each modification has two “costs” associated with them, a purchase and install price. We need to get the sum of these and return the vehicles with the highest sum. Each vehicle is own by one user stored in the regular table and each vehicle consists of a make and a model stored in table phpbb_garage_makes & phpbb_garage_models

    Hope that makes sense. Query is as below.

    SELECT v.id,
    	CONCAT_WS(' ', v.made_year, mk.make, md.model) AS vehicle,
    	v.user_id,
    	(SUM(m.install_price) + SUM(m.price)) AS POI,
    	u.username,
    	v.currency,
    	u.user_colour,
    	u.user_id
    FROM (
    	phpbb_garage_vehicles v,
    	phpbb_garage_makes mk,
    	phpbb_garage_models md,
    	phpbb_users u,
    	phpbb_garage_modifications m
    )
    WHERE m.vehicle_id = v.id
    	AND v.make_id = mk.id
    	AND mk.pending = 0
    	AND v.model_id = md.id
    	AND md.pending = 0
    	AND v.user_id = u.user_id
    GROUP BY v.id
    ORDER BY POI DESC
    LIMIT 5

    So our explain plan gives us the follow. We can see we have “using temporary”, which all my reading I believe to be a indicator of poor index/query structure.

    +-----+---------------+----------+------------+-----------------------------+---------+---------+----------------------+-------+----------------------------------+
    |  Id |   Select Type |   Table  |    Type    |       Possible Keys         |   Key   |  KeyLen	|         Ref          |  Rows |      Extra                       |
    +-----+---------------+----------+------------+-----------------------------+---------+---------+----------------------+-------+----------------------------------+
    |  1  |     SIMPLE    |     m    |     ALL    |	vehicle_id_2,vehicle_id     |         |         |                      |   3   |  Using temporary; Using filesort |
    |  1  |     SIMPLE    |     v    |    eq_ref  |	PRIMARY,user_id             | PRIMARY |    4    | phpbb3.m.vehicle_id  |   1   |                                  |
    |  1  |     SIMPLE    |     u    |    eq_ref  |	PRIMARY                     | PRIMARY |    3    | phpbb3.v.user_id     |   1   |  Using where                     |
    |  1  |     SIMPLE    |     mk   |    eq_ref  |	PRIMARY                     | PRIMARY |    4    | phpbb3.v.make_id     |   1   |  Using where                     |
    |  1  |     SIMPLE    |     md   |    eq_ref  |	PRIMARY                     | PRIMARY |    4    | phpbb3.v.model_id    |   1   |  Using where                     |
    +-----+---------------+----------+------------+-----------------------------+---------+---------+----------------------+-------+----------------------------------+

    So we have 5 table involved and we have the following structure, I will not bother listing phpbb_users as you know that one!! ;)

    Table description for phpbb_garage_vehicles

    +------------------+----------------------------+---------+-------+---------------+----------------+
    +  Field           |              Type          |   Null  |  Key  | Default       |    Extra       |
    +------------------+----------------------------+---------+-------+---------------+----------------+
    | id               | int(10) unsigned           |   NO    |  PRI  | 	NULL      | auto_increment |
    | user_id          | int(10)                    |   NO    |  MUL  | 	0         |                |
    | made_year        | varchar(4)                 |   NO    |       |     2003      |                |
    | engine_type      | varchar(32)                |   NO    |       |     NULL      |                |
    | colour           | varchar(128)               |   YES   |       |     NULL      |                |
    | mileage          | int(10) unsigned           |   NO    |       |     0         |                |
    | mileage_unit     | varchar(32)                |   NO    |       |     Miles     |                |
    | price            | int(10) unsigned           |   YES   |       |     NULL      |                |
    | currency         | varchar(32)                |   NO    |       |     USD       |                |
    | comments         | varchar(255)               |   YES   |       |     NULL      |                |
    | views            | int(10) unsigned           |   NO    |  MUL 	|     0         |                |
    | date_created     | int(10)                    |   YES   |  MUL  | 	NULL      |                |
    | date_updated     | int(10)                    |   YES   |  MUL 	|     NULL      |                |
    | make_id          | int(10) unsigned           |   NO    |       |     0         |                |
    | model_id         | int(10) unsigned           |   NO    |       |     0         |                |
    | main_vehicle     | tinyint(1)                 |   NO    |       |     0         |                |
    | weighted_rating  | double                     |   NO    |       |     0         |                |
    | pending          | tinyint(1)                 |   NO    |       |     0         |                |
    +------------------+----------------------------+---------+-------+---------------+----------------+

    Indexes for phpbb_garage_vehicles

    +---------------+----------+-------------------+--------------------------------+
    |    Keyname    |   Type   |    Cardinality    |  Field                         |
    +---------------+----------+-------------------+--------------------------------+
    | PRIMARY       | PRIMARY  |  3                | id                             |
    | date_created  | INDEX    |  None             | date_created                   |
    | date_updated  | INDEX    |  None             | date_updated                   |
    | user_id       | INDEX    |  None             | user_id                        |
    | views         | INDEX    |  None             | views                          |
    +---------------+----------+-------------------+--------------------------------+

    Table description for phpbb_garage_makes

    +------------------+----------------------------+-------------+-------+---------------+----------------+
    +  Field           |              Type          |    Null     |  Key  | Default       |    Extra       |
    +------------------+----------------------------+-------------+-------+---------------+----------------+
    | id               | int(10) unsigned           | NO          |  PRI  | NULL          | auto_increment |
    | make             | varchar(255)               | NO          |  MUL  |               |                |
    | pending          | tinyint(1)                 | NO          |       | 1             |                |
    +------------------+----------------------------+-------------+-------+---------------+----------------+

    Indexes for phpbb_garage_makes

    +---------------+----------+-------------------+--------------------------------+
    |    Keyname    |   Type   |    Cardinality    |  Field                         |
    +---------------+----------+-------------------+--------------------------------+
    | PRIMARY       | PRIMARY  |  91               |  id                            |
    | make          | INDEX    |  None             |  make                          |
    +---------------+----------+-------------------+--------------------------------+

    Table description for phpbb_garage_models

    +------------------+----------------------------+-------------+-------+---------------+----------------+
    +  Field           |              Type          |    Null     |  Key  | Default       |    Extra       |
    +------------------+----------------------------+-------------+-------+---------------+----------------+
    | id               | int(10) unsigned           |  NO         |  PRI  | NULL          | auto_increment |
    | make_id          | int(10) unsigned           |  NO         |  MUL  | 0             |                |
    | model            | varchar(255)               |  NO         |       |               |                |
    | pending          | tinyint(1)                 |  NO         |       | 1             |                |
    +------------------+----------------------------+-------------+-------+---------------+----------------+

    Indexes for phpbb_garage_models

    +---------------+----------+-------------------+--------------------------------+
    |    Keyname    |   Type   |    Cardinality    |  Field                         |
    +---------------+----------+-------------------+--------------------------------+
    |    PRIMARY    |  PRIMARY |        811        | id                             |
    |    make_id    |  INDEX   |        None       | make_id                        |
    +---------------+----------+-------------------+--------------------------------+

    Table description for phpbb_garage_modifications

    +------------------+----------------------------+-------------+-------+---------------+----------------+
    +  Field           |              Type          |    Null     |  Key  | Default       |    Extra       |
    +------------------+----------------------------+-------------+-------+---------------+----------------+
    | id               | int(10) unsigned           |     NO      |  PRI  | NULL          | auto_increment |
    | vehicle_id       | int(10) unsigned           |     NO      |  MUL  | 0             |                |
    | user_id          | int(10)                    |     NO      |  MUL  | 0             |                |
    | category_id      | int(10) unsigned           |     NO      |  MUL  | 0             |                |
    | manufacturer_id  | int(10) unsigned           |     NO      |       | 0             |                |
    | product_id       | int(10) unsigned           |     NO      |       | 0             |                |
    | price            | int(10) unsigned           |     NO      |       | 0             |                |
    | install_price    | int(10) unsigned           |     NO      |       | 0             |                |
    | product_rating   | tinyint(2)                 |     YES     |       | NULL          |                |
    | purchase_rating  | tinyint(2)                 |     YES     |       | NULL          |                |
    | install_rating   | tinyint(2)                 |     YES     |       | NULL          |                |
    | shop_id          | int(10)                    |     YES     |       | NULL          |                |
    | installer_id     | int(10)                    |     YES     |       | NULL          |                |
    | comments         | text                       |     YES     |       | NULL          |                |
    | install_comments | text                       |     YES     |       | NULL          |                |
    | date_created     | int(10)                    |     YES     |  MUL  | NULL          |                |
    | date_updated     | int(10)                    |     YES     |  MUL  | NULL          |                |
    +------------------+----------------------------+-------------+-------+---------------+----------------+

    Indexes for phpbb_garage_modifications

    +---------------+----------+-------------------+--------------------------------+
    |    Keyname    |   Type   |    Cardinality    |  Field                         |
    +---------------+----------+-------------------+--------------------------------+
    | PRIMARY       | PRIMARY  | 3                 | id                             |
    | user_id       | INDEX    | None              | user_id                        |
    | vehicle_id_2  | INDEX    | None              | vehicle_id, category_id        |
    | category_id   | INDEX    | None              | category_id                    |
    | vehicle_id    | INDEX    | None              | vehicle_id                     |
    | date_created  | INDEX    | None              | date_created                   |
    | date_updated  | INDEX    | None              | date_updated                   |
    +---------------+----------+-------------------+--------------------------------+

    Comment by Esmond Poynton — March 6, 2007 @ 8:21 am

  6. Esmond, I corrected the formatting. :-) Wordpress uses html rather than bbcode, so if you want something bold use <strong> </strong> and for code the best option seems to be <pre></pre>.

    I haven’t had a chance to look over this in detail yet, but there is one flaw in the code that may or may not have an impact on performance. It’s a common error on MySQL databases. When you create a query with an aggregate function (like sum() as you have) then you need to have every non-aggregate column appear in the GROUP BY. I will have to post more about that in a blog post. MySQL allows you to be “sloppy” and get away with not doing that, but it can have an impact on your data.

    Also, you are doing sum(x) + sum(y) where both x and y come from the same table. I would think that sum(x+y) could be slightly more efficient, although it won’t change your index usage.

    Final question: there is no “where clause” other than the expected joins. I believe you said that this was to get the top five cars with the most dollars spent on mods, right? What I would do is get those first in a separate query, retrieve the five vehicle IDs from that query, then go back and do the second query to get just those five.

    Here’s why I think two queries is more efficient than one in this case. First, when you have a group by (even an incomplete one ;-) ) the entire query has to run to completion before the first row can come back. That means that all of your joins are being done for every car in your database, even though you only want five.

    I would do this:

    SELECT	m.vehicle_id,
    	SUM(m.install_price + m.price) AS POI,
    FROM (
    	phpbb_garage_modifications m
    )
    GROUP BY m.vehicle_id
    ORDER BY POI DESC
    LIMIT 5

    That query should run much faster than the larger query that you have. It will return a dataset of five car id values. Next, replace your bigger query with this:

    SELECT v.id,
    	CONCAT_WS(' ', v.made_year, mk.make, md.model) AS vehicle,
    	v.user_id,
    	u.username,
    	v.currency,
    	u.user_colour,
    	u.user_id
    FROM (
    	phpbb_garage_vehicles v,
    	phpbb_garage_makes mk,
    	phpbb_garage_models md,
    	phpbb_users u
    )
    WHERE	v.id IN ( ... insert ids from prior query here ... )
    	AND v.make_id = mk.id
    	AND mk.pending = 0
    	AND v.model_id = md.id
    	AND md.pending = 0
    	AND v.user_id = u.user_id

    You will notice that I skipped the “modifications” table in the second query. You have the total modification dollars (currency type) already from the prior query, so you don’t need to have the overhead of recalculating it in this query.

    The net result – without any benchmarking on my part, I should add – should be an improvement, I would think. The first query still runs a GROUP BY and a SUM but only on one table. No joins are required. From that query you will get a dataset (array) of vehicle ID values and their total modification dollars. The second query gets the user and vehicle specifics for those top 5 vehicles.

    Comment by dave.rathbun — March 16, 2007 @ 11:18 am

  7. Dave,

    Thanks for taking the time to have a look. I will do some benchmarking and post the results.

    Comment by Esmond Poynton — March 16, 2007 @ 12:49 pm

  8. Do that, please. At some point I will look at the indexing scheme, but I suspect that breaking things into two queries will help.

    Comment by dave.rathbun — March 16, 2007 @ 8:35 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress