Yesterday I posted about moving the stopwords file into the database and changing the way the stopwords (and search syonyms) are processed. Then I rebuilt the search index for my largest board. The results? I reindexed 298,070 posts in about five and a half hours. Read more for details…
Last year when I ran the rebuild index process it took well over eight hours for around 200,000 posts. Yesterday the post count stood at 298,070, so if you project from last year I would have expected the process to run for closer to 12 hours this time. After making the changes suggested in the prior post (moving search stopwords and search synonyms into the database, skip reading the current text files, applying a uniquefy process to the words to be processed, increasing the number of stopwords as appropriate) here are some numbers for you to consider. In the chart below, the line indicates total elapsed time. Each marker represents a “batch of posts” that consists of all of the posts from 1 to 100, 101 to 200, 201 to 300, and so on.
There are four time trials in all, two with the “old” logic and two with the “new and improved” logic. Here are the numbers behind the graph; I think they’re fairly compelling:
Here is the easy conclusion: the new stopwords process results in a 50% reduction in processing time. The average time per 100 posts under the old code is eleven seconds (0:11) while the average under the new code is five (0:05).
What happened when I let the process run on my large database? First, I found some bugs in my code. I had projected that the process would run for five hours and thirty seven minutes (5:37:00). It actually ran for 6:04:11. Hm, well, okay. So my estimate was off, right? It turns out that wasn’t the problem. The problem was that php doesn’t barf if you manage to spell a variable name wrong. Yes, there are settings for that, I don’t have them turned on. So what did happen during those 6 hours?
What happened was the every post got processed, and the stopwords were handled as I designed, but no stopwords were actually removed from the post. So it turns out do be a happy accident, really. Recall that it took over eight hours last time for me to run the reindexing process, and then it took six for 50% more posts. The result was that I got over 12 million rows in my search index table because none of the stopwords were removed.
Let me clarify my point a bit: the process ran, and ran efficiently. With stopwords in the database instead of a text file I shaved many hours of actual runtime off of the process. It just didn’t stop the stopwords that it found, so they got inserted into the database along with the more valuable words that I wanted to index.
The next step was for me to fix that bug and rerun the process. Remember that I had originally projected it to run in 5:37:00? After fixing the bugs and running the process a second time, the total elapsed time was 5:25:07. Not bad. And after the fix, I had about 6.5 million rows in my stopwords table. This is why I consider the bug to be a happy accident as now I have some more interesting numbers to analyze. Specifically:
With stopwords I reduced the size of my search index from 12.3 million to 6.5 million, a 48% reduction. There’s the value in stopwords, right there. Before I cleaned up my bug I had nearly 238K posts with the word “the” indexed. Yes, that’s 238,000 out of 298,000 posts that contained the word “the”. You can see why it would be pointless to index that word, along with many of the other more common words. I wish now that I had determined the most popular word in my database before deleting everything; that might have been an interesting piece of data.
Another interesting observation: the difference in the total execution time pre-bug and post-bug was only 39 minutes. That means that it took my server 39 minutes to insert the extra 5.8 million rows of data into the search index table. Since the code was essentially the same (other than actually removing the stopwords from the processed data) the remaining time was all insert statements.
The process took 5:25:07 to process 298,070 posts. There were 172,367 “words” indexed. There are 6,450,990 records in the phpbb_search_wordmatch table now that it’s all done. Due to the efficiencies gained through my revised code I have increased the number of stopwords from 287 (my original file) to 1,215. And no, I am not going to go back and try to test the old code with that size of stopwords file.
The reason my stopwords file increased so much is I added all of the three-digit numbers to my file, starting from 100 and going to 999. I noticed quite a few numeric values in my search index, and decided that I don’t need those. In fact, I’m going to see about removing any “number” words from the process via a regular expression instead… something that looks for any “words” composed strictly of the digits 0 – 9 and remove them. I guess that’s another step in the MOD.