Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments January 26, 2007

How does search work? Part I: Table Review

Filed under: Search, phpBB — Dave Rathbun @ 11:46 am CommentsComments (11) 

One of the most frequent comments complaints about phpBB is the search process. For smaller boards it’s really not noticable. For larger boards (100K posts and above) it becomes very noticable. What is “it” that I am talking about?

The size of the search tables.

I should start this by stating that I happen to like the search process that phpBB offers, even with all of its quirks. So this is not going to be a negative “search-bashing” post. Instead, it’s going to be the first of several posts where I talk about the search process, what’s good about it, and – yes, I’ll be honest – what I would like to see improved.

First, I will review the database and functional design. There are two main tables (other than the obvious one of phpbb_posts_text) that drive the search process. The first table is the search word list aka phpbb_search_wordlist. The second is the list of posts where each word appears, known as phpbb_search_wordmatch. Here is the structure of each table in a standard phpBB installation.

desc phpbb_search_wordlist;
+-------------+-----------------------+------+-----+---------+----------------+
| Field       | Type                  | Null | Key | Default | Extra          |
+-------------+-----------------------+------+-----+---------+----------------+
| word_text   | varchar(50) binary    |      | PRI |         |                |
| word_id     | mediumint(8) unsigned |      | MUL | NULL    | auto_increment |
| word_common | tinyint(1) unsigned   |      |     | 0       |                |
+-------------+-----------------------+------+-----+---------+----------------+
3 rows in set (0.00 sec)

desc phpbb_search_wordmatch;
+-------------+-----------------------+------+-----+---------+-------+
| Field       | Type                  | Null | Key | Default | Extra |
+-------------+-----------------------+------+-----+---------+-------+
| post_id     | mediumint(8) unsigned |      | MUL | 0       |       |
| word_id     | mediumint(8) unsigned |      | MUL | 0       |       |
| title_match | tinyint(1)            |      |     | 0       |       |
+-------------+-----------------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

The first table listed (phpbb_search_wordlist) contains a list of unique words present in all of your posts on your board. :shock: Think about it, there are only so many words in the english language (the language I speak). Of course you can throw in the occasional technical word or jargon phrase, but at some point the rate of occurance of new words used should slow down. The wordlist table contains two main columns… one stores the actual word text which can be up to 50 characters long (but isn’t supposed to be, more on that in a bit) and the second is the word_id. Any database table should have a primary key of some sort, and in this case the word_id is our key or numerical equivalent to the word text.

The auto_increment feature from MySQL lets the php code do less work. As each new word is captured and stored, the word_id automatically increments to the next available value. (As a side note: the auto_increment field means that the field has to be unsigned, as you can’t auto-increment a negative value. This explains why the user_id field does not have an auto_increment attribute.)

The word_common field does not appear to be used. I will skip this and move on for now; a future search post might go back and talk about this.

Now back to the length of “50″ for the word text… I said something about that a few paragraphs ago. First, a bit of history. When I started with phpBB they had recently released phpBB2. My first installation was 2.0.1. At that point the code that split out the words and stored them in this table was designed so that words fewer than 3 letters were not stored. There isn’t much point to indexing common words like “a” or “is” or “on” anyway. The same logic ignored words over 20 letters long. I can verify this by running the following query:

select	length(word_text)
,	count(*)
from	phpbb_search_wordlist
group by length(word_text)

The results:

+-------------------+----------+
| length(word_text) | count(*) |
+-------------------+----------+
|                 3 |     7044 |
|                 4 |    14059 |
|                 5 |    16861 |
|                 6 |    19921 |
|                 7 |    20574 |
|                 8 |    20465 |
|                 9 |    16934 |
|                10 |    16054 |
|                11 |    10792 |
|                12 |     8063 |
|                13 |     6235 |
|                14 |     4520 |
|                15 |     3445 |
|                16 |     2554 |
|                17 |     1839 |
|                18 |     1420 |
|                19 |     1131 |
|                20 |      904 |
+-------------------+----------+
18 rows in set (0.60 sec)

Now these are real values, from a real database with over a quarter-million posts. In case you want to know, there are 172,816 “words” in my word list. :-) But notice that there are no words 2 or fewer or 21 or more letters. That means the code is working.

Somewhere along the way the regex used to parse the post words was updated, and it broke. When I run the exact same query on a newer board I get words of 1, 2, 21, 22, or even longer letters. This is the first thing that needs to be fixed if you want to tune your search system. I’ll show how to actually do that in a later post. No, I’m not trying to tease, just stay focused. It’s hard for me. :-)

So what we have so far, then, is a table designed to store words and give each of them a unique system-assigned ID. Why?

That’s best explained by describing the next table, the phpbb_search_wordmatch table. This is generally the biggest table in your database as far as row count. Other tables might be bigger as far as actual disk space required, but not likely. So what is it for?

This table is the index that tells phpBB where every word from the “word list” table is ever used. In any post on your board. Seriously, that’s what it is. The word list table is where we store the word_id for each distinct word, and then this table contains a mapping of word_id / post_id combinations. So if you enter the word “the” into the database (it won’t be, it’s a “stopword”, more on that in another post) and you use the word “the” in post number 42, 79, and 113, then you have three rows in your wordmatch table. That’s why that table can grow so large.

The big board I posted stats from earlier? It has 6,618,480 entries in the “word match” table. :shock: That is – by far – the largest table in my database. It takes nearly a minute to create a backup file of just that one table, and that’s on a dual Xeon server. But what’s the point? Why do we need all of that?

Searching is an interesting artform. Google and other search engines have to come to your site and capture the content, and then store it in some way so that it’s easy to retrieve. How do they do it? I have no idea. :-) But I do know how the phpBB search process works. I’ll finish this post with a brief review, and save more details for my continuation post.

Simply put: each time you enter a post the words of that post are indexed. So you pay a little overhead each time you click the “submit” button. But when someone searches for a word, say, “phpbbdoctor” as an example :-P , here’s what happens.

Step 1. Query the phpbb_search_wordlist table and obtain the word_id for “phpbbdoctor”
Step 2. Get a list of posts from the phpbb_search_wordmatch table that include that word
Step 3. Display

That’s it. It’s simple, it’s elegant, and it’s fast because you don’t have to brute-force search of a quarter-million posts for a single word. You already have an index of every post that word appears in. This works with wildcards, and it works with multiple words.

It doesn’t, however, work with exact phrases, and that’s one of my biggest complaints about the process. But more on that in another post.

11 Comments

  1. Thanks for the digits. I expected that the search data occupy a lot of the database space, but didn’t expected such high values.

    Waiting for the Part II, search_id and explanations why phpbb.com allows only one page of results for popular queries.

    Comment by olpa — January 26, 2007 @ 10:39 pm

  2. Welcome, olpa, and thanks for your comment.

    Part II talks about the search “stopwords” file, how it can be used to trim down the amount of space taken, and a MOD that I plan that will make it more efficient. I will try to cover the search_id in a later post. I think I have plans for parts II, III, IV, and perhaps V. There is a lot to talk about with searching. :-)

    As far as phpbb.com only allowing one page of results, that’s probably not intentional. :-) I made a comment regarding the MOD Search utility some time back doing that, and it got fixed. I have not tried to use more than one page of search results using the regular search.

    Comment by dave.rathbun — January 27, 2007 @ 12:36 am

  3. Search for “spam” on phpbb.com. The page 2 says … oh god, it works now. Nice.

    Some time ago it was not working. I was going to complaint, but found that some search queries were limited intentionally.

    Comment by olpa — January 29, 2007 @ 10:59 pm

  4. So…. why don’t you guys use fulltext indexes instead? This seems very inefficient.

    Comment by Stephen — October 26, 2007 @ 1:09 pm

  5. And by that I mean database, table-level fulltext indexes. You’re accomplishing this on the application level, which seems the wrong level.

    Comment by Stephen — October 26, 2007 @ 1:11 pm

  6. Hi, Stephen, and sorry for the delay in approving your comments. I have been busy with other things for a bit.

    Fulltext is a MySQL feature. phpBB is not a MySQL only system, so they (from what I can tell) implemented this system to be portable. There have also been tests that suggest that for some cases this algorithm can be more efficient. There have been a number of posts in the 2.0 phpBB Discussion forum at phpbb.com that talk more about this.

    Comment by Dave Rathbun — October 29, 2007 @ 10:35 am

  7. Cool, thanks for the info!

    Comment by Stephen — March 1, 2008 @ 11:03 pm

  8. Wow, nice text! I was searching for something to make the phpBB search function work better and I’m glad I’ve come here!

    Comment by Mori — March 20, 2008 @ 3:55 am

  9. Hi, Mori, welcome to the phpBB Doctor blog. There is an entire series of posts all about the search function so there’s plenty to digest. I’m using most if not all of these techniques on my largest board right now, and it has nearly 400,000 posts. I think we’re in good shape all the way up to a million at this point. :)

    Comment by Dave Rathbun — March 21, 2008 @ 7:17 pm

  10. Hi. Thanks for the explanation. Is there any way (other than the stopwords) of compressing this table? Or is it always as efficient as it can be, and you can’t make it smaller without losing information?

    Comment by Raoul Teeuwen — May 16, 2008 @ 3:11 am

  11. Sorry, already found http://www.phpbbdoctor.com/blog/2007/03/04/rebuilding-search-index-performance-results/

    Comment by Raoul Teeuwen — May 16, 2008 @ 3:14 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress