Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments March 6, 2007

How does search work? Part VI: Search Synonyms

Filed under: Search, phpBB — Dave Rathbun @ 1:49 am CommentsComments (12) 

Post Updated March 18, new notes at the end, thanks.

I recently completed some code that moves the “stopwords” into a database table and changes the way they are applied to a post or to the search process. At the same time I also moved the search synonyms into a database table. During testing I was very interested to find out that the way I thought the search synonyms were applied is not the way they are actually applied at all. This post will clarify how the synonyms are used, and point out something interesting about the internal consistency of the phpbb search_synonyms.txt file.

If you take a brief look at your search_synonyms.txt file you just might come to the same conclusion that I did. Here is a sample of some words from the top of the file:

center centre
check cheque
color colour
comission commission
comittee committee
commitee committee
conceed concede
creating createing
curiculum curriculum
defense defence
develope develop

What we have here is a mix of American and UK spellings (center versus centre) and some common alternate (or wrong) spellings (commitee committee) and so on. But how are these words used? I entered the following post in my board:

I need to check the colour of my cheque

Here is what got stored in my search index table:

+-----------+---------+
| word_text | word_id |
+-----------+---------+
| check     |     642 |
| cheque    |    9072 |
| color     |    1349 |
+-----------+---------+

Hm. There is something a bit strange going on here. I see two words that are supposed to be synonyms, both indexed. I didn’t expect check and cheque to both be there. After all, my colour is missing, right?

It turns out that there was a bug in my Efficient Cleanwords MOD. The last word in a sentence would not be properly processed as a stop word or as a synonym. I discovered this when I went back and added some words to the end of my sample sentence like this:

I need to check the colour of my cheque for this post

Once I did that, I got this:

+-----------+---------+
| word_text | word_id |
+-----------+---------+
| post      |     442 |
| color     |    1349 |
| check     |     642 |
+-----------+---------+

Now that’s more like it. Is it? Well, at least we can start talking about how synonyms work.

The Efficient Cleanwords MOD code has been updated to fix this bug. This is the sort of “alpha” testing I typically try to do of all of my MODs, as I despise releasing buggy code. Doesn’t mean I don’t do it, I just despise it. :-)

What I originally expected to have happen was that if I used the word color in a post that the search index would include both color and colour as alternative spellings. That way if one of my friends from “across the pond” were to come to the phpBB Doctor site and search, they would be able to find posts that included the word colour. Even though nobody would really ever type it that way. :-P

On reflection, that was a really dumb idea. Why would we want to increase the size of our search index tables by storing something twice? Almost everything I have posted in this series has been about reducing the size of the searchwords table, or making the interactions with that table more efficient. Adding more data is not the way to do that.

So as it turns out, it is quite simple: any iteration of the word color (color or colour) is indexed as the shorter spelling, as that’s the first word on the line in the synonyms text file. That means that cheque will always be indexed as check. And that centre will always be indexed as center. And that… wait a minute, let’s take a closer look at that list again…

center centre
check cheque
color colour
comission commission
comittee committee
commitee committee

The synonyms process always maps the second word to the first. So what is wrong with this picture? Do you know how to spell committee? That’s not a language thing, as far as I know it’s always spelled committee. It’s certainly not what you see in the first words in the listing shown above…

So I entered the following post:

A committee is a group of people able to accomplish nothing

Here’s the results from the indexing process:

+------------+---------+
| word_text  | word_id |
+------------+---------+
| able       |     206 |
| accomplish |    3573 |
| comittee   |    9075 |
| group      |    1495 |
| people     |    1166 |
+------------+---------+

Now I don’t know about you, but “comittee” in my book is spelled with two m’s and two t’s. And since that’s how the word is spelled in the second column of the synonyms text file, it seems obvious to me that someone, well, someone goofed. :-) Here’s the relevant code from an unmodified version of the clean_words() function:

list($replace_synonym, $match_synonym) = split(' ', trim(strtolower($synonym_list[$j])));

The “replace” word is first, the “match” word is second. So if the code finds a match for the second word, it is replaced by the first. Oops.

As it turns out, there is a problem with the logic used in the code compared to the actual format of the search_synonyms.txt file. I will be fixing that, and probably posting a bug report. I am guessing that since many of the phpBB developers are not of US persuasion, that they looked at this file and assumed colour and centre were, of course, the desired words. So they naturally assumed that the proper word would be listed second. :lol: That is complete speculation on my part, and a tug on the leg of whoever was responsible for setting up the contents of this file. Dare I guess it might have been done by “comittee”? :-D

Having said that, just what is the impact? Does it still work?

When the word “committee” comes through as part of a post every synonym in the text file is checked. That means “committee” is replaced by “comittee” first. Then the second occurrence of “committee” is skipped because it no longer matches anything. My redesigned table-driven process suffers from the same issue as I simply loaded the synonyms table straight away into the table without really checking to see that it was defined correctly.

But does search work? Ironically, yes, it will. When you enter the word “committee” as a search term, it will be remapped to “comittee” which is, of course, indexed. So a cynical person might suggest that perhaps there was no error, and that the shorter word was simply stored as a way to preserve space. :-P I would buy that, except that there are two lines with the word “committee” on them, and they are therefore clearly backwards.

There are other backwards entries, such as these:

heighth height
milage mileage
morgage mortgage

Remember the first word is what will get stored in the index, the second word is what is matched in the post. So if someone enters the word “mortgage” in a post (which is spelled correctly) it will be stored in the search table as “morgage” instead. There are also other “doubled” entries such as these examples:

maintainance maintenance
maintenence maintenance
ommision omission
ommission omission
suprise surprise
surprize surprise

You might argue that it would be faster and easier to reverse the php code in the clean_words() function… except that not everything is reversed!

I will leave it to you to examine your search_synonyms.txt file and fix the errors that you might find. Just remember that the second word is the “mistake” or alternate spelling, and the first word on the line is what will actually be stored in your index.

It doesn’t break the search. But in the current format any words that appear doubled are not going to work, as the second synonym line will never be used.

I don’t have to do anything to fix my code related to pushing synonyms into the database, as the code is fine. I will, however, have to clean up my syonyms table data. It’s a good thing that as part of my MOD I created an ACP page to allow me to manage my synonyms, right? ;-)

Summary
I have logged a bug with the phpBB Group, but I don’t expect anything to happen from it. I don’t mean that in a sarcastic or cynical way… it’s just that this bug is certainly not security or performance related, and the fix would be quite challenging. Think about it; you would have to alter the contents of the search_synonyms.txt file (easy) and then rebuild your index tables (hard). I understand phpBB3 includes a rebuild index feature, but phpBB2 does not. I don’t expect that they would fix this, but perhaps they’ll take a closer look at phpBB3 to make sure it doesn’t suffer from the same issue.

My ACP Stopwords Manager MOD (not yet published) will address this by providing a sequence of SQL statements to load the table correctly. So there’s another reason to consider looking at the MOD when it comes out. 8)

The Efficient Cleanwords() MOD does not do anything to the stopwords or synonyms processing. I would untimately expect that it will become a part of the Stopwords Manager MOD but I will also release it as a stand-alone MOD for those that want to retain the standard stopwords processing.

Update (March 18, 2007)
It seems that someone else posted the exact same bug years ago. The bug was closed by one of the developers, and for the reasons I expected. Any fix is not simply a code fix but would also require a rebuild of the search_wordmatch and search_wordlist tables as well. Since those features are not in phpBB2 (they are in phpBB3) it would require a MOD rather than a core code fix.

I feel a bit ambivalent about this. One the one hand, this is hardly a major issue. The only exposure is that if you have two (or more) synonyms for the same word, only the first is ever processed. Is that a huge deal? Probably not.

I will be fixing it with my MOD. Once search synonyms are moved into the database (and managed via the ACP) a board administrator will be able to easily correct their data. I will probably not release my own “rebuild search” MOD but instead will suggest that board owners install one of the others already released at phpbb.com instead.

12 Comments

  1. Dave,
    I’ve been reading your search blogs with interest. I’ve got two questions:

    1. Are you going to make a mod or concise set of instructions to fix/enhance a stock PHP 2.0.22 board?

    2. Is your search word rebuild mechanism already a MOD? I looked in PHPBB hacks and there was a MOD for that, but it is marked as applying to 2.0.6.

    I would like to clean up the short and long indexed search words and use an improved algorithm after you figure out the foreign word issue.

    Comment by John Marino — March 15, 2007 @ 8:52 am

  2. John, I absolutely intend to put all of this research into a MOD, or more than one MOD.

    The Efficient Cleanwords() MOD would work with any rebuild MOD, mostly because all of the rebuild MODs that I have seen (including my own) simply gather a batch of posts together and run the standard phpBB search words maintenance code. The cleanwords() function is a part of that, so if you use that MOD it will help any rebuild MOD.

    What language do you use?

    Thanks for your comment, welcome to the Doctor Blog. 8)

    Comment by dave.rathbun — March 15, 2007 @ 2:41 pm

  3. I’m multilingual :)

    I did not see your rebuild mod on phpbbhacks, and phpbb.com is down indefinitely, so I don’t know where I would get a copy. I guess what would be nice is a combined mod or set of mods that would fix the regex, update the cleanwords, and provide an admin utility to rebuild the search database. With those tools (and perhaps some tweaking to stopwords), wouldn’t that “repair” a stock 2.0.22 board? Or is there something else that needs tweaking?

    Comment by John Marino — March 16, 2007 @ 11:40 am

  4. I have not released my “rebuild” MOD anywhere, as there are a couple out there already. I wrote my own to see if I could improve on it, and I believe that there is one at phpBB.com that has more features than I am willing to write so I doubt I will ever release mine. I will continue to use it on my own sites.

    I have not had time to complete my research into why the regex fails on non-english language boards. When I was asking about your language, I wasn’t trying to pry, but I was wondering if you would be willing to test some code on your system.

    The search synonyms is wrong, as I posted recently. To fix that would require fixing the synonyms data file, or implementing the search syonyms table. I will be releasing the search stopwords/synonyms as a MOD, but not before I add one more feature to it. I want to add “search addwords”, which are words that would normally be eliminated because of their length or because they contain symbols and whatnot. This is a real complaint at my biggest board where it is impossible to search for version numbers like 6.5.1 because the period is dropped and the remaining characters are all too short to be indexed.

    Once phpBB.com comes back up (which I would imagine would have to be soon at this point) and you can find a rebuild MOD, would you be willing to try the Efficient Cleanwords MOD and give me your feedback? Alternatively, if you have a smaller board in a language other than german/swiss german that you could allow me to use for testing, I would appreciate it. The data would be treated as strictly confidential, and installed only on a server at my house which is not exposed to the Internet. I completely understand if the answer to both questions is “No”, so don’t worry about hurting my feelings. 8)

    Comment by dave.rathbun — March 16, 2007 @ 8:33 pm

  5. The board in question is english-based with about 35K posts. It’s not stock, but it’s not heavily modded either. That’s because I previously modded it’s predecessor heavily and paid a big price when the critical updates came up.

    I would be willing to install the rebuild mod and the efficient clean words mod. If you are asking for a complete mysql dump of the board, that can be tricky. While I only have one confidential forum (the rest are public view), the users table is very much enhanced and contains privacy data far beyond what a normal phpbb forum contains, and I really can’t release that info. perhaps a selective dump for that table would be the answer. I’m not sure how a stock english board would help you though.

    On a separate rant regarding Olympus: I’ve actually released 2 successful MOD’s for PHPBB 2. I found working with the PHPBB community so distasteful both during the approval process. I also thought they handled the PHPBB 2.2 update process (originally scheduled for 2003?) and the justified criticism for their programmatic decisions, blown deadlines, unrealistic goals and milestones extremely poorly and with hypersensitivity. Now, almost 5 years later, I’m looking in Area 51 (which I had been on and off for years) and I just don’t see a huge functionality improvement. I read about how people can’t stand PHPBB 2 after running PHPBB 3 beta, but I’m completely missing what they are raving about. From outward appearances, it’s a minor version (i.e. V2.2) update of PHPBB 2.0. So why would anybody in their right mind rush to PHPBB 3 when PHPBB 2 is only now becoming usable from a security standpoint and while the PHPBB community has fragmented (read: all the original guys are long gone) and proven they can’t manage a project well. There will be years of bug fallout from the release of PHPBB 3. I think PHPBB 2 had plenty of momentum in 2002 and was well on their way to give all forum projects a run for their money, but that swell is gone.

    And to here the community, you’d think they were building a database engine or a new operating system. It’s a freaking message board. I personally, with no help, could start from scratch a build a forum with more functionality in 4 months, so 5 years for an upgrade is absurd.

    To top it off, I had a lot of great ideas for new MOD’s, for functionality that doesn’t exist (similar to my Topic Cement MOD), but after such a negative experience there’s no way I was going to take the time to develop and contribute any more than I did. I was asked once or twice to port my mods to PHPBB 3, but I’m not inclined to do that.

    I have no idea why I ranted, but it felt good. I guess I was supported your decision to wait a long time to migrate to PHPBB 3. Maybe it’ll give you an idea for a blog topic.

    John

    Comment by John Marino — March 18, 2007 @ 6:07 am

  6. Um. Wow. :shock:

    What a post. :-) I’ll touch on most (but not all) of the points you brought up…

    If you decided to share some of your data with me for testing, all I would really need is the content from the topics, posts, and posts_text tables. I would not need any user data. In fact, what I would do (and what I did last time I got some external data for testing) is assign every single post and topic to the same user. That helps some with privacy concerns, as it addresses the user identities. It does not address the privacy concerns about what the text that was actually posted. I can’t really address that, other than to take every possible precaution that the data is used only for my testing, and deleted immediately afterwards.

    You could even extract topic and post data from one visible forum and dump that. I’ve done that before using a Select… Into command. If you’re not familiar with that syntax it’s a great way to get a reduced set of data. Suppose that you wanted to get a list of topics from forum 42 into a new table:

    create table forum42_topics as
    select * from phpbb_topics
    where forum_id = 42

    After that command completes you will have a table with the identical structure as that used for phpbb_topics (minus any indexes) and content based on topics from the selected forum. If you want to create an empty table with the same structure simply supply a where clause that will never be true, as in:

    create table forum42_topics as
    select * from phpbb_topics
    where forum_id = -99

    The tricky part is capturing the phpbb_posts_text table since the forum_id does not appear, but you can do that with a simple join:

    create table forum42_posts_text as
    select pt.post_id as post_id
    , bbcode_uid
    , post_subject
    , post_text
    from phpbb_posts_text pt
    , phppbb_posts p
    where pt.post_id = p.post_id
    and p.forum_id = 42

    You can’t get away with “select *” in this case as you’re using a join of two tables, and you don’t want everything from the phpbb_posts table to come along too. You can replicate this type of query for many purposes.

    It doesn’t matter at the moment, as it’s going to be a few months before I get back to testing anyway. :-) And if your board is english then I don’t really need to get your data, as I have plenty of english posts to use.

    Now to your rant… first, I’m glad you felt better afterwards. :-)

    I will admit that I see more than one side of the story. I can agree that phpBB2 is a solid product, and I believe it has life moving forward even after the release of phpBB3. I will also agree that from what you can see from the front area51 does not appear to have dramatically different features. Yes, attachments are built in. Yes, subforums are now a core feature instead of a cumbersome add-on. But if you only look from the front you’re not seeing the bulk of the upgrades. Like an iceberg, you have to look under the surface to see the bigger part of the picture. :-)

    SQL handling is different. Language handling is different. Moderations tasks are different. The admin control panel is in a different galaxy, it’s so different from what we have today. :-) Security has been rebuilt from the ground up. The fact that phpBB3 looks and acts so much like phpBB2 is – in my opinion – simply because it’s a forum. There are only so many ways that you can process forums, topics, and posts.

    I really don’t want to comment on why phpBB3 has taken so long or the personnel turnover or other things that you touched on. I have an opinion, but I prefer to keep it to myself. :-)

    Regarding the MOD process… I have submitted a few MODs, and most of them have been approved. I have chronicled the story of my Page Permissions MOD in quite a few blog posts. It went through numerous submissions (and rejections). The first time it was rejected I got what I considered to be some very inappropriate comments from the MOD validator. Frankly, it pissed me off. I had a nasty response all written out and ready to send when discretion took over and I cancelled the message.

    I came back a few days later and composed my reply instead. There were, in fact, legitimate concerns raised by the MOD validator and I had no problems with that. In fact, I learned from the experience, which is something that I value. What I sent to the MOD Validator was a brief comment explaining that while I valued the content of their review, the mechanism and style used to deliver their message was way out of line. Technical content? Good. Editorial comments? Very bad. :-)

    It was partially because of that feedback that it took so long to get Page Permissions approved. I also lost interest in the process. I did eventually come back and finalize the submission and ultimately got the MOD released.

    I personally would never want to join the MOD Validator Team. I think it’s a no-win situation to be in. They are always behind, partially because of the enthusiasm of the phpBB2 MOD community. And that, to some degree, is because of the longevity of the phpBB2 codebase. The longer something stays the same, the more people become familiar with it, and the more potential MOD writers there are. And the further behind they get.

    I’m sorry you had a bad experience. From my experience I would suggest that if enough time has passed you might try another submission to see if things have improved. If you feel that they have not, then a properly structured personal message (meaning constructive criticism) sent to wGEric (the MOD Team leader) might be appropriate.

    If you want to take this offline, feel free to send me a PM at phpbb.com (username “drathbun”).

    You mentioned looking at phpbbhacks for rebuild MODs. They may very well have some, but you won’t find any of my MODs at that site. Let me put it another way: if you find one of my MODs at that site then they were not submitted with my approval. At this time I only release MODs at this site (phpbbdoctor.com) or at phpbb.com.

    Comment by dave.rathbun — March 18, 2007 @ 11:14 am

  7. hi there, is there any chance to download whole mods or parts?
    i’ve searched a bit for specific mods of yours but with no luck.

    i’m pretty interested in synonyms/stopwords. i saw few not complete attepts on How to optimize search on big forums (phpbb) but i think that it is not complete


    if (trim($entry) !== '') //subject lines are blank sometimes so skip this if needed
    {
    //get array of words
    $word_array = split_words($entry);

    if ( !empty($word_array) )
    {
    //get unique words
    $word_array = array_keys(array_flip($word_array));

    $sql = "SELECT word_text
    FROM " . STOPWORDS_TABLE . "
    WHERE word_text IN ('" . implode("','", $word_array) . "')";

    if ( !($result = $db->sql_query($sql)) )
    {
    message_die(GENERAL_ERROR, 'Could not select stopwords', '', __LINE__, __FILE__, $sql);
    }
    while ( $row = $db->sql_fetchrow($result) )
    {
    $stopword = trim($row['word_text']);
    if ( $mode == 'post' || ( $stopword != 'not' && $stopword != 'and' && $stopword != 'or' ) )
    {
    $entry = str_replace(' ' . strtolower($stopword) . ' ', ' ', $entry);
    }
    }
    $db->sql_freeresult($result)
    }
    }

    this is pretty good start, but this “mod” won’t solve INSERT into stopwords … which i can’t really figure out how this is realized

    Comment by kriznik — September 20, 2010 @ 8:33 am

  8. Hi, kriznik, and thanks for your question. There are a whole lot of posts on my blog about optimizing search. The stopwords information in a basic phpBB2 installation is stored in a text file called search_stopwords.txt located in the \language\lang_english\ folder. To add new stop words, edit that file. However, if you have stop words stored in a database table, then you can either insert them manually with a SQL script or design an input screen to be used via the admin panel. Is that what your question is? If not, please post back and let me know.

    Comment by Dave Rathbun — September 20, 2010 @ 5:54 pm

  9. hi Dave,
    it’s nice to have such a fast reply! :)
    Let’s say for start that i’m not basic bb user and i’ve done a lot of smaller or bigger tweaks in my phpBB :]

    unfortunately this is not really answer to my question. My question is about whole process how is stopwords made or maitained.
    I’m pretty sure that i’ve never touched content of that file and allready have more than 1000 rows in that file. So it must be generated some way (i did not check how and where yet)
    So as i posted before that information is somewhere from http://www.phpbb.com/community/viewtopic.php?f=18&t=135383 this discussion

    As we can see that function maitain process which retrieve stop word from DB (which is much more faster than from file) and then do normal stuff
    What i was asking for is how you maitain insertion to that table? I don’t believe that you update that table by hand. As it looks like very automated process which originaly saves data to .txt file

    another thing (which is up in mind just now) is question how you reindexing your site so fast. Even with all tweaks (and many yours hint) my reindexing takes about 26hours (around 500.000 post now) and my server isn’t just old slow piece of furniture :)
    Maybe is there way to reindex searchindex from shell or something similar which will not be involving apache… dunno

    sorry for my english, it’s long time when i was using english on daily basis :)

    Comment by kriznik — September 21, 2010 @ 3:54 pm

  10. interesting thing i’ve discovered. My whole search did not use searchwordlist and searchwordmatch tables at all. because of this if ( !strstr($multibyte_charset, $lang['ENCODING']) ){
    even that i have whole web and forum in utf-8!

    after i’ve commented part with this if statement and left only part in else, search works smooth and fast!
    dunno why this statement did not work because my $lang['ENCODING'] = “utf-8″ so strstr may find occurence and go to false in that statement

    weird
    so i’ll look today closely for all this-like statements!!!

    Comment by kriznik — September 23, 2010 @ 4:32 am

  11. What i was asking for is how you maitain insertion to that table? I don’t believe that you update that table by hand. As it looks like very automated process which originaly saves data to .txt file

    Indeed it is a manual process for now. I only update the contents once or twice a year and it was not worth the time to create an admin panel page to manage the table contents.

    another thing (which is up in mind just now) is question how you reindexing your site so fast. Even with all tweaks (and many yours hint) my reindexing takes about 26hours (around 500.000 post now) and my server isn’t just old slow piece of furniture

    I suspect part of that was related to your follow up comment where you suggested that your stopwords list was not being used. That could easily double the time taken to reindex the site because every single word (even common words like “the” and “and” and so on) would be included in the index.

    Comment by Dave Rathbun — September 23, 2010 @ 9:47 am

  12. well i don’t thing that stopwords is not used because it’s growing up by self (as i said before).
    What wasn’t used was db tables searchwordslist and searchwordlist_match because of that statement which ended up with query like “from post_text where post_text like %search% ” which is terribly slow

    Comment by kriznik — September 23, 2010 @ 9:59 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress