Post Updated March 18, new notes at the end, thanks.
I recently completed some code that moves the “stopwords” into a database table and changes the way they are applied to a post or to the search process. At the same time I also moved the search synonyms into a database table. During testing I was very interested to find out that the way I thought the search synonyms were applied is not the way they are actually applied at all. This post will clarify how the synonyms are used, and point out something interesting about the internal consistency of the phpbb search_synonyms.txt file.
If you take a brief look at your search_synonyms.txt file you just might come to the same conclusion that I did. Here is a sample of some words from the top of the file:
center centre check cheque color colour comission commission comittee committee commitee committee conceed concede creating createing curiculum curriculum defense defence develope develop
What we have here is a mix of American and UK spellings (center versus centre) and some common alternate (or wrong) spellings (commitee committee) and so on. But how are these words used? I entered the following post in my board:
I need to check the colour of my cheque
Here is what got stored in my search index table:
+-----------+---------+ | word_text | word_id | +-----------+---------+ | check | 642 | | cheque | 9072 | | color | 1349 | +-----------+---------+
Hm. There is something a bit strange going on here. I see two words that are supposed to be synonyms, both indexed. I didn’t expect check and cheque to both be there. After all, my colour is missing, right?
It turns out that there was a bug in my Efficient Cleanwords MOD. The last word in a sentence would not be properly processed as a stop word or as a synonym. I discovered this when I went back and added some words to the end of my sample sentence like this:
I need to check the colour of my cheque for this post
Once I did that, I got this:
+-----------+---------+ | word_text | word_id | +-----------+---------+ | post | 442 | | color | 1349 | | check | 642 | +-----------+---------+
Now that’s more like it. Is it? Well, at least we can start talking about how synonyms work.
The Efficient Cleanwords MOD code has been updated to fix this bug. This is the sort of “alpha” testing I typically try to do of all of my MODs, as I despise releasing buggy code. Doesn’t mean I don’t do it, I just despise it.
What I originally expected to have happen was that if I used the word color in a post that the search index would include both color and colour as alternative spellings. That way if one of my friends from “across the pond” were to come to the phpBB Doctor site and search, they would be able to find posts that included the word colour. Even though nobody would really ever type it that way.
On reflection, that was a really dumb idea. Why would we want to increase the size of our search index tables by storing something twice? Almost everything I have posted in this series has been about reducing the size of the searchwords table, or making the interactions with that table more efficient. Adding more data is not the way to do that.
So as it turns out, it is quite simple: any iteration of the word color (color or colour) is indexed as the shorter spelling, as that’s the first word on the line in the synonyms text file. That means that cheque will always be indexed as check. And that centre will always be indexed as center. And that… wait a minute, let’s take a closer look at that list again…
center centre check cheque color colour comission commission comittee committee commitee committee
The synonyms process always maps the second word to the first. So what is wrong with this picture? Do you know how to spell committee? That’s not a language thing, as far as I know it’s always spelled committee. It’s certainly not what you see in the first words in the listing shown above…
So I entered the following post:
A committee is a group of people able to accomplish nothing
Here’s the results from the indexing process:
+------------+---------+ | word_text | word_id | +------------+---------+ | able | 206 | | accomplish | 3573 | | comittee | 9075 | | group | 1495 | | people | 1166 | +------------+---------+
Now I don’t know about you, but “comittee” in my book is spelled with two m’s and two t’s. And since that’s how the word is spelled in the second column of the synonyms text file, it seems obvious to me that someone, well, someone goofed. Here’s the relevant code from an unmodified version of the clean_words() function:
list($replace_synonym, $match_synonym) = split(' ', trim(strtolower($synonym_list[$j])));
The “replace” word is first, the “match” word is second. So if the code finds a match for the second word, it is replaced by the first. Oops.
As it turns out, there is a problem with the logic used in the code compared to the actual format of the search_synonyms.txt file. I will be fixing that, and probably posting a bug report. I am guessing that since many of the phpBB developers are not of US persuasion, that they looked at this file and assumed colour and centre were, of course, the desired words. So they naturally assumed that the proper word would be listed second. That is complete speculation on my part, and a tug on the leg of whoever was responsible for setting up the contents of this file. Dare I guess it might have been done by “comittee”?
Having said that, just what is the impact? Does it still work?
When the word “committee” comes through as part of a post every synonym in the text file is checked. That means “committee” is replaced by “comittee” first. Then the second occurrence of “committee” is skipped because it no longer matches anything. My redesigned table-driven process suffers from the same issue as I simply loaded the synonyms table straight away into the table without really checking to see that it was defined correctly.
But does search work? Ironically, yes, it will. When you enter the word “committee” as a search term, it will be remapped to “comittee” which is, of course, indexed. So a cynical person might suggest that perhaps there was no error, and that the shorter word was simply stored as a way to preserve space. I would buy that, except that there are two lines with the word “committee” on them, and they are therefore clearly backwards.
There are other backwards entries, such as these:
heighth height milage mileage morgage mortgage
Remember the first word is what will get stored in the index, the second word is what is matched in the post. So if someone enters the word “mortgage” in a post (which is spelled correctly) it will be stored in the search table as “morgage” instead. There are also other “doubled” entries such as these examples:
maintainance maintenance maintenence maintenance ommision omission ommission omission suprise surprise surprize surprise
You might argue that it would be faster and easier to reverse the php code in the clean_words() function… except that not everything is reversed!
I will leave it to you to examine your search_synonyms.txt file and fix the errors that you might find. Just remember that the second word is the “mistake” or alternate spelling, and the first word on the line is what will actually be stored in your index.
It doesn’t break the search. But in the current format any words that appear doubled are not going to work, as the second synonym line will never be used.
I don’t have to do anything to fix my code related to pushing synonyms into the database, as the code is fine. I will, however, have to clean up my syonyms table data. It’s a good thing that as part of my MOD I created an ACP page to allow me to manage my synonyms, right?
I have logged a bug with the phpBB Group, but I don’t expect anything to happen from it. I don’t mean that in a sarcastic or cynical way… it’s just that this bug is certainly not security or performance related, and the fix would be quite challenging. Think about it; you would have to alter the contents of the search_synonyms.txt file (easy) and then rebuild your index tables (hard). I understand phpBB3 includes a rebuild index feature, but phpBB2 does not. I don’t expect that they would fix this, but perhaps they’ll take a closer look at phpBB3 to make sure it doesn’t suffer from the same issue.
My ACP Stopwords Manager MOD (not yet published) will address this by providing a sequence of SQL statements to load the table correctly. So there’s another reason to consider looking at the MOD when it comes out.
The Efficient Cleanwords() MOD does not do anything to the stopwords or synonyms processing. I would untimately expect that it will become a part of the Stopwords Manager MOD but I will also release it as a stand-alone MOD for those that want to retain the standard stopwords processing.
Update (March 18, 2007)
It seems that someone else posted the exact same bug years ago. The bug was closed by one of the developers, and for the reasons I expected. Any fix is not simply a code fix but would also require a rebuild of the search_wordmatch and search_wordlist tables as well. Since those features are not in phpBB2 (they are in phpBB3) it would require a MOD rather than a core code fix.
I feel a bit ambivalent about this. One the one hand, this is hardly a major issue. The only exposure is that if you have two (or more) synonyms for the same word, only the first is ever processed. Is that a huge deal? Probably not.
I will be fixing it with my MOD. Once search synonyms are moved into the database (and managed via the ACP) a board administrator will be able to easily correct their data. I will probably not release my own “rebuild search” MOD but instead will suggest that board owners install one of the others already released at phpbb.com instead.