In an earlier post I disected the regular expression used in the clean_words() function to remove short and long words from the search index tables. I figured out how it worked, and then why it was broken from 2.0.6 forward. I fixed it.
For english boards only.
I have had some feedback from an owner of a german (swiss-german, actually) board, and the fix I provided does not work for his board. I think I know why. I don’t (yet) know how to fix it.
As a quick review, for versions 2.0.4 and earlier the regular expression used to separate words by spaces uses the \b token to identify a word boundary. From 2.0.6 and forward the code uses the character set [ ] instead. The problem with using a space is that if you have multiple short words in a row like is it an error? then every other short word will escape the regex and get stored in your search index.
When all of the posts are in english, then switching back to the \b seems to work extremely well. But for non-english boards it appears to be a problem.
Case in point: here are a few words from this swiss-german board that cause problems for some reason:
hÃ¶Ã¶ Ã¶Ã¶h jÃ¶Ã¶
I am told that these words are all interjections. In english they would be words like “hey” or “oh” or “wow” or similar. The issue? The Ã¶Ã¶ for some reason gets cut off from the related letter in the word. I have no idea why. Another problem is the word nÃ¼mme where the Ã¼ character seems to be dropped as an invalid character, leaving the “words” n and mme. The “n” gets dropped because it’s now a one-letter word, and “mme” gets added to the search index.
I can guess about part of this, but not all. At least not yet. But for example the phrase:
JÃ¶Ã¶, das isch ja sÃ¼ess!
…which I am told translates to “Oh, that’s sweet!” as JÃ¶Ã¶ is an interjection meaning “Oh” would get processed as:
Ã¶Ã¶ das isch sÃ¼ess
The ja gets removed as it’s too short. The jÃ¶Ã¶ get’s screwed up because the “j” is – for some reason – ignored. Very frustrating.
In doing more research on the php site I found a page that discusses the various tokens and tags and whatnot that can be used in regular expressions. The specific page includes this quote:
A “word” character is any letter or digit or the underscore character, that is, any character which can be part of a Perl “word”. The definition of letters and digits is controlled by PCRE’s character tables, and may vary if locale-specific matching is taking place. For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
So clearly I’m onto something. The \b is supposed to match a word boundary, and the word boundaries are defined as a state change from a word character matched by \w and a non-word character matched by \W. So what I have to do is figure out how php assigns those characters for various languages.
Because of these issues, I have not yet published my MOD at phpbb.com. It’s available here, and you can certainly download it if you run an english-language board. It’s working great for me on several boards. But it doesn’t work for non-english boards at this time.
In a frustrating turn of events, using [ ] works perfectly for the swiss-german board posts. It does, however, still have the problem of improperly processing a string of consecutive short (or long) words. As I type this, I am wondering if putting this back in and substituting two spaces for every space might not be a quick fix. Adding a second space would stop the problem of the first word match “eating” the second word’s space.
Will have to try it and post back. Stay tuned for details.