Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments February 11, 2007

Regex Redux

Filed under: Search, phpBB — Dave Rathbun @ 12:20 am CommentsComments (7) 

In an earlier post I disected the regular expression used in the clean_words() function to remove short and long words from the search index tables. I figured out how it worked, and then why it was broken from 2.0.6 forward. I fixed it.

For english boards only. :-)

I have had some feedback from an owner of a german (swiss-german, actually) board, and the fix I provided does not work for his board. I think I know why. I don’t (yet) know how to fix it.

As a quick review, for versions 2.0.4 and earlier the regular expression used to separate words by spaces uses the \b token to identify a word boundary. From 2.0.6 and forward the code uses the character set [ ] instead. The problem with using a space is that if you have multiple short words in a row like is it an error? then every other short word will escape the regex and get stored in your search index.

When all of the posts are in english, then switching back to the \b seems to work extremely well. But for non-english boards it appears to be a problem.

Case in point: here are a few words from this swiss-german board that cause problems for some reason:

höö
ööh
jöö

I am told that these words are all interjections. In english they would be words like “hey” or “oh” or “wow” or similar. The issue? The öö for some reason gets cut off from the related letter in the word. I have no idea why. Another problem is the word nümme where the ü character seems to be dropped as an invalid character, leaving the “words” n and mme. The “n” gets dropped because it’s now a one-letter word, and “mme” gets added to the search index.

I can guess about part of this, but not all. At least not yet. But for example the phrase:

Jöö, das isch ja süess! 

…which I am told translates to “Oh, that’s sweet!” as Jöö is an interjection meaning “Oh” would get processed as:

öö das isch süess

The ja gets removed as it’s too short. The jöö get’s screwed up because the “j” is – for some reason – ignored. Very frustrating. :-?

In doing more research on the php site I found a page that discusses the various tokens and tags and whatnot that can be used in regular expressions. The specific page includes this quote:

A “word” character is any letter or digit or the underscore character, that is, any character which can be part of a Perl “word”. The definition of letters and digits is controlled by PCRE’s character tables, and may vary if locale-specific matching is taking place. For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

So clearly I’m onto something. The \b is supposed to match a word boundary, and the word boundaries are defined as a state change from a word character matched by \w and a non-word character matched by \W. So what I have to do is figure out how php assigns those characters for various languages. :-)

Because of these issues, I have not yet published my MOD at phpbb.com. It’s available here, and you can certainly download it if you run an english-language board. It’s working great for me on several boards. But it doesn’t work for non-english boards at this time.

In a frustrating turn of events, using [ ] works perfectly for the swiss-german board posts. It does, however, still have the problem of improperly processing a string of consecutive short (or long) words. As I type this, I am wondering if putting this back in and substituting two spaces for every space might not be a quick fix. Adding a second space would stop the problem of the first word match “eating” the second word’s space.

Will have to try it and post back. Stay tuned for details. 8)

7 Comments

  1. This is an interesting problem. I’m wondering if lookahead assertions could help here…?

    Comment by SamG — February 11, 2007 @ 4:51 pm

  2. > So what I have to do is figure out how php assigns those characters for various languages.

    It isn’t your task, it’s the task of the phpBB administrators and translators. For example, my “language/lang_russian/lang_main.php” contains:

    setlocale(LC_ALL, ‘ru_RU.CP1251′);
    setlocale(LC_NUMERIC, ‘C’);

    Comment by olpa — February 11, 2007 @ 10:38 pm

  3. I wonder is it possible just to look for the content between two whitespaces? As thats really what we are after..rather than trying to match a word, get the content between two spaces??

    Must go dig out my regex books!! If I remember \s is whitespace & \S a non whitespace. So what we want is 3 \S bounded by \s on either end. I might be totally going wrong here though..I am typing as I think ;)

    Comment by Esmond Poynton — February 12, 2007 @ 7:06 am

  4. Certainly some interesting points, and I will try to address them to the best of my ability.

    SamG, I am still learning regular expressions and the wild variety of patterns that are available. I know – in theory – what lookahead assertions are but not how to implement them. Based on your suggestion I will try to figure that out.

    Olpa, that’s very interesting. I will ask the owner of the swiss-german board if they have something like that in their language file. It would also be something that I could test on my local server, even with an english board, right? I have been allowed to use several hundred posts from a foreign-language board for testing purposes (they are on a private server at my house). If I add the proper statements to my english lang_main.php file I should be able to “fake” it into using the alternate language character sets, would you agree?

    Esmond: the trick here is that we’re trying to eliminate anything outside of the acceptable range. We want words of between 3 and 20 words to remain. The regex is used to identify words outside that range and drop them from the string of words. And yes, you are correct in that \s is a space and \S is a non-space. But again, the definition of what is a space versus non-space can vary from language to language. Maybe? I will have to verify that last statement, as it doesn’t seem right. I know that \b can change, because it’s based on the definition of \w and \W, which can also change. Perhaps \s and \S do not.

    As a report on my attempt to add spaces… it failed. :-) I’m not sure why, yet. But what I did was add a statement that replaced every single space with a double space, and then tried to use the current regex with the character set of [ ] rather than the word boundary \b. I believe I have documented that the [ ] has a problem with a sequence of short words, and I believe my explanation for why this is so is correct. It seemed that adding an extra space would fix that.

    It didn’t.

    Had company over for the weekend, so didn’t get to spend as much time on research. More details to follow, and thank you all very much for your suggestions.

    This is quite an interesting puzzle. :-D

    Comment by dave.rathbun — February 12, 2007 @ 9:21 am

  5. Sorry, dave. I failed to parse your questions. All the words are familiar to me, but lack of concentration prevents me from understanding.

    Side notes: the language files should have:

    * “$lang['ENCODING'] = ‘xxxx’” for talking with browser.
    * (optionally) “setlocale” for locale-specific functions, such as \b in regexeps.

    Comment by olpa — February 12, 2007 @ 10:44 am

  6. olpa, please do not worry. :-) Your english is far better than I would hope to be in any other language.

    What I was speculating about was this: I have a few hundred non-english posts that someone gave me to experiment with. I simply loaded them into my “english” board. My question (and I will not be able to test until later) was that if I add the extra lines to my language/lang_english/lang_main.php file I might be able to “fake” php into using a non-english definition of the \b word boundary token. This would allow me to test the regex on foreign posts for proper handling.

    I didn’t do that before; didn’t realize it was even an option. I was looking for something in php.ini (seemed obvious at first, but now I see why it’s not there) to set up the desired language or character set. So you have given me a direction to go do some additional testing, and I thank you for that.

    Comment by dave.rathbun — February 12, 2007 @ 11:33 am

  7. > if I add the extra lines to my language/lang_english/lang_main.php file I might be able to “fake” php into using a non-english definition of the \b word boundary token.
    Yes, it should be so. But be aware that “locale” doesn’t always work. Support from operating system is required. For example, I had to contact my hoster with ask to install cp1251 locale. But I think that latin-1/iso-8859-1 locale (have no idea of its right name) for Europe is always available.

    By the way, why do you the whole phpBB for testing? Just create a simple test.php:

    setlocale(’…’);
    $str = “\zzz\zzz…\zzz”;
    and test regexp functions

    Comment by olpa — February 13, 2007 @ 9:36 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress