<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Welcome to the phpBB Doctor Blog &#187; Search</title>
	<atom:link href="http://www.phpbbdoctor.com/blog/category/phpbb/search/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.phpbbdoctor.com/blog</link>
	<description>Your premium source for custom modification services for phpBB</description>
	<lastBuildDate>Wed, 11 Jan 2012 21:30:50 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Experimenting With FULLTEXT Indexing</title>
		<link>http://www.phpbbdoctor.com/blog/2009/09/12/experimenting-with-fulltext-indexing/</link>
		<comments>http://www.phpbbdoctor.com/blog/2009/09/12/experimenting-with-fulltext-indexing/#comments</comments>
		<pubDate>Sat, 12 Sep 2009 15:43:56 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=330</guid>
		<description><![CDATA[I continue to get feedback from my users that &#8211; to be concise &#8211; the search process sucks.   As regular readers of my blog will probably remember, I have done a lot of work to understand and fine-tune the standard phpBB search process. I have moved stop words into the database. I have [...]]]></description>
			<content:encoded><![CDATA[<p>I continue to get feedback from my users that &#8211; to be concise &#8211; the search process sucks. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  As regular readers of my blog will probably remember, I have done a lot of work to understand and fine-tune the standard phpBB search process. I have moved stop words into the database. I have adjusted the regular expression used to parse and index the words. I have added code to provide cleaner input to the search routine. All of these changes were made to optimize the process as it works today.</p>
<p>But folks are still not happy.</p>
<p>They don&#8217;t like the fact that certain words are on the stop words list. My board is related to a specific brand of software used for reporting. It&#8217;s not too surprising, then, that the word &#8220;report&#8221; appears in nearly 30% of the half-million posts on my board. Yet they still feel like they would gain value by having that word in their search for some reason.</p>
<p>They don&#8217;t like the fact that short words (which in our case includes version numbers) are not included either.</p>
<p>They don&#8217;t like the fact that they can&#8217;t search for word combinations (exact phrase search).</p>
<p>So today I started testing out a FULLTEXT index on my posts table. I created the index on both the post text and the title. It took a minute and a half and spiked my CPU to about 33% use. The index is over half the size of the database table. On the other hand, the index is smaller than the index on the search_wordmatch table so that&#8217;s something positive.</p>
<p>Over the coming weeks I am going to be experimenting with different search keywords and will be trying to get some metrics as to how well the fulltext index performs. There are three aspects that I am hoping to use to rate the success of this experiment. First, how fast are the results provided. Second, how effective are the results. Third, how easy is it going to be to give the user an interface to use the new index.</p>
<p>Stay tuned for more details.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2009/09/12/experimenting-with-fulltext-indexing/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Search Tweak: Page 2 Not Found</title>
		<link>http://www.phpbbdoctor.com/blog/2008/03/09/search-tweak-page-2-not-found/</link>
		<comments>http://www.phpbbdoctor.com/blog/2008/03/09/search-tweak-page-2-not-found/#comments</comments>
		<pubDate>Sun, 09 Mar 2008 09:11:43 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/2008/03/09/search-tweak-page-2-not-found/</guid>
		<description><![CDATA[I have had frequent complaints from folks on my biggest board regarding the search results. It seems that  for many users page two (and beyond) was not available. Frankly, I was stumped. I worked with some users to verify that they have cookie settings that work and to ensure that their search session was [...]]]></description>
			<content:encoded><![CDATA[<p>I have had frequent complaints from folks on my biggest board regarding the search results. It seems that  for many users page two (and beyond) was not available. Frankly, I was stumped. I worked with some users to verify that they have cookie settings that work and to ensure that their search session was not timing out. I checked (and tweaked) the code that clears out old search sessions to make sure their search results were not being truncated too early. The main thing was, I was never able to reproduce the error. That makes it very frustrating when trying to debug and fix an issue.</p>
<p>It turns out that the root cause had nothing to do with a cookie or session problem. The cause was that these users were running searches that returned a huge number of posts or topics. I figured that out by looking at the MySQL query logs and noted something specific to this issue. </p>
<p>Why should the number of resulting posts matter if I am only trying to go to page 2? Shouldn&#8217;t it be an issue only if I am trying to view page 800 instead?</p>
<p><span id="more-170"></span></p>
<p>First, nobody is going to read 800+ pages of search results. It&#8217;s not going to happen. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  But let&#8217;s suppose that someone was going to read through all of those pages. How is that information managed from one session to the next? Meaning, just how does someone get to page 2 without passing the entire list of topics on the URL?</p>
<p>There is a table named <a href="http://www.phpbbdoctor.com/doc_columns.php?id=14">phpbb_search_results</a> that is used to contain the results of the search. Makes sense, right? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  The search results table has a very simple structure. It contains the session_id, the search_id, and the search_array column which is text. The session_id is used to clear out old search results after a session has expired. The search_id is either a text string like &#8220;unanswered&#8221; or &#8220;egosearch&#8221; (my personal favorite) or a numeric key like the following URL:</p>
<pre>search.php?search_id=1605853721&#038;start=25</pre>
<p>The search_id in that prior URL is used to look up the results from the user&#8217;s search, skip the first 25 topics (or posts), and start from there. That would be page 2. And that would fail in many cases.</p>
<h3>Why Page Two Fails</h3>
<p>The reason page 2 fails is right there in the table definition, but it helps to know a bit more about how the search_array field is used. Simply put, there are tons of options that a user may select on a search. Those include sort options, topics or posts, search topic title or message body, search for a forum, and so on. There are way too many options to pass on a URL. So as the search is performed, these values are all serialized and stored in the text field. Here is the code from 2.0.22 search.php:</p>
<pre>for($i = 0; $i < count($store_vars); $i++)
{
	$store_search_data[$store_vars[$i]] = $$store_vars[$i];
}

$result_array = serialize($store_search_data);</pre>
<p>If you want to see some examples, use phpMyAdmin (or something similar) and you'll see what goes into that column in the table.</p>
<p>The search_array column format is text. A text column in MySQL stores 64K, or 65,536 characters of data. That column does not just store all of the search options selected. It also includes the topic or post ID values that matched the search. So here is the problem: <strong>It is entirely possible to return more search values than will fit in the text field!</strong></p>
<h3>Simple Example of a Broken Search</h3>
<p>Suppose that I do a search for posts. Suppose that on the board being used that all of the post_id values are all 5 digits in length. The post_id values are separated by <code>", "</code> in the search_array text field. That means every post_id takes 7 characters of space. (That would be five characters for the post_id, one for the comma, and one for the trailing space for a total of 7.) Now suppose that a search for a common sequence of words returns 10,000 posts or perhaps even more. If I take 10,000 * 7, well, that's 70,000 characters of data. It will not fit into a text field.</p>
<h3>First Fix</h3>
<p>I have two fixes to suggest for this if you are seeing this behavior on your board. First, remove the space. There is nothing different between this:</p>
<pre>AND p.post_id in (1,2,3,4)</pre>
<p>and this:</p>
<pre>AND p.post_id in (1, 2, 3, 4)</pre>
<p>The extra space is cosmetic. It does not change the results. But does it help to remove it? Take my earlier example where I had 10,000 posts and 7 characters for each. By removing the space, I will reduce the amount of data that I am trying to store by 10,000 characters, dropping me from 70,000 (too many) down to 60,000. That just might fit. If you look in search.php you will find a number of places where this code exists:</p>
<pre>WHERE post_id IN (" . implode(", ", $search_id_chunks[$i]) . ")</pre>
<p>Simply change the implode "glue" from <code>", "</code> to <code>","</code> and you're done. You'll have to do this in a number of places to get them all. Frankly I would do this even if you are not having problems with search results being too big, as there is less data to push around.</p>
<h3>Second Fix</h3>
<p>Even after you do this you might still get some searches that are too big. It turns out that the phpBB2 developers were aware of this issue and there is code in place to take care of it. In phpBB2 2.0.22 it exists but is commented out so that it doesn't get executed. Here is the code I am talking about:</p>
<pre>//
// Limit the character length (and with this the results displayed at all following pages) to prevent
// truncated result arrays. Normally, search results above 12000 are affected.
// - to include or not to include
/*
$max_result_length = 60000;
if (strlen($search_results) > $max_result_length)
{
	$search_results = substr($search_results, 0, $max_result_length);
	$search_results = substr($search_results, 0, strrpos($search_results, ','));
	$total_match_count = count(explode(', ', $search_results));
}
*/</pre>
<p>In this case they are going to limit the result set to 60,000 total characters. If the search results are too long, the first line of code reduces the string to the proper size. The second line makes sure the end of the string is a valid post (or topic) ID and that something didn't get chopped off in the middle. And the last line resets the $total_match_count so the paging will be correct.</p>
<h3>Conclusion</h3>
<p>Should you do this? If your board has fewer than 70,000 posts I would suggest that it doesn't matter. Even if you're over 100,000 posts you are probably okay as long as you are keeping up with your stopwords file (or table, if you have implemented the ideas I posted about moving those values into the database in earlier <a href="http://www.phpbbdoctor.com/blog/2007/03/03/how-does-search-work-part-v-search-stopwords-redux/">search posts</a>.) In my case just removing the extra space from the glue in the implode() function was enough for now, but with my board just about to exceed 400,000 posts I feel like the more bullet-proof solution provided by the phpBB2 developers will be required.</p>
<p>Earlier in this post I mentioned that I finally figured this out by reviewing my MySQL query logs. One of the queries that I saw was an insert into the search results table, and the list of post_id values was so large that the query was truncated before the array of posts could be stored. Once I saw that (and by the way, that is also where I saw how the post_id values were stored and decided to remove the extra spaces) the fix was easy to recognize and implement.</p>
<p>So that's how you fix not being able to view Page 2 of your 800 pages of search results. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2008/03/09/search-tweak-page-2-not-found/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Yet Another Search Tweak</title>
		<link>http://www.phpbbdoctor.com/blog/2007/08/30/yet-another-search-tweak/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/08/30/yet-another-search-tweak/#comments</comments>
		<pubDate>Thu, 30 Aug 2007 09:26:57 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/2007/08/30/yet-another-search-tweak/</guid>
		<description><![CDATA[One of the ideas that comes up regularly when discussion tweaks to the phpBB2 search system includes the idea of setting up some sort of cron job that helps to maintain the search index table. There are a lot of transactions that hit that table during the posting process, and the posting process is what [...]]]></description>
			<content:encoded><![CDATA[<p>One of the ideas that comes up regularly when discussion tweaks to the phpBB2 search system includes the idea of setting up some sort of cron job that helps to maintain the search index table. There are a lot of transactions that hit that table during the posting process, and the posting process is what the user sees. If the posting process is slow, then the board &#8220;feels&#8221; slow. So I really like this next tweak, as it can improve the posting process quite a bit.</p>
<p><span id="more-140"></span></p>
<p>It&#8217;s quite simple, really. I&#8217;m going to show how to get a potentially huge boost in performance during the process of editing or removing posts. This tweak removes 3 queries (or one really big one if you&#8217;re not using MySQL). And one of those queries is a nasty one with both a group by and a having clause. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_eek.gif' alt=':shock:' class='wp-smiley' />  And the kicker? Most of the time it&#8217;s a query that scans a whole bunch of rows, and returns nothing! I will explain what to do, and then why.</p>
<h3>Do This Now, Thank Me Later</h3>
<p>Open includes/functions_search.php<br />
Find this code:</p>
<pre>function remove_search_post($post_id_sql)
{
        global $db;

        $words_removed = false;</pre>
<p>After, Add:</p>
<pre>/*</pre>
<p>Find this code:</p>
<pre>        $sql = "DELETE FROM " . SEARCH_MATCH_TABLE . "
                WHERE post_id IN ($post_id_sql)";</pre>
<p>Before, Add:</p>
<pre>*/</pre>
<p>What this does is comment out (effectively remove) a whole block of code. So that&#8217;s the &#8220;what&#8221;, how about the &#8220;why&#8221;?</p>
<h3>What Do I Lose?</h3>
<p>A brief review is in order. The <a href="http://www.phpbbdoctor.com/doc_columns.php?id=15">phpbb_search_wordlist</a> table contains a list of words that appear in one or more posts on your board. The <a href="http://www.phpbbdoctor.com/doc_columns.php?id=16">phpbb_search_wordmatch</a> table contains the index of which posts include those words. When you insert a new post, any new (unique) words are inserted into the phpbb_search_wordlist table and assigned word_id values. Then the combinations of the word_id and post_id values are inserted into the phpbb_search_wordmatch table. Both of these steps need to happen, so I won&#8217;t change those. I have already optimized this a bit in prior posts by changing how the stopwords and synonyms are processed.</p>
<p>But what about editing a post? Now that process can be tweaked. I can gain a huge boost in performance while giving up very little functionality.</p>
<h3>Faster Edits are Good</h3>
<p>When I edit a post phpBB does not try to keep track of a &#8220;before&#8221; and &#8220;after&#8221; picture of the text of my post. Instead, it will <strong>remove every row</strong> from the phpbb_search_wordmatch table (the index). Then it will check to see if any of the words in my post were unique <strong>to that post</strong>. If so, it will remove them from the phpbb_search_wordlist table. Here&#8217;s the problem with that. phpBB2 takes 3 queries against a MySQL database to do that operation. And one of those queries is really ugly, it looks like this:</p>
<pre>$sql = "SELECT word_id
	FROM " . SEARCH_MATCH_TABLE . "
	WHERE word_id IN ($word_id_sql)
	GROUP BY word_id
	HAVING COUNT(word_id) = 1";</pre>
<p>What this is doing is taking a list of word_id values and getting a count of how many times they appear in the index table. Then the &#8220;having&#8221; clause kicks in and drops any words that are counted more than once. That&#8217;s the uniqueness test, and it can be really slow.</p>
<p>I had at first revised this query to look like this:</p>
<pre>$sql = "SELECT word_id
	,	count(word_id) as word_count
	FROM " . SEARCH_MATCH_TABLE . "
	WHERE word_id IN ($word_id_sql)
	GROUP BY word_id";</pre>
<p>What this did was skip the having clause and instead return the count of the word_id back to the result array. From there, I checked to see if the count was one or not. That might have saved some time on the query, but it turns out that I decided it didn&#8217;t matter anyway. No, instead I am going to skip the entire query. And the one before it, and the one after it. Here&#8217;s why.</p>
<p><strong><span style="color:red;">I don&#8217;t care about removing words from the wordlist table!</span></strong></p>
<p>Think about it. When I edit or remove a post, sure, I want the index table to be cleaned up. But the odds are good that if a word was entered into the wordlist once before, it will be used again. The wordlist table isn&#8217;t the problem in this situation. Going back to the original definition&#8230; the phpbb_search_wordlist table contains words that appear in <strong>one</strong> or more posts on your board. After this tweak the definition will be a table that contains words that appear in <strong>zero</strong> or more posts on your board.</p>
<p>In other words, the &#8220;cost&#8221; of this tweak is that you might end up with words that don&#8217;t belong to any posts. The &#8220;benefit&#8221; is that you skip 2 or 3 queries every time someone edits or deletes a post. I am willing to make that trade, especially since the &#8220;having count(word_id) = 1&#8243; query has been showing up in my slow query log on my server. Every time one query slows down, all of the other queries &#8211; even the tuned ones &#8211; can suffer.</p>
<p>By the way, this exact same process is run during pruning. So it&#8217;s not just edits where this will make a difference.</p>
<h3>Summary</h3>
<p>This is a subtle tweak that improves editing and removing posts. If you never edit or delete (or prune) a post, this tweak will not help. But if you edit posts on a large board you might notice how slow the submit process is. That&#8217;s because it is first determining which words are in the post, then figuring out which are unique to that post, then removing those unique words from the wordlist table, then finally removing all the index rows. That is four queries to remove the post from the search index. If you don&#8217;t care if your words stick around, do this tweak. It will still maintain your search index, but skip the extra maintenance on the wordlist table.</p>
<p>Here is the funny part: if the edits were minor, then the code is going to immediately process some of those same words and put them right back into the database again! <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' /> </p>
<p>I am testing this tweak on my largest board right now. </p>
<p>To test I made the changes documented in this post. I created a post with an easily identified (unique) word and saved it. I confirmed that the word was in the word list table and in the index table. Then I deleted that post. The index rows were gone, but the word was still in the wordlist table. I can live with that.</p>
<p>And if I later decide I can&#8217;t, then this is a perfect application for a cron job. Run a job at midnight that checks your wordlist table and removes any words that do not appear in the index. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/08/30/yet-another-search-tweak/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Search Algorithms and the Wizards that Write Them</title>
		<link>http://www.phpbbdoctor.com/blog/2007/08/14/search-algorithms-and-the-wizards-that-write-them/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/08/14/search-algorithms-and-the-wizards-that-write-them/#comments</comments>
		<pubDate>Wed, 15 Aug 2007 01:38:58 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=128</guid>
		<description><![CDATA[One of the reasons I like the phpBB2 search system so much is I can understand it. If I can understand it, I can tweak it. And that&#8217;s what an entire series of blog posts has ended up being about. And they&#8217;re not over yet.   But I do have to be honest; the [...]]]></description>
			<content:encoded><![CDATA[<p>One of the reasons I like the phpBB2 search system so much is I can understand it. If I can understand it, I can tweak it. And that&#8217;s what an <a href="http://www.phpbbdoctor.com/blog/?cat=4">entire series of blog posts</a> has ended up being about. And they&#8217;re not over yet. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  But I do have to be honest; the search system provided with phpBB2 does have two big challenges. It doesn&#8217;t scale well, so very large boards will likely have to do a lot of tweaks or turn it off or seek some alternative. And it is missing one extremely important aspect that most users look for: the ability to search for an exact phrase.</p>
<p><span id="more-128"></span></p>
<p>I had a conversation with naderman on IRC a few weeks ago. He mentioned that he had read some of my search statistics (which was cool for me to hear). And he shared some statistics about how phpbb.com has been running, which were very enlightening. You see, phpbb.com is not running the standard search process right now, they are testing an implementation of sphinx that naderman is working on. That is why (if you hadnâ€™t noticed) the â€œsearch this topicâ€ doesnâ€™t work, as he hasnâ€™t had the time to integrate that part of it yet</p>
<p>What is sphinx? Glad you asked. </p>
<p><a href="http://www.sphinxsearch.com/">www.sphinxsearch.com/</a></p>
<p>Why is this important? Searching is A Big Deal, and I put that in capital letters for a reason. To be very frank, certain parts of phpBB are quite simple. Oh, there is a lot of code wrapped around it, there are parts that are quite elegant, but at the very core a message board isnâ€™t that big of a deal to write. In the phpBB â€œOrigin Storyâ€ I think theFinn said he had a working prototype in a few days. I am not trying to denigrate the efforts of any of the developers, past, present, or future. I am saying that the basic concept of Board -> Category -> Forum -> Post just isnâ€™t that hard. phpBB3 has improved on the implementation substantially, but the core aspects of the system are the same.</p>
<p>But searching, and searching well, now that to me is more like rocket science. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Look at how much money google and other search engines pour into research and development every year. It&#8217;s not a trivial thing. Here&#8217;s a quote from another site (linked below) that talks about the challenges of building a web search engine. We&#8217;re talking about searching a database, they&#8217;re searching a web, but I think the comments are still relevant:</p>
<blockquote><p>At serve time, you have to get the results out of the index, sort them as per their relevancy to the query, and stick them in a pretty Web page and return them. </p>
<p>If it sounds easy, then you haven&#8217;t written a search engine. Remember, first, that some queries have more than one word. This means that you have to intersect the index entries for the two words. My advice is to have them presorted in some canonical URL number order so that you can view the two (n) index entries as two stacks and pop until the tops are equal, in which case, you win the prizeâ€”the URL is in both index entries. These sorts of computations have to be run at query time and they need to be run quickly, so think hard about how you are going to do intersections. </p>
<p>Next problem, query time ranking. Now that you have the list of URLs, you have to rank them according to your relevancy algorithm. This has to be fast. People are waiting. </p></blockquote>
<p>That&#8217;s from the last page of the article, but I think it sums up search challenges well. Read it, index it, but be prepared to spit it back out on request. As the author says, people are waiting.</p>
<p>At this time we have two search options that I can see being used for phpBB. One is the built-in algorithm (which I have already said many times that I like), and the other is to ignore that code and fall through to MySQL&#8217;s full text option. In the second case we&#8217;re not writing any search algorithms, we are relying on the talented MySQL folks to do so instead. Which is really my point for this post.</p>
<p>And that point is? Yes, I know you have been asking that as I ramble on and on. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>My point is that it takes a lot of effort to really understand and write a good search algorithm. It&#8217;s not simple. Efficient search algorithms are complicated. There are two parts, and both are in my opinion equally important. First index and then retrieve. Many database systems do an excellent job of both. But figuring out how to create your index and then process it during the scan and retrieval? There&#8217;s a bit of magic involved. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>So instead of spending the time and effort to reinvent that particular wheel, I find it interesting that naderman has been researching other options. Some of the features from the sphinx web site, and my comments to go with them&#8230;</p>
<p><strong>high indexing speed (up to 10 MB/sec on modern CPUs)</strong><br />
Clearly this is important at the beginning, as you will want to index your database / site prior to using the search process.<br />
<strong>high search speed (avg query is under 0.1 sec on 2-4 GB text collections)</strong><br />
This is probably the most important. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Searching performance on very large phpBB boards is what suffers the most. My board with nearly 350K posts is still chugging along nicely but I have made a few <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' />  tweaks to the search process and have included an extensive stopwords list. Speaking of which&#8230;<br />
<strong>supports stopwords </strong><br />
No matter how efficient something is, I can&#8217;t imagine that you want to index the word &#8220;the&#8221; a quarter billion times. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /><br />
<strong>high scalability (up to 100 GB of text, upto 100 M documents on a single CPU)</strong><br />
Now that sounds impressive. I don&#8217;t have any where near one GB of database content, much less 100 times that much. If I could dedicate one of my four cpu&#8217;s for searching and handle everything I throw at it, that would be interesting.<br />
<strong>supports phrase searching</strong><br />
Yes! *pumps fist in the air* This is what&#8217;s missing from phpBB, and while I have played with ideas of retrofitting something into the existing search process wouldn&#8217;t it be great if it was already there?<br />
<strong>supports phrase proximity ranking, providing good relevance </strong><br />
I am going to be on the fence for this one. Relevance engines have come a long way, but I&#8217;ve still seen weird associations that defy explanation. It&#8217;s cool that it&#8217;s there, but I would like to see it work.</p>
<p>naderman has said that once he gets this engine fully integrated with phpbb.com that he will release it as a MOD. Unfortunately for me, it&#8217;s going to be for phpBB3. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' />  I bet I can work with it though, and I look forward to seeing the results of his efforts.</p>
<p><strong>Some Related Links</strong></p>
<ul>
<li><a href="http://www.sphinxsearch.com">www.sphinxsearch.com</a><br />
Home page for the Sphinx Search project.</li>
<li><a href="http://www.acmqueue.com/modules.php?name=Content&#038;pa=showpage&#038;pid=143">Why Writing Your Own Search Engine is Hard</a><br />
A link I found while browsing the sphinx site, it&#8217;s interesting reading. It&#8217;s more about writing a web search engine than a site search, but there are some nuggets of interest.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/08/14/search-algorithms-and-the-wizards-that-write-them/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Another Search Tweak: Marking Topics &#8220;Unsearchable&#8221;</title>
		<link>http://www.phpbbdoctor.com/blog/2007/08/12/another-search-tweak-marking-topics-unsearchable/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/08/12/another-search-tweak-marking-topics-unsearchable/#comments</comments>
		<pubDate>Sun, 12 Aug 2007 18:29:34 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[MOD Writing]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=126</guid>
		<description><![CDATA[As regular blog readers will probably know, I am always playing around with the search process. On my largest board I now have nearly 350K posts, and anything I can do to make the search process even marginally more effective and efficient can help. My latest idea (which I was up until the &#8220;wee hours&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>As regular blog readers will probably know, I am always playing around with the search process. On my largest board I now have nearly 350K posts, and anything I can do to make the search process even marginally more effective and efficient can help. My latest idea (which I was up until the &#8220;wee hours&#8221; a few nights ago building a prototype for) is to grant moderators the ability to mark a topic with an &#8220;unsearchable&#8221; flag.</p>
<p><span id="more-126"></span></p>
<p>If you run or even just participate on a board of any size, you have probably seen this sort of exchange:</p>
<blockquote><p>noob: Can anyone tell me about &#8220;foo&#8221;?<br />
member 1: Geez, didn&#8217;t you try searching for &#8220;foo&#8221; first?<br />
member 2: They must not have searched, since a search for &#8220;foo and bar&#8221; returns everything you need to know<br />
member 3: It&#8217;s even a FAQ: Everything Important about Foo Click Here<br />
noob: I will search for &#8220;foo&#8221; next time <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_redface.gif' alt=':oops:' class='wp-smiley' />  </p></blockquote>
<p>As you probably know, topics are indexed from most recent to oldest. So when Noob #2 comes to the board and actually does a search for &#8220;foo&#8221;, what are they going to find first? Odds are they will find all of the &#8220;why didn&#8217;t you search for foo&#8221; topics rather than the actual content. So it&#8217;s quite frustrating, for everyone involved. Eventually you get two or more pages of &#8220;please search&#8221; responses, and nobody can ever get to the content.</p>
<p>So the idea behind this MOD is to stop indexing those posts, leaving only the desirable content as searchable. It would be excellent if there was some way to automate this decision process, but for the initial attempt I am sticking with a manual moderator action.</p>
<p>And the best part? As I have prototyped it, this will impact only keyword searches. Other searches (like &#8220;my posts&#8221; or &#8220;last 24 hours&#8221; and &#8220;view all posts by Dave&#8221;) will still work. There are no changes made to search.php at all. The only trick is to add or remove the words into the <a href="http://www.phpbbdoctor.com/doc_columns.php?id=16">phpbb_search_wordmatch</a> table during the posting process based on the setting for this flag. I have opted to make this a topic option rather than a post option, but that could be debated.</p>
<p>It&#8217;s fairly simple. I pass in to the submit_post() function two additional pieces of information. One is the search status setting from the form, and the other is the search status setting of the topic. If the search status selected on the form is different from the search status stored in the database, then the rows in the search match table are either added or removed accordingly. If the poster is not a moderator, then the search setting from the form and the existing search status are always going to be the same, and the standard process takes place.</p>
<p>I have defaulted the searchable status to On, as that makes the most sense. I am playing with the idea that a board owner might want to be able to specify certain forms as off. This would allow you to mark the &#8220;Off Topic&#8221; forum topics as all non-searchable and that might save some space. If I were to do that, though, I would have to write code for the moderator control panel &#8220;move&#8221; option. I would have to build or destroy the search keyword rows based on whether the source / destination forums were searchable or not. I have not decided whether this feature is worth the extra code or not.</p>
<p>So, what do you think? Is this a worthwhile MOD? I am definitely going to implement it on my own board and will try to report back in a few months as to how it has worked out.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/08/12/another-search-tweak-marking-topics-unsearchable/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>How does search work? Part VI: Search Synonyms</title>
		<link>http://www.phpbbdoctor.com/blog/2007/03/06/how-does-search-work-part-vi-search-synonyms/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/03/06/how-does-search-work-part-vi-search-synonyms/#comments</comments>
		<pubDate>Tue, 06 Mar 2007 07:49:52 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=99</guid>
		<description><![CDATA[Post Updated March 18, new notes at the end, thanks.
I recently completed some code that moves the &#8220;stopwords&#8221; into a database table and changes the way they are applied to a post or to the search process. At the same time I also moved the search synonyms into a database table. During testing I was [...]]]></description>
			<content:encoded><![CDATA[<p><em>Post Updated March 18, new notes at the end, thanks.</em></p>
<p>I recently completed some code that moves the &#8220;stopwords&#8221; into a database table and changes the way they are applied to a post or to the search process. At the same time I also moved the search synonyms into a database table. During testing I was very interested to find out that the way I thought the search synonyms were applied is not the way they are actually applied at all. This post will clarify how the synonyms are used, and point out something interesting about the internal consistency of the phpbb search_synonyms.txt file.</p>
<p><span id="more-99"></span></p>
<p>If you take a brief look at your search_synonyms.txt file you just might come to the same conclusion that I did. Here is a sample of some words from the top of the file:</p>
<pre>center centre
check cheque
color colour
comission commission
comittee committee
commitee committee
conceed concede
creating createing
curiculum curriculum
defense defence
develope develop</pre>
<p>What we have here is a mix of American and UK spellings (center versus centre) and some common alternate (or wrong) spellings (commitee committee) and so on. But how are these words used? I entered the following post in my board:</p>
<blockquote><p>I need to check the colour of my cheque</p></blockquote>
<p>Here is what got stored in my search index table:</p>
<pre>+-----------+---------+
| word_text | word_id |
+-----------+---------+
| check     |     642 |
| cheque    |    9072 |
| color     |    1349 |
+-----------+---------+</pre>
<p>Hm. There is something a bit strange going on here. I see two words that are supposed to be synonyms, both indexed. I didn&#8217;t expect <strong>check</strong> and <strong>cheque</strong> to both be there. After all, my colour is missing, right?</p>
<p>It turns out that there was a bug in my <a href="http://www.phpbbdoctor.com/modsteps.php?m=71&#038;l=1">Efficient Cleanwords MOD</a>. The last word in a sentence would not be properly processed as a stop word or as a synonym. I discovered this when I went back and added some words to the end of my sample sentence like this:</p>
<blockquote><p>I need to check the colour of my cheque for this post</p></blockquote>
<p>Once I did that, I got this:</p>
<pre>+-----------+---------+
| word_text | word_id |
+-----------+---------+
| post      |     442 |
| color     |    1349 |
| check     |     642 |
+-----------+---------+</pre>
<p>Now that&#8217;s more like it. Is it? Well, at least we can start talking about how synonyms work. </p>
<p><em>The <a href="http://www.phpbbdoctor.com/modsteps.php?m=71&#038;l=1">Efficient Cleanwords MOD</a> code has been updated to fix this bug. This is the sort of &#8220;alpha&#8221; testing I typically try to do of all of my MODs, as I despise releasing buggy code. Doesn&#8217;t mean I don&#8217;t do it, I just despise it. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </em></p>
<p>What I originally expected to have happen was that if I used the word <strong>color </strong>in a post that the search index would include both <strong>color </strong>and <strong>colour </strong>as alternative spellings. That way if one of my friends from &#8220;across the pond&#8221; were to come to the phpBB Doctor site and search, they would be able to find posts that included the word <strong>colour</strong>. Even though nobody would really ever type it that way. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' /> </p>
<p>On reflection, that was a really dumb idea. Why would we want to increase the size of our search index tables by storing something twice? Almost everything I have posted in this series has been about reducing the size of the searchwords table, or making the interactions with that table more efficient. Adding more data is not the way to do that.</p>
<p>So as it turns out, it is quite simple: any iteration of the word <strong>color</strong> (color or colour) is indexed as the shorter spelling, as that&#8217;s the first word on the line in the synonyms text file. That means that <strong>cheque</strong> will always be indexed as <strong>check</strong>. And that <strong>centre</strong> will always be indexed as <strong>center</strong>. And that&#8230; wait a minute, let&#8217;s take a closer look at that list again&#8230;</p>
<pre>center centre
check cheque
color colour
comission commission
comittee committee
commitee committee</pre>
<p>The synonyms process always maps the second word to the first. So what is wrong with this picture? Do you know how to spell committee? That&#8217;s not a language thing, as far as I know it&#8217;s always spelled <strong>committee</strong>. It&#8217;s certainly not what you see in the first words in the listing shown above&#8230;</p>
<p>So I entered the following post:</p>
<blockquote><p>A committee is a group of people able to accomplish nothing</p></blockquote>
<p>Here&#8217;s the results from the indexing process:</p>
<pre>+------------+---------+
| word_text  | word_id |
+------------+---------+
| able       |     206 |
| accomplish |    3573 |
| comittee   |    9075 |
| group      |    1495 |
| people     |    1166 |
+------------+---------+</pre>
<p>Now I don&#8217;t know about you, but &#8220;comittee&#8221; in my book is spelled with two m&#8217;s and two t&#8217;s. And since that&#8217;s how the word is spelled in the second column of the synonyms text file, it seems obvious to me that someone, well, someone goofed. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  Here&#8217;s the relevant code from an unmodified version of the clean_words() function:</p>
<pre>list($replace_synonym, $match_synonym) = split(' ', trim(strtolower($synonym_list[$j])));</pre>
<p>The &#8220;replace&#8221; word is first, the &#8220;match&#8221; word is second. So if the code finds a match for the second word, it is replaced by the first. Oops.</p>
<p>As it turns out, there is a problem with the logic used in the code compared to the actual format of the search_synonyms.txt file. I will be fixing that, and probably posting a bug report. I am guessing that since many of the phpBB developers are <strong>not</strong> of US persuasion, that they looked at this file and assumed colour and centre were, of course, the desired words. So they naturally assumed that the proper word would be listed second. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' />  That is complete speculation on my part, and a tug on the leg of whoever was responsible for setting up the contents of this file. Dare I guess it might have been done by &#8220;comittee&#8221;? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_biggrin.gif' alt=':-D' class='wp-smiley' /> </p>
<p>Having said that, just what is the impact? Does it still work?</p>
<p>When the word &#8220;committee&#8221; comes through as part of a post every synonym in the text file is checked. That means &#8220;committee&#8221; is replaced by &#8220;comittee&#8221; first. Then the second occurrence of &#8220;committee&#8221; is skipped because it no longer matches anything. My redesigned table-driven process suffers from the same issue as I simply loaded the synonyms table straight away into the table without really checking to see that it was defined correctly.</p>
<p>But does search work? Ironically, yes, it will. When you enter the word &#8220;committee&#8221; as a search term, it will be remapped to &#8220;comittee&#8221; which is, of course, indexed. So a cynical person might suggest that perhaps there was no error, and that the shorter word was simply stored as a way to preserve space. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' />  I would buy that, except that there are two lines with the word &#8220;committee&#8221; on them, and they are therefore clearly backwards.</p>
<p>There are other backwards entries, such as these:</p>
<pre>heighth height
milage mileage
morgage mortgage</pre>
<p>Remember the first word is what will get stored in the index, the second word is what is matched in the post. So if someone enters the word &#8220;mortgage&#8221; in a post (which is spelled correctly) it will be stored in the search table as &#8220;morgage&#8221; instead. There are also other &#8220;doubled&#8221; entries such as these examples:</p>
<pre>maintainance maintenance
maintenence maintenance
ommision omission
ommission omission
suprise surprise
surprize surprise</pre>
<p>You might argue that it would be faster and easier to reverse the php code in the clean_words() function&#8230; except that not everything is reversed!</p>
<p>I will leave it to you to examine your search_synonyms.txt file and fix the errors that you might find. Just remember that the second word is the &#8220;mistake&#8221; or alternate spelling, and the first word on the line is what will actually be stored in your index.</p>
<p>It doesn&#8217;t break the search. But in the current format any words that appear doubled are not going to work, as the second synonym line will never be used.</p>
<p>I don&#8217;t have to do anything to fix my code related to pushing synonyms into the database, as the code is fine. I will, however, have to clean up my syonyms table data. It&#8217;s a good thing that as part of my MOD I created an ACP page to allow me to manage my synonyms, right? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p><strong>Summary</strong><br />
I have logged a bug with the phpBB Group, but I don&#8217;t expect anything to happen from it. I don&#8217;t mean that in a sarcastic or cynical way&#8230; it&#8217;s just that this bug is certainly not security or performance related, and the fix would be quite challenging. Think about it; you would have to alter the contents of the search_synonyms.txt file (easy) and then rebuild your index tables (hard). I understand phpBB3 includes a rebuild index feature, but phpBB2 does not. I don&#8217;t expect that they would fix this, but perhaps they&#8217;ll take a closer look at phpBB3 to make sure it doesn&#8217;t suffer from the same issue.</p>
<p>My ACP Stopwords Manager MOD (not yet published) will address this by providing a sequence of SQL statements to load the table correctly. So there&#8217;s another reason to consider looking at the MOD when it comes out. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p>
<p>The Efficient Cleanwords() MOD does not do anything to the stopwords or synonyms processing. I would untimately expect that it will become a part of the Stopwords Manager MOD but I will also release it as a stand-alone MOD for those that want to retain the standard stopwords processing.</p>
<p><strong>Update (March 18, 2007)</strong><br />
It seems that someone else posted the exact same bug years ago. The bug was closed by one of the developers, and for the reasons I expected. Any fix is not simply a code fix but would also require a rebuild of the search_wordmatch and search_wordlist tables as well. Since those features are not in phpBB2 (they are in phpBB3) it would require a MOD rather than a core code fix.</p>
<p>I feel a bit ambivalent about this. One the one hand, this is hardly a major issue. The only exposure is that if you have two (or more) synonyms for the same word, only the first is ever processed. Is that a huge deal? Probably not.</p>
<p>I will be fixing it with my MOD. Once search synonyms are moved into the database (and managed via the ACP) a board administrator will be able to easily correct their data. I will probably not release my own &#8220;rebuild search&#8221; MOD but instead will suggest that board owners install one of the others already released at phpbb.com instead.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/03/06/how-does-search-work-part-vi-search-synonyms/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Rebuilding Search Index: Performance Results</title>
		<link>http://www.phpbbdoctor.com/blog/2007/03/04/rebuilding-search-index-performance-results/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/03/04/rebuilding-search-index-performance-results/#comments</comments>
		<pubDate>Sun, 04 Mar 2007 19:27:31 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=98</guid>
		<description><![CDATA[Yesterday I posted about moving the stopwords file into the database and changing the way the stopwords (and search syonyms) are processed. Then I rebuilt the search index for my largest board. The results? I reindexed 298,070 posts in about five and a half hours. Read more for details&#8230;  
Last year when I ran [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I posted about moving the stopwords file into the database and changing the way the stopwords (and search syonyms) are processed. Then I rebuilt the search index for my largest board. The results? I reindexed 298,070 posts in about five and a half hours. Read more for details&#8230;  <span id="more-98"></span></p>
<p>Last year when I ran the rebuild index process it took well over eight hours for around 200,000 posts. Yesterday the post count stood at 298,070, so if you project from last year I would have expected the process to run for closer to 12 hours this time. After making the changes suggested in the prior post (moving search stopwords and search synonyms into the database, skip reading the current text files, applying a uniquefy process to the words to be processed, increasing the number of stopwords as appropriate) here are some numbers for you to consider. In the chart below, the line indicates total elapsed time. Each marker represents a &#8220;batch of posts&#8221; that consists of all of the posts from 1 to 100, 101 to 200, 201 to 300, and so on.</p>
<p><img src="/blog/images/rebuild_search_graph.jpg" width="511" height="345" border="1" alt="Elapsed Time Graph" title="Graph showing elapsed time per 100 post range during search rebuild" /></p>
<p>There are four time trials in all, two with the &#8220;old&#8221; logic and two with the &#8220;new and improved&#8221; logic. Here are the numbers behind the graph; I think they&#8217;re fairly compelling:</p>
<p><img src="/blog/images/rebuild_search_data.jpg" width="411" height="249" border="0" alt="Elapsed Time Graph" title="Graph showing elapsed time per 100 post range during search rebuild" /></p>
<p>Here is the easy conclusion: the new stopwords process results in a 50% reduction in processing time. The average time per 100 posts under the old code is eleven seconds (0:11) while the average under the new code is five (0:05).</p>
<p>What happened when I let the process run on my large database? First, I found some bugs in my code. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' />  I had projected that the process would run for five hours and thirty seven minutes (5:37:00). It actually ran for 6:04:11. Hm, well, okay. So my estimate was off, right? It turns out that wasn&#8217;t the problem. The problem was that php doesn&#8217;t barf if you manage to spell a variable name wrong. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' />  Yes, there are settings for that, I don&#8217;t have them turned on. So what did happen during those 6 hours?</p>
<p>What happened was the every post got processed, and the stopwords were handled as I designed, but <strong>no stopwords were actually removed from the post</strong>. So it turns out do be a happy accident, really. Recall that it took over eight hours last time for me to run the reindexing process, and then it took six for 50% more posts. The result was that I got over 12 million rows in my search index table because none of the stopwords were removed. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_eek.gif' alt=':shock:' class='wp-smiley' />  </p>
<p>Let me clarify my point a bit: the process ran, and ran efficiently. With stopwords in the database instead of a text file I shaved many hours of actual runtime off of the process. It just didn&#8217;t stop the stopwords that it found, so they got inserted into the database along with the more valuable words that I wanted to index.</p>
<p>The next step was for me to fix that bug and rerun the process. Remember that I had originally projected it to run in 5:37:00? After fixing the bugs and running the process a second time, the total elapsed time was 5:25:07. Not bad. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  And after the fix, I had about 6.5 million rows in my stopwords table. This is why I consider the bug to be a happy accident as now I have some more interesting numbers to analyze. Specifically:</p>
<p>With stopwords I reduced the size of my search index from 12.3 million to 6.5 million, a 48% reduction. There&#8217;s the value in stopwords, right there. Before I cleaned up my bug I had nearly 238K posts with the word &#8220;the&#8221; indexed. Yes, that&#8217;s 238,000 out of 298,000 posts that contained the word &#8220;the&#8221;. You can see why it would be pointless to index that word, along with many of the other more common words. I wish now that I had determined the most popular word in my database before deleting everything; that might have been an interesting piece of data.</p>
<p>Another interesting observation: the difference in the total execution time pre-bug and post-bug was only 39 minutes. That means that it took my server 39 minutes to insert the extra 5.8 million rows of data into the search index table. Since the code was essentially the same (other than actually removing the stopwords from the processed data) the remaining time was all insert statements.</p>
<p><strong>Summary</strong><br />
The process took 5:25:07 to process 298,070 posts. There were 172,367 &#8220;words&#8221; indexed. There are 6,450,990 records in the phpbb_search_wordmatch table now that it&#8217;s all done. Due to the efficiencies gained through my revised code I have increased the number of stopwords from 287 (my original file) to 1,215. And no, I am not going to go back and try to test the old code with that size of stopwords file. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' /> </p>
<p>The reason my stopwords file increased so much is I added all of the three-digit numbers to my file, starting from 100 and going to 999. I noticed quite a few numeric values in my search index, and decided that I don&#8217;t need those. In fact, I&#8217;m going to see about removing any &#8220;number&#8221; words from the process via a regular expression instead&#8230; something that looks for any &#8220;words&#8221; composed strictly of the digits 0 &#8211; 9 and remove them. I guess that&#8217;s another step in the MOD. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/03/04/rebuilding-search-index-performance-results/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>How does search work? Part V: Search Stopwords Redux</title>
		<link>http://www.phpbbdoctor.com/blog/2007/03/03/how-does-search-work-part-v-search-stopwords-redux/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/03/03/how-does-search-work-part-v-search-stopwords-redux/#comments</comments>
		<pubDate>Sat, 03 Mar 2007 09:02:02 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=97</guid>
		<description><![CDATA[From this post earlier in the Search series I made this statement:
The changes also reduced the amount of time required to rebuild my search tables by 4%. As it turned out later on, the reduction in time was not because of the uniquefy process, but something completely unexpected. You will just have to come back [...]]]></description>
			<content:encoded><![CDATA[<p>From <a href="http://www.phpbbdoctor.com/blog/?p=69">this post</a> earlier in the Search series I made this statement:</p>
<blockquote><p>The changes also reduced the amount of time required to rebuild my search tables by 4%. As it turned out later on, the reduction in time was not because of the uniquefy process, but something completely unexpected. You will just have to come back for the next installment to find out what it was. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p></blockquote>
<p>The &#8220;uniquefy&#8221; process was based on the idea that when you&#8217;re getting ready to store your post words you don&#8217;t need duplicate words to be processed. They just take extra time for no benefit. So I wrote a MOD that would include logic to &#8220;uniquefy&#8221; the list of words from a post before they were processed any further. I <strong>thought</strong> that was where the 4% performance benefit was coming from. I was wrong.</p>
<p>I am now ready to reveal the secret. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p><span id="more-97"></span></p>
<p><a href="http://www.phpbbdoctor.com/blog/?p=65">This post</a> was where I first started talking about the search &#8220;stopwords&#8221; file used by phpBB. If you want to go read it first, go ahead. I&#8217;ll wait right here.</p>
<p>Ah, you&#8217;re back, or perhaps you never left. In either case, here&#8217;s a very quick review of the purpose of the search_stopwords.txt file. It contains words that should never be added to your word index (phpbb_search_wordlist + phpbb_search_wordmatch) tables. By reviewing your search word index tables and identifying the most popular words, you can reduce the size of these tables and make search run quicker.  Now here&#8217;s the ironic thing: that last statement is wrong. </p>
<p>Adding stopwords will &#8211; as discussed in the prior post &#8211; make your search word index tables smaller. But it actually will degrade the performance of posting and searching. So it&#8217;s a win for database size, but a loss for performance. At first glance this does not seem to make sense. Why would increasing the size of your stopwords file be a performance issue?</p>
<p><strong>Current Stopwords Code</strong><br />
Here is a look at the code from includes/functions_search.php that relates to the stopwords handling. This code is from version 2.0.22 which was current as I wrote this blog entry.</p>
<pre>if ( !empty($stopword_list) )
{
  for ($j = 0; $j < count($stopword_list); $j++)
  {
    $stopword = trim($stopword_list[$j]);

    if ( $mode == 'post' || ( $stopword != 'not' &#038;&#038; $stopword != 'and' &#038;&#038; $stopword != 'or' ) )
    {
      $entry = str_replace(' ' . trim($stopword) . ' ', ' ', $entry);
    }
  }
}</pre>
<p>The variable $entry contains a space-delimited string of all of the words from the post. In the <a href="http://www.phpbbdoctor.com/blog/?p=69">prior blog entry</a> mentioned at the start of this post I showed how to ensure that the list of words was unique, so I won't go back over that again. The $stopword_list is an array passed by reference into this function. If it contains at least one value, then this line of code is executed for each word in your stopwords file:</p>
<pre>$entry = str_replace(' ' . trim($stopword) . ' ', ' ', $entry);</pre>
<p>Let me say that again. The line of code shown above is <strong>executed for every single word in the stopwords text file.</strong> You read that right... the code is based on the size of your stopwords file, not the number of words actually contained in your post. So after going to all of the trouble to optimize the contents of <span style="color:blue;">$entry</span> and make sure the list of words was unique, it really didn't matter. The remaining code still runs 287 <span style="color:blue;">str_replace()</span> commands. (My stopwords file currently has 287 words in it.)</p>
<p>And the bad news doesn't stop there. The entire process is repeated to process your post subject as well! <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_eek.gif' alt=':shock:' class='wp-smiley' />  That's another 287 <span style="color:blue;">str_replace()</span> operations, on what is no doubt an even shorter value for <span style="color:blue;">$entry</span>.</p>
<p>Wouldn't it make more sense to process the other way around? To take out the stopwords that we know are present, rather than looking for every single possibility?</p>
<p>I have a big board that I love to use for analysis of stuff like this. I believe that there is no substitute for real data when it comes to tuning or determine optimizing strategies. This board has nearly 300K posts as I write this. The average number of words left after removing stopwords words is 22. Twenty two! That's 22 unique words per post, yet the entire stopwords file is checked. What the code shown above is doing is reading each word in the stoplist file and then removing it from the list of post words.</p>
<p>Even if the word isn't there. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_eek.gif' alt=':shock:' class='wp-smiley' /> </p>
<p><strong>To Cache or Not To Cache</strong><br />
I've put a lot of work into working out a caching system used for quite a few phpBB Doctor MODs. The caching system was designed so that I could use it in more than one MOD, making it easier for me to bundle MODs together without conflicts. Why cache?</p>
<p>When you run a query there are several steps:</p>
<ul>
<li>Connect to the database</li>
<li>Execute the query</li>
<li>Read the results from disk</li>
</ul>
<p>When you get right down to it, the database is just a filter for reading a disk file. What if you don't want a filter... what if you want to read the entire file? </p>
<ul>
<li>Open the file</li>
<li>Read it</li>
</ul>
<p>Much more efficient. Note that the efficiencies are lost if you want to read only a part of the file, or you want to read bits from two (or more files) joined togehter. For that, you really want to leverage a relational database. So what's my point?</p>
<p>The phpBB stopwords process is reading the entire file, every time. I don't want to read the entire file! I only want to check the words that are actually in my post. Because of this, what I really need is a query. I'm going to move a text file cached on disk back into the database. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><strong>Search Stopwords Manager</strong><br />
I have already written a MOD called the Search Stopwords Manager. This MOD (not yet released) allows a board administrator manage their stopwords file contents via an admin control panel (ACP) interface rather than having to edit a text file. After the processing is completed, the file contents are written back out to the disk. Now I am going to change that. I am going to leave the stopwords in the database.</p>
<p>Let's go back to the stopwords code shown above. The entire stopwords file has been read from disk into an array. Then every word in the array is removed from the <span style="color:blue;">$entry</span> string variable, whether it exists or not. Not very efficient.</p>
<p>To fix this, I have added code into functions_search.php that uses the following process flow:</p>
<ul>
<li>Uniquefy the words contained in <span style="color:blue;">$entry</span></li>
<li>Query the stopwords database table using that list of words as a WHERE clause</li>
<li>Remove any stopwords found from <span style="color:blue;">$entry</span></li>
</ul>
<p>As you can see, it turns the current process completely around backwards. I get only the words that I am interested in from the database table containing my stopwords. For each word that I find, I remove it. Short and simple. But does it work?</p>
<p><strong>Benchmarking Results</strong><br />
I created a test board and loaded it with about 30,000 posts. These posts had an average word count of 64 and an average post length of 700 characters. I ran a Rebuild Search Index MOD that I wrote a while back. (This MOD - like most I have seen of this type - simply works its way through every post on the board, passing the posts through the standard phpBB post processing.) Before the changes that process took nearly an hour (actual time was 52:32). After loading the stopwords into the database and making the alterations outlined above the process took a bit over half of the original time (actual time was 32:04). I think you would agree that is quite an improvement. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  It works out to about 10 seconds per 100 posts versus 6 seconds per 100 posts. That's about a 40% reduction in post processing time, and a lot of cpu cycles that have been recovered. And it gets better... more on that in a few more paragraphs. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>What is the cost? We have to run a query to get a specific list of stopwords used for each post. The benefit is that we don't process the entire file. An additional benefit (for me) is that the Search Stopwords Management MOD that I wrote can be simplified. I don't have to worry about writing out a disk file... I just have to manage the contents of a database table.</p>
<p><strong>Summary</strong><br />
Back in the original post I mentioned a 4% reduction in post processing time. I had originally thought that the reduction was a result of the uniquefy process I applied to the post words. I was wrong. As part of testing the regex I had removed a bunch of "short" words from my stopwords text file. It was that slight change... removing a few short words from my stopwords file... that caused the difference in the rebuild process.</p>
<p>After moving the stopwords into the database and altering the code to take advantage of that, I've saved 40%. After removing the code that reads in the (now useless) stopwords file and adding the search synonyms file to a different database table I have reduced the processing time down to about 4 seconds per 100 posts, a 60% reduction in processing time for my rebuild MOD.</p>
<p>Posting is also going to benefit from these changes. First I uniquefy the words from the post, then I check that specific list of words against both the stopwords and synonyms lists. You might not notice a microsecond or two on each post, but your server will, especially on a busy board.</p>
<p>I still need to work out how to handle multiple languages, but that should be a simple matter. If you are wondering, I don't have any code to release for this yet, but I certainly plan to do so. I would think that anything that can reduce processing time by over half would be well received by the phpBB community.</p>
<p>Stay tuned for more details. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/03/03/how-does-search-work-part-v-search-stopwords-redux/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Regex Redux</title>
		<link>http://www.phpbbdoctor.com/blog/2007/02/11/regex-redux/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/02/11/regex-redux/#comments</comments>
		<pubDate>Sun, 11 Feb 2007 06:20:53 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=86</guid>
		<description><![CDATA[In an earlier post I disected the regular expression used in the clean_words() function to remove short and long words from the search index tables. I figured out how it worked, and then why it was broken from 2.0.6 forward. I fixed it.
For english boards only.  
I have had some feedback from an owner [...]]]></description>
			<content:encoded><![CDATA[<p>In an <a href="http://www.phpbbdoctor.com/blog/?p=74">earlier post</a> I disected the regular expression used in the clean_words() function to remove short and long words from the search index tables. I figured out how it worked, and then why it was broken from 2.0.6 forward. I fixed it.</p>
<p>For english boards only. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>I have had some feedback from an owner of a german (swiss-german, actually) board, and the fix I provided does not work for his board. I think I know why. I don&#8217;t (yet) know how to fix it.  <span id="more-86"></span></p>
<p>As a quick review, for versions 2.0.4 and earlier the regular expression used to separate words by spaces uses the <span style="color:blue;">\b</span> token to identify a word boundary. From 2.0.6 and forward the code uses the character set <span style="color:blue;">[ ]</span> instead. The problem with using a space is that if you have multiple short words in a row like <strong>is it an error?</strong> then every other short word will escape the regex and get stored in your search index.</p>
<p>When all of the posts are in english, then switching back to the <span style="color:blue;">\b</span> seems to work extremely well. But for non-english boards it appears to be a problem.</p>
<p>Case in point: here are a few words from this swiss-german board that cause problems for some reason:</p>
<pre>hÃ¶Ã¶
Ã¶Ã¶h
jÃ¶Ã¶</pre>
<p>I am told that these words are all interjections. In english they would be words like &#8220;hey&#8221; or &#8220;oh&#8221; or &#8220;wow&#8221; or similar. The issue? The <strong>Ã¶Ã¶</strong> for some reason gets cut off from the related letter in the word. I have no idea why. Another problem is the word <strong>nÃ¼mme</strong> where the <strong>Ã¼</strong> character seems to be dropped as an invalid character, leaving the &#8220;words&#8221; <strong>n</strong> and <strong>mme</strong>. The &#8220;n&#8221; gets dropped because it&#8217;s now a one-letter word, and &#8220;mme&#8221; gets added to the search index.</p>
<p>I can guess about part of this, but not all. At least not yet. But for example the phrase:</p>
<pre>JÃ¶Ã¶, das isch ja sÃ¼ess! </pre>
<p>&#8230;which I am told translates to &#8220;Oh, that&#8217;s sweet!&#8221; as JÃ¶Ã¶ is an interjection meaning &#8220;Oh&#8221; would get processed as:</p>
<pre>Ã¶Ã¶ das isch sÃ¼ess</pre>
<p>The <strong>ja</strong> gets removed as it&#8217;s too short. The <strong>jÃ¶Ã¶</strong> get&#8217;s screwed up because the &#8220;j&#8221; is &#8211; for some reason &#8211; ignored. Very frustrating. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_confused.gif' alt=':-?' class='wp-smiley' /> </p>
<p>In doing more research on the php site I found a page that discusses the various tokens and tags and whatnot that can be used in regular expressions. The <a href="http://www.php.net/manual/en/reference.pcre.pattern.syntax.php">specific page</a> includes this quote:</p>
<blockquote><p>A &#8220;word&#8221; character is any letter or digit or the underscore character, that is, any character which can be part of a Perl &#8220;word&#8221;. The definition of letters and digits is controlled by PCRE&#8217;s character tables, and may vary if locale-specific matching is taking place. For example, in the &#8220;fr&#8221; (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w. </p></blockquote>
<p>So clearly I&#8217;m onto something. The <span style="color:blue;">\b</span> is supposed to match a word boundary, and the word boundaries are defined as a state change from a word character matched by <span style="color:blue;">\w</span> and a non-word character matched by <span style="color:blue;">\W</span>. So what I have to do is figure out how php assigns those characters for various languages. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Because of these issues, I have not yet published my MOD at phpbb.com. It&#8217;s available here, and you can certainly download it if you run an english-language board. It&#8217;s working great for me on several boards. But it doesn&#8217;t work for non-english boards at this time.</p>
<p>In a frustrating turn of events, using <span style="color:blue;">[ ]</span> works perfectly for the swiss-german board posts. It does, however, still have the problem of improperly processing a string of consecutive short (or long) words. As I type this, I am wondering if putting this back in and substituting two spaces for every space might not be a quick fix. Adding a second space would stop the problem of the first word match &#8220;eating&#8221; the second word&#8217;s space.</p>
<p>Will have to try it and post back. Stay tuned for details. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/02/11/regex-redux/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>How does search work? Part IV: Evolution of a Regular Expression</title>
		<link>http://www.phpbbdoctor.com/blog/2007/02/02/how-does-search-work-part-iv-dissecting-a-regular-expression/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/02/02/how-does-search-work-part-iv-dissecting-a-regular-expression/#comments</comments>
		<pubDate>Fri, 02 Feb 2007 08:12:52 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=74</guid>
		<description><![CDATA[This is part IV of a series of posts about the phpBB2 search process. Previous posts include:

Part I: Table Review
Part II: Making Effective Use of &#8220;Stop Words&#8221;
Part III: Efficient clean_words() Function

You don&#8217;t have to read all of the prior parts in order to read this one. The last post was quite long, and so part [...]]]></description>
			<content:encoded><![CDATA[<p>This is part IV of a series of posts about the phpBB2 search process. Previous posts include:</p>
<ul>
<li><a href="http://www.phpbbdoctor.com/blog/?p=53">Part I: Table Review</a></li>
<li><a href="http://www.phpbbdoctor.com/blog/?p=65">Part II: Making Effective Use of &#8220;Stop Words&#8221;</a></li>
<li><a href="http://www.phpbbdoctor.com/blog/?p=69">Part III: Efficient clean_words() Function</a></li>
</ul>
<p>You don&#8217;t have to read all of the prior parts in order to read this one. The last post was quite long, and so part of what I wanted to cover there was postponed until this post. In this post I&#8217;m going to analyze what one particular regex (regular expression) from the clean_words() function is doing. In very early versions of phpBB2 it worked very well at keeping short and long words out of your search index tables. In later versions it did not work so well. In this post I will explain why, and provide an extremely easy fix.</p>
<p><span id="more-74"></span></p>
<p><strong>Cleaning Words</strong><br />
Prior to any &#8220;word&#8221; processing the clean_words() function has already removed HTML entities, BBCode, URL&#8217;s and special characters. So in theory what is left is a bunch of words separated by spaces. We want to reduce those words to ones that we&#8217;re interested in. We are interested in words that are not &#8220;stop&#8221; words and that are between 3 and 20 letters long. So that&#8217;s the purpose of the regex I want to review in this post. In versions of phpBB2 through 2.0.4 it looked like this:</p>
<pre>$entry = preg_replace('/\\b([a-z0-9]{1,2}|[a-z0-9]{21,})\\b/',' ', $entry);</pre>
<p>The purpose of the regex is to drop any words of one or two letters and any words of 21 letters or more. Now I do not claim to be a regexpert. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' />  But I think I can figure this one out. This is a pattern match that will replace certain matches with a space. The patterns are contained between two forward slashes, so at a very basic level this regex is:</p>
<pre>replace('/things-that-I-match/', ' ', $entry);</pre>
<p>The magic part is the &#8220;things-that-I-match&#8221; which is everything between the two forward slash characters. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  So I&#8217;ll dissect that&#8230; it&#8217;s really not too hard. </p>
<p>The <span style="color:blue">\b</span> stands for a word boundary. That just means we&#8217;re going to work on full words as identified by php&#8217;s definition of a &#8220;word boundary&#8221;.</p>
<p>The part that looks like this <span style="color:blue">[a-z0-9]</span> is a pattern match condition that catches all letters of the alphabet and numbers zero (0) through nine (9). This structure is called a &#8220;character class&#8221; and appears quite frequently in regular expressions. The <span style="color:blue">{1,2}</span> is an interval qualifier, and says that it is required to match at least one but no more than two of the identified characters. Putting it all back together and I can see that <span style="color:blue">[a-z0-9]{1,2}</span> says match from 1 to 2 characters in the set from a-z or 0-9.</p>
<p><em>You might be wondering about case sensitivity at this point. Not to worry, as the entire string was converted to lower case earlier in the code. A space was added to the front and back end of the string too, which will become important later on.</em></p>
<p>After that pattern there there is an &#8220;or&#8221; operator (the vertical bar <span style="color:blue">|</span> is for or) and then the same pattern is repeated. But this time the interval is <span style="color:blue">{21,}</span>. This will match any string of letters or numbers of 21 characters or more. I don&#8217;t know what this does for foreign language characters, but I&#8217;ll come back to that in a bit.</p>
<p>So this:</p>
<p><span style="color:blue">/\b([a-z0-9]{1,2}|[a-z0-9]{21,})\b/</span></p>
<p>&#8230; says to match any combination of letters and numbers bounded by spaces that are 1-2 or 21+ letters long. After that match, the <span style="color:blue">preg_replace()</span> function replaces them with a space. Done. So that seems really easy, and it functions correctly on all of my boards.</p>
<p><strong>One Step Forward, Two Steps Back</strong><br />
But it changed. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  In version 2.0.6 the regex I&#8217;ve just been through changed to this:</p>
<pre>$entry = preg_replace('/[ ]([\S]{1,2}|[\S]{21,})[ ]/',' ', $entry);</pre>
<p>First, you&#8217;ll notice that the <span style="color:blue">\b</span> has been replaced by the character class brackets with a space, as in <span style="color:blue">[ ]</span>. It turns out this is going to be important, so remember that. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  Next, our nifty <span style="color:blue">[a-z0-9]</span> has been replaced by <span style="color:blue">[\S]</span> instead. The <span style="color:blue">\S</span> represents any single non-whitespace character, which certainly sounds useful. Remember that earlier I was concerned that the [a-z] portion was only going to match certain languages? I would guess, then, that the switch to <span style="color:blue">\S</span> was an attempt to be more aggressive at matching strings of characters rather than just letters from a to z.</p>
<p>So why doesn&#8217;t it work? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_confused.gif' alt=':-?' class='wp-smiley' /> </p>
<p>As I have been studying regex techniques one of the phrases that comes up over and over goes something like this:</p>
<blockquote><p>Be careful what you ask for. You might get it.</p></blockquote>
<p>I believe that&#8217;s the explanation here. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' />  Having <span style="color:blue">[ ]</span> in the regex means that the <strong>match must include the space, thus not leaving it available for the following word.</strong> In other words, any space takes up more space than it is supposed to. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  It&#8217;s easily confirmed that the &#8220;new&#8221; regex fails to remove two-letter words if there are two two-letter words in a row. Or shorter. So this phrase:</p>
<pre>To be in love, ah it is bliss</pre>
<p>&#8230; causes problems. The word &#8220;To&#8221; would be dropped, &#8220;&#8216;be&#8221; would be included, and &#8220;in&#8221; would be dropped. Why?</p>
<p>First, recall that I mentioned earlier that the string of words has had a space appended to the front and back as well as being converted to lower case. The line of code responsible for that operation is this:</p>
<pre>$entry = ' ' . strip_tags(strtolower($entry)) . ' ';</pre>
<p>Those extra spaces are important, because we&#8217;re actually requiring a space on either side of each word. After this line of code has been executed the short phrase I entered above would look like this:</p>
<pre>" to be in love ah it is bliss "</pre>
<p>This is after the lower case operation, the extra spaces have been added, and punctuation marks and other special symbols have been removed using the $drop_char_match array. Here&#8217;s how the regex will match what remains. Items inside [ ] are matches and are replaced by spaces; the other words are left behind. Note that with the matches eating up the spaces the second (and fourth and sixth&#8230;) two-letter word in a sequence of two-letter words will not match, as they don&#8217;t include a space. The space was sacrificed to the earlier match! So here is the string, with the matches marked out&#8230;</p>
<pre>[ to ]be[ in ]love[ ah ]it[ is ]bliss </pre>
<p>What is left after the items that matched the regex are replaced by a space?</p>
<pre>be love it bliss</pre>
<p>So now I can see where the bogus two-letter words are coming from! These are the words that will be added to your search_wordmatch table. If the regex was applied recursively then this wouldn&#8217;t matter, as the new space added by the replace operation would be enough to allow the remaining two-letter words to be dropped as well. (In fact one person posted on phpbb.com that they altered their code so that the regex was executed three times in a row.) But part of the beauty of using regular expressions is being able to avoid complex looping code. <strong>You don&#8217;t have to keep doing the same thing over and over</strong>. If I were so inclined, I could write a very inefficent block of code that processed the string a character at a time. </p>
<p>Speaking of that&#8230; those of you that are quite observant might have noticed that the two regular expressions I&#8217;ve posted are from 2.0.4 and 2.0.6. What happened in 2.0.5? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<pre>
$entry = explode(' ', $entry);
for ($i = 0; $i < sizeof($entry); $i++)
{
   $entry[$i] = trim($entry[$i]);
   if ((strlen($entry[$i]) < 3) || (strlen($entry[$i]) > 20))
      {
          $entry[$i] = '';
      }
}
$entry = implode(' ', $entry);
</pre>
<p>Remember that this code is run on the full text of every post that is saved (whether a new post or an edited one). It has to be efficient. Processing strings of data a character or even a word at a time is not very efficient. The loop in 2.0.5 was probably functional (I did not test it during my review) but I doubt that it was anywhere near as efficient.</p>
<p><strong>The Fix</strong><br />
The original regex seemed to work fine for me, but I never used languages other than english. I expect my words will be made up of the of letters from &#8220;a&#8221; to &#8220;z&#8221;. The regex has to be more flexible than that, and the inclusion of <span style="color:blue">[\S]</span> does appear to work. The problem is the switch from <span style="color:blue">\b</span> to <span style="color:blue">[ ]</span>.</p>
<p>When you use <span style="color:blue">\b</span> it appears that the boundary is <strong>shared</strong> from one word to the next. If you use a required space as is done with <span style="color:blue">[ ]</span> then the space is not shared. Once it is found and replaced, then the regex starts with the next character. If that character is the first letter of a one or two-letter word (or even worse, a 50-letter word) then that word is included in your search database because it doesn&#8217;t have the required space at the beginning.</p>
<p>Without further ado, I present to you the merged regex with the best of both worlds:</p>
<pre>$entry = preg_replace('/\\b([\S]{1,2}|[\S]{21,})\\b/',' ', $entry);</pre>
<p>I updated my boards to use this regex and ran my MOD that rebuilds my search indexes. So far it has worked flawlessly. This update is included in my Efficient clean_words() Function MOD which was introduced in my last blog post. During the writing of this MOD (and this series of posts) I went back and searched <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' />  through the phpBB database for other discussions about this. I found a number of folks that had provided solutions, none of them exactly the same as this. Now that I really understand what is going on, I can say that some of the solutions posted should work.</p>
<p><strong>Make it Faster</strong><br />
This post was all about the regex used to get rid of short and long words so that they are not stored in your phpbb_search_wordlist (and wordmatch) tables. Without this fix, your tables can get filled up with words that should not be there, and that&#8217;s not a good thing. I did a lot of benchmarking during this process and initially thought that my updates to the clean_words() function were helping to improve performance. That is why I called my MOD related to these past two posts &#8220;Efficient clean_words() Function&#8221; as I thought I was making it more efficient.</p>
<p>It turns out that I was not. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  I corrected a problem where short or long words would get into your search index tables. That saves space and some processing time. But as far as the actual processing? It didn&#8217;t really add (or subtract) from the efficiency. I do absolutely believe that the changes suggested by the MOD are useful and appropriate, especially if you have a large board with very &#8220;chatty&#8221; posts made up of lots of short words. But will it dramatically improve your performance? No, not really. It will protect your search index from improper words, and it will protect your server from running extra search queries on duplicate words. Both of those are good for your board.</p>
<p>During my review I did, however, manage to make a change that had a side affect of improving my posting performance by about 4%. I mentioned that in the prior post, and I believe that I know what the actual cause is. I also believe that I can improve it even further. </p>
<p>However, once again we have run out of time and [ ] for this blog post, so you will have to come back for episode V for the big reveal. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/02/02/how-does-search-work-part-iv-dissecting-a-regular-expression/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

