<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Welcome to the phpBB Doctor Blog &#187; Database Tips</title>
	<atom:link href="http://www.phpbbdoctor.com/blog/category/phpbb/database-tips/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.phpbbdoctor.com/blog</link>
	<description>Your premium source for custom modification services for phpBB</description>
	<lastBuildDate>Fri, 30 Apr 2010 02:58:53 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Why Is 86400 A Magic Number?</title>
		<link>http://www.phpbbdoctor.com/blog/2009/10/06/why-is-86400-a-magic-number/</link>
		<comments>http://www.phpbbdoctor.com/blog/2009/10/06/why-is-86400-a-magic-number/#comments</comments>
		<pubDate>Tue, 06 Oct 2009 13:47:54 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=324</guid>
		<description><![CDATA[Anyone who has worked with database date/time fields probably recognizes the number from the title of this blog post. If not, it&#8217;s simple: there are 86400 seconds in a day. Why do I care about this? Because there are all sorts of fun things that I can do with that number.   
What Happened [...]]]></description>
			<content:encoded><![CDATA[<p>Anyone who has worked with database date/time fields probably recognizes the number from the title of this blog post. If not, it&#8217;s simple: there are 86400 seconds in a day. Why do I care about this? Because there are all sorts of fun things that I can do with that number. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  <span id="more-324"></span></p>
<h3>What Happened Yesterday?</h3>
<p>One of the frequent requests that I used to see on phpbb.com was something like this:</p>
<blockquote><p>How many visitors came to my board yesterday?</p></blockquote>
<p>The problem I have with questions like this is that your &#8220;yesterday&#8221; is not the same as mine, unless you happen to live in the central time zone in the United States. When I wrote a MOD to do this for a client, I convinced them that rather than showing what happened &#8220;yesterday&#8221; it would be better to show what happened in the last 24 hours.</p>
<p>The <code>user_lastvisit</code> field shows the date/time that a user last logged in. This field is used to track new topics during a user session. It&#8217;s also used to drive the difference between &#8220;new&#8221; and &#8220;unread&#8221; personal messages. (A &#8220;new&#8221; message arrived since the last session. An &#8220;unread&#8221; message is one that hasn&#8217;t been read yet but arrived before the current session started.) I have altered my memberlist.php code to show when the user last visited as well.</p>
<p>Like most date fields in phpbb, this field is stored as int(11) rather than as a date/time field. (Other examples are the user&#8217;s registration date, the post time, new topic time. &#8230; the list goes on from there.) The content of the field is a very large integer value and is officially known as a unix timestamp.</p>
<blockquote cite="Wikipedia"><p>Unix time, or POSIX time, is a system for describing points in time, defined as the number of seconds elapsed since midnight proleptic Coordinated Universal Time (UTC) of January 1, 1970, not counting leap seconds.</p></blockquote>
<p>The standard for storing date/time fields in unix timestamp is to use a signed integer rather than unsigned. This allows a developer to store negative numbers to reflect dates prior to 1970. It also has its own Y2K issue as the int(11) field will overflow in 2038. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  But let me get back on track for this blog post.</p>
<h3>SQL Code for Last 24 Hours</h3>
<p>Because of the way the user last visit time is stored, I can easily get a list of people that have visited my board in the last 24 hours with this SQL code:</p>
<pre>select  user_id
,       username
from    phpbb_users
where   user_lastvisit >= (unix_timestamp() - 86400)
order by user_lastvisit desc</pre>
<p>The MySQL function <code>unix_timestamp()</code> returns the current date and time in a unix timestamp format so I don&#8217;t have to convert anything. Since the unix timestamp is a number of seconds, and since one day has 86400 seconds, by subtracting 86400 from the current time I get the matching time from 24 hours ago. Easy stuff. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>If I wanted to truly get a list of people that signed in &#8220;yesterday&#8221; then the first thing I have to do is define what yesterday means. Time zones could get involved. It could get messy. I much prefer the &#8220;last 24 hours&#8221; definition because it&#8217;s the same for everybody everywhere.</p>
<h3>What About More Than One Day?</h3>
<p>Sometimes I want to calculate more than one day. Instead of memorizing multiples of 86400 I simply multiply by the number of days. So if I want to count how many people have logged in for the past 7 days (as defined by 24-hour periods rather than &#8220;days&#8221;) I would do this:</p>
<pre>select  count(user_id)
from    phpbb_users
where   user_lastvisit >= (unix_timestamp() - ( 86400 * 7 ) )</pre>
<p>This is easy enough to do, and the code becomes &#8220;self-documenting&#8221; in a manner of speaking. I know that there are 86400 seconds in a day, and if I multiply by 7 I get a week. This is much easier to read and understand than using the number 604800.</p>
<h3>Measuring Board Activity</h3>
<p>About two years ago I told folks I was eagerly looking forward to the first week where my board averaged 86400 page views daily. Now that I have explained what the number is, that statement makes more sense. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  I was looking for the first week that I averaged a page view every second for an entire week. That happened over a year ago, and in fact my board averages over 100K page views daily at this point.</p>
<p>Now I am looking forward to the first week that I average 172800 page views a day. Hmm, I wonder why that is? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p><strong>Related Links</strong></p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Unix_timestamp">Wiki on Unix Timestamps</a></li>
<li><a href="http://en.wikipedia.org/wiki/UTC">Wiki on UTC</a></li>
<li><a href="http://en.wikipedia.org/wiki/Year_2038_problem">Unix Millenium Bug</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2009/10/06/why-is-86400-a-magic-number/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>MySQL Bug Breaks Banner System</title>
		<link>http://www.phpbbdoctor.com/blog/2009/07/12/mysql-bug-breaks-banner-system/</link>
		<comments>http://www.phpbbdoctor.com/blog/2009/07/12/mysql-bug-breaks-banner-system/#comments</comments>
		<pubDate>Sun, 12 Jul 2009 18:42:06 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=312</guid>
		<description><![CDATA[One of the reasons I wasn&#8217;t around much earlier this year was I was in the process of moving a bunch of sites over to a new server (including this one). In most cases the move went without a hitch. In one particular case there was an interesting bug that didn&#8217;t show up right away. [...]]]></description>
			<content:encoded><![CDATA[<p>One of the reasons I wasn&#8217;t around much earlier this year was I was in the process of moving a bunch of sites over to a new server (including this one). In most cases the move went without a hitch. In one particular case there was an interesting bug that didn&#8217;t show up right away. It was related to the banner system I wrote for my largest board. Fortunately it was an error on the &#8220;good&#8221; side, so I didn&#8217;t make any sponsors angry. <span id="more-312"></span></p>
<p>The banner system is fairly complex, but at the most basic level there is a cron (scheduled) job that periodically decrements the sponsor view balance. Once the view balance hits zero, the sponsor&#8217;s banners are deactivated until they pay for another round. The code is quite simple:</p>
<pre>$sql = 'UPDATE  ' . SPONSORS_TABLE . '
	SET     view_balance =  view_balance - ' . $decrement_views . '
	WHERE   sponsor_id = ' . $sponsor_data[$i]['sponsor_id'];</pre>
<p>The value for <code>$decrement_views</code> is assigned earlier in the loop. The definition for the <code>view_balance</code> column is an unsigned integer (mediumint specifically) so it will not allow negative values. On my old server this worked perfectly. If <code>$decrement_views</code> was greater than <code>view_balance</code> the sponsor view balance was set to zero. My old server was running MySQL 4.1.</p>
<p>My new server is running 5.0, and this same code did not work. Unfortunately it did not generate a syntax or other runtime error. Instead it did the math wrong.</p>
<h3>Integer Storage in MySQL</h3>
<p>Before I talk more about the bug I think I should talk about how computers store numbers. This is not specific to MySQL, it can affect any system that stores numeric values. When I store a number I have a choice of adding the unsigned attribute. In MySQL it takes the following format. The first will store a tiny integer without a sign, and the second will store a tiny integer with a sign.</p>
<p><code>create table dave (new_column tinyint unsigned);</code></p>
<p><code>create table dave (new_column tinyint);</code></p>
<p>What is the difference? When numbers are stored they take space. A <code>tinyint</code> column in MySQL can store values from -128 to 127. An unsigned <code>tinyint</code> can store values from 0 to 255. How does this work, and why are the numbers different in each case? </p>
<p>A <code>tinyint</code> is stored in one byte or eight bits of information. With eight bits I have a range of 0000 0000 to 1111 1111. With an unsigned value I can use all eight bits for my number, so 0000 0000 = 0 and 1111 1111 is 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128, or &#8211; if you do the math &#8211; 255. That&#8217;s how an unsigned tiny integer column can store a value ranging from 0 to 255. If, however, I want to use a signed value, the first bit becomes an indication of whether the value is negative or not. That means I only have seven bits left to determine the value. 0111 1111 becomes 1 + 2 + 4 + 8 + 32 + 64 which is 127, or the maximum <strong>signed</strong> value that can be stored in a signed tiny integer field. What happens when the eighth bit gets flipped to a 1? That&#8217;s an indication that the number is negative instead of positive. So while both signed and unsigned values take the same amount of space, a signed value is one order of magnitude smaller because the most significant bit (the leading bit) is used to indicate the sign of the value that is stored. </p>
<p>Put another way: a signed <code>tinyint</code> has seven available bits and therefore can store 2<sup>7</sup>-1 or 127 as the maximum value. An unsigned <code>tinyint</code> has eight available bits and therefore can store 2<sup>8</sup>-1 or 255 as the maximum value. Suppose I am looking at a number in memory and the bit values are 1000 0001. What is the value represented by these bits?</p>
<p>The fact is I can&#8217;t make that determination until I know if the value is signed or not. If the value is unsigned, the number represented by 1000 0001 is 129. If it&#8217;s signed, it gets complicated <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  but the value returned would be -127. Keep in mind that I am using <code>tinyint</code> example values here. </p>
<h3>How Unsigned Math Broke My Sponsor System</h3>
<p>In my banner system I don&#8217;t try to track the page views down to exactly zero. A sponsor will pay for two million page views at a time. If they actually use two million and twelve I am not going to complain about the few extra views. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  So my system is designed to allow each sponsor more views in the interest of good will but mostly for system performance. In my system, each sponsor&#8217;s view balance is only updated once an hour. Let me repeat the code that I showed above:</p>
<pre>$sql = 'UPDATE  ' . SPONSORS_TABLE . '
	SET     view_balance =  view_balance - ' . $decrement_views . '
	WHERE   sponsor_id = ' . $sponsor_data[$i]['sponsor_id'];</pre>
<p>First (not shown) I get a total of the banner views that have accumulated over the past hour and store them into the <code>$decrement_views</code> variable in my php script. Next I execute the SQL script shown above for each sponsor with an active banner. Suppose sponsor number 12 has 1000 views left and they used 300 in the last hour. The SQL code resolves to this:</p>
<p><code>UPDATE  phpbb_sponsors<br />
SET     view_balance =  1000 - 300<br />
WHERE   sponsor_id = 12;</code></p>
<p>After this statement is executed the sponsor has a balance of 700 views left. Suppose the same sponsor has 100 views in their balance and they used 300 more during the last hour. The SQL ends up looking like:</p>
<p><code>UPDATE  phpbb_sponsors<br />
SET     view_balance =  100 - 300<br />
WHERE   sponsor_id = 12;</code></p>
<p>When 300 is subtracted from 100 it results in a negative number. <strong>Under my old version of MySQL that number was set to zero since the column is defined as unsigned and is not capable of storing a negative value.</strong> This is what broke during the upgrade to MySQL 5.</p>
<h3>MySQL Bug Explained</h3>
<p>The newer version of MySQL did the math as signed, which allowed the value to go negative, and then stored the results in the unsigned field. You might start to see the problem now. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Instead of setting the sponsor view balance to zero for any negative result, the sponsor view balance was set to the maximum value that it could possibly hold because of the overflow. In this case, the view balance got set to 16,777,215 instead of zero. What is significant about that number? Here is a quote from the MySQL web page where it details the values that can be stored for any particular numeric column type&#8230;</p>
<blockquote><p>MEDIUMINT[(M)] [UNSIGNED] [ZEROFILL]<br />
A medium-sized integer. The signed range is -8388608 to 8388607. The unsigned range is 0 to 16777215. </p></blockquote>
<p>In a nutshell, the old version of MySQL caught the overflow exception and set the value to zero. The newer version of MySQL did not handle the overflow and instead let the signed value stored the unsigned negative value. I&#8217;m sure that would have made my sponsors happy, but it certainly wasn&#8217;t how things were intended to work.</p>
<h3>Fixing The Problem</h3>
<p>There is a good lesson to be learned here. I was lazy <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  in the way I wrote my earlier code. Rather than check to see if the number of accumulated views was higher than the balance remaining and handling that exception with code, I relied on the fact that MySQL would not (should not) store a negative result in an unsigned field. When the database behavior changed (it has been recognized as a bug by MySQL according to my research) my system broke.</p>
<p>I have fixed the SQL by using a case statement so that this error will never occur for me again. Here is the revised SQL:</p>
<pre>$sql = 'UPDATE  ' . SPONSORS_TABLE . '
	SET     view_balance =  case
			when ' . $decrement_views . ' > view_balance then 0
			else view_balance - ' . $decrement_views . '
			end
	WHERE   sponsor_id = ' . $sponsor_data[$i]['sponsor_id'];</pre>
<p>This updated code uses a <code>case</code> statement structure to check to make sure that the remaining balance is larger than the value to be decremented. If it is not, the value is simply set to zero.</p>
<p>Finally, now that I&#8217;ve explained signed versus unsigned it makes the following cartoon from xkcd.com more meaningful, doesn&#8217;t it? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p><img src="http://imgs.xkcd.com/comics/cant_sleep.png" /></p>
<p><strong>Related Links</strong></p>
<ul>
<li><a href="http://dev.mysql.com/doc/refman/5.1/en/numeric-type-overview.html">MySQL Numeric Types Explained</a></li>
<li><a href="http://www.rwc.uc.edu/koehler/comath/13.html">Unsigned and Signed Integers</a>, an article I found on the Internet with more details on signed versus unsigned storage, it&#8217;s short and a very easy read</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2009/07/12/mysql-bug-breaks-banner-system/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Building a Better Board Banner System Part II: Using LAST_INSERT_ID()</title>
		<link>http://www.phpbbdoctor.com/blog/2008/08/29/building-a-better-board-banner-system-part-ii-using-last_insert_id/</link>
		<comments>http://www.phpbbdoctor.com/blog/2008/08/29/building-a-better-board-banner-system-part-ii-using-last_insert_id/#comments</comments>
		<pubDate>Fri, 29 Aug 2008 12:19:40 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=243</guid>
		<description><![CDATA[In the first post in this series I talked about the design process for my new banner system. I wanted it to be 100% accurate so I eliminated any sort of random number generation process. I also eliminated a SELECT &#8230; FOR UPDATE because I was concerned about deadlocks affecting the efficiency of my code. [...]]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://www.phpbbdoctor.com/blog/2008/08/20/building-a-better-board-banner-system-part-i-efficient-accuracy">first post in this series</a> I talked about the design process for my new banner system. I wanted it to be 100% accurate so I eliminated any sort of random number generation process. I also eliminated a SELECT &#8230; FOR UPDATE because I was concerned about deadlocks affecting the efficiency of my code. At the end of the post I introduced the MySQL LAST_INSERT_ID() function. Today I will cover it in much more detail.</p>
<p>MySQL offers an interesting function called LAST_INSERT_ID() that &#8211; once I figured out the proper syntax &#8211; provided exactly what I needed for my accurate and efficient banner system. What does it do? Simply put, it provides a shortcut to return the result of a previous SQL statement without having to worry about intervening updates. Before I explain the function, I want to share a bit of the database design and how the process will eventually work.</p>
<p><span id="more-243"></span><br />
<h3>Page Count Math</h3>
<p>I have a table called phpbb_views. The structure of this table is very simple: it only contains a single column named view_ctr. The field is currently defined as <code>int(11) unsigned</code> which means it can go up to a really large number. What can I say&#8230; I&#8217;m optimistic. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  I have had code in place for years that increments this value on every page view. Previously I had the update statement running inside the includes/page_tail.php code. For the banner system I moved that logic into the includes/page_header.php code instead.</p>
<p>Let me assume that I have 10 banners that are all equally weighted. I want to ensure that these banners are displayed in sequential order, from one to ten and then starting over again. There is a perfect function for this, and you might have seen it if you&#8217;ve looked at how phpBB2 generates alternating row colors for tables. The syntax looks like this:</p>
<pre>($var1 % $var2)</pre>
<p>The result is the remainder of $var1 divided by $var2. Note that both $var1 and $var2 are converted to integer values (if needed) prior to the operation. How can this help me?</p>
<p>A sequential number (like page view counter values) will increment by one, starting at some number, and extending to infinity. What I need is a rotating sequence of numbers that starts at 1 and goes to 10 and then starts over again. That&#8217;s exactly what the % operator will do for me. All I have to do is use my page view counter as $var1 and 10 (the number of banners) as $var2 and here&#8217;s what the numbers look like:</p>
<pre>+-----+--------------+
| 100 |            0 |
| 101 |            1 |
| 102 |            2 |
| 103 |            3 |
| 104 |            4 |
| 105 |            5 |
| 106 |            6 |
| 107 |            7 |
| 108 |            8 |
| 109 |            9 |
| 110 |            0 |
+-----+--------------+</pre>
<p>If you notice the column on the left has an increasing sequence, while the column on the right starts with zero and increments up to nine and then starts over again. This is because 100 / 10 has no remainder, while 101 / 10 has a remainder of one, and so on. This is perfect! Now all I have to do is ensure that I will never skip a number and I have a guaranteed way to pick the next (and correct) banner to display. If I have a zero-based array index these numbers are already perfect. If my array index starts at one then all I have to do is change the formula to <code>($var1 % $var2) + 1</code> and I am all set.</p>
<h3>How the MySQL LAST_INSERT_ID() Function Works</h3>
<p>I will jump right to the syntax of my query and then explain it next:</p>
<pre>UPDATE phpbb_views
SET view_ctr = LAST_INSERT_ID(view_ctr+1) ;</pre>
<p>This statement accomplishes two things at the same time. First, it updates the view_ctr column using the provided expression (view_ctr + 1). Second, the LAST_INSERT_ID() function makes a note of the resulting value and stores it as a part of my session. In a sense, it is doing the update and the select <strong>all at the same time!</strong> This eliminates any potential issue with statements appearing out of order. It also eliminates any need to perform any locking on the rows or the table.</p>
<p>To retrieve the session value I issue this command:</p>
<pre>SELECT LAST_INSERT_ID() ;</pre>
<p>That is simple, isn&#8217;t it? As per the MySQL documentation this query does not even reference a table. It pulls the last affected value from the session variable and provides it to me. If the view_ctr value was 110 and I run the first command the new value for view_ctr is 111 and that value is also cached for my session. When I retrieve the value using the second query I get 111.</p>
<p>All of my problems are solved. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8-)' class='wp-smiley' /> I can use this technique without worrying about a deadlock and without worrying about missing values. It didn&#8217;t take long to set up my code with this function. But is it fast?</p>
<p>On average that query takes about 0.00015 seconds. That&#8217;s not too bad. What&#8217;s even better is that the query to set the new value as well as the one required to update the banner counters both are equally as efficient. </p>
<p>My final banner system provides me with the following features:</p>
<ul>
<li>Banners can have start dates or expiration dates</li>
<li>Advertisers can have multiple banners on file</li>
<li>Advertisers can &#8220;weight&#8221; banners so they appear more (or less) frequently</li>
<li>Page views are evenly distributed across all active banners according to weight</li>
</ul>
<p>There are many more features in place, but the last bullet point is the result of the work done to get this far. I will talk more about the table design and how I completed my banner system in my next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2008/08/29/building-a-better-board-banner-system-part-ii-using-last_insert_id/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Building a Better Board Banner System Part I: Efficient Accuracy</title>
		<link>http://www.phpbbdoctor.com/blog/2008/08/20/building-a-better-board-banner-system-part-i-efficient-accuracy/</link>
		<comments>http://www.phpbbdoctor.com/blog/2008/08/20/building-a-better-board-banner-system-part-i-efficient-accuracy/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 11:18:59 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=242</guid>
		<description><![CDATA[I recently rewrote my banner management system for one of my boards. The board is fairly active (in fact we&#8217;re averaging over 100,000 page views daily now) so with multiple page views per second taking place during the busiest times of the day it would make sense to be concerned about performance. And I was. [...]]]></description>
			<content:encoded><![CDATA[<p>I recently rewrote my banner management system for one of my boards. The board is fairly active (in fact we&#8217;re averaging over 100,000 page views daily now) so with multiple page views per second taking place during the busiest times of the day it would make sense to be concerned about performance. And I was. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  But I also had to be concerned about auditing my banner and page statistics, and ensuring that if I said a banner was going to be displayed every 10 page views that it was. So the system had to be 100% efficient and just as accurate. That presented some challenges.</p>
<p><span id="more-242"></span><br />
<h3>Problem Defined</h3>
<p>I needed a way to set up a sequence of banners and then ensure that nobody ever got skipped. If I said that each banner was going to be displayed an equal number of times per day, then I had better be sure that I delivered that. Random number generators are historically not very random and therefore I could not use that approach. Having a &#8220;next banner to be displayed&#8221; record in a database would work, but the lock for update / update / select / release lock process might result in a deadlock situation, and certainly would not perform very well. I decided I needed to do some research and figure out the best way to manage my banner rotation. </p>
<p>What I found was really interesting, and has been working amazingly well for several months now. Unfortunately the technique is unique to MySQL so I can&#8217;t plan on releasing this as a MOD anytime soon.</p>
<h3>Random Numbers versus Incrementing Page View Counter</h3>
<p>I had decided that I did not want to use any sort of random number generator for showing banners. My preliminary testing showed that it was quite frankly a horrible solution. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  There was always one banner that was displayed a lot more than any of the others. Perhaps over time it would have become more even, but after 100 trials and 10 banners it seemed that there was always one banner in the 20+ range when they should have all been (if equally weighted) around 10. That was unacceptable and I quickly eliminated this from consideration.</p>
<p>Next I decided to leverage a sequential number that I already had. In page_tail.php I had a very simple query that was responsible for updating a page view counter. By using the mod() function on the newly assigned page counter number I could rely on a continuing series of numbers that would never skip values.</p>
<p>Or could I?</p>
<p>On a very busy board there might be multiple updates occuring nearly at the same time. If I were to use this technique, as shown by these two simple SQL statements:</p>
<pre>UPDATE phpbb_page_table SET page_counter = page_counter + 1;
SELECT page_counter FROM phpbb_page_table;</pre>
<p>It is entirely possible (and fact even quite likely) that someone else could have updated the page counter value before I selected my result row. That meant that even if the code looked like what I typed above the reality is that it could get executed like this:</p>
<pre>UPDATE phpbb_page_table SET page_counter = page_counter + 1;
UPDATE phpbb_page_table SET page_counter = page_counter + 1;
UPDATE phpbb_page_table SET page_counter = page_counter + 1;
SELECT page_counter FROM phpbb_page_table;
UPDATE phpbb_page_table SET page_counter = page_counter + 1;
SELECT page_counter FROM phpbb_page_table;
SELECT page_counter FROM phpbb_page_table;
SELECT page_counter FROM phpbb_page_table;</pre>
<p>Note that there are still four &#8220;update&#8221; statements and four &#8220;select&#8221; statements, but they&#8217;re not guaranteed to execute in the proper order since each request is coming from a different page. Of course each page only sees one SELECT and one UPDATE statement but there is no way to ensure that someone else&#8217;s page isn&#8217;t running the same queries at essentially the exact same time that I am. The queries are sent to the database from all pages at the same time. Any page logic that requires an absolutely isolated query execution sequence is going to potentially fail in a shared user environment. It might not fail every time, but it could <strong>potentially</strong> fail. That&#8217;s a problem.</p>
<p>What is the result of having the queries out of order in the above example? Banner number 1 and 2 get skipped because the update runs with no select. Banner number 3 is displayed once, and banner number 4 is displayed three times because there are three selects following a single update. The net result is that I am going to potentially see the same issue here that I saw with my random number system&#8230; banners were not going to be guaranteed to come out with an even distribution of page views.</p>
<h3>Select &#8230; For Update</h3>
<p>To avoid having multiple updates run before I can get my selected value returned I could consider using a lock statement of some kind. That way when the select statement is issued it is guaranteed that nobody else has updated that row. After the select is performed then the lock would be released. Here&#8217;s how this would have worked out if I had wanted to try this solution. </p>
<p>MySQL provides the &#8220;Select FOR UPDATE&#8221; syntax. This will use page or row locking (whatever is available based on the engine in use) instead of locking the entire table. In the scenario shown in the prior section I ran the update and then selected the resulting value. This time the logic will be reversed, as shown here:</p>
<pre>SELECT page_counter FROM phpbb_page_table FOR UPDATE;
UPDATE phpbb_page_table SET page_counter = page_counter + 1;</pre>
<p>The &#8220;FOR UPDATE&#8221; clause is used to prevent anyone else from updating the row until I get my value. Once I get the value, I increment the counter for the next person. The lock has to be an exclusive lock otherwise two pages would end up competing for the update.</p>
<p>I am really hesitant to issue application locks for these and other reasons. Obviously it&#8217;s fine if the database issues locks as a result of commands that I execute, but having specific code in my php files to issues locks? Not a good idea. To be honest, I didn&#8217;t even try to prototype or test this solution because it would have been horrible. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  What if a page dies in the middle of the lock and never releases it?</p>
<p>To summarize so far:</p>
<ul>
<li>Random numbers are efficient but not guaranteed accurate</li>
<li>Sequential counters are efficient but not guaranteed accurate</li>
<li>Sequential counters with locks are accurate but not efficient and could lead to a deadlock situation</li>
</ul>
<p>It turns out that while I was researching lock statements on mysql.com I found this buried at the very bottom of the page:</p>
<blockquote><p>The preceding description is merely an example of how SELECT &#8230; FOR UPDATE works. In MySQL, the specific task of generating a unique identifier actually can be accomplished using only a single access to the table: </p>
<pre>UPDATE child_codes SET counter_field = LAST_INSERT_ID(counter_field + 1);
SELECT LAST_INSERT_ID();</pre>
<p>The SELECT statement merely retrieves the identifier information (specific to the current connection). It does not access any table.
</p></blockquote>
<p>Hm. That sounds very interesting. In fact, it&#8217;s exactly the solution that I needed for my banner system to be both 100% accurate and efficient at the same time. I&#8217;ll talk more about how the LAST_INSERT_ID() function works in my next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2008/08/20/building-a-better-board-banner-system-part-i-efficient-accuracy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Decoding IP Addresses from phpBB2</title>
		<link>http://www.phpbbdoctor.com/blog/2008/08/13/decoding-ip-addresses-from-phpbb2/</link>
		<comments>http://www.phpbbdoctor.com/blog/2008/08/13/decoding-ip-addresses-from-phpbb2/#comments</comments>
		<pubDate>Wed, 13 Aug 2008 15:46:00 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=240</guid>
		<description><![CDATA[The IP address information for a poster is stored in the phpbb_posts table in phpBB2. In my Checkbox Challenge MOD it&#8217;s also stored during registration attempts. The IP address stored after being converted to hex and is then stored as an 8 character string. This is then decoded on the fly when requested. Sometimes you [...]]]></description>
			<content:encoded><![CDATA[<p>The IP address information for a poster is stored in the phpbb_posts table in phpBB2. In my Checkbox Challenge MOD it&#8217;s also stored during registration attempts. The IP address stored after being converted to hex and is then stored as an 8 character string. This is then decoded on the fly when requested. Sometimes you might want to decode the IP from the character string by using MySQL directly. It turns out there is a very simple formula to do that. </p>
<p><span id="more-240"></span>To encode an IP use this expression:</p>
<pre>mysql> select conv(inet_aton('127.0.0.1'),10,16);
+------------------------------------+
| conv(inet_aton('127.0.0.1'),10,16) |
+------------------------------------+
| 7F000001                           |
+------------------------------------+
1 row in set (0.00 sec)</pre>
<p>To decode an IP address use this expression:</p>
<pre>mysql> select inet_ntoa(conv('7F000001',16,10));
+-----------------------------------+
| inet_ntoa(conv('7F000001',16,10)) |
+-----------------------------------+
| 127.0.0.1                         |
+-----------------------------------+
1 row in set (0.01 sec)</pre>
<p>The functions used here are conv() and then the pair of inverse functions called inet_ntoa() and inet_aton(). The conv() function converts from base 10 to base 16 (or base 16 to base 10) based on the order of the arguments. The inet_ntoa() and inet_aton() functions are used to convert from an &#8220;alpha&#8221; representation of the IP address to a &#8220;numeric&#8221; representation. So the inet_ntoa() is &#8220;<strong>n</strong>umber <strong>to</strong> <strong>a</strong>lpha&#8221; and the inet_aton() is an &#8220;<strong>a</strong>lpha <strong>to</strong> <strong>n</strong>umber&#8221; translation.</p>
<p>So to summarize: the IP address comes in as 127.0.0.1 and the inet_aton() function converts that dotted notation into a number. Then the conv() function converts that number from base 10 to base 16 (hexidecimal) format, and which point it&#8217;s stored in the table. To reverse that process I have to convert from base 16 back to base 10 and then apply the reverse function inet_ntoa() to return the IP address in proper dotted notation.</p>
<p>I&#8217;m not sure but I think I have been told that phpBB3 does not store IP addresses using this same encoding. I&#8217;m not sure why not, as by storing the IP address encoded as hex you can use wildcards and substring operations to do some interesting queries. If the IP address is stored in dotted notation then those advantages go away. I am not sure I can come up with a good advantage to storing the IP address in dotted notation at this time. Anyone else have an idea?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2008/08/13/decoding-ip-addresses-from-phpbb2/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Working With Recursive Data Part II: Tree Traversal</title>
		<link>http://www.phpbbdoctor.com/blog/2008/08/02/working-with-recursive-data-part-ii-tree-traversal/</link>
		<comments>http://www.phpbbdoctor.com/blog/2008/08/02/working-with-recursive-data-part-ii-tree-traversal/#comments</comments>
		<pubDate>Sat, 02 Aug 2008 05:48:24 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=226</guid>
		<description><![CDATA[In the first post in this series on working with recursive data I talked about several different ways to store the information in a database. Some of them were promising, but they all had complications of some kind or another. I was using the phpBB Doctor Project Manager database design as an example, but there [...]]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://www.phpbbdoctor.com/blog/2008/07/29/working-with-recursive-data-part-i-table-designs/">first post in this series</a> on working with recursive data I talked about several different ways to store the information in a database. Some of them were promising, but they all had complications of some kind or another. I was using the phpBB Doctor Project Manager database design as an example, but there are quite a few different scenarios where recursive data will be found. Since SQL is not a recursive language, I am trying to find the best way to model the data so that I can access it with minimal fuss. </p>
<p>As an example, in my project management system I need to be able to quickly and easily identify the parent task, if the task has any sub-tasks (child records), and which tasks are at the same level (siblings). I would like to be able to traverse the tree in either direction (up to the parent or down to the child) without using recursion. In order to do that, I need a model that is different from anything presented in the prior post.</p>
<p><span id="more-226"></span>I am going to return to the initial table design and add a few columns that will look very familiar to phpBB3 MOD authors. The new columns are called &#8220;left&#8221; and &#8220;right&#8221; which means this is the new structure for my task table:</p>
<pre>+-----------------+-----------------------+------+-----+---------+----------------+
| Field           | Type                  | Null | Key | Default | Extra          |
+-----------------+-----------------------+------+-----+---------+----------------+
| task_id         | int(11) unsigned      |      | PRI | NULL    | auto_increment |
| task_number     | mediumint(8) unsigned |      |     | 0       |                |
| parent_task_id  | int(11) unsigned      |      | MUL | 0       |                |
| project_id      | mediumint(8) unsigned |      | MUL | 0       |                |
| task_lvl        | tinyint(2) unsigned   | YES  |     | 0       |                |
| task_left       | mediumint(8) unsigned |      |     | 0       |                |
| task_right      | mediumint(8) unsigned |      |     | 0       |                |
| task_status     | tinyint(2) unsigned   | YES  |     | 0       |                |
| task_name       | varchar(64)           |      |     |         |                |
| estimated_hours | decimal(8,2)          | YES  |     | 0.00    |                |
+-----------------+-----------------------+------+-----+---------+----------------+</pre>
<p>There are three new columns in my task table now, and I will detail each of them next. The <code>task_left</code> and <code>task_right</code> columns are going to give me the structure that I need to do what is called a preorder tree traversal. Don&#8217;t worry, I will explain what that means shortly. With pictures. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  The net of it is I can recreate the task tree with a single query and therefore recursion is not required.</p>
<p>The task level (<code>task_lvl</code> in my example) can be calculated on the fly but I prefer to generate this number and store it as it saves time. The level starts at zero for a top-level task, and is incremented by one for each &#8220;step&#8221; away from the parent. So if the WBS for a task (see prior post for a definition) is 1.2.1 then that task is at level two. Count the dots and that&#8217;s the assigned level for the task. That one is easy.</p>
<h3>Preorder Tree Traversal</h3>
<p>I haven&#8217;t yet explained how the left and right values are assigned or used. It&#8217;s fairly simple, and the best way to describe it is to convert my task data into a tree, as shown here:</p>
<p><img src="/blog/images/tree_01.jpg" width="458" height="370" alt="Tree Toplogy" title="Tree Topology" /></p>
<p>For simplicity I have only shown one branch of the task tree. This is a quite pretty picture <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  but I can&#8217;t store the picture in the database and use it for any sort of structure. What I need is a way to convert the picture into information, and that&#8217;s easily done with what is sometimes called a &#8220;depth first&#8221; or &#8220;preorder tree traversal&#8221; process. Without going into all of the math (there is a wiki link at the end of the post for those that are interested) here is how I traverse the tree:</p>
<ol>
<li>Pick up a pencil (or a crayon, in this case <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' />  )</li>
<li>Put my crayon on the left side of the top node</li>
<li>Draw around the tree starting down the left side, going around each node, and back up and around until I end up on the right side of the parent node. Don&#8217;t lift the crayon, and don&#8217;t cross any lines while drawing.</li>
</ol>
<p>Here&#8217;s what I end up with after that process:</p>
<p><img src="/blog/images/tree_02.jpg" width="458" height="370" alt="Tree Toplogy" title="Tree Topology" /></p>
<p>Once I have walked the tree (following the crayon marks in the image above), I will go back over the line and each time I pass a node (task) on either the left or the right I add a number. The left side of the parent node starts at one, and I increment by one each time I pass a node. Here is the tree that has been traversed and numbered. In this case I used blue for the left side of the node, and orange for the right side of the node, for reasons which will become clear later in the post.</p>
<p><img src="/blog/images/tree_03.jpg" width="458" height="370" alt="Tree Toplogy" title="Tree Topology" /></p>
<p>Now that I have made a mess of things, what do I do next? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' />  It turns out that the left and right values are very interesting numbers because by looking at the these values for a node I can determine all sorts of things with some simple observations.</p>
<p>First, any node where left+1 = right has to be a &#8220;leaf&#8221; or terminal node. This means with this model I can look at any node in the tree and immediately determine if it&#8217;s a leaf. That is not possible with a recursive model because I don&#8217;t know if some other node points &#8220;up&#8221; to that node as a parent without running a query first. For example, the task node 1.1 has a left value of 2 and a right value of 3, so it is a leaf. I can easily verify the rest of the leaf nodes fit this same pattern.</p>
<p>Second, I can determine how many children a node has with a very simple formula. (This formula works for leaf nodes as well, it&#8217;s just not needed there.) The formula is <code>(Right - Left - 1) / 2</code>. If I look at node 1.2.1 I see that the left value is 5, the right value is 10, and there are two child nodes. If I apply the formula I can see that <code>(10 - 5 - 1) / 2</code> gives me a result of 2, which is exactly the number of children for that node. If I apply the formula to the very top node I get <code>(16 - 1 - 1) / 2</code> or 7.</p>
<h3>Conclusion</h3>
<p>At this point I have shown how to create the tree, how the left and right values are assigned, and a few tricks that I can do with simple math. For my next post in this series I plan to start from my tree structure and show how easy it is to manage the data during operations such as adding a node, deleting a node, or moving a node. I also plan to show how easy it is to use this model to efficiently process and display any sort of hierarchical data without having to resort to recursive programming.</p>
<p><strong>Related Links</strong></p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Tree_traversal">Wiki on Tree Traversal Algorithms</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2008/08/02/working-with-recursive-data-part-ii-tree-traversal/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Working With Recursive Data Part I: Table Designs</title>
		<link>http://www.phpbbdoctor.com/blog/2008/07/29/working-with-recursive-data-part-i-table-designs/</link>
		<comments>http://www.phpbbdoctor.com/blog/2008/07/29/working-with-recursive-data-part-i-table-designs/#comments</comments>
		<pubDate>Wed, 30 Jul 2008 04:55:12 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=225</guid>
		<description><![CDATA[There are all sorts of scenarios that require recursive data. If you don&#8217;t know what &#8220;recursive&#8221; means, it&#8217;s a relationship from an entity back to the same entity. In English, it&#8217;s data that points back to itself.   Some typical examples of recursive data are company org charts, inventory build instructions, or even forums [...]]]></description>
			<content:encoded><![CDATA[<p>There are all sorts of scenarios that require recursive data. If you don&#8217;t know what &#8220;recursive&#8221; means, it&#8217;s a relationship from an entity back to the same entity. In English, it&#8217;s data that points back to itself. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Some typical examples of recursive data are company org charts, inventory build instructions, or even forums for phpBB3. Yes, I&#8217;m talking about phpBB3, are you surprised? <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_eek.gif' alt=':shock:' class='wp-smiley' />  <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' />  I hope not, because I&#8217;m going to reference phpBB3 only in passing. The article is actually more about storing recursive data in any form. It&#8217;s also how I store information in my phpBBDoctor Project Management system, among other things.</p>
<p>SQL is not a recursive language. When I write a query it&#8217;s all about relationships between rows, not about looping back through the same table. Oracle has a special construct used to traverse recursive data and it works very well, but it&#8217;s the only database that I am currently aware of to support this. Since most phpBB MOD authors will not be writing for Oracle, I will skip that concept for now.</p>
<p>As mentioned in the first paragraph I have recursive data in my project tracking system that I use here on the phpBB Doctor site. The design for this system is simple, but complex. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  The first table is the project table and it includes summary attributes for the project. These attributes include values such as when the project started, who is the project manager, a description of the project, and the status. The next table is the tasks table. A task is assigned to a project, but a task can be broken up into sub-tasks as well. There is no expected limit to the depth of the tasks. Here is a screen shot of my test project so you get an idea of what I&#8217;m talking about.</p>
<p><span id="more-225"></span><img src="/blog/images/projmgr.jpg" width="549" height="365" border="0" alt="Project Manager Screen Shot" title="Project Manager Screen Shot" /></p>
<p>&#8220;WBS&#8221; stands for Work Breakdown Structure, and it&#8217;s a dot notation used to denote the level of detail and lineage for a project task. In this case, the first task is &#8220;1&#8243; and the first sub-task assigned to that task is &#8220;1.1&#8243; and the next sub-task is &#8220;1.2&#8243;. If there is a sub-sub-task then another dot is added, as in &#8220;1.1.1&#8243; or &#8220;1.2.1&#8243; and so on. In this way, each task has a lineage from the parent project all the way down to the most detailed level tasks. The challenge that I face is to model this data properly so that I can retrieve the lineage from top to bottom, bottom to top, or even side to side.</p>
<h3>Task Table Design 1: Storing Parent Relationships</h3>
<p>Here&#8217;s a typical table structure for a tasks table in this situation.</p>
<pre>+-----------------+-----------------------+------+-----+---------+----------------+
| Field           | Type                  | Null | Key | Default | Extra          |
+-----------------+-----------------------+------+-----+---------+----------------+
| task_id         | int(11) unsigned      |      | PRI | NULL    | auto_increment |
| task_number     | mediumint(8) unsigned |      |     | 0       |                |
| parent_task_id  | int(11) unsigned      |      | MUL | 0       |                |
| project_id      | mediumint(8) unsigned |      | MUL | 0       |                |
| task_status     | tinyint(2) unsigned   | YES  |     | 0       |                |
| task_name       | varchar(64)           |      |     |         |                |
| estimated_hours | decimal(8,2)          | YES  |     | 0.00    |                |
+-----------------+-----------------------+------+-----+---------+----------------+</pre>
<p>Notice that there is a task_id column which will be assigned by the database as the new record is inserted (that&#8217;s what the auto_increment attribute does). The task is also assigned a parent_task_id value, so that the task is related to (joined to) the parent. This allows me to walk through the task list to show task parents with their children. But in order to do that, I need a recursive language (php works) and I need to run a query for each level of the task depth. That&#8217;s not very efficient. But if I don&#8217;t do recursion, then it&#8217;s hard to put the information together correctly. Here is some of the raw data from my sample project shown in the screen shot earlier:</p>
<pre>+---------+----------------+-----------------------------------------------+
| task_id | parent_task_id | task_name                                     |
+---------+----------------+-----------------------------------------------+
|      43 |             40 | Child Task 2 of Parent Task 1                 |
|      42 |             40 | Child Task 1 of Parent 1                      |
|      41 |              0 | Parent Task 2                                 |
|      40 |              0 | Parent Task 1                                 |
|      44 |             41 | Child Task 1 of Parent Task 2                 |
|      45 |             41 | Child Task 2 of Parent Task 2                 |
|      46 |             43 | Child task 1 of child task 1 of parent task 1 |
|      47 |             43 | Child task 2 of child task 1 of parent task 1 |
|      48 |             43 | Child task 3 of child task 2 of parent task 1 |
|      49 |             48 | Child task at level 4                         |
|      50 |             48 | Child Task 2 at level 4 After Edit            |
|      51 |             46 | Another child                                 |
|      52 |             46 | Another child                                 |
+---------+----------------+-----------------------------------------------+</pre>
<p>If I look carefully, I can see where a parent_task_id references a task_id from another row in the output. There are two tasks where the parent_task_id is zero, which means they are the top task or starting point for the hierarchy. For example, the task that is assigned task_id 46 has a parent task of 43, and that task (43) has a parent task of 40, and that task (40) has a parent task of 0. By examing the recursive relationship of one task to another I can go up (or down) the task list.</p>
<p>But each time I get a row, I have to run another query based on the result of the last query in order to determine the parent. If I were trying to determine the child, I would go in the other direction, but the results would be the same. Surely there is a better way to address this.</p>
<h3>Task Table Design 2: Separate Tables For Task Levels</h3>
<p>I could store each different task level in a different table. If I have five levels of tasks, I could have task_01, task_02, and so on up to task_05. By doing this I am no longer doing a recursive join as each table is separate. Recursion is required because all of the data is stored in one place. Storing each level in a different table fixes that.</p>
<p>But don&#8217;t do this, just don&#8217;t. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  It&#8217;s a horrible database design. Why?</p>
<p>Consider what happens if some tasks have fewer than five levels of detail. In order to avoid dropping rows, each query has to be an outer join. Outer joins are typically less efficient than inner joins so I try to avoid them. And this is not the worst issue. What if some projects end up with tasks at six levels deep? Or seven? Each time I add a new level to the task structure, I have to add a new table to my database. Each time I add a new table to my database, I have to adjust my application code. The application should drive the data, not the other way around. The reality is, unless you know that you will forever have a specific depth to your recursive data, you should never consider this option.</p>
<h3>Task Table Design 3: Ancestor &#8211; Parent &#8211; Child Model</h3>
<p>One of the challenges of working with recursive data is finding out where a row belongs. If you have to follow the relationship up through 20 levels of data it&#8217;s going to take a while to figure out who the ultimate parent is. To solve this, I can store the root node or &#8220;ancestor&#8221; on each row. That would make the data look like this:</p>
<pre>+-------------+---------+----------------+-----------------------------------------------+
| ancestor_id | task_id | parent_task_id | task_name                                     |
+-------------+---------+----------------+-----------------------------------------------+
|          40 |      43 |             40 | Child Task 2 of Parent Task 1                 |
|          40 |      42 |             40 | Child Task 1 of Parent 1                      |
|          41 |      41 |              0 | Parent Task 2                                 |
|          40 |      40 |              0 | Parent Task 1                                 |
|          41 |      44 |             41 | Child Task 1 of Parent Task 2                 |
|          41 |      45 |             41 | Child Task 2 of Parent Task 2                 |
|          40 |      46 |             43 | Child task 1 of child task 1 of parent task 1 |
|          40 |      47 |             43 | Child task 2 of child task 1 of parent task 1 |
|          40 |      48 |             43 | Child task 3 of child task 2 of parent task 1 |
...</pre>
<p>In this way I no longer have to follow the relationships to find out where the starting point (the ancestor) is because the application is responsible for maintaining this data on the row itself. With this I can easily identify all of the child tasks that belong to parent task 40 (including the parent task of 40 itself) by running a query to return all rows with that ancestor_id.</p>
<p>This solves part of the problem (going up the path) but not all of it. For example, even if I know the ancestor, and I know all of the child tasks that belong to that ancestor, I don&#8217;t know the topology of those tasks. The topology is the layout of the tasks, showing the entire structure from top to bottom. If I refer back to the screenshot towards the top of the screen I can see how the tasks are indented based on their level. The indentation also shows which tasks are siblings (occur at the same level) as well as the parent and child status. So I&#8217;m still not quite ready to accept this design for my application.</p>
<p>For what it is worth, I did use this design for a real-world project for a major manufacturing company, and they are very happy with the results. I created a report from their SAP system that allowed them to determine in 30 minutes or less the answer to questions that took SAP over eight hours to generate. Cool stuff. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8-)' class='wp-smiley' /> </p>
<h3>Task Table Design: ???</h3>
<p>There is a better way to store recursive data, and I&#8217;m going to detail that in the next post in this series. Here is a hint: If you are working with phpBB3 then you are already using this solution. Details next time. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8-)' class='wp-smiley' /> </p>
<p><strong>Related Links</strong></p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Work_breakdown_structure">Wiki on WBS</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2008/07/29/working-with-recursive-data-part-i-table-designs/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MOD Review Service</title>
		<link>http://www.phpbbdoctor.com/blog/2007/09/07/mod-review-service/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/09/07/mod-review-service/#comments</comments>
		<pubDate>Sat, 08 Sep 2007 03:36:36 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>
		<category><![CDATA[phpBB]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/2007/09/06/mod-review-service/</guid>
		<description><![CDATA[A Check-Up From the Doctor
I will not install code for clients that was written by someone else without reviewing it first. There is a lot of bad code out there. There is also some very good code, don&#8217;t get me wrong about that. But many excellent MOD authors just don&#8217;t know databases as well as [...]]]></description>
			<content:encoded><![CDATA[<h3>A Check-Up From the Doctor</h3>
<p>I will not install code for clients that was written by someone else without reviewing it first. There is a lot of bad code out there. There is also some very good code, don&#8217;t get me wrong about that. But many excellent MOD authors just don&#8217;t know databases as well as they would like. So the queries that they write may be functional, but not optimal. At this point in my involvement with phpBB I write most of my own MODs, so most of the time this review is for a private client who wants me to install something that I don&#8217;t already have or have no interest in writing.</p>
<p>At one point after I essentially rewrote a MOD for one of my clients they suggested that I offer this service via the Doctor Board. Hmm. Well. It&#8217;s an interesting idea. It wasn&#8217;t in my plans for this board when I started (and still isn&#8217;t) and I would much rather get my desired services completed and released first, but yeah, I could do that.</p>
<p>Probably in about 2012.</p>
<p>Maybe.</p>
<p><span id="more-141"></span></p>
<p>The reality is I won&#8217;t offer this service, for a variety of reasons. It&#8217;s a lot of work, and the level of effort goes up substantially on complex MODs. So instead, here are a few tips that I will offer. If you are a MOD author these are the things that send up red flags to me when I read code.</p>
<h3>Queries inside a loop</h3>
<p>There are good reasons to put a query inside a loop. There are many more bad reasons to put a query inside a loop, especially authorization checks. Each auth check in phpbb makes two queries. I recently discovered that one of the first custom pages I created for myself was doing this. When I fixed it the number of queries per page dropped from well over 100 to 10. That&#8217;s 90+ queries that got eliminated. My server is running 30+ queries a second, so dropping 90 queries from a single page view is a substantial improvement. My current project (one of about a dozen #1 priority projects) is tuning my biggest board by going over every page specifically looking for queries that can be moved out of a loop, combined with another query, or eliminated altogether.</p>
<p>Some examples of MODs that I have seen that use extra queries inside a loop include some of the fancy &#8220;color group&#8221; MODs, and MODs that display a tool-tip (title attribute) that contains the text of the post. In both of these cases the MODs I rewrote the MOD that I was asked to use and they became much more efficient as a result.</p>
<p>Queries inside a loop can be ugly.</p>
<h3>More than one query hitting the same tables</h3>
<p>When I installed the Attachment MOD I noticed that the attachment icon was not visible on the search results. In order to know whether to show the icon or not, I had to do two things. First I had to know if the topic had an attachment or not, so I included the topic_attachment column to one of the standard queries. Then I had to know if the user viewing the search results was authorized to view downloads or not, so I created a new query to get the forum information. This was only one query, and it was not inside a loop.</p>
<p>During my review (mentioned above) I noticed that I was already hitting the phpbb_forums table in an earlier query in order to get the forum name for the search results output. So I dropped this extra query and simply added the auth_download column to the existing query. Now my search page uses one less query.</p>
<p>I think MOD authors end up with this situation because they don&#8217;t want to alter the existing query so they write their own instead. And in some cases having two smaller queries can be much more efficient than one really big one. But in general if you are hitting the table already, adding one more field to the select list does not have an impact on performance. Adding a second query will.</p>
<h3>Don&#8217;t recalculate something already stored</h3>
<p>This example comes from something that I did for a client. This was the &#8220;Top Posters on Index&#8221; MOD, and what it was supposed to do was display the top &#8220;N&#8221; members of the board ranked by post count. You may know that the user_posts field does not get updated every time a post gets deleted. In older versions of phpBB2 you could delete a post and not affect your post count. That got fixed. Pruning still does not impact the post count. What the original MOD author elected to do was recalculate the user post count every time someone viewed the index.</p>
<p>This is all well and good until your index starts getting hit multiple times a second. This particular client was complaining that her index took several seconds to display, and that her host was warning her about server loads. (By the way, her board is listed on big-boards.com. That means she has at least 500,000 posts and 50,000 users. This is not a small board I&#8217;m talking about.) The culprit was the Top Posters on Index MOD.</p>
<p>What I did was rewrite the code so that it used the user_posts field instead of recalculating the user post count every time. That should be good enough for most folks, and as a bonus, the top 10 users on the index will now match the top 10 users on the memberlist. The next thing that I did was create an index on the user_posts field so that when I do a query / order by user_posts the results come back nearly instantly. The index.php query time went from 10+ seconds to nearly instant. Oh, and the person I did the work for? I didn&#8217;t hear from her for two years after that, so I assume that the work is still holding up under an ever increasing load.</p>
<p>The moral of this example is that correctness (counting posts) is not necessarily as important as speed. If you can get an answer that is &#8220;good enough&#8221; and get it really really fast, that&#8217;s fine. In this case I would argue that using the user_posts field is the correct answer anyway. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<h3>Using Outer or Left Joins</h3>
<p>Sometimes you need these. Most of the time you don&#8217;t. Outer joins are a subject for a <a href="http://www.phpbbdoctor.com/blog/category/phpbb/database-tips/">Database Tip</a> article that I started writing a long time ago and have not finished yet. (It will be done when&#8230; you know the rest.) Without recreating that entire article, here is the important point: Outer joins generally void index use. </p>
<p>Indexes exist to make your queries efficient. Earlier I mentioned creating an index on user_posts in order to speed up the Top Members on Index MOD. Without the index the code optimizations still helped. With the index the extra query added essentially zero load to the board. If you give up the index, you give up a lot of efficiency.</p>
<p>Avoid outer joins.</p>
<p>An example of an outer join is the &#8220;You posted here&#8221; MOD. There was a MOD author writing this one at phpbb.com, and he posted his SQL code. I suggested that instead of one big query he break it into two. This would allow him to avoid the outer join. He balked, at first, quoting the standard response:</p>
<blockquote><p>Fewer queries are always better.</p></blockquote>
<p>If you&#8217;ve been paying attention, that was what I suggested earlier in this very topic! <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' />  But the point here is that you don&#8217;t always participate in every topic. Getting the &#8220;you posted here&#8221; indicator to work means you need to check and set the mark for those topics that you did post in, and leave the rest out.</p>
<p>This is best done with a second query.</p>
<p>But not a query inside the loop, no, never that. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' /> </p>
<p>What I do for my version of this MOD is this. First I run the general viewforum query which gets all of the topic data. I collect the topic_id values that will be shown on the page into an array. Then I run a second query that returns a count of posts that I have made per topic in the topics to be displayed on the page. I set up an array using the topic_id as the key, and the value TRUE if I have posted at least once. Now, during the viewforum output, I simply check the array for the existance of a key that matches the topic, and if I find it, I display the proper indicator.</p>
<p>On the search page I cheat. If you have selected one of the &#8220;canned&#8221; searches like &#8220;egosearch&#8221; then the indicator is simply turned on for every topic and I skip the extra query.</p>
<h3>Summary</h3>
<p>If it seemed that all of these optimization tips were related to queries, well, you are very observant. There is a reason for that. I work with databases in real life too, so that&#8217;s where I feel my expertise is. And in my experience there are a lot more &#8220;bad&#8221; MODs that can be improved by tweaking either the position or the structure of the query than by altering the actual php code.</p>
<p>If you have a MOD that you are thinking about, but you are wondering if it&#8217;s efficient or not, try checking it out with some of these tips in mind. If it passes, then odds are decent that it won&#8217;t kill your server. If it&#8217;s really important to you, check back in about <del datetime="2007-09-01T03:15:39+00:00">2016</del> 2017 when I&#8217;ll have this service live here on the Doctor Board. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/09/07/mod-review-service/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Database Design #3: Table Joins</title>
		<link>http://www.phpbbdoctor.com/blog/2007/06/04/database-design-3-table-joins/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/06/04/database-design-3-table-joins/#comments</comments>
		<pubDate>Mon, 04 Jun 2007 13:25:26 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=83</guid>
		<description><![CDATA[In prior articles in this series I have talked about associative tables and primary / foreign keys. In this post I am going to talk more about keys but more specifically about table joins. If you&#8217;ve been working with databases for a while this article will probably seem fairly basic. But like many basic things, [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.phpbbdoctor.com/blog/?cat=8">prior articles in this series</a> I have talked about associative tables and primary / foreign keys. In this post I am going to talk more about keys but more specifically about table joins. If you&#8217;ve been working with databases for a while this article will probably seem fairly basic. But like many basic things, it&#8217;s important. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><span id="more-83"></span></p>
<p>To do much of anything useful in phpBB you need data from more than one table. For example to display the index page you need information from the following tables:</p>
<ul>
<li><a href="http://www.phpbbdoctor.com/doc_columns.php?id=3">phpbb_categories</a><br />
This table contains the category name and display order</li>
<li><a href="http://www.phpbbdoctor.com/doc_columns.php?id=7">phpbb_forums</a><br />
This table contains the forum names, the category it belongs to, the forum description, and other information (like the post_id for the last post in the forum)</li>
<li><a href="http://www.phpbbdoctor.com/doc_columns.php?id=21">phpbb_topics</a><br />
As you might expect from the name of this table it contains the information about a topic.</li>
<li><a href="http://www.phpbbdoctor.com/doc_columns.php?id=9">phpbb_posts</a><br />
This table contains the post information&#8230; well, most of it, anyway. It contains everything but the actual post text and things used to apply formatting (like the bbcode_uid value).</li>
<li><a href="http://www.phpbbdoctor.com/doc_columns.php?id=24">phpbb_users</a><br />
This table is used to display the user data for the last poster in the topic.</li>
</ul>
<p>The list goes on from there, actually. There are references to the auth_access table, the user_group table, the groups table (for figuring out who the forum moderators are) but I&#8217;ll skip those for now. The point is, each of these tables contains specific information about separate components of your board, and we want to pull everything together. That&#8217;s where joins come in.</p>
<p><strong>ANSI versus &#8220;Old Style&#8221;</strong><br />
I&#8217;ve been working with databases for a long time. Much of my early work was done with the Oracle database but I have experience with DB2, Informix, Microsoft SQL Server, Teradata, just about all of the &#8220;big guns&#8221; in the database world. And for a long time they all did joins differently. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_lol.gif' alt=':lol:' class='wp-smiley' /> </p>
<p>For example, to do an outer (optional) join in Oracle you use a syntax that looks like (+). For Microsoft you would use *=. And still other database vendors used this weird syntax with the actual words LEFT JOIN in it instead. It turns out the weird syntax was there for a reason: it&#8217;s the ANSI standard method of writing join logic. Oracle support has been spotty, at best, for phpBB2, but if you look at the current code for index.php you will still see a block of special SQL that was written just for the Oracle database.</p>
<p>Today I think every database that I&#8217;ve worked with (at least in the last five years) has had support for the ANSI join syntax. I tend to still use the &#8220;old&#8221; syntax, and I will talk a little bit about that. I will also talk about performance implications, and how the position of the &#8220;where&#8221; clause elements can actually change the answer to a query.</p>
<p>That&#8217;s a lot for one post, so we will see how far we get to start with. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><strong>Join Syntax</strong><br />
I will start with a very simple case of joining the categories table to the forums table and getting a list of categories and their related forums. The category table has the following structure:</p>
<pre>+------------+-----------------------+------+-----+---------+----------------+
| Field      | Type                  | Null | Key | Default | Extra          |
+------------+-----------------------+------+-----+---------+----------------+
| cat_id     | mediumint(8) unsigned |      | PRI | NULL    | auto_increment |
| cat_title  | varchar(100)          | YES  |     | NULL    |                |
| cat_order  | mediumint(8) unsigned |      | MUL | 0       |                |
+------------+-----------------------+------+-----+---------+----------------+</pre>
<p>The forums table has these columns (and some others that were left out for space):</p>
<pre>+----------------------+-----------------------+------+-----+---------+-------+
| Field                | Type                  | Null | Key | Default | Extra |
+----------------------+-----------------------+------+-----+---------+-------+
| forum_id             | smallint(5) unsigned  |      | PRI | 0       |       |
| cat_id               | mediumint(8) unsigned |      | MUL | 0       |       |
| forum_name           | varchar(150)          | YES  |     | NULL    |       |
| forum_desc           | text                  | YES  |     | NULL    |       |</pre>
<p>Notice something in common? Both tables have a cat_id field. This is a primary key (categories table) or foreign key (forums table) that I have talked about before. Suppose that I now want to pull a list of categories and their related forums, and put them in the proper order. I would use this:</p>
<pre>SELECT	c.cat_title
,	f.forum_name
FROM	phpbb_categories c
,	phpbb_forums f
WHERE	c.cat_id = f.cat_id
ORDER BY c.cat_order, f.forum_order</pre>
<p>This is one syntax&#8230; the &#8220;old&#8221; join syntax, where the join logic that puts the two category ID values together is in the WHERE clause. The ANSI standard syntax pulls the join logic into the FROM clause instead, and looks like this:</p>
<pre>SELECT	c.cat_title
,	f.forum_name
FROM	phpbb_categories c
JOIN 	phpbb_forums f
	ON c.cat_id = f.cat_id
ORDER BY c.cat_order, f.forum_order</pre>
<p>The keyword JOIN has been added, and the keyword ON identifies the join clause for the two tables. Which is easier to read? Most articles I have read suggest the second syntax is easier, although I can see arguments either way. For example in the second example it is very easy to see that the only restrictions are join restrictions; there is no &#8220;WHERE&#8221; clause and therefore no additional restrictions on the data returned by the query. When all of the joins are mixed in the WHERE clause along with other restrictions it is harder to make that determination.</p>
<p>Which is correct? For most databases (I should probably say all databases) the results of this query will be exactly the same. Why use one over the other? Both are fairly portable, perform equally well, and thus it would seem there&#8217;s really no difference.</p>
<p>But there can be. And it can be surprising. More on that in a moment.</p>
<p><strong>Join Types</strong><br />
There are three types of joins, of which only two are typically used. The three types of joins (and their definitions) are:</p>
<ul>
<li>Inner Join<br />
This is &#8211; unless specified otherwise &#8211; the assumed type of join. An inner join looks for matching data on both sides of the join and returns row sets that fit the requirements.</li>
<li>Outer Join<br />
Otherwise known as an &#8220;optional&#8221; join, this join will only be used if the key word LEFT or RIGHT (or possibly FULL OUTER) is included in the join clause. The word LEFT or RIGHT tells the database which side of the relationship is required with the other side becoming optional. A FULL OUTER join is optional in both directions.</li>
<li>Cross or Cartesian Product Join<br />
This is generally an error rather than an intentional technique. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_eek.gif' alt=':shock:' class='wp-smiley' />  A cross-product also known as a cartesian product join is really the absense of a join, and the results are generally not what you want. If you have ten forums and 100 topics and you forget to put a join clause in your query you will return every topic <strong>in every forum</strong> for a total result row count of 10 * 100 or 1,000 rows of data. A cartesian join is generally considered a bug. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' /> </li>
</ul>
<p><strong>JOIN or WHERE?</strong><br />
The main difference in ANSI versus the older style of join is to separate the join logic from the WHERE clause. The logic required to pull data out of a database can involve two steps: putting the data together, and then throwing away what you don&#8217;t want. Joins are a way to pull data together, filters or conditions are a way to throw away what you don&#8217;t want. These join filters can also appear in the FROM or the WHERE clause, and here&#8217;s where things get tricky.</p>
<p>I made a few very simple tables to demonstrate some join issues and differences between a join and a filter, and how it can impact the results.</p>
<p>First table</p>
<pre>mysql> select * from a;
+----+-------+
| id | var1  |
+----+-------+
|  1 | Row 1 |
|  2 | Row 2 |
+----+-------+
2 rows in set (0.00 sec)</pre>
<p>Second table</p>
<pre>mysql> select * from b;
+----+-------+
| id | var2  |
+----+-------+
|  1 | Row A |
|  2 | Row B |
+----+-------+
2 rows in set (0.00 sec)</pre>
<p>Associative Table</p>
<pre>mysql> select * from a_b;
+------+------+
| a_id | b_id |
+------+------+
|    1 |    1 |
+------+------+
1 row in set (0.00 sec)</pre>
<p>So in the sample data above I have three tables. The first table has two rows, as does the second table. The third table is an <em>associative table</em> that links the two together. You might notice that only one row from each table is related. In order to pull out the rows that <strong>are</strong> related I would use the following query:</p>
<pre>select  a.*
,       b.*
from    a
join    a_b
        on a.id = a_b.a_id
join    b
        on a_b.b_id = b.id</pre>
<p>Results</p>
<pre>+----+-------+----+-------+
| id | var1  | id | var2  |
+----+-------+----+-------+
|  1 | Row 1 |  1 | Row A |
+----+-------+----+-------+
1 row in set (0.00 sec)</pre>
<p>This query behaves as expected. I asked for data from all three tables, but only where a relationship exists. There is only one row that fits the qualifications, and therefore one row is presented in the result set.</p>
<p>Next, let&#8217;s run  query that makes the join optional. We&#8217;ll require data from the left, but not the right. We will also include a condition in the WHERE clause. That query looks like this:</p>
<pre>select  a.*
,       b.*
from    a
left join a_b
        on a.id = a_b.a_id
left join b
        on a_b.b_id = b.id
where   b.var2 = 'Row B'</pre>
<p>The results? No rows are returned.</p>
<pre>Empty set (0.00 sec)</pre>
<p>The reason is that the joins are done first, then the data comparision is made against b.var2. Why do we get zero rows? Let&#8217;s move the condition (filter) from the WHERE into the FROM and rerun the query&#8230;</p>
<p>The SQL:</p>
<pre>select  a.*
,         b.*
from    a
left join a_b
        on a.id = a_b.a_id
left join b
        on a_b.b_id = b.id
        and b.var2 = 'Row B'</pre>
<p>The results:</p>
<pre>+----+-------+------+------+
| id | var1  | id   | var2 |
+----+-------+------+------+
|  1 | Row 1 | NULL | NULL |
|  2 | Row 2 | NULL | NULL |
+----+-------+------+------+</pre>
<p>In this case we got two rows. Hm. I said earlier that the two ways of doing joins were the same, right? so why the different results?</p>
<p>It&#8217;s fairly simple. When you put a filter in the FROM clause the filter is applied before (or during) the join process. If you put a filter in the WHERE clause it is done after the joins are completed. In the example above the filter is applied to table &#8220;b&#8221; and then the rows are optionally joined (via the LEFT OUTER) to table &#8220;a&#8221;. Because of the position of the filter, all of the rows from table &#8220;a&#8221; will show up. In the prior example the condition was applied after the left join was performed, therefore the rows were all eliminated. So it does, in fact, make a difference where your filters are placed.</p>
<p>In theory, a filter in the FROM clause could be more efficient, because it throws rows away before doing the join logic. Is it really more efficient? That&#8217;s a topic for another blog post.</p>
<p><strong>Performance on Outer Joins</strong><br />
When you do an inner join the database is expecting data to exist on both sides of the relationship. Because of that it can use indexes to pull the data together. I have another post that I need to complete that discusses this issue in more detail. For now, let me just leave it as you should only use outer joins when they are absolutely necessary. In some cases, even having two separate queries is more efficient than one larger query with an outer join. Details to come later. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><strong>Summary</strong><br />
You probably can use either syntax that I showed in this post when writing joins. Old habits die hard, so I tend to use the older syntax (in the WHERE clause) unless I need an outer join.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/06/04/database-design-3-table-joins/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Database Design #2 &#8211; Primary Keys</title>
		<link>http://www.phpbbdoctor.com/blog/2007/02/13/database-design-2-primary-keys/</link>
		<comments>http://www.phpbbdoctor.com/blog/2007/02/13/database-design-2-primary-keys/#comments</comments>
		<pubDate>Tue, 13 Feb 2007 06:01:39 +0000</pubDate>
		<dc:creator>Dave Rathbun</dc:creator>
				<category><![CDATA[Database Tips]]></category>
		<category><![CDATA[MOD Writing]]></category>

		<guid isPermaLink="false">http://www.phpbbdoctor.com/blog/?p=79</guid>
		<description><![CDATA[Over the years that I have been involved with phpbb.com I have seen a number of posts &#8211; not that frequent, mind you, but more often than I would expect &#8211; asking how to &#8220;renumber my users&#8221; or something like that. It seems that folks are bothered by the fact that the first user_id is [...]]]></description>
			<content:encoded><![CDATA[<p>Over the years that I have been involved with phpbb.com I have seen a number of posts &#8211; not that frequent, mind you, but more often than I would expect &#8211; asking how to &#8220;renumber my users&#8221; or something like that. It seems that folks are bothered by the fact that the first user_id is 2, or sometimes by the fact that there can be gaps in the sequence of user_id values. Why worry? Your database doesn&#8217;t. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_biggrin.gif' alt=':-D' class='wp-smiley' /> </p>
<p><span id="more-79"></span></p>
<p>You see, the user_id, topic_id, post_id, forum_id, and others are all <strong>primary key</strong> values. They are generated automatically by the system (user_id is a bit different from the rest; more on that in a bit) and used only to connect bits of information together. The user_id is meaningless to a person. But it means quite a bit if you want to connect a user to a group via an associative table as I covered in the <a href="http://www.phpbbdoctor.com/blog/?p=72">first post in this series</a>.</p>
<p>Centuries ago the world was thought to be flat. It turns out it wasn&#8217;t. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  Decades ago most databases were flat. As in flat text files. You had COBOL coders reading and writing ISAM files to mainframe disk packs or tape drives. In order to find anything, you had to know the record size and offset so you would know how far to read in order to retrieve your data. This worked okay for tapes which were essentially a linear data stream. It was okay for flat files stored on disk as well. But eventually someone figured out there was a better way to store data, and relational databases were born.</p>
<p>A relational database today is all about, well, relationships. You may have heard of an ER Diagram? The &#8220;ER&#8221; stands for Entity-Relationship. An <strong>entity</strong> is something like a forum or a topic or a user. A <strong>relationship</strong> is how two of those things get put together. For example, a user posts a topic, or the inverse, a topic is posted by a user. In order to maintain that relationship we need something unique about each user and about each topic. So now we&#8217;re ready to talk about primary keys. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>A primary key is something that uniquely identifies an entity. In many cases the primary key is a single value (like user_id for a user) but it doesn&#8217;t have to be. For an example you would see a compound key if you review the phpbb_search_wordmatch table (an associative table between posts and search words). There is no single primary key for the table&#8230; instead it is the combination of post_id and word_id that is unique. For simple systems like phpBB it&#8217;s probably worth the extra effort to assign a single-column primary key for each table. For extremely complex systems it is often necessary to create multiple-column primary keys. </p>
<p><em>Ironically there is no primary key defined on (post_id, word_id) in the database for phpBB. The application doesn&#8217;t need it, strictly speaking it should be present to enforce data integrity rules.</em></p>
<p>Why is a primary key important? Why not just use the user&#8217;s username? That&#8217;s a decent question, and there are two very important reasons for not using something like a username as a primary key. First, it&#8217;s character data. Linking one block of character data to another block of character data requires a lot more bytes of traffic than linking two compact numeric values. The user_id in phpBB is defined as mediumint(8). This takes far less space to process than 25 characters of string data. An index on a numeric field is smaller and therefore much more efficient than an index on character data.</p>
<p>Second&#8230; and now we&#8217;re getting back to the opening paragraph from this post&#8230; a primary key <strong>should never ever change</strong> its value! Never. Never ever. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  By definition a primary key is used to relate that entity to other entities in your database. If I change that key value, I have to go through the entire database and change the key <strong>everywhere it exists!</strong> If you miss something, then you end up with what are called &#8220;orphan records&#8221; where you have parts of your database that don&#8217;t link up to anything else. That&#8217;s a &#8220;Bad Thing&#8221; to have. Usernames can change. A user_id should not.</p>
<p>So what is the solution? What if you wanted to have a sequential &#8220;user number&#8221; that did not have any gaps, and that got adjusted every time a user got delected from your database? That user number &#8211; like the username, email address, web site, and so on &#8211; should become an <strong>attribute</strong> of the user (entity). Attributes are pieces of data that we collect about an entity and store in our database. If it was really important for some reason to have a sequential user number then here&#8217;s how I would see that working:</p>
<ol>
<li>Assign the next user_number on registration using a process similar to that used for user_id</li>
<li>Anytime a user is deleted execute something like the following:
<ul>
<li>get user number for deleted user</li>
<li>update users_table<br />
set user_number = user_number &#8211; 1<br />
where user_number > deleted_user_number</li>
</ul>
</ol>
<p>Is it worth it? Maybe. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  It depends on how important it is to you. Frankly I don&#8217;t see the need, or the point. But it should be done with a separate field, and never with the user_id primary key value.</p>
<p>I mentioned earlier in this post that the process used to generate user_id values is not the same as other primary keys within phpBB. Most of the primary keys are generated (on MySQL) via an <strong>auto-increment</strong> attribute. In other databases it would be done via a sequence. What this means is the application (phpBB) is not responsible for generating primary key values; the database will do that process automatically. This sounds good, so why isn&#8217;t the user_id handled in the same way? For whatever reason the &#8220;Anonymous&#8221; user has a negative user_id (-1). I am told that in earlier versions of MySQL (the most popular database used to run phpBB) an auto-increment attribute column must be numeric and unsigned, meaning it can only store positive numbers. Later versions of MySQL apparently don&#8217;t have this issue, as I was able to create a table with a signed integer key and insert negative values. The auto-increment, however, started with the first positive value rather than using any negative values.</p>
<p>We have here at the phpBBDoctor web site an online table reference. It&#8217;s a bit out of date, and that&#8217;s one of the projects that I intend to get around to on the first.* There is a link in the links section at the end of this post, and I encourage you to check it out if you haven&#8217;t seen it before. Even if it&#8217;s slightly out of date (as I write this, it could be updated soon), it is a more than adequate reference for the database design for phpBB2. (No work has started on an equivalent for phpBB3 as of yet.) One of the things that the reference shows is which columns are primary keys as well as which columns are foreign keys.</p>
<p>What&#8217;s a foreign key? Simply put, it&#8217;s a primary key value stored in another table in order to create the relationship. The user_id is a primary key for the phpbb_users table. It is a foreign key (poster_id) in the phpbb_posts table. That&#8217;s how we know which user entered the post. Another foreign key is topic_poster in the phpbb_topics table. That&#8217;s used to record which user started the topic. Why store the topic_poster in the topics table? The poster_id is stored on the first post in the topic, and we know which post is the first in the topic by looking up the topic_first_post_id from the topics table, so why do we need it? Speed. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>For a pure database design you would not store the topic_poster on the phpbb_topics table. It&#8217;s done in phpBB for performance reasons, and is a process called Denormalization. That&#8217;s a topic for another post. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_cool.gif' alt='8)' class='wp-smiley' /> </p>
<p><strong>Related Links</strong></p>
<ul>
<li><a href="http://www.phpbbdoctor.com/doc_tables.php">phpBBDoctor Table Reference</a></li>
<li><a href="http://www.phpbbdoctor.com/blog/?p=72">Database Design #1: Associative Tables</a></li>
</ul>
<p><span style="font-size: 8px">* &#8220;On the first&#8221; means on the first chance I get. <img src='http://www.phpbbdoctor.com/blog/wp-includes/images/smilies/icon_razz.gif' alt=':-P' class='wp-smiley' /> </span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.phpbbdoctor.com/blog/2007/02/13/database-design-2-primary-keys/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
