In prior articles in this series I have talked about associative tables and primary / foreign keys. In this post I am going to talk more about keys but more specifically about table joins. If you’ve been working with databases for a while this article will probably seem fairly basic. But like many basic things, it’s important.
To do much of anything useful in phpBB you need data from more than one table. For example to display the index page you need information from the following tables:
This table contains the category name and display order
This table contains the forum names, the category it belongs to, the forum description, and other information (like the post_id for the last post in the forum)
As you might expect from the name of this table it contains the information about a topic.
This table contains the post information… well, most of it, anyway. It contains everything but the actual post text and things used to apply formatting (like the bbcode_uid value).
This table is used to display the user data for the last poster in the topic.
The list goes on from there, actually. There are references to the auth_access table, the user_group table, the groups table (for figuring out who the forum moderators are) but I’ll skip those for now. The point is, each of these tables contains specific information about separate components of your board, and we want to pull everything together. That’s where joins come in.
ANSI versus “Old Style”
I’ve been working with databases for a long time. Much of my early work was done with the Oracle database but I have experience with DB2, Informix, Microsoft SQL Server, Teradata, just about all of the “big guns” in the database world. And for a long time they all did joins differently.
For example, to do an outer (optional) join in Oracle you use a syntax that looks like (+). For Microsoft you would use *=. And still other database vendors used this weird syntax with the actual words LEFT JOIN in it instead. It turns out the weird syntax was there for a reason: it’s the ANSI standard method of writing join logic. Oracle support has been spotty, at best, for phpBB2, but if you look at the current code for index.php you will still see a block of special SQL that was written just for the Oracle database.
Today I think every database that I’ve worked with (at least in the last five years) has had support for the ANSI join syntax. I tend to still use the “old” syntax, and I will talk a little bit about that. I will also talk about performance implications, and how the position of the “where” clause elements can actually change the answer to a query.
That’s a lot for one post, so we will see how far we get to start with.
I will start with a very simple case of joining the categories table to the forums table and getting a list of categories and their related forums. The category table has the following structure:
+------------+-----------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-----------------------+------+-----+---------+----------------+ | cat_id | mediumint(8) unsigned | | PRI | NULL | auto_increment | | cat_title | varchar(100) | YES | | NULL | | | cat_order | mediumint(8) unsigned | | MUL | 0 | | +------------+-----------------------+------+-----+---------+----------------+
The forums table has these columns (and some others that were left out for space):
+----------------------+-----------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------------------+-----------------------+------+-----+---------+-------+ | forum_id | smallint(5) unsigned | | PRI | 0 | | | cat_id | mediumint(8) unsigned | | MUL | 0 | | | forum_name | varchar(150) | YES | | NULL | | | forum_desc | text | YES | | NULL | |
Notice something in common? Both tables have a cat_id field. This is a primary key (categories table) or foreign key (forums table) that I have talked about before. Suppose that I now want to pull a list of categories and their related forums, and put them in the proper order. I would use this:
SELECT c.cat_title , f.forum_name FROM phpbb_categories c , phpbb_forums f WHERE c.cat_id = f.cat_id ORDER BY c.cat_order, f.forum_order
This is one syntax… the “old” join syntax, where the join logic that puts the two category ID values together is in the WHERE clause. The ANSI standard syntax pulls the join logic into the FROM clause instead, and looks like this:
SELECT c.cat_title , f.forum_name FROM phpbb_categories c JOIN phpbb_forums f ON c.cat_id = f.cat_id ORDER BY c.cat_order, f.forum_order
The keyword JOIN has been added, and the keyword ON identifies the join clause for the two tables. Which is easier to read? Most articles I have read suggest the second syntax is easier, although I can see arguments either way. For example in the second example it is very easy to see that the only restrictions are join restrictions; there is no “WHERE” clause and therefore no additional restrictions on the data returned by the query. When all of the joins are mixed in the WHERE clause along with other restrictions it is harder to make that determination.
Which is correct? For most databases (I should probably say all databases) the results of this query will be exactly the same. Why use one over the other? Both are fairly portable, perform equally well, and thus it would seem there’s really no difference.
But there can be. And it can be surprising. More on that in a moment.
There are three types of joins, of which only two are typically used. The three types of joins (and their definitions) are:
- Inner Join
This is – unless specified otherwise – the assumed type of join. An inner join looks for matching data on both sides of the join and returns row sets that fit the requirements.
- Outer Join
Otherwise known as an “optional” join, this join will only be used if the key word LEFT or RIGHT (or possibly FULL OUTER) is included in the join clause. The word LEFT or RIGHT tells the database which side of the relationship is required with the other side becoming optional. A FULL OUTER join is optional in both directions.
- Cross or Cartesian Product Join
This is generally an error rather than an intentional technique. A cross-product also known as a cartesian product join is really the absense of a join, and the results are generally not what you want. If you have ten forums and 100 topics and you forget to put a join clause in your query you will return every topic in every forum for a total result row count of 10 * 100 or 1,000 rows of data. A cartesian join is generally considered a bug.
JOIN or WHERE?
The main difference in ANSI versus the older style of join is to separate the join logic from the WHERE clause. The logic required to pull data out of a database can involve two steps: putting the data together, and then throwing away what you don’t want. Joins are a way to pull data together, filters or conditions are a way to throw away what you don’t want. These join filters can also appear in the FROM or the WHERE clause, and here’s where things get tricky.
I made a few very simple tables to demonstrate some join issues and differences between a join and a filter, and how it can impact the results.
mysql> select * from a; +----+-------+ | id | var1 | +----+-------+ | 1 | Row 1 | | 2 | Row 2 | +----+-------+ 2 rows in set (0.00 sec)
mysql> select * from b; +----+-------+ | id | var2 | +----+-------+ | 1 | Row A | | 2 | Row B | +----+-------+ 2 rows in set (0.00 sec)
mysql> select * from a_b; +------+------+ | a_id | b_id | +------+------+ | 1 | 1 | +------+------+ 1 row in set (0.00 sec)
So in the sample data above I have three tables. The first table has two rows, as does the second table. The third table is an associative table that links the two together. You might notice that only one row from each table is related. In order to pull out the rows that are related I would use the following query:
select a.* , b.* from a join a_b on a.id = a_b.a_id join b on a_b.b_id = b.id
+----+-------+----+-------+ | id | var1 | id | var2 | +----+-------+----+-------+ | 1 | Row 1 | 1 | Row A | +----+-------+----+-------+ 1 row in set (0.00 sec)
This query behaves as expected. I asked for data from all three tables, but only where a relationship exists. There is only one row that fits the qualifications, and therefore one row is presented in the result set.
Next, let’s run query that makes the join optional. We’ll require data from the left, but not the right. We will also include a condition in the WHERE clause. That query looks like this:
select a.* , b.* from a left join a_b on a.id = a_b.a_id left join b on a_b.b_id = b.id where b.var2 = 'Row B'
The results? No rows are returned.
Empty set (0.00 sec)
The reason is that the joins are done first, then the data comparision is made against b.var2. Why do we get zero rows? Let’s move the condition (filter) from the WHERE into the FROM and rerun the query…
select a.* , b.* from a left join a_b on a.id = a_b.a_id left join b on a_b.b_id = b.id and b.var2 = 'Row B'
+----+-------+------+------+ | id | var1 | id | var2 | +----+-------+------+------+ | 1 | Row 1 | NULL | NULL | | 2 | Row 2 | NULL | NULL | +----+-------+------+------+
In this case we got two rows. Hm. I said earlier that the two ways of doing joins were the same, right? so why the different results?
It’s fairly simple. When you put a filter in the FROM clause the filter is applied before (or during) the join process. If you put a filter in the WHERE clause it is done after the joins are completed. In the example above the filter is applied to table “b” and then the rows are optionally joined (via the LEFT OUTER) to table “a”. Because of the position of the filter, all of the rows from table “a” will show up. In the prior example the condition was applied after the left join was performed, therefore the rows were all eliminated. So it does, in fact, make a difference where your filters are placed.
In theory, a filter in the FROM clause could be more efficient, because it throws rows away before doing the join logic. Is it really more efficient? That’s a topic for another blog post.
Performance on Outer Joins
When you do an inner join the database is expecting data to exist on both sides of the relationship. Because of that it can use indexes to pull the data together. I have another post that I need to complete that discusses this issue in more detail. For now, let me just leave it as you should only use outer joins when they are absolutely necessary. In some cases, even having two separate queries is more efficient than one larger query with an outer join. Details to come later.
You probably can use either syntax that I showed in this post when writing joins. Old habits die hard, so I tend to use the older syntax (in the WHERE clause) unless I need an outer join.