Your premium source for custom modification services for phpBB


HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments July 29, 2008

Working With Recursive Data Part I: Table Designs

Filed under: Database Tips, MOD Writing, phpBB — Dave Rathbun @ 11:55 pm CommentsComments (2) 

There are all sorts of scenarios that require recursive data. If you don’t know what “recursive” means, it’s a relationship from an entity back to the same entity. In English, it’s data that points back to itself. :) Some typical examples of recursive data are company org charts, inventory build instructions, or even forums for phpBB3. Yes, I’m talking about phpBB3, are you surprised? :shock: :lol: I hope not, because I’m going to reference phpBB3 only in passing. The article is actually more about storing recursive data in any form. It’s also how I store information in my phpBBDoctor Project Management system, among other things.

SQL is not a recursive language. When I write a query it’s all about relationships between rows, not about looping back through the same table. Oracle has a special construct used to traverse recursive data and it works very well, but it’s the only database that I am currently aware of to support this. Since most phpBB MOD authors will not be writing for Oracle, I will skip that concept for now.

As mentioned in the first paragraph I have recursive data in my project tracking system that I use here on the phpBB Doctor site. The design for this system is simple, but complex. :) The first table is the project table and it includes summary attributes for the project. These attributes include values such as when the project started, who is the project manager, a description of the project, and the status. The next table is the tasks table. A task is assigned to a project, but a task can be broken up into sub-tasks as well. There is no expected limit to the depth of the tasks. Here is a screen shot of my test project so you get an idea of what I’m talking about.

Project Manager Screen Shot

“WBS” stands for Work Breakdown Structure, and it’s a dot notation used to denote the level of detail and lineage for a project task. In this case, the first task is “1″ and the first sub-task assigned to that task is “1.1″ and the next sub-task is “1.2″. If there is a sub-sub-task then another dot is added, as in “1.1.1″ or “1.2.1″ and so on. In this way, each task has a lineage from the parent project all the way down to the most detailed level tasks. The challenge that I face is to model this data properly so that I can retrieve the lineage from top to bottom, bottom to top, or even side to side.

Task Table Design 1: Storing Parent Relationships

Here’s a typical table structure for a tasks table in this situation.

| Field           | Type                  | Null | Key | Default | Extra          |
| task_id         | int(11) unsigned      |      | PRI | NULL    | auto_increment |
| task_number     | mediumint(8) unsigned |      |     | 0       |                |
| parent_task_id  | int(11) unsigned      |      | MUL | 0       |                |
| project_id      | mediumint(8) unsigned |      | MUL | 0       |                |
| task_status     | tinyint(2) unsigned   | YES  |     | 0       |                |
| task_name       | varchar(64)           |      |     |         |                |
| estimated_hours | decimal(8,2)          | YES  |     | 0.00    |                |

Notice that there is a task_id column which will be assigned by the database as the new record is inserted (that’s what the auto_increment attribute does). The task is also assigned a parent_task_id value, so that the task is related to (joined to) the parent. This allows me to walk through the task list to show task parents with their children. But in order to do that, I need a recursive language (php works) and I need to run a query for each level of the task depth. That’s not very efficient. But if I don’t do recursion, then it’s hard to put the information together correctly. Here is some of the raw data from my sample project shown in the screen shot earlier:

| task_id | parent_task_id | task_name                                     |
|      43 |             40 | Child Task 2 of Parent Task 1                 |
|      42 |             40 | Child Task 1 of Parent 1                      |
|      41 |              0 | Parent Task 2                                 |
|      40 |              0 | Parent Task 1                                 |
|      44 |             41 | Child Task 1 of Parent Task 2                 |
|      45 |             41 | Child Task 2 of Parent Task 2                 |
|      46 |             43 | Child task 1 of child task 1 of parent task 1 |
|      47 |             43 | Child task 2 of child task 1 of parent task 1 |
|      48 |             43 | Child task 3 of child task 2 of parent task 1 |
|      49 |             48 | Child task at level 4                         |
|      50 |             48 | Child Task 2 at level 4 After Edit            |
|      51 |             46 | Another child                                 |
|      52 |             46 | Another child                                 |

If I look carefully, I can see where a parent_task_id references a task_id from another row in the output. There are two tasks where the parent_task_id is zero, which means they are the top task or starting point for the hierarchy. For example, the task that is assigned task_id 46 has a parent task of 43, and that task (43) has a parent task of 40, and that task (40) has a parent task of 0. By examing the recursive relationship of one task to another I can go up (or down) the task list.

But each time I get a row, I have to run another query based on the result of the last query in order to determine the parent. If I were trying to determine the child, I would go in the other direction, but the results would be the same. Surely there is a better way to address this.

Task Table Design 2: Separate Tables For Task Levels

I could store each different task level in a different table. If I have five levels of tasks, I could have task_01, task_02, and so on up to task_05. By doing this I am no longer doing a recursive join as each table is separate. Recursion is required because all of the data is stored in one place. Storing each level in a different table fixes that.

But don’t do this, just don’t. :) It’s a horrible database design. Why?

Consider what happens if some tasks have fewer than five levels of detail. In order to avoid dropping rows, each query has to be an outer join. Outer joins are typically less efficient than inner joins so I try to avoid them. And this is not the worst issue. What if some projects end up with tasks at six levels deep? Or seven? Each time I add a new level to the task structure, I have to add a new table to my database. Each time I add a new table to my database, I have to adjust my application code. The application should drive the data, not the other way around. The reality is, unless you know that you will forever have a specific depth to your recursive data, you should never consider this option.

Task Table Design 3: Ancestor – Parent – Child Model

One of the challenges of working with recursive data is finding out where a row belongs. If you have to follow the relationship up through 20 levels of data it’s going to take a while to figure out who the ultimate parent is. To solve this, I can store the root node or “ancestor” on each row. That would make the data look like this:

| ancestor_id | task_id | parent_task_id | task_name                                     |
|          40 |      43 |             40 | Child Task 2 of Parent Task 1                 |
|          40 |      42 |             40 | Child Task 1 of Parent 1                      |
|          41 |      41 |              0 | Parent Task 2                                 |
|          40 |      40 |              0 | Parent Task 1                                 |
|          41 |      44 |             41 | Child Task 1 of Parent Task 2                 |
|          41 |      45 |             41 | Child Task 2 of Parent Task 2                 |
|          40 |      46 |             43 | Child task 1 of child task 1 of parent task 1 |
|          40 |      47 |             43 | Child task 2 of child task 1 of parent task 1 |
|          40 |      48 |             43 | Child task 3 of child task 2 of parent task 1 |

In this way I no longer have to follow the relationships to find out where the starting point (the ancestor) is because the application is responsible for maintaining this data on the row itself. With this I can easily identify all of the child tasks that belong to parent task 40 (including the parent task of 40 itself) by running a query to return all rows with that ancestor_id.

This solves part of the problem (going up the path) but not all of it. For example, even if I know the ancestor, and I know all of the child tasks that belong to that ancestor, I don’t know the topology of those tasks. The topology is the layout of the tasks, showing the entire structure from top to bottom. If I refer back to the screenshot towards the top of the screen I can see how the tasks are indented based on their level. The indentation also shows which tasks are siblings (occur at the same level) as well as the parent and child status. So I’m still not quite ready to accept this design for my application.

For what it is worth, I did use this design for a real-world project for a major manufacturing company, and they are very happy with the results. I created a report from their SAP system that allowed them to determine in 30 minutes or less the answer to questions that took SAP over eight hours to generate. Cool stuff. 8-)

Task Table Design: ???

There is a better way to store recursive data, and I’m going to detail that in the next post in this series. Here is a hint: If you are working with phpBB3 then you are already using this solution. Details next time. 8-)

Related Links


  1. phpBB 3? Do you mean the mysterious left and right IDs in the forums table ??!!

    Comment by Dog Cow — July 30, 2008 @ 9:58 am

  2. Yup, that’s exactly what I mean. The next post will clarify how those values are assigned, and how you can write queries to interact with them. Fun stuff. :)

    Comment by Dave Rathbun — July 30, 2008 @ 12:56 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress