Over the years that I have been involved with phpbb.com I have seen a number of posts – not that frequent, mind you, but more often than I would expect – asking how to “renumber my users” or something like that. It seems that folks are bothered by the fact that the first user_id is 2, or sometimes by the fact that there can be gaps in the sequence of user_id values. Why worry? Your database doesn’t.
You see, the user_id, topic_id, post_id, forum_id, and others are all primary key values. They are generated automatically by the system (user_id is a bit different from the rest; more on that in a bit) and used only to connect bits of information together. The user_id is meaningless to a person. But it means quite a bit if you want to connect a user to a group via an associative table as I covered in the first post in this series.
Centuries ago the world was thought to be flat. It turns out it wasn’t. Decades ago most databases were flat. As in flat text files. You had COBOL coders reading and writing ISAM files to mainframe disk packs or tape drives. In order to find anything, you had to know the record size and offset so you would know how far to read in order to retrieve your data. This worked okay for tapes which were essentially a linear data stream. It was okay for flat files stored on disk as well. But eventually someone figured out there was a better way to store data, and relational databases were born.
A relational database today is all about, well, relationships. You may have heard of an ER Diagram? The “ER” stands for Entity-Relationship. An entity is something like a forum or a topic or a user. A relationship is how two of those things get put together. For example, a user posts a topic, or the inverse, a topic is posted by a user. In order to maintain that relationship we need something unique about each user and about each topic. So now we’re ready to talk about primary keys.
A primary key is something that uniquely identifies an entity. In many cases the primary key is a single value (like user_id for a user) but it doesn’t have to be. For an example you would see a compound key if you review the phpbb_search_wordmatch table (an associative table between posts and search words). There is no single primary key for the table… instead it is the combination of post_id and word_id that is unique. For simple systems like phpBB it’s probably worth the extra effort to assign a single-column primary key for each table. For extremely complex systems it is often necessary to create multiple-column primary keys.
Ironically there is no primary key defined on (post_id, word_id) in the database for phpBB. The application doesn’t need it, strictly speaking it should be present to enforce data integrity rules.
Why is a primary key important? Why not just use the user’s username? That’s a decent question, and there are two very important reasons for not using something like a username as a primary key. First, it’s character data. Linking one block of character data to another block of character data requires a lot more bytes of traffic than linking two compact numeric values. The user_id in phpBB is defined as mediumint(8). This takes far less space to process than 25 characters of string data. An index on a numeric field is smaller and therefore much more efficient than an index on character data.
Second… and now we’re getting back to the opening paragraph from this post… a primary key should never ever change its value! Never. Never ever. By definition a primary key is used to relate that entity to other entities in your database. If I change that key value, I have to go through the entire database and change the key everywhere it exists! If you miss something, then you end up with what are called “orphan records” where you have parts of your database that don’t link up to anything else. That’s a “Bad Thing” to have. Usernames can change. A user_id should not.
So what is the solution? What if you wanted to have a sequential “user number” that did not have any gaps, and that got adjusted every time a user got delected from your database? That user number – like the username, email address, web site, and so on – should become an attribute of the user (entity). Attributes are pieces of data that we collect about an entity and store in our database. If it was really important for some reason to have a sequential user number then here’s how I would see that working:
- Assign the next user_number on registration using a process similar to that used for user_id
- Anytime a user is deleted execute something like the following:
- get user number for deleted user
- update users_table
set user_number = user_number – 1
where user_number > deleted_user_number
Is it worth it? Maybe. It depends on how important it is to you. Frankly I don’t see the need, or the point. But it should be done with a separate field, and never with the user_id primary key value.
I mentioned earlier in this post that the process used to generate user_id values is not the same as other primary keys within phpBB. Most of the primary keys are generated (on MySQL) via an auto-increment attribute. In other databases it would be done via a sequence. What this means is the application (phpBB) is not responsible for generating primary key values; the database will do that process automatically. This sounds good, so why isn’t the user_id handled in the same way? For whatever reason the “Anonymous” user has a negative user_id (-1). I am told that in earlier versions of MySQL (the most popular database used to run phpBB) an auto-increment attribute column must be numeric and unsigned, meaning it can only store positive numbers. Later versions of MySQL apparently don’t have this issue, as I was able to create a table with a signed integer key and insert negative values. The auto-increment, however, started with the first positive value rather than using any negative values.
We have here at the phpBBDoctor web site an online table reference. It’s a bit out of date, and that’s one of the projects that I intend to get around to on the first.* There is a link in the links section at the end of this post, and I encourage you to check it out if you haven’t seen it before. Even if it’s slightly out of date (as I write this, it could be updated soon), it is a more than adequate reference for the database design for phpBB2. (No work has started on an equivalent for phpBB3 as of yet.) One of the things that the reference shows is which columns are primary keys as well as which columns are foreign keys.
What’s a foreign key? Simply put, it’s a primary key value stored in another table in order to create the relationship. The user_id is a primary key for the phpbb_users table. It is a foreign key (poster_id) in the phpbb_posts table. That’s how we know which user entered the post. Another foreign key is topic_poster in the phpbb_topics table. That’s used to record which user started the topic. Why store the topic_poster in the topics table? The poster_id is stored on the first post in the topic, and we know which post is the first in the topic by looking up the topic_first_post_id from the topics table, so why do we need it? Speed.
For a pure database design you would not store the topic_poster on the phpbb_topics table. It’s done in phpBB for performance reasons, and is a process called Denormalization. That’s a topic for another post.
* “On the first” means on the first chance I get.