Home

Your premium source for custom modification services for phpBB

  logo

HomeForumsBlogMOD ManagerFAQSearchRegisterLogin

Comments November 2, 2009

Storing Post Revisions / Post Locking

Filed under: Board Management, MOD Writing — Dave Rathbun @ 1:53 am CommentsComments (10) 

I’ve seen this on other boards but only recently have I started seeing it on my own: people that edit the first post (or potentially even all of their posts) of a topic and remove all of the content. They might leave behind something like “…” because as we all know you can’t have a truly empty post. The net result is the topic is then worthless because nobody knows what we’re talking about.

With phpBB3 the moderator team can lock a post to prevent further editing. But once the original content is gone it doesn’t help. So tonight I started thinking about how and where to store post revisions in order to recover from this sort of action.

Defining the Problem

Here’s what I want to be able to do:

  1. Capture the prior text of any edited post and store is somewhere
  2. Track who last edited a post
  3. Give moderators the ability to review the post edit history and revert back to an older version
  4. Provide the ability to lock a post so further editing (except by moderators) is no longer possible

This MOD is still a work in progress so I don’t have final code to share yet. But there are a few interesting wrinkles that I thought about during the design process that I thought I would share.

Locking Posts

Locking posts is one of the easier parts. I added a status field to the phpbb_posts table. For topics there is a topic_status field that contains several different status values. I don’t need anything that complex so I called my new field post_locked and made it a tinyint(1) unsigned with a default of zero. That means that every new post that gets inserted is going to default to unlocked. Of course I added the new field to the insert statement rather than rely on the database to provide the default for me.

Once the field is present in the table I have to check it. At the beginning of viewtopic.php there are a number of authorization checks that determine which buttons / icons are displayed on the post. If the post is locked I do not display the edit button; that’s simple enough. However, what if someone creates their own URL rather than clicking the button image? In that case I have code at the top of posting.php to see if the requested function is “editpost” and then check to see if the post_locked field is set to 1 instead of 0, and if so I reject the edit attempt.

At this point I have not decided just how moderators will lock / unlock posts. One easy way would be to add a lock / unlock button on each post that moderators can see and use. However, I currently envision the locking process only being used when a board member has abused their edit permissions. That means it would be more efficient to provide my moderators a way to lock a post as they are reverting the post to a prior version.

Who Edited The Post?

With a default phpBB2 installation we don’t store who edited a post. In fact no edit history is kept at all in many cases. For example if a moderator edits a post belonging to someone else, it does not trigger the edit history. If a normal user edits a post that is the last post of a topic, that does not trigger the edit history either. The post edit history is only stored when a user edits their own post after at least one reply has been made. Since we only store the fact that a user edits their own post, there is no user_id stored. The code stores the last edit date/time as well as incrementing the overall edit count. That’s it.

To address this I added a new field called last_edit_user to the phpbb_posts table. I altered the edit history so that it runs every time a post is edited instead of only when a user edits their own post. I already have a post notes feature that records who edited the post and when. But this additional step will store the last edit user (and only the last edit user) on the post itself which means I don’t have to join out to my post notes table. I made a slight adjustment to the viewtopic.php code so that the “Last edited by…” message now includes the true user name for all edits.

What Was Changed?

This was the fun part: how do I store revisions of the post? After thinking through this and considering a number of esoteric ways to store differences in the post text I threw up my hands and decided simply to copy the entire text of the post to a new row. Why? I’m using a fraction of my disk space. I can also take advantage of the fact that in phpBB2 the post table and the post text table are separate. So here’s what I did to do that.

First I added a new field to the phpbb_posts_text table called post_version that is manditory (not null). It stores an unsigned integer value. The current version of the post text is always version zero in this table. Second, I went through the base phpBB2 code and added the following to every SQL statement that joins to the phpbb_posts_text table:

AND pt.post_version = 0

There were about 20 files that needed to have this change, but in the grand scheme of things it was a low-impact update. Once I added the column to my table and updated the code everything ran perfectly. The next step is to figure out what to store in this new field.

Tracking Versions

For most of what goes on during the post processing on a phpBB board there isn’t much to change other than the extra join column. I only ever want to search the current version. I only want to display the current version on viewtopic.php. In fact, every time I reference the phpbb_posts_text table I want only the current version… unless I am a moderator that needs to review the actual post history. Why did I make the current post version zero instead of taking the maximum number? Simple. Every post has at least one version to start with, and that version will always be zero. By ensuring that the “current” version of the post text is always stored as version 0 my join logic is extremely simple.

But what does the version number mean then? In this model, I think of the version as a number representing how many “versions ago” the text was posted. Take a post with 5 version rows in the post text table. Version 0 is the currently displayed (and search indexed) version. Version 1 is one version ago. Version 2 is two versions ago, and so on up to version 4 (the original post text) which was four versions ago compared to the current text.

How are these versions created? It’s fairly simple. In the standard phpBB2 code there is a statement that checks to see if the posting process is in edit mode or new post mode. It creates either an INSERT or UPDATE statement based on the mode. All I did was add this check:

        if ( $mode == 'editpost' )
        {
                $sql =  "UPDATE " . POSTS_TEXT_TABLE . "
                        SET     post_version = post_version + 1
                        WHERE   post_id = $post_id
                        ORDER BY post_version DESC";
                if ( !$db->sql_query($sql) )
                {
                        message_die (GENERAL_ERROR, 'Error incrementing post version', '', __LINE__, __FILE__, $sql);
                }
        }

There are a couple of interesting things here. During the edit process I increment the post version for every existing record for the post by one. This means 1 becomes 2, 2 becomes 3, and so on. This generates a SQL error because when 1 becomes 2 and 2 already exists it violates the unique primary key constraint. I added an ORDER BY clause to the update statement to ensure that I start with the end of the chain instead and it fixed that problem. By starting at the end, 3 becomes 4 before 2 becomes 3 and the constraint is never violated.

This may or may not be cross-database compatible; I don’t know which databases allow an update to have an ORDER BY clause.

After the post version numbers are incremented I always do an insert because post version zero no longer exists. That means that posts are never edited… they are always inserts. This means fewer database locks which is also a good thing.

Why Not Store Version On the Post Table?

I am going to anticipate the following question because I think it’s quite logical. In fact, I reviewed this idea myself before rejecting it in favor of what I ultimately decided on. The question?

Why not store the post version on the post table and increment it during edits, rather than using the backwards zero-based logic?

This is a good question, I think. But it’s considered bad form to ever update a database key. Once it’s set it should never change. If the post_id + post_version combination key in the post table ever got out of sync with the post text table I would have a problem. By using only the post_id as the main key and the post_version as a qualifier (and only storing the post_version in one place) I will never have that problem.

Would it have saved me some work? The only benefit is that the post table itself would store an indication of how many versions there are. But wait, I have that already. :) When I altered the edit process to that every edit increments the post_edit_count field I essentially got that key value. Why not use it in my join instead?

I will say it again: updating a key is a bad idea. The post edit count is informative but should not be used as a key. What happens if there is a database hiccup between the update of the posts table and the insert to the posts_text table? The edit count might be updated to 5 but I only have rows 1-4 in my post text table, and because of that the join fails. In the solution I decided to go with, I will only ever have a problem if the initial post insert process fails, otherwise I will always have a post version of zero stored in my table.

What’s Next?

At this point I have created the field needed to allow moderators to lock a post and I have written the code to check this field and prevent users from editing their post once it has been locked. I have not written the code that would allow a moderator to lock / unlock the post.

I have updated the code that tracks edits and added the last edit user to the posts table. I have also updated the code so that every edit – not just the qualifying edits as determined by the standard phpBB2 code – will trigger this code.

Finally, I have altered the posts text table so that it includes a version tag, updated all of the base phpBB2 code to reference version zero, and altered the posting process so that every edit is an insert rather than an edit to the table.

At this point I have opened a discussion with my moderator team to design the interface that they will use to interact with this new information. I need to have some way to indicate to the team that edits have been performed (already in place with the edit post notes). I need to have some way for the team to open the post history and review it. I need to allow them to decide which version of the text to revert to, and I need to provide them the option to lock the post to that text once the revert process is done.

I will be sure to post an update as the discussion moves forward. If anyone has done this before and would be willing to share your own decision ideas / process I would certainly be interested in hearing from you.

10 Comments

  1. I posted a topic to phpBB.com some months ago on this very subject, which you may have read: http://www.phpbb.com/community/viewtopic.php?f=64&t=1573125

    One of the problems is that current RDBM systems (that I know of) don’t have a third dimension of time. They are just row/column-oriented. Thus, you’d have to use a JOIN to another table. Having to deal with a JOIN just introduces another facet of complexity due to the inherent problem that a JOINing two or more large tables presents. Quite frankly, when one is dealing with millions of rows and gigabytes of data, the only solution then becomes to remove all joins and query tables as key-value pairs, otherwise the memory and CPU involved becomes prohibitively expensive.

    Now, one of the bonuses we have on our side is that of frequency of reference. It just like having a collection of volumes in a library. One uses compact shelving to save space because these historical volumes are infrequently accessed. Likewise, we can store all versioned posts as compressed binary strings in a separate table, or even on a separate database server.

    Comment by Dog Cow — November 2, 2009 @ 2:26 pm

  2. Secondly, I don’t agree with how you have designed the versioning system. I think that upon every edit, you should be inserting a new row of data into a post_version table. This table would be denormalized and the data stored compressed as mentioned earlier. Some columns would be post_id, post_version, user_id, topic_id, post_date, post_text, etc…

    Then you could make the primary key on post_id, post_version and query this table separately. Therefore, the posts table and posts_text table would store _only_ the most current post revision, and nothing else. You also would not need the additional field.

    Comment by Dog Cow — November 2, 2009 @ 2:29 pm

  3. I had not read the topic, thanks for the link. As mentioned in the topic, performance is one concern. The way I have implemented it has no negative impact on performance and in fact may have a slight benefit. Inserts can be more efficient than updates since no locking is done. Since every post “edit” is in fact converted to an insert there might be a benefit there. Note I have not done any benchmarking to determine if this is the case.

    However, I don’t like the idea of storing the data in a separate table. It’s different versions of the post text only; the post itself is not changed. Why not store it – in versioned form – in the posts text table? Since there is no performance impact of storing versions and there are certainly advantages (less code to change, easier to retrieve and display the archived versions) I think it’s a good choice.

    Time will tell; I plan to roll out the collection portion of the code this week. Giving the moderators the ability to review / revert code is coming later.

    Comment by Dave Rathbun — November 3, 2009 @ 9:48 am

  4. There is certainly a performance hit every time the table grows. Why have that data stored there if it is so infrequently accessed?

    I must admit, though, that I do like how every post action becomes an INSERT. Nice job on that one. :-)

    Comment by Dog Cow — November 3, 2009 @ 11:44 am

  5. Bigger table != performance problem :) With a unique key of post_id + post_version only one row (or one block, depending on how the database works) is retrieved. It doesn’t matter if it’s 100 rows or 100M rows, a unique key read should be nearly instant. Yes, a unique key index on a larger table is going to be larger than a unique key index on a smaller table, but it’s not a linear performance hit. When I query a single post using the unique key on my post table with over 580,000 posts the query time is not measurable. According to MySQL, everything comes back in 0.00 seconds. The query plan (from the explain command) shows that the primary key index will be used, and because of the where clause exactly one row will be returned. That’s about as fast as it gets.

    Comment by Dave Rathbun — November 3, 2009 @ 9:55 pm

  6. Ok, so what about my second question?

    Comment by Dog Cow — November 4, 2009 @ 12:16 pm

  7. I read it again and I’m not sure what the “second question” is, can you humor me and clarify what it is? :-?

    Comment by Dave Rathbun — November 16, 2009 @ 8:10 pm

  8. Why have that data stored there if it is so infrequently accessed?

    Since you could be storing elsewhere, that is….

    Comment by Dog Cow — November 17, 2009 @ 11:24 am

  9. Because it’s convenient.

    This is all a big experiment, and I may come around to your way of thinking eventually, but right now it makes it very simple to deal with.

    Comment by Dave Rathbun — November 17, 2009 @ 9:52 pm

  10. Fair enough. :-)

    Comment by Dog Cow — November 18, 2009 @ 2:46 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress