Comments July 22, 2009

Just How Bad Is The gmail.com Problem?

July 22, 2009

Just How Bad Is The gmail.com Problem? 

Not too long ago I participated in a topic at phpbb.com where the author was asking about blocking gmail email addresses. The general consensus from the community was that the board owner should not block gmail but instead rely on some other methods for blocking spammers. I don’t block gmail, but sometimes I would like to. In this post I think I summarized it best, saying:

hotmail, yahoo, gmail… any free email account is subject to abuse. Spammers are using the fact that board owners are, as you are, reluctant to ban gmail outright because it does have so many legitimate users.

Having said that, I decided it was time to go back and work through some numbers. Instead of guessing how bad the problem is, I wanted to get actual statistics to back up my claims. Anyone can say anything they want. :) Having numbers makes the claims more substantial. And graphs. Pictures are always good. The data used for this post is available as an Excel file for anyone to download and review (link at the end of the post). Here’s the summary:

Google: Your gmail system is borked. Fix it or risk it becoming irrelevant.

Logging Registration Attempts

I have written more than a few posts about my simple Checkbox Challenge MOD. I use it for board registrations as well as comment forms. For this post I am going to concentrate only on registration attempts at my largest phpBB board. I will use registration attempts from January of 2008 through June of 2009 (eighteen months).

For the first step, I ran some preliminary queries to identify the top five domains used. There are plenty of obvious spammer domains out there but that isn’t the point of this post. I know that mail.ru and gawab.com are the source of a lot of spam already. I also can recognize that domains like nastyteengirl.info and onlineovernightpharmacy.com are probably not legitimate. The point I want to drive home is how bad things are for mainstream domains, and for gmail.com specifically. In order to do that I want to focus only on the domains that are the source of higher volumes of registration attempts.

The top five domains and the total registration attempts are shown here.

Domain          Total Attempts   % of Total
gmail.com                12909          61%
yahoo.com                 2968          14%
mail.ru                   2704          13%
hotmail.com               1606           8%
aol.com                    843           4%

Notice that gmail is not only number one; it is in that position by a really large margin. No other email domain comes even close. My first piece of evidence clearly shows that gmail is a popular domain. It is so popular that if I were to consider banning or blocking it, I might lose 61% of my new members. But wait, is that really true? How many of those registration attempts were successful, and how many were blocked as bots?

Checkbox Challenge Data Collection Process

My Checkbox Challenge code presents a user with a standard registration form as well as a series of checkboxes. The user is instructed to click on only the marked checkbox in order to prove they are human. The development is well documented in other posts on my blog, so I won’t go into great detail here. Suffice it to say that bots seem to either ignore all of the checkboxes because they don’t expect them to be there, or they attempt to be smart and mark all of the checkboxes since they know they’re on the form. There are some humans that have issues with the system and might take multiple attempts to get through the screen but those situations are not very common, and for the sake of this post I will assume they don’t exist. Every attempt is logged, and it is that table that I am using for source material for this block post.

I listed the top five domains above. For the rest of this post I am going to drop mail.ru because most board owners know it’s a standard domain used by spammers. I am also going to drop aol.com because at 4% of the total registrations it’s not that relevant. That leaves me with three remaining domains to focus on: gmail.com, yahoo.com, and hotmail.com. (If you’re wondering who is in position six, it was gawab.com, which is another notorious spammer domain.)

Who’s Your Bot?

Any registration attempt is a potential board member. The concept behind most any anti-spam measure is to allow real people through and block bots. I have already established that gmail is by far the number one source of registration attempts. The next step is to evaluate how many of those attempts are desirable new users, and how many are bots. To do that, I retrieved the last 18 full months of data and determined the percentage of successful versus failed registrations. Here are those numbers for the three domains I have decided to focus on for this post.

Total           Success Failed  % Success
gmail.com          5644   7265      43.7%
yahoo.com          2372    596      79.9%
hotmail.com        1384    222      86.2%

Now we start to see the real problem. Both yahoo and hotmail have approximately eighty percent success rates. That means that eight out of ten registration attempts from those domains are expected to be legitimate and valuable users. With gmail over half of the registration attempts fail and therefore are presumed to be bots. Not only is gmail the number one source for registration attempts, it is the worst source in terms of the human to bot ratio.

Is Google Doing Anything To Help?

Given that these numbers start in January of 2008, the next question I want to answer is whether the problem is getting better or worse. I have to believe that Google is aware of the issues that they’re facing. Are they doing anything to help?

Here are the gmail numbers broken down by month.

Log Month        Domain         Success Fail
 2008-01        gmail.com           297   79
 2008-02        gmail.com           260   42
 2008-03        gmail.com           320   94
 2008-04        gmail.com           293  107
 2008-05        gmail.com           290   65
 2008-06        gmail.com           286  139
 2008-07        gmail.com           395  147
 2008-08        gmail.com           346  380
 2008-09        gmail.com           316  398
 2008-10        gmail.com           283  561
 2008-11        gmail.com           316  367
 2008-12        gmail.com           254  484
 2009-01        gmail.com           291  898
 2009-02        gmail.com           343  510
 2009-03        gmail.com           346  808
 2009-04        gmail.com           330  981
 2009-05        gmail.com           291  614
 2009-06        gmail.com           387  591

Here are a few things that I find interesting about these numbers. First, for the past 18 months I have averaged 313 new members (successful registrations) from gmail. That number is remarkably consistent, as shown by this graph. The blue line shows the raw data, and the orange line shows the trend.

trend graph for successful registrations

Here is the graph for failed registration attempts from gmail.

trend graph for failed registrations

In this case the red line represents the data and the black line is the trend. The trend is not my friend in this case. :shock: Pay careful attention to the scale of those two graphs. While they are presented as the same size (approximately 400 pixels square) the top graph (successes) has a maximum scale of 450 while the bottom graph (failures) goes all the way up to 1200. Here’s a combined graph without trend lines that will help drive that point home.

graph for all registration attempts

The data does not look good for Google. Sometime back in 2008 (it looks like August for me) the number of valid registrations and bot registrations were about the same. Prior to that date, bot registrations were in the minority. After that date the bot usage of gmail.com has clearly soared. In February of 2009 (2009-02 on the graph) there was a dip in bot usage, at least on my board. Was it a result of something Google did? If it was, it clearly was not very successful in the longer term as bot usage popped right back up in the following months.

Here’s another chart that shows the value of gmail to me as a board owner. This is a percentage column chart so it ignores the overall numbers and instead presents the data as percentages.

percentage graph for gmail.com registration attempts

Just how significant is this? Back at the beginning of this post I noted that for the past 18 months the average success rate for a registration attempt from a gmail.com email address was 43.7%. If I recalculate the value for the past six months it drops to 31.1%. That’s not good. Is it fair to pick on Google? During the same time that the success ratio for gmail has dropped from 43.7% to 31.3% (a difference of 12.6%) yahoo has dropped 2.4% and hotmail has dropped 3.1%. In other words, all of the top three domains have seen the ratio of legitimate registrations to bots drop, but the ratio for gmail has dropped four times as much as the other two.

What Can I Do About gmail.com?

New board members are important. Without new members a community will start to get stagnant, and a stagnant community typically doesn’t thrive. As I mentioned earlier, I get an average of over 300 new members a month from gmail.com alone. For the past 18 months I have averaged 751 new members each month, and 314 or 42% of those are from gmail.com email addresses. If I were to consider banning gmail.com that’s a large chunk of my community that would disappear. I don’t think that’s a realistic action to take.

What Should I Do About gmail.com?

I think that Google should be held responsible. I can take individual steps that impact my board… Google can (and should) take steps that will protect everyone on the Internet. Am I overstating the problem? I really don’t think so. All of the numbers I have used for this post came from registration attempts on my largest (and most active) phpBB board. Here are some other numbers to chew on. All of these have been filtered to show only log entries with gmail.com email addresses.

Site Comment Form
Total attempts: 10,441
Total rejected: 10,381
Bot percent: 99.4%

Another phpBB Board
Total attempts: 2,767
Total rejected: 2,723
Bot percent: 98.4%

Still Another phpBB Board
Total attempts: 1,859
Total rejected: 1,843
Bot percent: 99.1%

What conclusion do I draw from these numbers? I submit that the problem is even worse that it appears based on the details I provided in this post! The numbers I used come from an extremely active board. Registration bots don’t pay too much attention to how many legitimate users are already registered on a board. The only goal of a bot is to find a board and register. For a smaller board this means the problem is even worse. My big board didn’t start out big. In the early days we got about 10-20 new registrations each month. Today I get more than that in one day. Because I get so many new legitimate users, it can actually mask just how bad the gmail problem really is. If you are a smaller board owner, having thousands of bogus gmail registrations can be extremely frustrating. If I didn’t have something in place that was – at least for now – somewhat effective in blocking these bogus attempts, I would very seriously have to consider blocking gmail accounts.

The problem is not new. While researching to see if I was the only one impacted by this (of course I am not) I found a post that shows how bots break the gmail CAPTCHA, and the post was from February of 2008. As we have long discussed on phpbb.com there are also services that will put real people to work breaking confirmation codes. I linked a few articles at the end of this post, and most of them are over a year old. The situation hasn’t improved since then either. If anything it has become much worse.

Google, are you listening? It’s time to fix this.

  1. All Google has done with their CAPTCHA is make it more and more unreadable, otherwise I haven’t seen much action from them. It really shows how bad the spam problem is getting.

    Great article drathbun!

    Comment by onehundredandtwo — July 23, 2009 @ 1:21 am

  2. A very comprehensive article Dave,

    It is more than obvious that there’s a growing problem affecting the internet community as a whole.

    Have you considered getting in touch with Google about this?


    Comment by Dogs and things — July 24, 2009 @ 7:38 am

  3. That’s one of my other BIG complaints about Google. Try to find some means to talk to them… if anyone has a method that gets to a real person, I will be quite happy to forward my data to them. :)

    Comment by Dave Rathbun — July 24, 2009 @ 11:30 am

  4. How about Report spam, paid links, malware, and other problems to Google?

    Comment by Dogs and things — July 24, 2009 @ 3:29 pm

  5. Dogs and things, that link appears to have more to do with sites than users of gmail. And it’s not a need to report gmail spam; there is already a means for doing that. What I need is a way to report people that are abusing the gmail service, even if they’re not sending out mail.

    Comment by Dave Rathbun — July 25, 2009 @ 5:07 pm

  6. Well Dave, I figure that via the report link you’ll be able to get in touch with real Google people that likely will be able to recommend you another Googler that you can inform about your findings. Don’t you think?

    Comment by Dogs and things — July 25, 2009 @ 7:32 pm

  7. You might have read in the news a few months ago that Google’s captcha was cracked. The software which does it (the Google captcha cracking) is called XRumer, and it seems to be a pretty popular automated spam tool. It will automatically sign up the email account, then use that account to register at forums.

    Comment by Dog Cow — August 24, 2009 @ 6:46 pm

  8. XRumer is not new; it’s been around for years. From what I read, Google’s CAPTCHA was cracked over a year ago, not just a few months ago.

    Comment by Dave Rathbun — August 25, 2009 @ 10:36 pm

  9. What I’m trying to say is, it is only a recent version of XRumer which cracks Google’s captcha, not older versions.

    Comment by Dog Cow — August 27, 2009 @ 10:33 am

