I want to follow up on my earlier article about how to do CAPTCHAs without images, for accessibility and usability. In that article I hoped my simple scheme would deter dumb robots, cutting down on the bulk of the comment spam I was getting.
It did cut out the comment spam — for a while. After a while, though, I started to get some spam again. I could see the spammer was either doing it manually, or had figured out that my form submission included the ID of the question I’d asked (wow, you spammers sure are smart). I tried changing the questions once or twice, thinking they might be automated set-and-forget spambots that would not get updated for a while. This seemed to have no effect at all. “Alas,” I thought, “I’ll just put up with a few spam comments every now and then.” Then it became 10 a day. That bothered me, but still not enough to do anything about it.
But then some @$$ started really slamming me. It was all about buying medications and online poker and so forth. It’s funny how easy it is to detect which spammer is which by the message style, too. The phrasing is always the same. I could tell this one was new. And I was getting hundreds of spam comments a day.
I tried a Bayesian filter plugin for Wordpress briefly, but it didn’t work quite right and I didn’t have enough time to learn about Wordpress’s plugin architecture to fix it. During that trial, comments were totally disabled on the blog. I couldn’t let that continue, so I uninstalled the plugin and kept moderating while I hoped for a few spare minutes to find a fix.
Meanwhile, even my posts about not using image CAPTCHAs were getting slain. Oh, the irony!
Finally, I managed to find a half-hour to tweak my image-less CAPTCHA system. Instead of posting back which question was on the comment form, I made it set up a session and store which question, like traditional CAPTCHA systems. I really didn’t want a heavy-weight solution where I stored the information in an actual session or in a relational database. I wanted it to be just enough to deter spammers, as before. So this time I used some encryption, some randomization, and a known bit of data that changes frequently — though I won’t say what that is — to generate a passkey and put it in a cookie. The cookie is valid only for one request, and is time-sensitive too. And since the secret changes frequently, hopefully it’s not obvious how this all works (though, as before, it wouldn’t be too hard to figure it out if you approach it from a “what would be easiest for him to do” point of view).
Basically, I went from a stateless and easily hackable system to one with a bare minimum of statefulness, and I guess it was just enough to foil the spammers. I haven’t gotten a single spam comment since. This is like being in an airplane and putting noise-blocking headphones on. It is blessed, blessed relief. And I’m really happy because I seem to have found — at least temporarily — just the right balance between people and robots.
Now, the only thing left is to wait and see how long it takes the spammers this time. I promise, I will fight image CAPTCHAs till the last resort is exhausted. Who knows how it will work out? I’ll certainly let you know!
Technorati Tags:No Tags
That’s bizarre! I would have guessed a custom-built CAPTCHA would be just a little too high of a threshold. That’s a little frightening, and a little puzzling. I’m very surprised that anyone would bother to build a custom spamming tool (or whatever they use) just for one blog.
Of course, your blog probably has a higher Google rank than the easier-to-attack blogspot blogs, and is therefore perceived as worth more overall to comment on.
I’m getting a spam comment or two once or twice a day, so even this isn’t a completely adequate system. But it’s a lot better than hundreds a day. Whew.
Hello Xaprb. Although this is a decent imageless CAPTCHA system, spambots can still bypass most of your filters by googling “question answer” and determine which answer gets the most results.
That’s a good point. I hadn’t thought of that. I suppose it would be possible to find combinations of words, or phrase the question carefully, so the correct answer isn’t the one with the most Google hits. For example, “lion cat” has 17+ million results, but “lion tree” has 19+ million, so that one already works OK.
I think a CAPTCHA is like a random number: really, really hard to get right. At this point I have something that’s working well enough to keep me from getting swamped with spam, even though it’s not very good in absolute terms. If I start getting a lot of spam comments again, I’ll switch to an image CAPTCHA, because I feel I’m re-inventing the wheel badly. I’m not an expert in foiling spambots, but the spambots seem to be expert at foiling me.
Could some of the spam be hand-entered?
I assume at least some of it is, though I could be wrong. I suspect GUI scripting is responsible for much of it, and perhaps the script is recorded from a hand-controlled session.
Xaprb, are you sure?
No. I could probably look in my web server log files and get a better idea though. But really, tools to automatically drive a genuine browser are everywhere. Every now and then I get some spam that looks like someone just testing the system out — a real human, that is. Empty comments, or comments with asdfasdf in them (but a handful at a time, not just one). It’s really interesting to observe the patterns.
At this point I get about one or two a day, which is much less than the hundreds a day I was getting when I had a few percent of the traffic I have now, so I’m fairly happy with this system. I have considered upgrading it though. (By upgrading, I mean using something I didn’t hand-roll).
CAPTCHAS are not that bad. Because of them I moved from BlogSpot to WordPress and I think it was for good ;-)
Your “logic puzzle” approach is good. And maybe it could be generalized and automatically generated (but not solved, I hope). Instead of a question / answer scheme you could use something like this.
Choose one:
— o Yellow
— o Apple
— o Red
where one option belongs to a domain F, n>1 options belong to a domain T, and F is perpendicular to T. Well, in the example I cheated a little, because I started with Water instead of Apple, but then I realized that there is added trouble for automation if all the pairs work equally well.
I would Andrea’s approach is good:
Which one doesn’t belong here:
1) Hand
2) Eye
3) Shoe
4) Hair
3 - as it is no body part - so, you go to matching characteristics of objects (as in IQ tests), instead of asking for a char. of an object (”A lion is a … (cat)”).
What about reversing questions:
What do men like the least -
1) Coffee
2) Women
3) Dishes
Or asking (parts) of a saying (with related incorrect answers):
The best invention since:
1) Wheel
2) Sliced bread
3) Plane
Or combinations:
What is blue:
1) Grass, meat
2) Ocean, mood
3) Men, women
And about your initial system being cracked: what about the spammer reading this blog, entering some posts to learn from it? Good training exercise - in case your system becomes more widely adopted….
But good programming! Thanks for providing it to us with comments.
I like Trios’s question, it took me a while to find out what is blue. Brain twisting and spam fighting as well. Like Tim mentioned earlier, since your blog has a Google page rank of 5, obviously will attract more spammer than usual. May be there is a way to turn off the ranking indicator…perhaps someone can shine some light here.
You can tank your blog’s Google ranking by having another one (a mirror), almost identical one, with javascripted absolute links to this blog and hard absolute back links.
Google will eventually find out that both subsites are identical, and it is to be hoped that it will eliminate the correct one from its results.
One way to achieve this is to use a second domain name for the same blog and adjust the absolute links to fool Google. Hopefully Google will not resolve the IP address. If so, some more clever DNS record manipulation may do the trick (2 IP addresses as well).
Come to think of it, since Google is perfectly able to distinguish between user sites that are subdirectories of the same domain, you could just create two subsites in the same domain, where one gets the google rating and the other one, to which the one with the google rating soft-redirects also contains the more sensitive info. Unfortunately this will not keep hostile humans out:)
Oh, and one important detail: the mirror site that gets the highest ranking should omit the most sensitive information.
Did anyone mention that the brain teasers actually have a fun-factor to them. Certainly more fun to do one of these than figuring out twisted, bent, twirled, chiseled CAPTCHA character strings.