Fighting spam and recapturing books with reCAPTCHA

A CAPTCHA is an anti-spam test used to work out whether a request has been made by a human, or a spambot. CAPTCHAs no longer seem to be as popular as they once were, as other spam identification techniques have emerged, however a considerable number of websites still use them.

CAPTCHA pictures

Some common examples of CAPTCHAs.

CAPTCHAs can be really annoying, hence their downfall in recent years. Take a look at the different CAPTCHAs in the image above, if you had spent 30 seconds filling in a feedback form, would you be willing to try and decipher one of the above CAPTCHAs, or would you just abandon the feedback?

The top left image could be ZYPEB, however it could just as easily be 2tPF8. If you get it wrong, usually you will be forced to do another, which could be just as difficult.

The BBC recently reported how The National Federation for the Blind has criticised CAPTCHAs, due to their restrictive nature for the visually impaired. Many CAPTCHAs do offer an auditory version, however if you check out the BBC article (which has an example of an auditory CAPTCHA), you will see that they are near impossible to understand.

reCAPTCHA

Luis von Ahn is a computer scientist who was instrumental in developing the CAPTCHA back in the late 90’s and early 2000’s. According to an article the Canadian magazine The Walrus, when CAPTCHAs started to become popular, Luis von Ahn “realized that he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles.

Anti-spam reCAPTCHA

An example of a reCAPTCHA CAPTCHA.

In order to try and ensure that this time was not wasted, von Ahn set about developing a way to better utilise this time; it was at this point that reCAPTCHA was born.

reCAPTCHA is different to most CAPTCHAs because it uses two words. One word is generated by a computer, whilst the other is taken from an old book, journal, or newspaper article.

Recapturing Literature

As I mentioned, reCAPTCHA shows you two words. One of the images is to prevent spam, and confirm the accuracy of your reading; you must get this one right, or you will be presented with another. The other image is designed to help piece together text from old literature, so that books, newspapers and journals can be digitised.

reCAPTCHA presents the same word to a variety of users and then uses the average response to work out what the word actually says – this helps to stop abuse. In a 2007 quality test, using a standard computer text reader, (also known as OCR) 83.5% of words were identified correctly – a reasonably high amount – however the accuracy of human interpretation via reCAPTCHA was an astonishing 99.1%!

According to an entry in the journal Science, in 2007 reCAPTCHA was present on over 40,000 websites, and users had interpreted over 440 million words! Google claim that today around 200 million CAPTCHAs are solved each day.

If each CAPTCHA took 10 seconds to solve, that would have been around 139 years (or 4.4 billion seconds) of brain time wasted; I am starting to see what Mr von Ahn meant! To put the 440 million words into perspective, the complete works of Shakespeare is around 900,000 words – or 0.9 million.

Whilst the progress of reCAPTCHA seems pretty impressive, it is a tiny step on the path to total digitisation. According to this BBC article, at the time von Ahn is quoted saying:

“There’s still about 100 million books to be digitised, which at the current rate will take us about 400 years to complete”

Google

In 2009 Google acquired reCAPTCHA. The search giant claimed that it wanted to “teach computers to read” hence the acquisition.

Many speculate that Google‘s ultimate aim is to index the world, and reCAPTCHA will help it to accelerate this process. That said, if that is its goal, it is still a very long way off.

We won’t be implementing a CAPTCHA on Technology Bloggers any time soon, however next time you have to fill one in, do spare a thought for the [free] work you might be doing for literature, for history and for Google.

Don’t underestimate Jetpack

Jetpack is a WordPress plugin that lets you access many of the features which come inbuilt with a WordPress.com site, on a WordPress.org installation. Historically plugins have just one function, however Jetpack is a combination of plugins which can perform a huge range of actions.

Plugins on Steroids

Jetpack by WordPress.comOne way of describing Jetpack is plugins on steroids. Jetpack makes it really easy to access loads of the great features available through WordPress, all in one simple package.

Jetpack creates its own area in WordPress Admin (wp-admin) where you can learn about, configure and activate/deactivate different elements of the plugin.

You don’t have to activate all of Jetpacks elements, you can use as many or few elements as you choose. Like with every plugin, every extra function of Jetpack you activate will have a small affect on your blog’s speed, so only use the ones that work for you.

The Future of Plugins

The way Jetpack sets out all the different plugins and makes it so easy for users to configure them is a great leap forward for WordPress. Currently the wp-admin plugins page is quite boring, and it can be hard to find the plugin you want fast. I feel that a Jetpack style interface could significantly improve usability, and generally make plugins more fun.

A screenshot of Jetpack's plugins

A screenshot of the different plugins and settings Jetpack includes.

Could a future version of the CMS use a Jetpack like style to display plugins? Maybe.

Features

Here are some of the many features that Jetpack includes:

  • WordPress.com Stats – On-site analytics for your site. Personally I feel server side analytics and more detailed external statistic managers (like Google Analytics) are better than Jetpack’s version, however nonetheless many people find it is an easier, free alternative.
  • Publicise – This enables you to post your articles to Facebook, Twitter, LinkedIn and Tubmlr. The great thing about Publicise is that it only publishes when your articles go live – so it works on scheduled posts too 🙂
  • Spelling and Grammar – Simple yet advanced spell checking for content. I use Firefox’s default spell checking software, and Jetpacks version is slightly annoying, so this is disabled on Technology Bloggers!
  • WP.me Shortlinks – An easy inbuilt URL shortener. Using the WP.me URL shortener helps to keep short URLs tidy, as having too many from too many different sites can look messy.
  • Infinite Scroll – This is a feature that I personally dislike – a lot! It enables you to have a bottomless page, so once uses get to the bottom, it loads more articles. This can effectively put your entire blog on one page. I don’t like bottomless pages, they drive me mad, so if you want me to visit your site, keep this option off 😉
  • Sharing – Technology Bloggers uses the Sharing feature to power the share buttons at the bottom of each article. I have removed the standard buttons and replaced them with more minimal, stylish buttons. The sharing feature is truly great, and is a lightweight way of combining many network sharing plugins.
  • Omnisearch – A fantastic and really simple way to search wp-admin.
Technology Bloggers share buttons

Technology Bloggers new share buttons – found at the bottom of every article.

Give It A Go

I didn’t think I would like Jetpack, and at first I didn’t. After reading a bit about its features and how good it can be, I thought I would give it a go. I now love it!

I love the flexibility that it offers, in that you can have as many or few elements active as you choose. Technology Bloggers only uses 4 of the 27 functions, and that works fine for us. On my personal philosophy blog, I also use Jetpack and have 8 of the 27 elements active; it is a different blog which benefits from different plugins.

Do you use Jetpack?

Your thoughts are welcome as always 🙂

Music Royalties and Spotify

The BBC World website is reporting that Thom Yorke, the singer from Radiohead, has pulled some of his music off Spotify and Rdio because he says that their royalty payments are too low.

To be exact he tweets:

%CODETWEETTB7%

Radiohead are not new to this type of provocation however. In 2007 they released an album “In Rainbow” that could be downloaded only from their website. The interesting line was that the listener could pay whatever they wanted for the download, there was no fixed price.

They came in for a lot of criticism as this article in the NME shows, with some people claiming they were making it more difficult for new bands to make any money from their releases.

Well Thom does not agree. The album was “bought’ 3 million times in its first year of release and Yorke himself says that the band made more money from this one album than all of the others put together.

So this leads to the obvious question about Spotify, how much do they pay?

Busking pays more than Spotify

Busking pays more than Spotify

As a musician myself I have a good knowledge of how the payment systems worked in the days before digital downloads. When my band released albums or singles and we received radio play we were paid. In the UK about 15 years ago an artist was paid about $30 a minute for a play on national radio. This means about $90 to $100 for a song, minus the 20% that the PRS take for collecting it and distribution. It’s good money. One single that gets 10 plays on John Peel or other fringe shows could make $1000, enough to make another.

Now Spotify is different of course because a play is personal, not to an entire country. This article on the Music Think Tank blog explains how royalties are worked out, but I will try to explain here as simply as I can.

Spotify make money from advertising and subscriptions. They pay 70% of their income out in royalties for the music they play. They pay out pro-rata, so if 1% of all streams happen to be your music, you get 1% of that total payment. Simple enough (maybe).

So lets look at the numbers.

In 2011 Spotify generated $20,333,333 per month.

They distributed 70% which is $14,233,333.

They had 1,083,333,333.333333 streams per month.

Let’s say I got 20 streams a month, about 0.00000184615% of the total royalties payout.

I make $0.26 a month, that is 26 cents. $0.01 per stream, minus the 15% that the digital distribution company takes for putting the tracks up and the 10.2% publishing fee.

About a quarter of a cent per play, in round terms, anyway not a lot of money by anyone’s standards.

Of course the more people use Spotify, the less an individual stream is worth.

In the US Pandora, another streaming site is pushing Congress into passing legislation that will cut this rate further, by 85%.

Even Pink Floyd have been complaining that artists are being duped.

I for one keep making music and releasing it to the world, I don’t expect to get rich though!

UPDATE: This week the BBC has a follow up article about Thom and Spotify. Check it out here.

UPDATE: Spotify has now revealed it pays artists $0.007 per stream. That’s a lot less that previously thought.