Procedural worlds, statistical analysis, image processing and PRNG exploitation for the lulz

—or why IMDb got a CAPTCHA.

This article was published in the Spring 2014 issue of 2600 — The Hacker Quarterly. This online version was slightly corrected for grammar and clarity.

In February 2005, the GNAA devised a cunning plan to troll IMDb users using various fancy hacks. This is what happened.

The plan

It was suggested on the #gnaa IRC channel that the movie Gayniggers from Outer Space (GNfOS), from which the organisation takes its name, be upvoted to the IMDb top 250 as an emotional tribute to this cult movie. The GNAA not being 4chan, they did not have an army of idiots to carry out their deeds; they had to use imagination, skill and technology instead.

The first attempt was simple: everyone voted for GNfOS, and asked people they knew to vote as well. It went slowly. In order to vote several times, a person had to go through a heavy process: only registered users can vote on IMDb, and a valid e-mail address is required in order to register an account. Manual account creation was slow.

The GNAA therefore decided to automate the IMDb account creation and voting.

Creating a procedural world of people

The following observations and guesses were made about the IMDb voting process:

  • For the Top 250, only votes from “regular voters” were considered. This probably meant that in order to have an impact on the vote, they needed to A. vote for several movies in addition to GNfOS and B. have the same accounts vote again in the following days.
  • A “weighting system” was applied to the votes, which probably included disfavouring multiple votes from the same IP address, so they needed to use as many different IPs as possible.
  • Multiple e-mail addresses from the same domain were more likely to attract attention, so using as many different domains as possible would make it more difficult to deduce which other accounts were created using this process.
  • New users needed to fill in a form with their gender, birth year, country, postal code etc.; randomising this information would reduce the odds of being detected through statistical analysis.

So the GNAA wrote an account creation library that, given a random seed, would create a unique identity comprising:

  • Full name, using data from the most common female and male names as well as surnames in the U.S.
  • E-mail address, using the full name combined with a variety of free e-mail providers such as spam.la, mailinator.com, fastmail.us
  • Gender, country, year of birth, postal code. Male individuals from Niger were made to appear artificially more frequently.
  • A preferred password for use on websites.

Generated identities would then look like this:

seed full name gender e-mail address password
3480 Tracy Gilbert F TracyGilbert@spamhole.com 26ACTR41
3481 Rene Reid M Rene_Reid@runbox.com Re96RE14
3482 Sandra Silva F SANDRA63@swift-mail.com UA75ED11
3483 Terrence Bowman M terrencebowman@spamhole.com en29TETE
3484 Ian Wade M WADE5946@poboxed.com 59DE28WA
3485 Barbara Burke F barbara_burke@spam.la rb86BA13

People taking part in the operation would then be responsible for a seed range, for instance Gary would run a script with seeds 1400 to 1499 for several days. But if Gary became busy with other things, someone else could run the script with the same seed range and continue where he had left. There was no need to create a central database because all the identity information was generated procedurally.

Operation imdbtroll

The GNAA combined the identity creation library with additional anonymising features such as a regularly updated list of public HTTP proxies (Tor was barely usable back in 2005), and web user agent randomisation. The imdbtroll.py script was created.

People on IRC started running the script with a seed range assigned to them. The script went through several iterations, but the final version worked roughly as follows:

  1. Choose a seed from the provided range, and create the corresponding identity. For instance, seed 9432: John Blackman (john_blackman@runbox.com).
  2. Check whether the identity’s e-mail address is activated, by logging in if necessary. For instance, a spam.la account didn’t require any subscription. But a mailinator.com account did.
    • If the e-mail address is not active, register an account at the e-mail provider.
  3. Check whether the IMDb account is present, by logging in if necessary.
    • If the IMDb account is not present but there is a confirmation e-mail in the mailbox, activate the account using that e-mail.
    • If the IMDb account is not present and there is no e-mail, create an IMDb account and wait for a confirmation e-mail in the mailbox.
  4. Log in to IMDb.
  5. Vote for movies from IMDb’s top 250, from the bottom 100, or using its built-in search engine; random search words included “troll”, “communists” or “nazis”.
  6. Vote for “Gayniggers from Outer Space”, giving that movie 8, 9 or 10 stars.
  7. Vote for other movies some more, so as not to show an obvious pattern.

The script also tried hard to simulate a real human using a real web browser, pausing between pages, using valid referrer information, clicking on links, sometimes not even voting for GNfOS…

It worked well. The weighted average vote for GNfOS rose from 5.9 stars to 8.7.

Feb 2nd Feb 3rd Feb 4th
5.9/10 7.5/10 8.7/10

And here are the voting details:

Feb 2nd Feb 3rd Feb 4th
10 605 (68.0%) 1391 (81.2%) 2913 (81.8%)
9 26 (2.9%) 60 (3.5%) 224 (6.3%)
8 24 (2.7%) 25 (1.5%) 85 (2.4%)
7 28 (3.1%) 28 (1.6%) 55 (1.5%)
6 28 (3.1%) 29 (1.7%) 51 (1.4%)
5 33 (3.7%) 33 (1.9%) 51 (1.4%)
4 18 (2.0%) 18 (1.1%) 37 (1.0%)
3 27 (3.0%) 27 (1.6%) 35 (1.0%)
2 30 (3.4%) 30 (1.8%) 37 (1.0%)
1 71 (8.0%) 71 (4.1%) 72 (2.0%)

Bantown trolls the GNAA

On February 4th, Bantown, a rival trolling group, got ahold of the GNAA’s script by lurking on the IRC channel and using powerful hacker tools such as wget to retrieve the publicly posted script updates.

Bantown started running imdbtroll.py, too, with their own secret seed ranges. They just made one single modification to it: instead of giving GNfOS ten stars, they were giving it one star.

A race had begun. It was obvious that Bantown was running more instances of the script than the GNAA, so that they could completely cancel the GNAA’s efforts. One solution was to run even more instances than Bantown, but a weapon escalation could only mean the eventual detection of unusual behaviour by IMDb admins.

But the GNAA had a secret weapon: a logic bomb hidden in plain sight, right into imdbtroll.py.

The GNAA trolls Bantown back

The library used for IMDb access had a lot of features, including changing a user’s password. It was not used by imdbtroll.py but it was fully functional. The GNAA therefore created a new script, fuckbantown.py, which did the following:

  • Create a new identity from a random seed.
  • Log into IMDb using the identity.
  • Change the user’s password so that the account becomes unusable for Bantown’s running scripts.
  • Change the vote for GNfOS from 1 star back to 10.

There was only one small problem: the GNAA did not know what random seeds Bantown had been using. It would have to potentially log in to billions of possible accounts in order to find out which users were created. That was not only assured to raise alarms at IMDb, but also practically unfeasible in a reasonable amount of time.

But there was another way, thanks to spam.la. Some of the identities were using that domain for their e-mail address.

spam.la

As you can see, one prominent feature of that website was that all e-mails sent to a spam.la address appeared on the website. So the GNAA only had to monitor that website and look for unknown IMDb account activation e-mails!

Then, if the confirmation e-mail was sent to e.g. TRACEY49@spam.la, they only had to brute-force the Python pseudorandom number generator in order to find the seed that had created such an address. That still meant testing all possible seeds, but without having to connect to any server. If the seed was 215045, it probably meant that a Bantown person was using seeds 215000 to 215999.

Little by little, the GNAA secretly changed the votes for the users that Bantown had spent hours creating.

The IMDb CAPTCHA

Understandably, the Bantown people felt butthurt. On February 5th, they decided to put an end to the whole operation and they alerted IMDb. A wave of panic swept over the admins and one of them quickly set up a CAPTCHA to protect account creation from automated scripts:

imdb-captchaBack in 2005, CAPTCHA breaking was rather uncommon. Some tools existed but they only targeted simple CAPTCHAs with minor image distortions. The one used by IMDb was considered hard to break.

However, the CAPTCHA had an unexpected weakness. It took the GNAA some time to understand it, but even with a few samples, it had become visible:

captcha-samplesCan you see it? “Morgan Freeman” and “Hide and Seek” appeared twice each. What were the odds that, given 16 movie and actor names chosen at random, two of them would appear more than once? Pretty small, wouldn’t you agree? Well yes, unless the list of movies and actors was unexpectedly short. And a small dictionary is a serious CAPTCHA weakness.

In order to guess the size of the dictionary, the GNAA gathered 192 CAPTCHA samples and counted how many times duplicates appeared:

  • 66 names appeared once
  • 39 names appeared twice
  • 10 names appeared 3 times
  • 3 names appeared 4 times
  • 1 name appeared 6 times

They then performed a statistical analysis and managed to compute the probability that the above distribution would appear given various dictionary sizes:

graph2The most probable dictionary sizes were between 170 and 190. As expected, that was small and allowed for a CAPTCHA breaking attack that did not involve OCR: given the size of the corpus, they only had to count characters instead of decoding them. For instance, 4 characters followed by 7 characters could be “Pulp Fiction”, “Ryan Gosling” or “Teri Hatcher”. Since three tries were allowed to solve the CAPTCHA, that one would always be successfully guessed. In average, this led to a CAPTCHA breaker that had more than 60% efficiency.

Operation imdbtroll could carry on.

Epilogue

A few hours after the CAPTCHA breaker was integrated into imdbtroll.py, someone on #gnaa pointed out that the IMDb top 250 only allowed movies that ran for more than 45 minutes.

GNfOS was a short movie. It would never enter the top 250.

The whole operation had been in vain, but science progressed and lulz were had.

Leave a Reply

Your email address will not be published. Required fields are marked *