Over the last couple of weeks, based on this article about detecting headless browsers and this linked article about detecting PhantomJS, I’ve been doing a bunch of research into bot traffic.

What started out as a “let’s write a better bot blocker” project turned into more of an open-ended investigation into the kind of bot traffic we see on popular traffic sources.

What’s the code behind fraudulent traffic? How sophisticated is it? And how can you detect or block it?

Even in this early-stage research, which was conducted on a single popular traffic source, I found some pretty surprising results. I initially tested using very low bids in a few countries including the USA, which I know from experience is a good way to get lots and lots of bot placements. Subsequently I verified my tests using a very high bid on premium traffic.

All tests were conducted on wifi traffic to reduce false positives from click loss, and because wifi traffic tends to have more bots.

TL: DR Summary

  • Sophisticated bot traffic appears to be very, very rare.
  • The single most effective simple test for bot traffic was “can it parse Javascript”.
  • “Does it even download my landing page” is another effective test.
  • navigator.languages appears to detect some probably-bot traffic.

Findings: Point By Point

Sophisticated Headless Browsers are nowhere to be seen

I implemented most of the techniques in the above article, including WebGL vendor detection, navigator.languages checking, missing JS functions, suspiciously fast alert box closing and of course useragent checking.

I didn’t have much hope that useragent checking would work, because that’s a really obvious thing to spoof, but some of the other checks would be an absolute pain to bypass.

The results surprised me: in tens of thousands of hits, I didn’t detect a single definitive instance of either headless Chrome or PhantomJS! I also saw precisely zero bots rapidly closing alert boxes. This applied both to cheap traffic and high-end premium traffic.

Interesting. But it gets more interesting…

Here’s the source code for the most common bot I saw…

After the initial surprising results above, I thought a bit about what could be causing them.

What’s the simplest “bot” one could possibly use, and how could I detect i ...

This thread has 15 more replies.
WANT TO SEE THE REST?

Join STM Forum