The difficulty of antibot fight on the web

Robots, more commonly known as bots, are now plentiful on the Internet. They account for a significant share of global Internet traffic.

Today, I propose you to discover the world of Internet bots.

Bots… but what for?

The first question one might ask would be why bots are roaming freely on the Internet.

You use, sometimes unknowingly, bots every day when you surf the Internet. You have run a search on a search engine, a bot has indexed it before for you. A large part of the messages from companies on social networks come from bots that publish them in their name.

But bots are also attacks on websites to saturate them, or even wrappers in charge of downloading all the content of the sites to resell the content or create a website at a lower cost by exploiting the data and infrastructure of other websites.

The good bots and the bad bots…

The nice bots, they go to websites, they download all the content, but they’re nice bots. The bad bots, they go to websites, they download all the content, but it’s not the same.

That’s the complexity of the fight against bots in one sentence. I’m of course not going to talk about “attack” bots such as DDoS, which I’ll talk about below, but rather about indexing bots versus content wrappers.

As for indexing bots, I’m going to talk about the best known of them: GoogleBot. This bot will read your site, and in particular some very specific files to index the content of your website, and allow to reference it in its engine. Bing, Duckduckgo, Qwant and others use similar methods too.

These bots are necessary for the modern functioning of the Internet and very often have a minor or even no impact on the visited site. This also comes from the fact that these bots have nothing to gain by degrading the performance of the site.

To help these bots, several files can be deposited on the server, for example the robots.txt file at the root of the server allows, for example, to indicate to the search engine the contents not to be indexed. Be careful however, this file is purely indicative and can be completely ignored by the robot. Similarly, to facilitate indexing, we often set up a sitemap allowing the robots to structure the content of the site.

On the other hand, suction boxes, or wrappers, have only one purpose, to suck up part or all of your site, for :

Create a site from the extracted data: LinkedIn VS Apollo.io
Extract the data and cross it with others to resell them
Make stats next to it, for example

All wrappers are not necessarily malicious, I developed one myself a few years ago to retrieve public information from a site that did not provide an API (API arrived a little later by the way).

The attack bots

We could also talk about the more malevolent bots, the attack bots. The goal in this case is clearly to harm the website.

These bots are declined in several types, but we could quote:

Bots that scan open ports (with NMAP for example).
The brute force bots (if you have a website, look at the accesses on the admin wordpress or phpmyadmin, even if you don’t have one)
DDOS bots, which only aim to bring down your website
These bots have no use for the host.

You might think: What is the use for a hacker to take control of a blog?

There are actually several:

Extract the user base (email address, password etc..)
Be able to use the site to host phishing pages using its “clean” web reputation
Publish specific content
That’s why it’s important to protect yourself just as much from these bots.

Differentiating bots

As you must have understood, differentiating the “good bot” from the “bad bot” is not necessarily easy. Solutions exist today, but none is 100% reliable.

Most of these solutions will be based on several criteria:

Is the source IP known to be dangerous?
Is the user’s behavior abnormal, or inhuman (for example, a user who will read 2000 blog posts in 20 seconds is slightly suspicious).
Does the user use a headless browser (more information here)?

The real issue here is that a heuristic analysis requires a pre-existing data model and also requires navigation before blocking, so the bot is potentially allowed to browse temporarily.

Moreover, during a block, many bots come back, but otherwise. For example, by going behind a proxy, or for bots running from compromised devices, by exploiting a botnet network.

In the case of botnets by the way, the fight is much more complicated, as the user can be more complicated to trace.

All I have to do is adding a captcha…

The captcha… the ultimate weapon of many a site. This weapon is often useless because it is poorly exploited. The captcha is often more annoying for humans than the robot, which is a bit paradoxical.

Displaying a captcha is always complicated, because you have to wonder if it will not degrade the user experience too much. I’m myself one of the people who are not a fan of training Google’s recognition algorithms with reCaptcha.

Moreover, you have to keep in mind that a lot of captcha are bypassable, either by bots that will solve them, or by humans paid to solve captchas that are displayed (for example: https://anti-captcha.com/mainpage).

Nevertheless, when we detect a behavior that we consider abnormal, displaying a captcha may be a solution for :

Slow down the bot
Raising doubts about a human or a bot

Keeping in mind that this solution is not 100% reliable.

To conclude

As you will have understood, the fight against robots is a complicated fight. It’s a daily fight and it’s mostly a cat and mouse game. As I like to say, no security is infallible, and bots are constantly evolving. It is therefore our duty, for those like me who do security on a daily basis, to never let our guard down and stay on the lookout for new methods.

It’s also a complicated struggle because it’s a struggle that has to be silent, at the risk of degrading the user experience and scaring away its audience. In addition, as I explained, some bots are necessary to ensure the proper functioning of the site and its referencing for example.

And you, what do you think about it?