Advice on how to deal with AI bots/scrapers?

zoey@lemmy.librebun.com · 20 hours ago

Advice on how to deal with AI bots/scrapers?

drkt@scribe.disroot.org · 17 hours ago

Build tar pits.

mholiv@lemmy.world · 16 hours ago

They want to reduce the bandwidth usage. Not increase it!

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 15 hours ago

A good tar pit will reduce your bandwidth. Tarpits aren’t about shoving useless data at bots; they’re about responding as slow as possible to keep the bot connected for as long as possible while giving it nothing.

Endlessh accepts the connection and then… does nothing. It doesn’t even actually perform the SSL negotiation. It just very… slowly… sends… an endless preamble, until the bot gives up.

As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

mholiv@lemmy.world · 5 hours ago

Fair. But I haven’t seen any anti-ai-scraper tarpits that do that. The ones I’ve seen mostly just pipe 10MB of /dev/urandom out there.

Also I assume that the programmers working at ai companies are not literally mentally deficient. They certainly would add .timeout(10) or whatever to their scrapers. They probably have something more dynamic than that.

sem@lemmy.blahaj.zone · 2 hours ago

There’s one I saw that gave the bot a long circular form to fill out or something, I can’t exactly remember

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 2 hours ago

Yeah, that’s a good one.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 2 hours ago

Ah, that’s where tuning comes in. Look at the logs, take the average time-out, and tune the tarpit to return a minimum payload consisting of a minimal HTML containing a single, slightly different URL back to the tar pit. Or, better yet, JavaScript that loads a single page of tarpit URLs very slowly. Bots have to be able to run JS, or else they’re missing half the content on the web. I’m sure someone has created a JS forkbomb.

Variety is the spice of life. AI botnet blacklists are probably the better solution for web content; you can run ssh on a different port and run a tarpit on the standard port, and it will barely affect you. But for the web, if you’re running a web server you probably want visitors, and tarpits would be harder to set up to catch only bots.

mholiv@lemmy.world · 1 hour ago

I see your point but like I think you underestimate the skill of coders. You make sure your timeout is inclusive of JavaScript run times. Maybe set a memory limit too. Like imagine you wanted to scrape the internet. You could solve all these tarpits. Any capable coder could. Now imagine a team of 20 of the best coders money can buy each paid 500.000€. They can certainly do the same.

Like I see the appeal of running a tar pit. But like I don’t see how they can “trap” anyone but script kiddies.

drkt@scribe.disroot.org · 16 hours ago

Bots will blacklist your IP if you make it hostile to bots

This will save you bandwidth

douglasg14b@lemmy.world · 16 hours ago

Cool, lots of information provided!