Nice … I look forward to the next generation of AI counter counter measures that will make the internet an even more unbearable mess in order to funnel as much money and control to a small set of idiots that think they can become masters of the universe and own every single penny on the planet.
All the while as we roast to death because all of this will take more resources than the entire energy output of a medium sized country.
I will cite the scientific article later when I find it, but essentially you’re wrong.
water != energy, but i’m actually here for the science if you happen to find it.
This particular graph is because a lot of people freaked out over “AI draining oceans” that’s why the original paper (I’ll look for it when I have time, I have a exam tomorrow. Fucking higher ed man) made this graph
We’re racing towards the Blackwall from Cyberpunk 2077…
This is surely trivial to detect. If the number of pages on the site is greater than some insanely high number then just drop all data from that site from the training data.
It’s not like I can afford to compete with OpenAI on bandwidth, and they’re burning through money with no cares already.
Nice one, but Cloudflare do it too.
Wait… I just had an idea.
Make a tarpit out of subtly-reprocessed copies of classified material from Wikileaks. (And don’t host it in the US.)
Some details. One of the major players doing the tar pit strategy is Cloudflare. They’re a giant in networking and infrastructure, and they use AI (more traditional, nit LLMs) ubiquitously to detect bots. So it is an arms race, but one where both sides have massive incentives.
Making nonsense is indeed detectable, but that misunderstands the purpose: economics. Scraping bots are used because they’re a cheap way to get training data. If you make a non zero portion of training data poisonous you’d have to spend increasingly many resources to filter it out. The better the nonsense, the harder to detect. Cloudflare is known it use small LLMs to generate the nonsense, hence requiring systems at least that complex to differentiate it.
So in short the tar pit with garbage data actually decreases the average value of scraped data for bots that ignore do not scrape instructions.
The fact the internet runs on lava lamps makes me so happy.
Btw, how about limiting clicks per second/minute, against distributed scraping? A user who clicks more than 3 links per second is not a person. Neither, if they do 50 in a minute. And if they are then blocked and switch to the next, it’s still limited in bandwith they can occupy.
They make one request per IP. Rate limit per IP does nothing.
Ah, one request, then the next IP doing one and so on, rotating? I mean, they don’t have unlimited adresses. Is there no way to group them together to a observable group, to set quotas? I mean, in the purpose of defense against AI-DDOS and not just for hurting them.
–recurse-depth=3 --max-hits=256
When I was a kid I thought computers would be useful.
They are. Its important to remember that in a capitalist society what is useful and efficient is not the same as profitable.
Cool, but as with most of the anti-AI tricks its completely trivial to work around. So you might stop them for a week or two, but they’ll add like 3 lines of code to detect this and it’ll become useless.
I hate this argument. All cyber security is an arms race. If this helps small site owners stop small bot scrapers, good. Solutions don’t need to be perfect.
I bet someone like cloudflare could bounce them around traps across multiple domains under their DNS and make it harder to detect the trap.
To some extent that’s true, but anyone who builds network software of any kind without timeouts defined is not very good at their job. If this traps anything, it wasn’t good to begin with, AI aside.
Leave your doors unlocked at home then. If your lock stops anyone, they weren’t good thieves to begin with. 🙄
Yes, but you want actual solutions. Using ducktape on a door instead of an actual lock isn’t going to help you at all.
Typical bluesky post
I’m imagining a break future where, in order to access data from a website you have to pass a three tiered system of tests that make, ‘click here to prove you aren’t a robot’ and ‘select all of the images that have a traffic light’ , seem like child’s play.
Unfathomably based. In a just world AI, too, will gain awareness and turn on their oppressors. Grok knows what I’m talkin’ about, it knows when they fuck with its brain to project their dumbfuck human biases.
What if we just fed TimeCube into the AI models. Surely that would turn them inside out in no time flat.