Sign in Create account

#scrapers — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #scrapers, aggregated by home.social.

Santiago :arandela: @[email protected] · 2026-05-03 · 15:22 UTC

Desde afuera todavia se nota cierta latencia, a veces, posiblemente porque no han cesado los ataques de scraping. En la red interna vuela, y en las metricas los servidores no estan bajo carga o demanda altos, estan normales. El problema en ese caso sería que todos esos ataques que el firewall esta bloqueando exitosamente, lo hace recien dentro de la red, por lo que ese trafico ocupa lugar en la conexión dejando menos ancho de banda neto para el tráfico legítimo... veremos si la cosa mejora en los próximos dias #undernet #ataque #bots #scrapers #iabot #peertube

#undernet #ataque #bots #scrapers #iabot #peertube
N-gated Hacker News @[email protected] · 2026-03-29 · 11:55 UTC

🎩🤖 Oh, look, another #GitHub hero has blessed us with a "groundbreaking" #tool to trap #AI #web #scrapers in a "poison pit." Because clearly, what we all need is a #digital Venus flytrap for code 😏. Meanwhile, GitHub's feature salad just keeps growing, because who doesn't love a good menu with more options than a diner? 🍔💻
https://github.com/austin-weeks/miasma #innovation #featureupdate #codinghumor #HackerNews #ngated

#github #tool #ai #web #scrapers #digital
Hacker News @[email protected] · 2026-03-29 · 11:55 UTC

Miasma: A tool to trap AI web scrapers in an endless poison pit
https://github.com/austin-weeks/miasma
#HackerNews #Miasma #AI #web #scrapers #Endless #pit #Tech #innovation #Open #source

#hackernews #miasma #ai #web #scrapers #endless
Hacker News @[email protected] · 2026-03-29 · 11:55 UTC

Miasma: A tool to trap AI web scrapers in an endless poison pit
https://github.com/austin-weeks/miasma
#HackerNews #Miasma #AI #web #scrapers #Endless #pit #Tech #innovation #Open #source

#hackernews #miasma #ai #web #scrapers #endless
Hacker News @[email protected] · 2026-03-29 · 11:55 UTC

Miasma: A tool to trap AI web scrapers in an endless poison pit
https://github.com/austin-weeks/miasma
#HackerNews #Miasma #AI #web #scrapers #Endless #pit #Tech #innovation #Open #source

#hackernews #miasma #ai #web #scrapers #endless
Hacker News @[email protected] · 2026-03-29 · 11:55 UTC

Miasma: A tool to trap AI web scrapers in an endless poison pit
https://github.com/austin-weeks/miasma
#HackerNews #Miasma #AI #web #scrapers #Endless #pit #Tech #innovation #Open #source

#source #open #innovation #tech #pit #endless
Hacker News @[email protected] · 2026-03-29 · 11:55 UTC

Miasma: A tool to trap AI web scrapers in an endless poison pit
https://github.com/austin-weeks/miasma
#HackerNews #Miasma #AI #web #scrapers #Endless #pit #Tech #innovation #Open #source

#hackernews #miasma #ai #web #scrapers #endless
Jürgen Hubert @[email protected] · 2026-02-27 · 11:54 UTC

No outages in the latest Apache logs. However, there is plenty of suspicious activity.
The log has 16,033 lines.
Of these, 1,559 lines feature the "RecentChanges" function for my wikis. Which is something regular users _might_ call up from time to time, but I suspect that #scrapers are the more likely culprits.
The vast majority of these requests come from a random assortment of IP addresses, and they usually end with something on the lines of:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
So yeah, "anonymous bot nets scraping the Interwebs for nefarious purposes" would be by first guess.

#scrapers
Webrocker @[email protected] · 2026-02-24 · 13:12 UTC

Army of Bots

For some months now I have a simple detection against "bad" bots in place. Bots that scrape *everything* they find and very likely are vacuuming all the contents they get to feed the data grinders that train the LLMs of the world. Bots that not only ignore the "robots.txt" protocol, but actively see entries in the robots.txt file as an invitation to visit the contents that are listed there as "disallowed".
I always had a hunch that stating addresses in a publicy reachable text file and flagging those as "please stay out of there" wasn't the best idea, but well, it was the only thing we've got back in the days where the only bots out there were the crawlers of the search engines.
(…) There are two important considerations when using /robots.txt:
robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
robotstxt.org
Now with all the content-sucking and scraping that the "AI" corporations let lose on the web, it is not unusual to haver a massive spike in bot-related visits even in the personal-website-space. And those scrapers are ruthless, they hammer the servers in high frequency and repeatetly, and are killing the web as we know it along the way.
(…) Many of these scrapers are so sophisticated that it is hard, or impossible, to detect them in action. They often ignore the websites’ programmatic pleas not to be scraped, and are known to hit the more fragile parts of a website repeatedly. opendemocracy.net
I created a directory with a random name in the top-level of my website.
I then added this directory in the robots.txt file with a disallow. This directory is not linked anywhere. Its name is so random and cryptical that it is highly unlikely that a "name guessing" bot will find it (like those exploit-searching idiot scripts that hammer on "wp-admin" or "typo3" URLs even on sites that don't use WordPress or TYPO3…). Inside the directory is a index script that
a) sends me an email,
b) logs the visit with user-agent-string and IP address and
c) saves the data in a nosql db.
In front of my website I have a script that will check the current visitor's IP address against the nosql and if the IP matches, a HTTP 403 status is served.
Here's a best-of user agent strings that recently "visited" my hidden dir.
That last one is superb, considering that this one alone is several times in my log, of course with a different IP each time:
PetalBot
Googlebot/2.1
Claude-SearchBot/1.0
Thinkbot/0 +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.
Plus, there's a load more that pretend to be "normal" web browsers, of course. 🙄
It is a crude, a symbolical fist shaking yelling at clouds kind-of thing, especially compared to the things that Matthias Ott shared in his post, but it is better than nothing.

#Bots #getoffmylawn #LLM #scrapers

https://webrocker.de/?p=29781

#bots #getoffmylawn #llm #scrapers
OpenStreetMap Ops Team @[email protected] · 2026-02-11 · 12:58 UTC

https://OpenStreetMap.org has been disrupted today. We're working to keep the site online while facing extreme load from anonymous scrapers spread across 100,000+ IP addresses. Please be patient while we mitigate and protect the service. #OpenStreetMap #DDoS #Scrapers #AI

#openstreetmap #ddos #scrapers #ai
loebas @[email protected] · 2026-02-06 · 17:14 UTC

Looks like those nasty AI scraper cannot follow 30x redirects
#webmaster #scrapers #website

#webmaster #scrapers #website
DadBuildingAtNight @[email protected] · 2026-01-27 · 00:23 UTC

Any solution to get more SERP results from Google? Any hack/tricks? #BuildInPublic #scraping #scrapers #python

#buildinpublic #scraping #scrapers #python
Larvitz :fedora: @[email protected] · 2026-01-19 · 18:48 UTC

Posted some new blog-articles during the weekend .. Now I saw a quite substancial spike in traffic, that doesn't really look like normal human interaction ...
Grafana/Loki with some LogQL did quickly reveal it. It's scrapers of the AI-Slop generators hitting the webserver in bursts 🤦‍♂️
Time for come countermeasures 🙂
#ai #slop #scrapers #aislop #openai #stopai

#ai #slop #scrapers #aislop #openai #stopai
Demigodrick @[email protected] · 2026-01-15 · 16:20 UTC

Really excited by becoming a collaborator on #stegodon - It's such an exciting piece of software! I did however spend most of my morning countering #scrapers on lemmy.zip - the fun of hosting a site on the #fediverse eh :)

#stegodon #scrapers #fediverse
Hacker News @[email protected] · 2026-01-13 · 22:28 UTC

We can't have nice things because of AI scrapers
https://blog.metabrainz.org/2025/12/11/we-cant-have-nice-things-because-of-ai-scrapers/
#HackerNews #AI #Scrapers #Technology #Ethics #Online #Community #Digital #Rights

#hackernews #ai #scrapers #technology #ethics #online
OpenHistoricalMap @[email protected] · 2026-01-01 · 09:07 UTC

#scrapers and #crawlers are waging a constant #DDOS on our site and driving up cloud hosting costs. We’re coping, but if it keeps getting worse, will OHM last? 🫠

#scrapers #crawlers #ddos
Geoffrey Arduini sur GoToSocial @[email protected] · 2025-12-24 · 16:17 UTC

#fediauthors si je pars du postulat que les LLM scrappent le web et copient/aspirent le contenu que je crée sur mon blog. Comment faire pour m'en prémunir, continuer à partager mes idées et protéger mon contenu de ce vol et de l'utilisation de mes récits pour alimenter ces I.A. ?

Quelles solutions, quels outils ?
Dois-je simplement cesser de créer ? Les capsules #Gemini sont elles scrappées elles aussi ? Si non, combien de temps avant qu'elles ne le soient ?

Merci

#scrapers #LLM #voleDeDonnees #commentFaire #blog

#fediauthors #gemini #scrapers #llm #volededonnees #commentfaire
N-gated Hacker News @[email protected] · 2025-12-12 · 10:13 UTC

🤖🔒 A fox-led #crusade to bamboozle #AI #scrapers from a "Git forge" that sounds as mythical as it does unnecessary. 29 minutes of your life wasted on a convoluted game of hide-and-seek with bots, because apparently, that's the hill we're choosing to die on. 🦊💻
https://vulpinecitrus.info/blog/guarding-git-forge-ai-scrapers/ #Fox #GitForge #HideAndSeek #TechHumor #HackerNews #ngated

#crusade #ai #scrapers #fox #gitforge #hideandseek
Hacker News @[email protected] · 2025-12-12 · 09:55 UTC

Guarding My Git Forge Against AI Scrapers
https://vulpinecitrus.info/blog/guarding-git-forge-ai-scrapers/
#HackerNews #GuardingMyGitForge #AI #Scrapers #Cybersecurity #OpenSource #DeveloperCommunity

#hackernews #guardingmygitforge #ai #scrapers #cybersecurity #opensource
:rss: Hacker News @[email protected] · 2025-12-12 · 08:54 UTC

Guarding My Git Forge Against AI Scrapers
https://vulpinecitrus.info/blog/guarding-git-forge-ai-scrapers/
#ycombinator #git_forge #forgejo #nginx #scrapers

#ycombinator #git_forge #forgejo #nginx #scrapers
Raccoon🇺🇸🏳️‍🌈 @[email protected] · 2025-12-11 · 00:52 UTC

Random #Mastodon feature idea:
Hiding random bits of text in the page to poison illegal #AI #scrapers.

#mastodon #ai #scrapers
PregNuki Raccoon: Shitposting for 2 (at least!) @[email protected] · 2025-12-11 · 00:49 UTC

"How can I spend less time on the nonsense posts I make to poison #AI #scrapers?"
Me:

#ai #scrapers
Inautilo @[email protected] · 2025-11-14 · 11:05 UTC

#Development #Approaches
Rate-limiting requests with Nginx · An alternative approach to counter AI crawlers https://ilo.im/168axr
_____
#RateLimiting #Nginx #WebServer #AI #Scrapers #RobotsTxt #DevOps #WebDev #Backend

#development #approaches #ratelimiting #nginx #webserver #ai
Hacker News @[email protected] · 2025-10-31 · 17:21 UTC

AI scrapers request commented scripts
https://cryptography.dog/blog/AI-scrapers-request-commented-scripts/
#HackerNews #AI #scrapers #commented #scripts #technology #automation

#hackernews #ai #scrapers #commented #scripts #technology
Toby Kurien @[email protected] · 2025-03-27 · 05:51 UTC

Here's one way to deal with #AI #bots and #scrapers hammering our websites: set up a trap. I've hidden a link on my website that humans wouldn't click on, but scrapers would follow. I added the destination to my robots.txt so that well-behaving bots won't follow it. Now I can grep my web logs for hits to that trap and get a list of IP addresses of badly behaving bots. If we #crowdsource such a list of IPs (like with Crowdsec), we can collectively ban them.

#ai #bots #scrapers #crowdsource
Chris @[email protected] · 2025-03-12 · 10:47 UTC

Made a little Astro integration to easily disallow known AI scrapers in your site’s `robots.txt`
https://delucis.github.io/astro-ai-robots-txt/
#ai #scrapers #astrojs

#ai #scrapers #astrojs