#crawler — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #crawler, aggregated by home.social.
-
Welcome to the future, where AI agents hunt down alleged online copyright infringementAs readers of this blog have doubtless noticed, the latest hot tech – and investment – area involves “agentic AI”, where AI systems are allowed to operative autonomously on allocated tasks. There’s no doubt there are some exciting possibilities here, as well as some troubling issues concerning lack of control. It’s a rapidly-evolving area of research and experimentation, which makes […]
#agenticAi #agents #ai #ceaseAndDesist #crawler #digitalWatermarks #infringement #licensing #llms #patents #pricing #takedowns #universalMusicGroup https://walledculture.org/welcome-to-the-future-where-ai-agents-hunt-down-alleged-online-copyright-infringement/ -
📬 Google-Ranking verstehen: Was hinter den Suchergebnissen steckt
#Empfehlungen #Internet #Absprungrate #Crawler #GoogleRanking #Keywords #SEO #Suchmaschinen #URLStrukturen https://sc.tarnkappe.info/d8af41 -
Who do you think you are?
47.128.32.0 - - [18/Mar/2026:00:48:01 +0100] "GET /robots.txt HTTP/1.1" 403 239 "-" "-" 1650 4269
Good on you that #CrowdSec won't immediately block on a missing user-agent, but my httpd-ACL does.
#DarkVisitors #AI #Crawler #GenAI #SocialPermissionToBurnEnergy
-
:ablobcatheartsqueeze: I have been running iocaine on my server for a week now. During this time, 7,076,701 requests have passed through iocaine, 3,312,318 of which were identified as AI crawlers/bots. 3,741,577 requests came from crawlers/bots that got stuck in iocaine's deadly maze, consuming an infinite amount of poisoned garbage. Furthermore, 972 crawlers/bots were detected that were routed into the maze via major browsers.
All of this is managed by iocaine with just ~80 MB of memory and ~0.1% direct CPU usage. Now that’s what I call efficient! Well done, @algernon.
Let's fight back against AI crawlers and bots. Thanks to projects like iocaine, this is entirely possible, not just theory :blobcat_thisisfine:
#iocaine #ai #llm #FckAI #FckLLMs #selfhosting #crawler #bots
-
I have just installed iocaine 3.2.0 by @algernon and have already started successfully serving poisoned garbage to the AI agents. I love it! I especially like how simple the setup was, and how easy it was to expand my existing Caddyfile. My monthly donation is set up too. What a great project!
-
Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.
#teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz
-
Crawler Preps for Entry into VAB for Artemis II Rollout Ops 🌑🧑🚀
#Artemis #ArtemisII #crawler #rollout
⏩ 7 new pictures from NASA (Image Library) https://commons.wikimedia.org/wiki/Special:ListFiles?limit=7&user=OptimusPrimeBot&ilshowall=1&offset=20260115010334
-
Vers un #web toujours plus fragile https://siecledigital.fr/2025/12/31/etude-cloudflare-2025-un-web-plus-vaste-plus-automatise-et-plus-fragile
À eux seuls, les #bots représenteraient près de 30% du trafic web mondial, avec des pics capables de générer des volumes comparables à des attaques DDoS
#Googlebot est le #crawler dominant avec 4,5% des requêtes HTML
En 2025, le #smartphone s’impose avec environ 43% des utilisateurs mondiaux, contre 57% pour les ordinateurs. #Android domine largement le trafic mobile à l’échelle mondiale, tandis qu’#iOS conserve une position forte -
my #dwarffortress experience comes after #moria #roguelike #mincraft #rpg #dungeon #crawler #dnd #ascii #angband #shatteredpixel ... so far, after 8 hours, I sort of feel aimless and like maybe I made a bad purchase. Is there an "arcade" version of Dwarf Fortress? to get started with? How does one engage with this #lore... ?
-
#RSL 1.0 statt robots.txt: Neuer Standard für Internet-Inhalte | heise online https://www.heise.de/news/RSL-1-0-Standard-soll-Verwendung-von-Inhalten-regeln-11111422.html #searchengines #searchengine #ArtificialIntelligence #crawler #ReallySimpleLicensing #robotsTXT
-
Wenn ich nachsehen möchte, ob im #Formatstring für das Datum das kleine s für Sekunden und das große S für Millisekunden steht, dann frage ich das einen beliebigen #GPTbot (den, der nicht sagt #Quota exceded, weil ich mich nicht per #Api-Key identifiziert habe). Warum?In #Wikipedia steht die Antwort möglicherweise. Es dauert aber, herauszufinden, in welchem Artikel, Listenartikel oder Unterartikel. Die #Suche von Wikipedia verwendet zwar #ElasticSearch, aber um die Vorteile von dieser starken Engine auch zu erhalten, hätten 100000e Menschen, die Wikipedia-Artikel auch verschlagworten müssen (#wikidata). Ausserdem kann es sein, dass etwas so praktisches wie formatstrings als #unenzyklpädisch eingestuft wurde und daher entfernt.
In #Stackexchange muss ich mehrfach bestätigen, dass ich ein Mensch bin, finde dann einen Artikel, der unbeantwortet geschlossen wurde, weil #Duplikat. Dann zwei veraltete, die inzwischen falsch sind, dann welche mit einem nicht mehr funktionierenden link auf die Lösung.
Bei #archive_org, archive.is und #AnnasArchive muss ich die #URL des gesuchten Artikels wissen, um suchen zu können.
Eine #Suchmaschine sucht nicht. Eine Suchmaschine liest die "Sutemap.XML" Dateien aus, die websitebetreiber online stellen für die #crawler der Suchmaschinen. Ich finde also fünf Jahre alte Artikel auf Websites die seit fünf Jahren nicht mehr gepflegt werden. Und maximal ein jahr alte Artikel, die meine Frage nicht beantworten aber in der #sitemap stehen. Die 100 Websites, die die richtige Antwort in einem zwei bis vier Jahre alte Artikel enthalten, finde ich nicht, weil diese Artikel nicht mehr in der sitemap stehen.
Die GPTbots haben Wikipedia, stackexchange, Archiv.org, Annas archive und alle Websites gescrapt und dabei #robots.txt und sitemap ignoriert. Ich bekomme die richtige Antwort und zwar schneller als mit allen zuvor genannten Varianten.
Oder ich suche in #Grokipedia. Grokipedia besteht aus 1Million statischen seiten im #CDN von #Cloudflare die von wikipedia gescrapt wurden. Die suche ist ein GPTbot und 57mal besser als die suche in wikipedia.
@malteengeler @awinkler @evawolfangel @bkastl @Raymond @wikipedia
-
Leben in der Welt der Bots
Klicks sind nicht gleich Klicks
Mehr dazu bei https://t3n.de/news/studie-anstieg-bot-traffic-seo-1710540/
a-fsa.de/d/3KH
Link zu dieser Seite: https://www.a-fsa.de/de/articles/9305-20251008-leben-in-der-welt-der-bots.html
Link im Tor-Netzwerk: http://a6pdp5vmmw4zm5tifrc3qo2pyz7mvnk4zzimpesnckvzinubzmioddad.onion/de/articles/9305-20251008-leben-in-der-welt-der-bots.html
Tags: #Bots #Crawler #Spider #Klicks #AI #KI #OpenAI #Meta #Web #Internet #Publisher #Inhalte #Verdienst #Zensur #Transparenz #Informationsfreiheit #Meinungsmonopol #Meinungsfreiheit #Pressefreiheit #Internetsperren #Verhaltensänderung -
→ Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/“We observed that #Perplexity [an AI-powered answer engine] uses not only their declared #user_agent, but also a generic browser intended to #impersonate Google Chrome on macOS when their declared crawler was blocked.”
“This activity was observed across tens of thousands of domains and millions of requests per day.”
-
@dgouttegattat - I believe you found evidence of a #GenerativeAI #crawler leveraging #residentialproxy infrastructure, like the one offered through #brightdata or competitors.
It even advertises itself for #AI crawling use cases: https://brightdata.com/
-
#IETF diskutiert Maßnahmen gegen den Ansturm der KI-#Crawler | heise online https://www.heise.de/news/Technische-Massnahmen-gegen-den-Ansturm-der-KI-Crawler-10497930.html #Webcrawler #ArtificialIntelligence
-
在 LLM crawler 盛行的年代擋 bot...
在「把 wiki 搬回到家裡的機器上」之後,就更容易看出來上面的 loading 了 (因為目前上面只有一個站台)。 這個是 monitorix 的週圖: 這個是月圖: 搬回來後就一直有看到 crawler 的量在上面掃,一開始還沒管太多,後來發現愈來愈嚴重 (幾乎所有的 bot 都會因為你撐的住就加速),還是研究了在 Caddy 上擋 bot 的方案。 這邊採用兩個方案,一個是 IP-based 的,另外一個是 User-Agent-based 的。 IP-based 的部分用的是 caddy-defender 的方案,擋掉所有常見的 bot 網段 (包括了 cloud 以及 VPS 的網段): defender block { ranges aws azurepubliccloud…
#blocker #bot #caddy #crawler #defender #llm #php #web #wiki
-
Improved ways to operate a rude crawler
https://www.marginalia.nu/log/a_115_rude_crawler/
#HackerNews #Improved #Crawler #Rude #Crawler #Techniques #Web #Scraping #Automation
-
Meta's AI Bot, cannot be blocked by JavaScript detection. That is because Meta's AI Bot, is running a real web browser, just like a user. The script side of things is on their server - Not your typical crawler.
#WebCrawler #Crawler #AI #ArtificialIntelligence #Meta -
#Development #Reports
Redirecting 404s to homepage? · Google’s Martin Splitt warns against it https://ilo.im/162p5g_____
#Business #Google #SearchEngine #Crawler #Bot #SEO #TechnicalSEO #Redirects #WebDev #Backend -
It looks like LLM-producing companies that are massively #crawling the #web require the owners of a website to take action to opt out. Albeit I am not intrinsically against #generativeai and the acquisition of #opendata, reading about hundreds of dollars of rising #cloud costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?
I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an #operator, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of #LLM #web #crawlers? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic #opt-out configurations that tackle LLM projects specifically?
I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?
We require a new #opt-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even #CommonCrawl has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good #Internet citizenship?
To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.
Some concerning #news articles on the topic:
-
Here's some details on the .htaccess we use for minimizing the impact of the "AI" crawlers: https://wiki.openhumans.org/wiki/PersonalScienceWiki:Spam#.htaccess_User-Agent_blocks /cc @jascha
-
#Business #Reports
Google warns of soft 404 errors · They can impact a website’s crawlability and indexing https://ilo.im/15zcu2_____
#Google #SearchEngine #SEO #Crawler #HTTP #StatusCode #Error404 #Development #WebDev #Backend -
Given all the recent updates to the #CROWler #gpt I have decided to rename it to "The CROWler Support" as it can now provide support on everything, not just the rulesets creation/debugging. The link has changed, so here is the new link for everyone. Enjoy and happy content discovery development!
-
Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂
Release notes: https://github.com/vmandic/tris-web-crawler/releases/tag/v1.3.1
I focused mostly on CSS animating a spider emoji to walk a web emoji! https://tris.fly.dev/
#indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp
-
Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂
Release notes: https://github.com/vmandic/tris-web-crawler/releases/tag/v1.3.1
I focused mostly on CSS animating a spider emoji to walk a web emoji! https://tris.fly.dev/
#indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp
-
Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂
Release notes: https://github.com/vmandic/tris-web-crawler/releases/tag/v1.3.1
I focused mostly on CSS animating a spider emoji to walk a web emoji! https://tris.fly.dev/
#indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp
-
Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂
Release notes: https://github.com/vmandic/tris-web-crawler/releases/tag/v1.3.1
I focused mostly on CSS animating a spider emoji to walk a web emoji! https://tris.fly.dev/
#indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp
-
I am so happy with the first own web application 🎉 I have developed: Tris, a simple and free web crawler 🕸️ 🕷️ !
You can try it for free online: https://tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.
Next thing I will add will be a text input to set a target domain hhh, now I am making it hard! 🙈
#node #nodejs #web #webcrawler #crawler #seo #datatools #webscraper #scraping #seotools #seotool #tris #triswebcrawler #webapp #indie #indiedev
-
Robots.txt, OpenAI’s GPTBot, Common Crawl’s CCBot: How to block AI crawlers from gathering text and images from your website: https://katharinabrunner.de/2023/08/robots-txt-openais-gptbot-common-crawls-ccbot-how-to-block-ai-crawlers-from-gathering-text-and-images-from-your-website/
#ai #openAI #crawler #commoncrawl #ccbot #GPTBot #robotstxt #wordpress
-
offsec.tools - A vast collection of security tools
#CyberSecurity #osint #pentest #scanner #cve #vulnerabilities #burpsuite #endpoints #passwords #cloud #secrets #fuzzing #dns #ips #framework #network #directories #crawler #screeenshots #git #cms #allinone #proxy #probing