home.social

#crawler — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #crawler, aggregated by home.social.

  1. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  2. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  3. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  4. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  5. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  6. Added github.com/laylavish/uBlockOri to PriEco

    PriEco will no longer create results out of clearly slop

    Our fight against AI doesn't end here, and we are figuring out better ways to handle them

  7. Welcome to the future, where AI agents hunt down alleged online copyright infringement

    As readers of this blog have doubtless noticed, the latest hot tech – and investment – area involves “agentic AI”, where AI systems are allowed to operative autonomously on allocated tasks. There’s no doubt there are some exciting possibilities here, as well as some troubling issues concerning lack of control. It’s a rapidly-evolving area of research and experimentation, which makes […]

    #agenticAi #agents #ai #ceaseAndDesist #crawler #digitalWatermarks #infringement #licensing #llms #patents #pricing #takedowns #universalMusicGroup walledculture.org/welcome-to-t
  8. Welcome to the future, where AI agents hunt down alleged online copyright infringement

    As readers of this blog have doubtless noticed, the latest hot tech – and investment – area involves “agentic AI”, where AI systems are allowed to operative autonomously on allocated tasks. There’s no doubt there are some exciting possibilities here, as well as some troubling issues concerning lack of control. It’s a rapidly-evolving area of research and experimentation, which makes […]

    #agenticAi #agents #ai #ceaseAndDesist #crawler #digitalWatermarks #infringement #licensing #llms #patents #pricing #takedowns #universalMusicGroup walledculture.org/welcome-to-t
  9. Welcome to the future, where AI agents hunt down alleged online copyright infringement

    As readers of this blog have doubtless noticed, the latest hot tech – and investment – area involves “agentic AI”, where AI systems are allowed to operative autonomously on allocated tasks. There’s no doubt there are some exciting possibilities here, as well as some troubling issues concerning lack of control. It’s a rapidly-evolving area of research and experimentation, which makes […]

    #agenticAi #agents #ai #ceaseAndDesist #crawler #digitalWatermarks #infringement #licensing #llms #patents #pricing #takedowns #universalMusicGroup walledculture.org/welcome-to-t
  10. Welcome to the future, where AI agents hunt down alleged online copyright infringement

    As readers of this blog have doubtless noticed, the latest hot tech – and investment – area involves “agentic AI”, where AI systems are allowed to operative autonomously on allocated tasks. There’s no doubt there are some exciting possibilities here, as well as some troubling issues concerning lack of control. It’s a rapidly-evolving area of research and experimentation, which makes […]

    #agenticAi #agents #ai #ceaseAndDesist #crawler #digitalWatermarks #infringement #licensing #llms #patents #pricing #takedowns #universalMusicGroup walledculture.org/welcome-to-t
  11. Welcome to the future, where AI agents hunt down alleged online copyright infringement

    As readers of this blog have doubtless noticed, the latest hot tech – and investment – area involves “agentic AI”, where AI systems are allowed to operative autonomously on allocated tasks. There’s no doubt there are some exciting possibilities here, as well as some troubling issues concerning lack of control. It’s a rapidly-evolving area of research and experimentation, which makes […]

    #agenticAi #agents #ai #ceaseAndDesist #crawler #digitalWatermarks #infringement #licensing #llms #patents #pricing #takedowns #universalMusicGroup walledculture.org/welcome-to-t
  12. To all who use their service: This may be a quick fix for your woes. But you're not going to like the future they usher in. Your own will ask you one day why you ceded the control of this wonderful public resource to the likes of CF.

    [3/4]

  13. If I'm visiting a site from a country that you don't expect me to be from, does that mean that I'm not a human being interested in the content? Your solution to the AI vacuum cleaner is to arbitrarily blanket ban the IP blocks we're in? Why are we denied the full benefits of the internet because of your incompetence and/or unwillingness to solve the issue technically?

    [2/4]

  14. Found out about a project with millions of randomly generated links. The author explained how #Facebook's scraping bot hit it's page 38 million times. All while the company itself claims that their bot only crawls pages that are shared on their platforms.

    Why is there so much dishonesty in some hyperscaling tech companies?

    Other crawlers are also listed in a short write-up by the author.

    #infosec #crawler #llm #meta

  15. Who do you think you are?

    47.128.32.0 - - [18/Mar/2026:00:48:01 +0100] "GET /robots.txt HTTP/1.1" 403 239 "-" "-" 1650 4269

    #Amazon #AWS Singapore.

    Good on you that #CrowdSec won't immediately block on a missing user-agent, but my httpd-ACL does.

    #DarkVisitors #AI #Crawler #GenAI #SocialPermissionToBurnEnergy

  16. :ablobcatheartsqueeze: I have been running iocaine on my server for a week now. During this time, 7,076,701 requests have passed through iocaine, 3,312,318 of which were identified as AI crawlers/bots. 3,741,577 requests came from crawlers/bots that got stuck in iocaine's deadly maze, consuming an infinite amount of poisoned garbage. Furthermore, 972 crawlers/bots were detected that were routed into the maze via major browsers.

    All of this is managed by iocaine with just ~80 MB of memory and ~0.1% direct CPU usage. Now that’s what I call efficient! Well done, @algernon.

    Let's fight back against AI crawlers and bots. Thanks to projects like iocaine, this is entirely possible, not just theory :blobcat_thisisfine:

    #iocaine #ai #llm #FckAI #FckLLMs #selfhosting #crawler #bots

  17. I have just installed iocaine 3.2.0 by @algernon and have already started successfully serving poisoned garbage to the AI agents. I love it! I especially like how simple the setup was, and how easy it was to expand my existing Caddyfile. My monthly donation is set up too. What a great project!

    #iocaine #ai #llm #FckAI #FckLLMs #bot #crawler

  18. Hallo liebe Fedinauten hier auf anonsys.net. Ab sofort wird diese Instanz vor AI- bzw. KI-Crawlern geschützt. Diese werden geblockt bzw. gebannt.

    Danke @rainer für den Tipp. Habe diesen jetzt auf anonsys.net aktiviert und lasse den Filter einmal täglich aktualisieren.

    Verdammt interessant ist, dass nach ca. 10 Minuten der Aktivierung des Filters bereits 128 AI-Crawler gebannt wurden:

    Status for the jail: apache-ai-crawler
    |- Filter
    |  |- Currently failed: 0
    |  |- Total failed:     33
    |  `- File list:        /var/log/apache2/useragent.log
    `- Actions
       |- Currently banned: 128
       |- Total banned:     128
       `- Banned IP list:   100.28.204.82 100.29.160.53 107.20.181.148 119.28.140.106 18.207.89.138 18.214.124.6 18.215.24.66 18.215.49.176 18.232.11.247 18.235.158.19 184.73.167.217 184.73.239.35 216.73.216.43 23.21.179.120 23.21.225.190 2
    3.21.227.240 23.21.228.180 23.23.99.55 3.209.174.110 3.212.205.90 3.212.86.97 3.220.148.166 3.221.244.28 3.222.190.107 3.93.211.16 3.93.253.174 34.192.67.98 34.195.248.30 34.205.163.103 34.225.138.57 34.226.89.140 34.227.234.246 34.230.
    124.21 34.231.45.47 35.169.102.85 35.169.119.108 35.171.117.160 43.130.101.151 43.130.116.87 43.130.26.3 43.134.186.61 43.135.115.233 43.153.192.98 43.154.140.188 43.154.250.181 43.155.157.239 43.157.20.63 43.157.46.118 43.164.195.17 43
    .164.196.57 43.164.197.224 43.165.135.242 43.165.189.206 43.166.128.86 43.166.242.189 43.166.244.66 44.194.134.53 44.205.74.196 44.209.35.147 44.210.213.220 44.213.202.136 44.217.255.167 44.220.2.97 44.221.105.234 44.223.116.180 47.128.
    112.235 47.128.112.241 47.128.63.217 49.51.166.228 50.19.102.70 52.0.63.151 52.2.4.213 52.201.155.215 52.203.237.170 52.4.229.9 52.5.232.250 52.54.157.23 52.6.97.88 52.70.123.241 54.145.82.217 54.147.80.137 54.157.84.74 54.159.18.27 54.
    235.172.108 54.83.23.103 54.83.240.58 54.83.56.1 66.249.68.128 66.249.68.130 98.82.38.120 98.82.63.147 98.82.66.172 98.83.10.183 98.83.8.142 98.84.60.17 18.208.11.93 18.214.238.178 3.218.35.239 44.212.131.50 54.157.99.244 3.230.69.161 1
    8.235.81.246 52.203.152.231 35.173.38.202 3.232.82.72 34.193.2.57 54.166.126.132 3.225.9.97 98.82.39.241 98.84.200.43 3.94.156.104 44.223.115.10 43.163.104.54 43.157.22.109 43.130.131.18 43.131.26.226 49.51.132.100 50.16.248.61 43.155.1
    62.41 52.203.68.145 54.89.90.224 34.236.185.101 52.200.251.20 43.166.224.244 98.82.107.102 129.226.174.80 18.205.213.231 34.204.150.196

    Es werden minütlich mehr. Das ist echt Wahnsinn! 😳

    Quelle: rainer.sokoll.com/?p=8353

    #anonsys.net #friendica #fedinauten #ai #ki #crawler

  19. I am looking for a nice tool that I could run on my home server to poison my internet useage pattern.

    So far I could only find some outdated projects...

    Do you have any recommendations?

    #diday #crawler #selfhosting #anonymity #advertising

  20. Wie KI die Art und Weise, wie wir Inhalte finden, neu definiert

    Die Art und Weise, wie Menschen Informationen online finden, ändert sich schnell. Da Künstliche Intelligenz (KI) zu einem Kernbestandteil davon wird, wie Benutzer Inhalte entdecken, müssen Ihre Inhalte härter und intelligenter arbeiten, um gesehen zu werden.

    clearleft.com/thinking/how-ai-

    #Crawler #Information #Inhalt #KI #KIBots #KünstlicheIntelligenz #SEO #GEO #Suche #Suchmaschine #Optimieren

  21. Web analytics is a tricky business. Your customers’ problems are your problems - nowhere is this clearer than when a DDoS attack or an overzealous crawler comes knocking.

    At first, the traffic looks normal: requests flood in, metrics tick upward, and everything seems fine.

    But then the pattern emerges - something off.

    wideangle.co/blog/when-a-crawl

    #webanalytyics #crawler #bot #ai #web

  22. Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.

    #teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz

  23. Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.

    #teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz