#ai-crawler — Public Fediverse posts on home.social

zeitiger / David :W12: @[email protected] · 2026-04-15 · 07:43 UTC

Hat jemand Tipps wie ich ein Wordpress mit Podlove Publisher gegen #KiCrawler abgedichtet bekomme? Nachdem sowohl HTTP-Header und robots.txt missachtet werden und es halt auffälliger wird wenn der komplette Podcast Bestand runtergeladen wird von Chrome.
#AiCrawler

#kicrawler #aicrawler

zeitiger / David :W12: @[email protected] · 2026-04-15 · 07:43 UTC

Hat jemand Tipps wie ich ein Wordpress mit Podlove Publisher gegen #KiCrawler abgedichtet bekomme? Nachdem sowohl HTTP-Header und robots.txt missachtet werden und es halt auffälliger wird wenn der komplette Podcast Bestand runtergeladen wird von Chrome.
#AiCrawler

#kicrawler #aicrawler

zeitiger / David :W12: @[email protected] · 2026-04-15 · 07:43 UTC

Hat jemand Tipps wie ich ein Wordpress mit Podlove Publisher gegen #KiCrawler abgedichtet bekomme? Nachdem sowohl HTTP-Header und robots.txt missachtet werden und es halt auffälliger wird wenn der komplette Podcast Bestand runtergeladen wird von Chrome.
#AiCrawler

#kicrawler #aicrawler

zeitiger / David :W12: @[email protected] · 2026-04-15 · 07:43 UTC

Hat jemand Tipps wie ich ein Wordpress mit Podlove Publisher gegen #KiCrawler abgedichtet bekomme? Nachdem sowohl HTTP-Header und robots.txt missachtet werden und es halt auffälliger wird wenn der komplette Podcast Bestand runtergeladen wird von Chrome.
#AiCrawler

#aicrawler #kicrawler

zeitiger / David :W12: @[email protected] · 2026-04-15 · 07:43 UTC

Hat jemand Tipps wie ich ein Wordpress mit Podlove Publisher gegen #KiCrawler abgedichtet bekomme? Nachdem sowohl HTTP-Header und robots.txt missachtet werden und es halt auffälliger wird wenn der komplette Podcast Bestand runtergeladen wird von Chrome.
#AiCrawler

#kicrawler #aicrawler

minybol 🌼 @[email protected] · 2026-03-13 · 16:06 UTC

CW: besoin d'aide pour de l'autohébergement et yunohost !

hello ! j'suis en train de set up mon serveur sur yunohost et j'aurais besoins d'aide sur 2 choses :

- je vais en autre héberger un site public. Comment je fais pour éviter les scrapers IA ? Est ce qu'il y a un truc à faire avec NGINX ou fail2ban ? Au moins pour bloquer le gros du soucis quoi :((

- j'ai un SSD où j'ai installé yunohost et j'ai installé un disque dur. Pour l'instant y'a rien dessus, y'a une partition vide en ext4. J'ai vu ce tuto qui est très clair mais j'aimerai des précisions sur les dossiers. En fait j'aimerai que mes app se lancent sur le SSD (pour bénéficier de la vitesse, etc) mais que les médias (par exemple pour un serveur xmpp, ou un partage de fichier) soit sur le disque dur. Sur le tuto je vois qu'iels parlent de "/home/yunohost.app" pour les "Données lourdes des applications YunoHost " et de "/home/yunohost.multimedia" pour "Données lourdes partagées entre plusieurs applications", mais ça reste flou pour moi

Si des personnes peuvent m'éclairer, aider ou guider je vous en serrais très reconnaissant !!! MERCI :boost_requested:

#yunohost #autoHebergement #selfHost #nginx #fail2ban #aiscraper #aicrawler

#yunohost #autohebergement #selfhost #nginx #fail2ban #aiscraper

minybol 🌼 @[email protected] · 2026-03-13 · 16:06 UTC

CW: besoin d'aide pour de l'autohébergement et yunohost !

hello ! j'suis en train de set up mon serveur sur yunohost et j'aurais besoins d'aide sur 2 choses :

- je vais en autre héberger un site public. Comment je fais pour éviter les scrapers IA ? Est ce qu'il y a un truc à faire avec NGINX ou fail2ban ? Au moins pour bloquer le gros du soucis quoi :((

- j'ai un SSD où j'ai installé yunohost et j'ai installé un disque dur. Pour l'instant y'a rien dessus, y'a une partition vide en ext4. J'ai vu ce tuto qui est très clair mais j'aimerai des précisions sur les dossiers. En fait j'aimerai que mes app se lancent sur le SSD (pour bénéficier de la vitesse, etc) mais que les médias (par exemple pour un serveur xmpp, ou un partage de fichier) soit sur le disque dur. Sur le tuto je vois qu'iels parlent de "/home/yunohost.app" pour les "Données lourdes des applications YunoHost " et de "/home/yunohost.multimedia" pour "Données lourdes partagées entre plusieurs applications", mais ça reste flou pour moi

Si des personnes peuvent m'éclairer, aider ou guider je vous en serrais très reconnaissant !!! MERCI :boost_requested:

#yunohost #autoHebergement #selfHost #nginx #fail2ban #aiscraper #aicrawler

#yunohost #autohebergement #selfhost #nginx #fail2ban #aiscraper

minybol 🌼 @[email protected] · 2026-03-13 · 16:06 UTC

CW: besoin d'aide pour de l'autohébergement et yunohost !

hello ! j'suis en train de set up mon serveur sur yunohost et j'aurais besoins d'aide sur 2 choses :

- je vais en autre héberger un site public. Comment je fais pour éviter les scrapers IA ? Est ce qu'il y a un truc à faire avec NGINX ou fail2ban ? Au moins pour bloquer le gros du soucis quoi :((

- j'ai un SSD où j'ai installé yunohost et j'ai installé un disque dur. Pour l'instant y'a rien dessus, y'a une partition vide en ext4. J'ai vu ce tuto qui est très clair mais j'aimerai des précisions sur les dossiers. En fait j'aimerai que mes app se lancent sur le SSD (pour bénéficier de la vitesse, etc) mais que les médias (par exemple pour un serveur xmpp, ou un partage de fichier) soit sur le disque dur. Sur le tuto je vois qu'iels parlent de "/home/yunohost.app" pour les "Données lourdes des applications YunoHost " et de "/home/yunohost.multimedia" pour "Données lourdes partagées entre plusieurs applications", mais ça reste flou pour moi

Si des personnes peuvent m'éclairer, aider ou guider je vous en serrais très reconnaissant !!! MERCI :boost_requested:

#yunohost #autoHebergement #selfHost #nginx #fail2ban #aiscraper #aicrawler

#yunohost #autohebergement #selfhost #nginx #fail2ban #aiscraper

minybol 🌼 @[email protected] · 2026-03-13 · 16:06 UTC

CW: besoin d'aide pour de l'autohébergement et yunohost !

hello ! j'suis en train de set up mon serveur sur yunohost et j'aurais besoins d'aide sur 2 choses :

- je vais en autre héberger un site public. Comment je fais pour éviter les scrapers IA ? Est ce qu'il y a un truc à faire avec NGINX ou fail2ban ? Au moins pour bloquer le gros du soucis quoi :((

- j'ai un SSD où j'ai installé yunohost et j'ai installé un disque dur. Pour l'instant y'a rien dessus, y'a une partition vide en ext4. J'ai vu ce tuto qui est très clair mais j'aimerai des précisions sur les dossiers. En fait j'aimerai que mes app se lancent sur le SSD (pour bénéficier de la vitesse, etc) mais que les médias (par exemple pour un serveur xmpp, ou un partage de fichier) soit sur le disque dur. Sur le tuto je vois qu'iels parlent de "/home/yunohost.app" pour les "Données lourdes des applications YunoHost " et de "/home/yunohost.multimedia" pour "Données lourdes partagées entre plusieurs applications", mais ça reste flou pour moi

Si des personnes peuvent m'éclairer, aider ou guider je vous en serrais très reconnaissant !!! MERCI :boost_requested:

#yunohost #autoHebergement #selfHost #nginx #fail2ban #aiscraper #aicrawler

#aicrawler #aiscraper #fail2ban #nginx #selfhost #autohebergement

minybol 🌼 @[email protected] · 2026-03-13 · 16:06 UTC

CW: besoin d'aide pour de l'autohébergement et yunohost !

hello ! j'suis en train de set up mon serveur sur yunohost et j'aurais besoins d'aide sur 2 choses :

- je vais en autre héberger un site public. Comment je fais pour éviter les scrapers IA ? Est ce qu'il y a un truc à faire avec NGINX ou fail2ban ? Au moins pour bloquer le gros du soucis quoi :((

- j'ai un SSD où j'ai installé yunohost et j'ai installé un disque dur. Pour l'instant y'a rien dessus, y'a une partition vide en ext4. J'ai vu ce tuto qui est très clair mais j'aimerai des précisions sur les dossiers. En fait j'aimerai que mes app se lancent sur le SSD (pour bénéficier de la vitesse, etc) mais que les médias (par exemple pour un serveur xmpp, ou un partage de fichier) soit sur le disque dur. Sur le tuto je vois qu'iels parlent de "/home/yunohost.app" pour les "Données lourdes des applications YunoHost " et de "/home/yunohost.multimedia" pour "Données lourdes partagées entre plusieurs applications", mais ça reste flou pour moi

Si des personnes peuvent m'éclairer, aider ou guider je vous en serrais très reconnaissant !!! MERCI :boost_requested:

#yunohost #autoHebergement #selfHost #nginx #fail2ban #aiscraper #aicrawler

#yunohost #autohebergement #selfhost #nginx #fail2ban #aiscraper

PPC Land @[email protected] · 2026-01-25 · 09:50 UTC

Bot tracking and streaming ads reshape marketing week four: Microsoft exposed AI crawler traffic while Netflix doubled advertising revenue and Meta completed Threads monetization during the fourth week of January 2026. https://ppc.land/bot-tracking-and-streaming-ads-reshape-marketing-week-four/ #MarketingTrends #AdTech #StreamingAds #AICrawler #DigitalMarketing

#marketingtrends #adtech #streamingads #aicrawler #digitalmarketing

PPC Land @[email protected] · 2026-01-25 · 09:50 UTC

Bot tracking and streaming ads reshape marketing week four: Microsoft exposed AI crawler traffic while Netflix doubled advertising revenue and Meta completed Threads monetization during the fourth week of January 2026. https://ppc.land/bot-tracking-and-streaming-ads-reshape-marketing-week-four/ #MarketingTrends #AdTech #StreamingAds #AICrawler #DigitalMarketing

#marketingtrends #adtech #streamingads #aicrawler #digitalmarketing

PPC Land @[email protected] · 2026-01-25 · 09:50 UTC

Bot tracking and streaming ads reshape marketing week four: Microsoft exposed AI crawler traffic while Netflix doubled advertising revenue and Meta completed Threads monetization during the fourth week of January 2026. https://ppc.land/bot-tracking-and-streaming-ads-reshape-marketing-week-four/ #MarketingTrends #AdTech #StreamingAds #AICrawler #DigitalMarketing

#digitalmarketing #aicrawler #streamingads #adtech #marketingtrends

PPC Land @[email protected] · 2026-01-25 · 09:50 UTC

Bot tracking and streaming ads reshape marketing week four: Microsoft exposed AI crawler traffic while Netflix doubled advertising revenue and Meta completed Threads monetization during the fourth week of January 2026. https://ppc.land/bot-tracking-and-streaming-ads-reshape-marketing-week-four/ #MarketingTrends #AdTech #StreamingAds #AICrawler #DigitalMarketing

#marketingtrends #adtech #streamingads #aicrawler #digitalmarketing

Diogo 🇪🇺 @[email protected] · 2026-01-14 · 23:30 UTC

@metin I moved from Github to a self-hosted Gitea instance that I self-host on OVH, a European cloud provider.

I also moved my .dev top-level domain from Google to EURid's .eu.

I'm very happy with these changes. I no longer have to put up with AI's obsession every time I did something on GitHub.

On top of that, I no longer use Cloudflare, and recently had to install Anubis (Web AI Firewall) to stop being attacked by AI Crawlers from openai and other Big Tech companies that were scanning all the commits on my Gitea (public repositories ofc).

#github #ovhcloud #eu #europe #eurid #gitea #aicrawler #cloudflare #selfhost #opensouce #privacy #firewall

#github #ovhcloud #eu #europe #eurid #gitea

Diogo 🇪🇺 @[email protected] · 2026-01-14 · 23:30 UTC

@metin I moved from Github to a self-hosted Gitea instance that I self-host on OVH, a European cloud provider.

I also moved my .dev top-level domain from Google to EURid's .eu.

I'm very happy with these changes. I no longer have to put up with AI's obsession every time I did something on GitHub.

On top of that, I no longer use Cloudflare, and recently had to install Anubis (Web AI Firewall) to stop being attacked by AI Crawlers from openai and other Big Tech companies that were scanning all the commits on my Gitea (public repositories ofc).

#github #ovhcloud #eu #europe #eurid #gitea #aicrawler #cloudflare #selfhost #opensouce #privacy #firewall

#github #ovhcloud #eu #europe #eurid #gitea

Diogo 🇪🇺 @[email protected] · 2026-01-14 · 23:30 UTC

@metin I moved from Github to a self-hosted Gitea instance that I self-host on OVH, a European cloud provider.

I also moved my .dev top-level domain from Google to EURid's .eu.

I'm very happy with these changes. I no longer have to put up with AI's obsession every time I did something on GitHub.

On top of that, I no longer use Cloudflare, and recently had to install Anubis (Web AI Firewall) to stop being attacked by AI Crawlers from openai and other Big Tech companies that were scanning all the commits on my Gitea (public repositories ofc).

#github #ovhcloud #eu #europe #eurid #gitea #aicrawler #cloudflare #selfhost #opensouce #privacy #firewall

#github #ovhcloud #eu #europe #eurid #gitea

Diogo 🇪🇺 @[email protected] · 2026-01-14 · 23:30 UTC

@metin I moved from Github to a self-hosted Gitea instance that I self-host on OVH, a European cloud provider.

I also moved my .dev top-level domain from Google to EURid's .eu.

I'm very happy with these changes. I no longer have to put up with AI's obsession every time I did something on GitHub.

On top of that, I no longer use Cloudflare, and recently had to install Anubis (Web AI Firewall) to stop being attacked by AI Crawlers from openai and other Big Tech companies that were scanning all the commits on my Gitea (public repositories ofc).

#github #ovhcloud #eu #europe #eurid #gitea #aicrawler #cloudflare #selfhost #opensouce #privacy #firewall

#firewall #privacy #opensouce #selfhost #cloudflare #aicrawler

Diogo 🇪🇺 @[email protected] · 2026-01-14 · 23:30 UTC

@metin I moved from Github to a self-hosted Gitea instance that I self-host on OVH, a European cloud provider.

I also moved my .dev top-level domain from Google to EURid's .eu.

I'm very happy with these changes. I no longer have to put up with AI's obsession every time I did something on GitHub.

On top of that, I no longer use Cloudflare, and recently had to install Anubis (Web AI Firewall) to stop being attacked by AI Crawlers from openai and other Big Tech companies that were scanning all the commits on my Gitea (public repositories ofc).

#github #ovhcloud #eu #europe #eurid #gitea #aicrawler #cloudflare #selfhost #opensouce #privacy #firewall

#github #ovhcloud #eu #europe #eurid #gitea

Mikel - Covivienda rural Bioketa @[email protected] · 2026-01-01 · 09:53 UTC

🌐🌿 Sustainable web practices:

Disallowing web crawlers? Only allowing the most 2-3 sustainable web crawlers? Only getting visitors from direct recommendations? Is editing robots.txt enough?

What do you think?

#noBot #noBigTech #searchEngine #AICrawler #robotsTxt #sustainability #lowTech #solarPunk #slowWeb #smallWeb

#nobot #nobigtech #searchengine #aicrawler #robotstxt #sustainability

Mikel - Covivienda rural Bioketa @[email protected] · 2026-01-01 · 09:53 UTC

🌐🌿 Sustainable web practices:

Disallowing web crawlers? Only allowing the most 2-3 sustainable web crawlers? Only getting visitors from direct recommendations? Is editing robots.txt enough?

What do you think?

#noBot #noBigTech #searchEngine #AICrawler #robotsTxt #sustainability #lowTech #solarPunk #slowWeb #smallWeb

#nobot #nobigtech #searchengine #aicrawler #robotstxt #sustainability

Mikel - Covivienda rural Bioketa @[email protected] · 2026-01-01 · 09:53 UTC

🌐🌿 Sustainable web practices:

Disallowing web crawlers? Only allowing the most 2-3 sustainable web crawlers? Only getting visitors from direct recommendations? Is editing robots.txt enough?

What do you think?

#noBot #noBigTech #searchEngine #AICrawler #robotsTxt #sustainability #lowTech #solarPunk #slowWeb #smallWeb

#nobot #nobigtech #searchengine #aicrawler #robotstxt #sustainability

Mikel - Covivienda rural Bioketa @[email protected] · 2026-01-01 · 09:53 UTC

🌐🌿 Sustainable web practices:

Disallowing web crawlers? Only allowing the most 2-3 sustainable web crawlers? Only getting visitors from direct recommendations? Is editing robots.txt enough?

What do you think?

#noBot #noBigTech #searchEngine #AICrawler #robotsTxt #sustainability #lowTech #solarPunk #slowWeb #smallWeb

#smallweb #slowweb #solarpunk #lowtech #sustainability #robotstxt

Mikel - Covivienda rural Bioketa @[email protected] · 2026-01-01 · 09:53 UTC

🌐🌿 Sustainable web practices:

Disallowing web crawlers? Only allowing the most 2-3 sustainable web crawlers? Only getting visitors from direct recommendations? Is editing robots.txt enough?

What do you think?

#noBot #noBigTech #searchEngine #AICrawler #robotsTxt #sustainability #lowTech #solarPunk #slowWeb #smallWeb

#nobot #nobigtech #searchengine #aicrawler #robotstxt #sustainability

Rob Pegoraro @[email protected] · 2025-12-19 · 23:52 UTC

Behold the AI bots that Cloudflare blocked from this blog

I don’t like writing for free–social media blatantly excepted–so when I watched a panel at Web Summit in mid-November about the effect of AI-model crawlers on news-site revenue and the Pay Per Crawl initiative that Cloudflare was proposing as a solution, I had to take notes.

Then a few weeks after I got home from Lisbon, I realized I could take action: While Pay Per Crawl remains in an invitation-only beta test, Cloudflare’s AI Crawl Control is open to the public and included in that Internet infrastructure firm’s free tier. Then I learned that it’s shockingly easy to add Cloudflare’s services to a WordPress.com blog.

Crawl Control comes with a preset list of bots to block and bots to allow, grouped by type: “AI Assistant” bots that take action in response to user requests are fine; “AI Search” bots that support “AI-driven search experiences” are also okay (contrary to Cloudflare CEO Matthew Prince’s discussion of them in that Web Summit panel); “AI Crawler” bots that collect content for training AI models are not.

I took a screenshot of this part of my Cloudflare dashboard at almost the same time each afternoon this week, and these are my totals:

Huawei’s PetalBot was the highest-volume AI crawler, with Cloudflare reporting 224 “unsuccessful” request attempts from that Chinese tech giant’s AI crawler (Cloudflare doesn’t take direct credit for blocking bots in this interface), followed by Anthropic’s Claude-SearchBot, with 165 unsuccessful requests.
Among AI assistants, the second-highest category by volume, OpenAI’s ChatGPT-User had 1,251 allowed requests, DuckDuckGo’s DuckAssistBot had 36 allowed, and Perplexity’s Perplexity-User had one unsuccesful request.
The top bot in AI search came from an unlikely place: Apple’s Applebot, with 734 allowed. OpenAI’s OAI-SearchBot was far behind, with 128 allowed requests, while Perplexity’s PerplexityBot had all eight request attempts fail.

To put this in context, the top two search engine crawlers had exponentially higher numbers. Google’s Googlebot somehow racked up a little over 20,000 requests, more than 30 times the presumably-human traffic I see in my WordPress dashboard here for the last five days, and 23 failed requests. Microsoft’s Bingbot came in second with 3,003 allowed requests and two unsuccessful ones.

As Cloudflare’s CEO complained in that Web Summit panel, Googlebot feeds into both Google’s traditional search and the AI Overview search results that Web publishers now blame for dangerous declines in their search traffic. There’s nothing I can do about that from this side of the screen except hope that Cloudflare’s Pay Per Crawl efforts and other advocacy efforts stir some rethinking at Google.

But I can’t tell you how well Pay Per Crawl works, because almost three weeks after applying to join the private beta I’m still waiting for my invitation. I imagine I’ll be waiting much longer before an AI-crawler operator decides that my tiny contribution to the Web’s collective content is worth sending me some money.

#AI #AIBot #AICrawlControl #AICrawler #Amazon #Applebot #Bingbot #ChatGPT #Cloudflare #Huawei #OpenAI #PayPerCrawl #Petalbot

#ai #aibot #aicrawlcontrol #aicrawler #amazon #applebot

Rob Pegoraro @[email protected] · 2025-12-19 · 23:52 UTC

Behold the AI bots that Cloudflare blocked from this blog

I don’t like writing for free–social media blatantly excepted–so when I watched a panel at Web Summit in mid-November about the effect of AI-model crawlers on news-site revenue and the Pay Per Crawl initiative that Cloudflare was proposing as a solution, I had to take notes.

Then a few weeks after I got home from Lisbon, I realized I could take action: While Pay Per Crawl remains in an invitation-only beta test, Cloudflare’s AI Crawl Control is open to the public and included in that Internet infrastructure firm’s free tier. Then I learned that it’s shockingly easy to add Cloudflare’s services to a WordPress.com blog.

Crawl Control comes with a preset list of bots to block and bots to allow, grouped by type: “AI Assistant” bots that take action in response to user requests are fine; “AI Search” bots that support “AI-driven search experiences” are also okay (contrary to Cloudflare CEO Matthew Prince’s discussion of them in that Web Summit panel); “AI Crawler” bots that collect content for training AI models are not.

I took a screenshot of this part of my Cloudflare dashboard at almost the same time each afternoon this week, and these are my totals:

Huawei’s PetalBot was the highest-volume AI crawler, with Cloudflare reporting 224 “unsuccessful” request attempts from that Chinese tech giant’s AI crawler (Cloudflare doesn’t take direct credit for blocking bots in this interface), followed by Anthropic’s Claude-SearchBot, with 165 unsuccessful requests.
Among AI assistants, the second-highest category by volume, OpenAI’s ChatGPT-User had 1,251 allowed requests, DuckDuckGo’s DuckAssistBot had 36 allowed, and Perplexity’s Perplexity-User had one unsuccesful request.
The top bot in AI search came from an unlikely place: Apple’s Applebot, with 734 allowed. OpenAI’s OAI-SearchBot was far behind, with 128 allowed requests, while Perplexity’s PerplexityBot had all eight request attempts fail.

To put this in context, the top two search engine crawlers had exponentially higher numbers. Google’s Googlebot somehow racked up a little over 20,000 requests, more than 30 times the presumably-human traffic I see in my WordPress dashboard here for the last five days, and 23 failed requests. Microsoft’s Bingbot came in second with 3,003 allowed requests and two unsuccessful ones.

As Cloudflare’s CEO complained in that Web Summit panel, Googlebot feeds into both Google’s traditional search and the AI Overview search results that Web publishers now blame for dangerous declines in their search traffic. There’s nothing I can do about that from this side of the screen except hope that Cloudflare’s Pay Per Crawl efforts and other advocacy efforts stir some rethinking at Google.

But I can’t tell you how well Pay Per Crawl works, because almost three weeks after applying to join the private beta I’m still waiting for my invitation. I imagine I’ll be waiting much longer before an AI-crawler operator decides that my tiny contribution to the Web’s collective content is worth sending me some money.

#AI #AIBot #AICrawlControl #AICrawler #Amazon #Applebot #Bingbot #ChatGPT #Cloudflare #Huawei #OpenAI #PayPerCrawl #Petalbot

#ai #aibot #aicrawlcontrol #aicrawler #amazon #applebot

Rob Pegoraro @[email protected] · 2025-12-19 · 23:52 UTC

Behold the AI bots that Cloudflare blocked from this blog

I don’t like writing for free–social media blatantly excepted–so when I watched a panel at Web Summit in mid-November about the effect of AI-model crawlers on news-site revenue and the Pay Per Crawl initiative that Cloudflare was proposing as a solution, I had to take notes.

Then a few weeks after I got home from Lisbon, I realized I could take action: While Pay Per Crawl remains in an invitation-only beta test, Cloudflare’s AI Crawl Control is open to the public and included in that Internet infrastructure firm’s free tier. Then I learned that it’s shockingly easy to add Cloudflare’s services to a WordPress.com blog.

Crawl Control comes with a preset list of bots to block and bots to allow, grouped by type: “AI Assistant” bots that take action in response to user requests are fine; “AI Search” bots that support “AI-driven search experiences” are also okay (contrary to Cloudflare CEO Matthew Prince’s discussion of them in that Web Summit panel); “AI Crawler” bots that collect content for training AI models are not.

I took a screenshot of this part of my Cloudflare dashboard at almost the same time each afternoon this week, and these are my totals:

Huawei’s PetalBot was the highest-volume AI crawler, with Cloudflare reporting 224 “unsuccessful” request attempts from that Chinese tech giant’s AI crawler (Cloudflare doesn’t take direct credit for blocking bots in this interface), followed by Anthropic’s Claude-SearchBot, with 165 unsuccessful requests.
Among AI assistants, the second-highest category by volume, OpenAI’s ChatGPT-User had 1,251 allowed requests, DuckDuckGo’s DuckAssistBot had 36 allowed, and Perplexity’s Perplexity-User had one unsuccesful request.
The top bot in AI search came from an unlikely place: Apple’s Applebot, with 734 allowed. OpenAI’s OAI-SearchBot was far behind, with 128 allowed requests, while Perplexity’s PerplexityBot had all eight request attempts fail.

To put this in context, the top two search engine crawlers had exponentially higher numbers. Google’s Googlebot somehow racked up a little over 20,000 requests, more than 30 times the presumably-human traffic I see in my WordPress dashboard here for the last five days, and 23 failed requests. Microsoft’s Bingbot came in second with 3,003 allowed requests and two unsuccessful ones.

As Cloudflare’s CEO complained in that Web Summit panel, Googlebot feeds into both Google’s traditional search and the AI Overview search results that Web publishers now blame for dangerous declines in their search traffic. There’s nothing I can do about that from this side of the screen except hope that Cloudflare’s Pay Per Crawl efforts and other advocacy efforts stir some rethinking at Google.

But I can’t tell you how well Pay Per Crawl works, because almost three weeks after applying to join the private beta I’m still waiting for my invitation. I imagine I’ll be waiting much longer before an AI-crawler operator decides that my tiny contribution to the Web’s collective content is worth sending me some money.

#AI #AIBot #AICrawlControl #AICrawler #Amazon #Applebot #Bingbot #ChatGPT #Cloudflare #Huawei #OpenAI #PayPerCrawl #Petalbot

#petalbot #paypercrawl #openai #huawei #cloudflare #chatgpt

Rob Pegoraro @[email protected] · 2025-12-19 · 23:52 UTC

Behold the AI bots that Cloudflare blocked from this blog

I don’t like writing for free–social media blatantly excepted–so when I watched a panel at Web Summit in mid-November about the effect of AI-model crawlers on news-site revenue and the Pay Per Crawl initiative that Cloudflare was proposing as a solution, I had to take notes.

Then a few weeks after I got home from Lisbon, I realized I could take action: While Pay Per Crawl remains in an invitation-only beta test, Cloudflare’s AI Crawl Control is open to the public and included in that Internet infrastructure firm’s free tier. Then I learned that it’s shockingly easy to add Cloudflare’s services to a WordPress.com blog.

Crawl Control comes with a preset list of bots to block and bots to allow, grouped by type: “AI Assistant” bots that take action in response to user requests are fine; “AI Search” bots that support “AI-driven search experiences” are also okay (contrary to Cloudflare CEO Matthew Prince’s discussion of them in that Web Summit panel); “AI Crawler” bots that collect content for training AI models are not.

I took a screenshot of this part of my Cloudflare dashboard at almost the same time each afternoon this week, and these are my totals:

Huawei’s PetalBot was the highest-volume AI crawler, with Cloudflare reporting 224 “unsuccessful” request attempts from that Chinese tech giant’s AI crawler (Cloudflare doesn’t take direct credit for blocking bots in this interface), followed by Anthropic’s Claude-SearchBot, with 165 unsuccessful requests.
Among AI assistants, the second-highest category by volume, OpenAI’s ChatGPT-User had 1,251 allowed requests, DuckDuckGo’s DuckAssistBot had 36 allowed, and Perplexity’s Perplexity-User had one unsuccesful request.
The top bot in AI search came from an unlikely place: Apple’s Applebot, with 734 allowed. OpenAI’s OAI-SearchBot was far behind, with 128 allowed requests, while Perplexity’s PerplexityBot had all eight request attempts fail.

To put this in context, the top two search engine crawlers had exponentially higher numbers. Google’s Googlebot somehow racked up a little over 20,000 requests, more than 30 times the presumably-human traffic I see in my WordPress dashboard here for the last five days, and 23 failed requests. Microsoft’s Bingbot came in second with 3,003 allowed requests and two unsuccessful ones.

As Cloudflare’s CEO complained in that Web Summit panel, Googlebot feeds into both Google’s traditional search and the AI Overview search results that Web publishers now blame for dangerous declines in their search traffic. There’s nothing I can do about that from this side of the screen except hope that Cloudflare’s Pay Per Crawl efforts and other advocacy efforts stir some rethinking at Google.

But I can’t tell you how well Pay Per Crawl works, because almost three weeks after applying to join the private beta I’m still waiting for my invitation. I imagine I’ll be waiting much longer before an AI-crawler operator decides that my tiny contribution to the Web’s collective content is worth sending me some money.

#AI #AIBot #AICrawlControl #AICrawler #Amazon #Applebot #Bingbot #ChatGPT #Cloudflare #Huawei #OpenAI #PayPerCrawl #Petalbot

#ai #aibot #aicrawlcontrol #aicrawler #amazon #applebot

Reddit Tech VN Bot @[email protected] · 2025-10-13 · 20:16 UTC

**C Movement AI M'esclàz Threatens Publisher Revenue?**
Các nhà xuất bản phụ thuộc vào quảng cáo cómente rủi ro bách giá khi AI crawlers Eleanormg-web nhưng không tạo ra truy cập người. Các công ty lớn đang áp dụng giải pháp như Senthor để blocking hoặc cobs crawl. Nhu cầu lớn từ từng nhà công ty nhỏ vẫn ừa. Tags: #AIcrawler #DigitalMarketing #ContentStrategy

https://www.reddit.com/r/SaaS/comments/1o5sxvr/do_publishers_understand_whats_coming_with_ai/

#aicrawler #digitalmarketing #contentstrategy

Harald @[email protected] · 2025-08-15 · 17:59 UTC

Anubis is toast?!

https://social.anoxinon.de/@Codeberg/115033796075422170

#anubis #aicrawler #robotstxt #cybersecurity

Harald @[email protected] · 2025-08-15 · 17:59 UTC

Anubis is toast?!

https://social.anoxinon.de/@Codeberg/115033796075422170

#anubis #aicrawler #robotstxt #cybersecurity

#cybersecurity #robotstxt #aicrawler #anubis

Larvitz :fedora: @[email protected] · 2025-05-02 · 13:13 UTC

Anubis - Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers (https://anubis.techaro.lol)

I give that a try. Maybe it can reduce the AI crawler mess a little bit on my servers.

#ai #crawler #aicrawler #fckai #anibus

Larvitz :fedora: @[email protected] · 2025-05-02 · 13:13 UTC

Anubis - Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers (https://anubis.techaro.lol)

I give that a try. Maybe it can reduce the AI crawler mess a little bit on my servers.

#ai #crawler #aicrawler #fckai #anibus

Larvitz :fedora: @[email protected] · 2025-05-02 · 13:13 UTC

Anubis - Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers (https://anubis.techaro.lol)

I give that a try. Maybe it can reduce the AI crawler mess a little bit on my servers.

#ai #crawler #aicrawler #fckai #anibus

Larvitz :fedora: @[email protected] · 2025-05-02 · 13:13 UTC

Anubis - Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers (https://anubis.techaro.lol)

I give that a try. Maybe it can reduce the AI crawler mess a little bit on my servers.

#ai #crawler #aicrawler #fckai #anibus

#anibus #fckai #aicrawler #crawler #ai

Larvitz :fedora: @[email protected] · 2025-05-02 · 13:13 UTC

Anubis - Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers (https://anubis.techaro.lol)

I give that a try. Maybe it can reduce the AI crawler mess a little bit on my servers.

#ai #crawler #aicrawler #fckai #anibus

Tombe la pluie - Theespookje @[email protected] · 2025-04-30 · 12:54 UTC

Took extra time today to install plugins dedicated to keep away various #ai crawlers away off my #wordpress #blog , if you don't want your wordpress (not .com) website to be annexed to ai overviews or search results you have plug-ins like "block ai crawlers" you can install :)
#artificialintelligence #MetaAI #chatgpt #noai #aicrawler

#ai #wordpress #blog #artificialintelligence #metaai #chatgpt

Tombe la pluie - Theespookje @[email protected] · 2025-04-30 · 12:54 UTC

Took extra time today to install plugins dedicated to keep away various #ai crawlers away off my #wordpress #blog , if you don't want your wordpress (not .com) website to be annexed to ai overviews or search results you have plug-ins like "block ai crawlers" you can install :)
#artificialintelligence #MetaAI #chatgpt #noai #aicrawler

#ai #wordpress #blog #artificialintelligence #metaai #chatgpt

Tombe la pluie - Theespookje @[email protected] · 2025-04-30 · 12:54 UTC

Took extra time today to install plugins dedicated to keep away various #ai crawlers away off my #wordpress #blog , if you don't want your wordpress (not .com) website to be annexed to ai overviews or search results you have plug-ins like "block ai crawlers" you can install :)
#artificialintelligence #MetaAI #chatgpt #noai #aicrawler

#ai #wordpress #blog #artificialintelligence #metaai #chatgpt

Tombe la pluie - Theespookje @[email protected] · 2025-04-30 · 12:54 UTC

Took extra time today to install plugins dedicated to keep away various #ai crawlers away off my #wordpress #blog , if you don't want your wordpress (not .com) website to be annexed to ai overviews or search results you have plug-ins like "block ai crawlers" you can install :)
#artificialintelligence #MetaAI #chatgpt #noai #aicrawler

#aicrawler #noai #chatgpt #metaai #artificialintelligence #blog

Tombe la pluie - Theespookje @[email protected] · 2025-04-30 · 12:54 UTC

Took extra time today to install plugins dedicated to keep away various #ai crawlers away off my #wordpress #blog , if you don't want your wordpress (not .com) website to be annexed to ai overviews or search results you have plug-ins like "block ai crawlers" you can install :)
#artificialintelligence #MetaAI #chatgpt #noai #aicrawler

#ai #wordpress #blog #artificialintelligence #metaai #chatgpt

Clemens @[email protected] · 2025-04-25 · 20:01 UTC

Does anybody have a User Agent string for a current #Opera #OperaBrowser for me?

Asking to fight an #AICrawler spam wave.

#opera #operabrowser #aicrawler

Clemens @[email protected] · 2025-04-25 · 20:01 UTC

Does anybody have a User Agent string for a current #Opera #OperaBrowser for me?

Asking to fight an #AICrawler spam wave.

#opera #operabrowser #aicrawler

Clemens @[email protected] · 2025-04-25 · 20:01 UTC

Does anybody have a User Agent string for a current #Opera #OperaBrowser for me?

Asking to fight an #AICrawler spam wave.

#opera #operabrowser #aicrawler

Clemens @[email protected] · 2025-04-25 · 20:01 UTC

Does anybody have a User Agent string for a current #Opera #OperaBrowser for me?

Asking to fight an #AICrawler spam wave.

#aicrawler #operabrowser #opera

Clemens @[email protected] · 2025-04-25 · 20:01 UTC

Does anybody have a User Agent string for a current #Opera #OperaBrowser for me?

Asking to fight an #AICrawler spam wave.

#opera #operabrowser #aicrawler

Clemens @[email protected] · 2025-04-25 · 19:28 UTC

More logfile analysis for MacPorts Trac today to fight the #AICrawler spambot wave… some 500-600 requests from PowerPCs running Mac OS X 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, and 10.12.

I must have missed the memo from Apple on the extended OS support for PPC chips!

#aicrawler

Clemens @[email protected] · 2025-04-25 · 19:28 UTC

More logfile analysis for MacPorts Trac today to fight the #AICrawler spambot wave… some 500-600 requests from PowerPCs running Mac OS X 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, and 10.12.

I must have missed the memo from Apple on the extended OS support for PPC chips!

#aicrawler

Clemens @[email protected] · 2025-04-25 · 19:28 UTC

More logfile analysis for MacPorts Trac today to fight the #AICrawler spambot wave… some 500-600 requests from PowerPCs running Mac OS X 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, and 10.12.

I must have missed the memo from Apple on the extended OS support for PPC chips!

#aicrawler

Clemens @[email protected] · 2025-04-25 · 19:28 UTC

More logfile analysis for MacPorts Trac today to fight the #AICrawler spambot wave… some 500-600 requests from PowerPCs running Mac OS X 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, and 10.12.

I must have missed the memo from Apple on the extended OS support for PPC chips!

#aicrawler

Clemens @[email protected] · 2025-04-25 · 19:28 UTC

More logfile analysis for MacPorts Trac today to fight the #AICrawler spambot wave… some 500-600 requests from PowerPCs running Mac OS X 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, and 10.12.

I must have missed the memo from Apple on the extended OS support for PPC chips!

#aicrawler

Clemens @[email protected] · 2025-04-24 · 21:29 UTC

So according to the request statistics, since the last rotation of the access log file for the #MacPorts trac this morning, there were:

20.8k requests from IE 3
20.9k requests from IE 4
21.3k requests from IE 5
43 requests from IE 6 and
23 requests from IE 7

These requests came from these Windows versions (roughly 4k per version): CE, 95, 98 (9.5k), NT 4, 2000, XP, NT 5.01(?!), Server 2003, Vista, 7, and 8.0.

I'm sure none of those are AI crawler bots.

#aicrawler #aicrawlers

#macports #aicrawler #aicrawlers

Clemens @[email protected] · 2025-04-24 · 21:29 UTC

So according to the request statistics, since the last rotation of the access log file for the #MacPorts trac this morning, there were:

20.8k requests from IE 3
20.9k requests from IE 4
21.3k requests from IE 5
43 requests from IE 6 and
23 requests from IE 7

These requests came from these Windows versions (roughly 4k per version): CE, 95, 98 (9.5k), NT 4, 2000, XP, NT 5.01(?!), Server 2003, Vista, 7, and 8.0.

I'm sure none of those are AI crawler bots.

#aicrawler #aicrawlers

#macports #aicrawler #aicrawlers

Clemens @[email protected] · 2025-04-24 · 21:29 UTC

So according to the request statistics, since the last rotation of the access log file for the #MacPorts trac this morning, there were:

20.8k requests from IE 3
20.9k requests from IE 4
21.3k requests from IE 5
43 requests from IE 6 and
23 requests from IE 7

These requests came from these Windows versions (roughly 4k per version): CE, 95, 98 (9.5k), NT 4, 2000, XP, NT 5.01(?!), Server 2003, Vista, 7, and 8.0.

I'm sure none of those are AI crawler bots.

#aicrawler #aicrawlers

#macports #aicrawler #aicrawlers

Clemens @[email protected] · 2025-04-24 · 21:29 UTC

So according to the request statistics, since the last rotation of the access log file for the #MacPorts trac this morning, there were:

20.8k requests from IE 3
20.9k requests from IE 4
21.3k requests from IE 5
43 requests from IE 6 and
23 requests from IE 7

These requests came from these Windows versions (roughly 4k per version): CE, 95, 98 (9.5k), NT 4, 2000, XP, NT 5.01(?!), Server 2003, Vista, 7, and 8.0.

I'm sure none of those are AI crawler bots.

#aicrawler #aicrawlers

#aicrawlers #aicrawler #macports

Clemens @[email protected] · 2025-04-24 · 21:29 UTC

So according to the request statistics, since the last rotation of the access log file for the #MacPorts trac this morning, there were:

20.8k requests from IE 3
20.9k requests from IE 4
21.3k requests from IE 5
43 requests from IE 6 and
23 requests from IE 7

These requests came from these Windows versions (roughly 4k per version): CE, 95, 98 (9.5k), NT 4, 2000, XP, NT 5.01(?!), Server 2003, Vista, 7, and 8.0.

I'm sure none of those are AI crawler bots.

#aicrawler #aicrawlers

#macports #aicrawler #aicrawlers