#crawlers — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #crawlers, aggregated by home.social.
-
Spider-Horror Pic ‘Crawlers’ Acquired By Roadside Attractions & Saban Films
#Acquisitions #News #Crawlers #RoadsideAttractions #SabanFilmshttps://deadline.com/2026/04/crawlers-roadside-attractions-1236877140/
-
We're live!
Avali Week AND 420? Hell yeah!
$145 flat colors and $25 crawlers!
See ya there! -
Oh look even more crawlers lmao
-
#Development #Explainers
Inside Googlebot · How Google’s crawl system decides which content gets indexed https://ilo.im/16btho_____
#Business #Google #SearchEngine #SEO #Crawlers #Content #RobotsTxt #Development #WebDev #Frontend -
Quo Vadis, Crawlers? Progress and what’s next on safeguarding our infrastructure https://diff.wikimedia.org/2026/03/26/quo-vadis-crawlers-progress-and-whats-next-on-safeguarding-our-infrastructure/ #AI, #AIDataCrawlers, #Crawlers, #Infrastructure, #Knowledge, #KnowledgeAsAService, #Scraping, #ScrapingBots, #WebScraping, #WikimediaFoundation, #WikimediaProjects
-
#Development #Findings
Markdown, llms.txt, and AI crawlers · Do Markdown and llms.txt matter for your website? https://ilo.im/16b5qb_____
#Business #SEO #SearchEngines #AI #Crawlers #Content #Website #Markdown #LlmsTxt #RobotsTxt -
#Business #Reports
Anthropic details how Claude crawls sites · How to block the three separate user agents https://ilo.im/16ax7y_____
#AI #Claude #Crawlers #UserAgents #RobotsTxt #Content #Website #WebDev #Frontend #Backend -
#Development #Reports
Google lists Googlebot file limits · Do Google’s crawling limits affect your website? https://ilo.im/16adna_____
#Business #Google #SearchEngine #Crawlers #Googlebot #Files #HTML #PDF #WebDev #Frontend -
#Development #Challenges
Webspace invaders · Let’s level up our anti-AI scraping game! https://ilo.im/16ahl8_____
#AI #Crawlers #RobotsTxt #RateLimiting #WAFs #Cloudflare #IndieWeb #WebDev #Frontend #Backend -
🤖 The Confederation of Open Access Repositories (COAR) has just announced the new Dealing with AI Bots website (https://dealing-with-bots.coar-repositories.org/) that provides a wealth of information on bots and #crawlers impacting the services and operations of open #repositories, including mitigation strategies; see:
https://coar-repositories.org/news-updates/mitigating-the-impact-of-ai-bots/ -
The New York Times sues Perplexity for producing ‘verbatim’ copies of its work – The Verge
Credit: NYT Times, gettyimages-2249036304The New York Times sues Perplexity for producing ‘verbatim’ copies of its work
The NYT alleges Perplexity ‘unlawfully crawls, scrapes, copies, and distributes’ work from its website.
by Emma Roth, Dec 5, 2025, 7:42 AM PS, Emma Roth is a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.
The New York Times has escalated its legal battle against the AI startup Perplexity, as it’s now suing the AI “answer engine” for allegedly producing and profiting from responses that are “verbatim or substantially similar copies” of the publication’s work.
The lawsuit, filed in a New York federal court on Friday, claims Perplexity “unlawfully crawls, scrapes, copies, and distributes” content from the NYT. It comes after the outlet’s repeated demands for Perplexity to stop using content from its website, as the NYT sent cease-and-desist notices to the AI startup last year and most recently in July, according to the lawsuit. The Chicago Tribune also filed a copyright lawsuit against Perplexity on Thursday.
The New York Times sued OpenAI for copyright infringement in December 2023, and later inked a deal with Amazon, bringing its content to products like Alexa.
Perplexity became the subject of several lawsuits after reporting from Forbes and Wired revealed that the startup had been skirting websites’ paywalls to provide AI-generated summaries — and in some cases, copies — of their work. TheNYT makes similar accusations in its lawsuit, stating that Perplexity’s crawlers “have intentionally ignored or evaded technical content protection measures,” such as the robots.txt file, which indicates the parts of a website crawlers can access.
Perplexity attempted to smooth things over by launching a program to share ad revenue with publishers last year, which it later expanded to include its Comet web browser in August.
Related
- Cloudflare says Perplexity’s AI bots are ‘stealth crawling’ blocked sites
- Perplexity is cutting checks to publishers following plagiarism accusations
“By copying The Times’s copyrighted content and creating substitutive output derived from its works, obviating the need for users to visit The Times’s website or purchase its newspaper, Perplexity is misappropriating substantial subscription, advertising, licensing, and affiliate revenue opportunities that belong rightfully and exclusively to The Times,” the lawsuit states.
Continue/Read Original Article Here: The New York Times sues Perplexity for producing ‘verbatim’ copies of its work | The Verge
Tags: AI, artificial intelligence, Copyright, Crawlers, Distribution, Lawsuit, NYT Work, OpenAI, Perplexity, Robots.txt, Scrapping, Sues, The New York Times, The Verge, Verbatim Copies#AI #artificialIntelligence #Copyright #Crawlers #Distribution #Lawsuit #NYTWork #OpenAI #Perplexity #RobotsTxt #Scrapping #Sues #TheNewYorkTimes #TheVerge #VerbatimCopies
-
#Business #Initiatives
AI’s free web scraping days may be over · Say hello to RSS’s younger, tougher brother https://ilo.im/166s9q_____
#Web #Publishing #Website #Blog #Content #AI #Crawlers #Payments #RSL #RSS #RobotsTxt -
🤖🎉 Wow, #AI #crawlers are now the Indiana Jones of #Codeberg, fearlessly solving #Anubis #challenges while we mere mortals fumble with #JavaScript on Mastodon. 🙄 Clearly, the robots are one step closer to world domination, and we're still struggling to open our native apps. 📱💥
https://social.anoxinon.de/@Codeberg/115033790447125787 #WorldDomination #HackerNews #ngated -
#Development #Guides
Who owns your content in the AI age? · When AI bots take your content without consent https://ilo.im/165tej_____
#AI #Crawlers #Consent #Content #GitHub #AccessLogs #RobotsTxt #Design #WebDesign #WebDev -
I've been getting a lot of weird "domain as a URL" requests for flyriver.com. Decided to look it up and… I'm confused.
Their site says "Is your site being used by AI Generators?" but then it also says "Or, generate an exact article based on your query:"
The former seems like an "is your content getting ripped off?" question, but then the second bit is "hey, want to rip off other people's content?" 🧐
-
The initial problem is the aggressiveness of #LLM web #crawlers that don't respect "robots.txt". The first idea that comes to mind is IP #blocking . However, web crawlers have circumvented this restriction by using individual IPs via specialized #botnets .
Another solution is therefore to exhaust the resources of the harvesters. With a #zipbomb , we attempt to #exhaust their #RAM .
-
#Development #Trends
Who’s crawling your site in 2025 · The most active and blocked bots and crawlers https://ilo.im/1652mx_____
#Bots #Crawlers #Website #Business #SEO #UserAgents #RobotsTxt #WebDev #Frontend #Backend -
#Business #Launches
Cloudflare Radar’s crawl-to-refer ratio · Monitor AI crawls vs. real website traffic sent https://ilo.im/1652pb_____
#SEO #AnswerEngines #AI #Crawlers #Referrals #Content #Website #Development #WebDev #Frontend -
Another tool for our arsenal of protections against adversarial bots:
-
#Business #Introductions
Meet LLMs.txt · A proposed standard for AI website content crawling https://ilo.im/16318s_____
#SEO #GEO #AI #Bots #Crawlers #LlmsTxt #RobotsTxt #Development #WebDev #Backend -
Anubis: Proof-of-work proxy to prevent AI crawlers
#HackerNews #Anubis #ProofOfWork #AI #Crawlers #TechNews #Blockchain
-
A sysadmin's rant about feed readers and crawlers (2022) — http://rachelbythebay.com/w/2022/03/07/get/
#HackerNews #sysadmin #rant #feedreaders #crawlers #technology #2022 -
Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.
#hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper
-
Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.
#hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper
-
Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.
#hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper
-
Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.
#hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper
-
Film Review: The Descent
Link: https://film-book.com/review-the-descent/?fsp_sid=11490
#Crawlers #filmreview #FilmReviewTheDescent #Horror #horrormovie #MovieReview #NatalieJacksonMendoza #NeilMarshall #OliverMilburn #ShaunaMacDonald #TheDescent #TheDescentfilmreview #TheDescentmoviereview #TheDescentreview
-
It looks like LLM-producing companies that are massively #crawling the #web require the owners of a website to take action to opt out. Albeit I am not intrinsically against #generativeai and the acquisition of #opendata, reading about hundreds of dollars of rising #cloud costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?
I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an #operator, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of #LLM #web #crawlers? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic #opt-out configurations that tackle LLM projects specifically?
I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?
We require a new #opt-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even #CommonCrawl has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good #Internet citizenship?
To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.
Some concerning #news articles on the topic:
-
If you have established Websites and wish to isolate them from search engines such as Google and other crawlers, place a text file named robots.txt with the following wording in the root directory of your Web site:
User-agent: *Disallow: /The top line targets all crawlers, and the bottom line targets all files under the root directory for non-permission.
It is convenient because only two lines are needed to reject all crawlers, but placing this in place would leave all crawlers, which may harm search indexing and other aspects of the site. If there are any adverse effects, you can narrow the list of crawlers to reject and define them individually.
WordPress robots.txt can be edited safely using the following plug-ins without placing the text file directly in the root directory:
By the way; there is something I have been thinking about for a while that I would like to realize so that I will write it here.
Speaking of Cloudflare Tunnel, a revolutionary mechanism that allows you to securely expose your servers from inside your router rather than from the DMZ, and Cloudflare One/Cloudflare WARP, a combination of a gateway to securely access those contents on the go and a WireGuard client with MDM capabilities to connect to it.
For a long time, I have wanted to use Cloudflare Tunnel and Cloudflare WARP on the same server simultaneously, but that has been challenging to achieve. Sometimes, it degrades the Tunnel, or the server is unusable even when the Tunnel port is open.
I did a Google search and found the following exchange on a thread in the Cloudflare Community:
I have ‘degraded status’ when using WARP with Zero Trust
“Just open the specified port,” the Cloudflare staff responded, but that didn’t work for me.
Now, I asked ChatGPT about it.Using Cloudflare Tunnel and Cloudflare WARP on the same server can indeed be a bit challenging due to potential conflicts in network configurations and routing. However, it’s not impossible to set them up together. Here are some tips and considerations to help you achieve this:
- Port Conflicts:
- Make sure that the ports used by Cloudflare Tunnel and Cloudflare WARP do not conflict. Check the default ports used by each service and configure them accordingly. You might need to adjust the settings for one of the services to use a different set of ports.
- Routing Issues:
- Ensure that there are no conflicting routing rules between the two services. This can be particularly tricky because both services might try to manipulate the server’s network stack. Carefully review the routing tables to avoid conflicts.
- Firewall Settings:
- Check your server’s firewall settings to ensure that it allows traffic for both Cloudflare Tunnel and Cloudflare WARP. If you are using iptables or another firewall tool, make sure that the necessary ports are open and traffic is allowed.
- Concurrency and Resources:
- Consider the resource requirements of both services. Running both services simultaneously may consume a significant amount of system resources (CPU, memory, etc.). Ensure that your server has enough resources to handle the load. You can monitor resource usage using tools like
toporhtop.
- Consider the resource requirements of both services. Running both services simultaneously may consume a significant amount of system resources (CPU, memory, etc.). Ensure that your server has enough resources to handle the load. You can monitor resource usage using tools like
- Logging and Debugging:
- Enable logging for both Cloudflare Tunnel and Cloudflare WARP. Examine the logs for any error messages or warnings that might indicate conflicts or issues. This can help you pinpoint the source of the problem.
- Configuration Order:
- Try starting the services in a specific order. For example, start one service first and ensure it’s working correctly before starting the second one. The order in which services are initialized can sometimes affect their behavior.
- Documentation and Support:
- Refer to the official documentation of both Cloudflare Tunnel and Cloudflare WARP. Check if there are any specific guidelines or recommendations for running them together. Additionally, check the support forums or communities for both services to see if others have encountered similar issues.
- Separate Instances:
- As a last resort, you might consider running Cloudflare Tunnel and Cloudflare WARP on separate instances or virtual machines. This ensures complete isolation and avoids potential conflicts.
Remember to make backups or snapshots of your server before making significant changes, and proceed with caution, especially if the server is in a production environment. If the issues persist, reaching out to Cloudflare support may provide additional assistance tailored to your specific setup.
If you ask a professional engineer, you will get this answer. But it’s the answer for someone who has never touched Cloudflare.
Does anyone know how to countermeasure against this “degraded?”
https://kotaromiyasaka.com/search-engine-rejection-by-robots-txt-and-cloudflare-tunnel-failure/
#chatgpt #cloudflare #crawlers #degraded #disallow #firefish #google #mdm #one #plugin #robotsTxt #server #tunnel #warp #wireguard #wordpress #wprobotstxt #zerotrust
- Port Conflicts: