home.social

#webcrawler — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #webcrawler, aggregated by home.social.

  1. Oh, this is #fun.

    #Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.

    I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.

    And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.

    Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
    e.g. support.apple.com/en-ca/119829

    So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.

    Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.

    #Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer

  2. 🚀 Spider v0.8.0

    New features include:

    "-file" to generate n-grams from local plaintext files

    "-timeout" for URL crawling

    "-sort" to output n-grams by frequency

    forum.hashpwn.net/post/52

    #spider #webcrawler #wordlist #ngram #infosec #hashcracking #golang #hashpwn

  3. Hey, #webmasters ... just so you know.

    #Facebook's new-ish "meta-externalagent" #webcrawler, which they document is for stealing data for their Grand Theft Autocomplete (cough #AI cough), is ignoring robots.txt on my websites.

    developers.facebook.com/docs/s

    Is anyone surprised?

    #Meta #LLM #scrape #web #copyright #RobotsTXT

  4. Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂

    Release notes: github.com/vmandic/tris-web-cr

    I focused mostly on CSS animating a spider emoji to walk a web emoji! tris.fly.dev/

    #indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp

  5. Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂

    Release notes: github.com/vmandic/tris-web-cr

    I focused mostly on CSS animating a spider emoji to walk a web emoji! tris.fly.dev/

    #indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp

  6. Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂

    Release notes: github.com/vmandic/tris-web-cr

    I focused mostly on CSS animating a spider emoji to walk a web emoji! tris.fly.dev/

  7. Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂

    Release notes: github.com/vmandic/tris-web-cr

    I focused mostly on CSS animating a spider emoji to walk a web emoji! tris.fly.dev/

    #indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp

  8. Added some #CSS magic and media queries to make it more mobile-friendly for my #triswebcrawler!

    Check it: tris.fly.dev/

    Also, finally added a text input so you can enter whatever domain you like to crawl. :blobfoxhappy:

    #webcrawler #indie #indiedev #node #nodejs #flyio #seo #seotool #seotools #sem #design #devux #ui

  9. I am so happy with the first own web application 🎉 I have developed: Tris, a simple and free web crawler 🕸️ 🕷️ !

    You can try it for free online: tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.

    Next thing I will add will be a text input to set a target domain hhh, now I am making it hard! 🙈

    #node #nodejs #web #webcrawler #crawler #seo #datatools #webscraper #scraping #seotools #seotool #tris #triswebcrawler #webapp #indie #indiedev

  10. The #NewYorkTimes has blocked #OpenAI’s #webcrawler, meaning that OpenAI can’t use content from the publication to train its AI models. If you check the NYT’s robots.txt page, you can see that the NYT disallows #GPTBot, the crawler that OpenAI introduced earlier this month. Based on the #InternetArchive’s #WaybackMachine, it appears NYT blocked the crawler as early as August 17th. theverge.com/2023/8/21/2384070 #copyright #legalresearch

  11. Web nerds, developers and content modelling types - please go follow @eaton and read about the extraordinary box of tricks he and Autogram have been working on for Web Analysis.

    Spidergram is honestly such an exciting tool, every time they showed me a bit of it I could think of a new use case and problem it would help with, and I vibrated in my chair a bit.

    #webcrawler #webdev #contentmodel #spidergram