#webcrawler — Public Fediverse posts on home.social

C. @[email protected] · 2026-03-28 · 07:01 UTC

Oh, this is #fun.

#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.

I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.

And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.

Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829

So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.

Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.

#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer

#fun #applebot #yonks #apple #appleinc #timapple

ccinfo.nl @[email protected] · 2025-05-22 · 10:34 UTC

In 2024 intensiveren de politie en opsporingsdiensten hun strijd tegen cybercrime met krachtige technologieën en wereldwijde samenwerking.

Podcast Youtube: https://youtu.be/Reovh4irwTI

Podcast Spotify: https://open.spotify.com/episode/1gVqd4g95c86T8QroJyG4M?si=54398ede95ac4507

Artikel Cybercrimeinfo: https://www.ccinfo.nl/menu-nieuws-trends/opsporing/opsporing-cyber-nieuws/2520331_cybercrime-2024-hoe-de-politie-de-digitale-oorlog-wint-met-technologie-en-samenwerking

#Cybercrime2024 #Politie #DigitaleOorlog #Veiligheid #Technologie #AI #SafeBrowser #InternationaleSamenwerking #Ransomware #Phishing #Identiteitsdiefstal #DarkWeb #Webcrawler #Lokprofiel #MachineLearning #PredictivePolicing

#digitaleoorlog #veiligheid #technologie #ai #safebrowser #internationalesamenwerking

Cyclone @[email protected] · 2025-04-17 · 17:27 UTC

🚀 Spider v0.8.0

New features include:

"-file" to generate n-grams from local plaintext files

"-timeout" for URL crawling

"-sort" to output n-grams by frequency

https://forum.hashpwn.net/post/52

#spider #webcrawler #wordlist #ngram #infosec #hashcracking #golang #hashpwn

#spider #webcrawler #wordlist #ngram #infosec #hashcracking

C. @[email protected] · 2024-10-21 · 14:53 UTC

Hey, #webmasters ... just so you know.

#Facebook's new-ish "meta-externalagent" #webcrawler, which they document is for stealing data for their Grand Theft Autocomplete (cough #AI cough), is ignoring robots.txt on my websites.

https://developers.facebook.com/docs/sharing/webmasters/web-crawlers

Is anyone surprised?

#Meta #LLM #scrape #web #copyright #RobotsTXT

#webmasters #facebook #webcrawler #ai #meta #llm

Inautilo @[email protected] · 2024-02-15 · 22:05 UTC

#Development #Analyses
The text file that runs the internet · Is a basic social contract of the web falling apart? https://ilo.im/15xzdk

_____
#AI #AiModel #GenerativeAI #WebBot #WebCrawler #WebScraper #SearchEngine #Website #Blog #RobotsTxt

#development #analyses #ai #aimodel #generativeai #webbot

Vedran Mandić @[email protected] · 2024-02-03 · 22:26 UTC

Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂

Release notes: https://github.com/vmandic/tris-web-crawler/releases/tag/v1.3.1

I focused mostly on CSS animating a spider emoji to walk a web emoji! https://tris.fly.dev/

#indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp

#indiedev #crawler #tris #triswebcrawler #seo #seotools

Vedran Mandić @[email protected] · 2024-02-03 · 22:26 UTC

Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂

Release notes: https://github.com/vmandic/tris-web-crawler/releases/tag/v1.3.1

I focused mostly on CSS animating a spider emoji to walk a web emoji! https://tris.fly.dev/

#indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp

#indiedev #crawler #tris #triswebcrawler #seo #seotools

Vedran Mandić @vekzdran · 2024-02-03 · 22:26 UTC

Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂

Release notes: https://github.com/vmandic/tris-web-crawler/releases/tag/v1.3.1

I focused mostly on CSS animating a spider emoji to walk a web emoji! https://tris.fly.dev/

#indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp

#indiedev #crawler #tris #triswebcrawler #seo #seotools

Vedran Mandić @[email protected] · 2024-02-03 · 22:26 UTC

Released Tris - v1.3.1, added a killer feature: "Clear" button to help you clear input easier while on mobile UI 🙂

Release notes: https://github.com/vmandic/tris-web-crawler/releases/tag/v1.3.1

I focused mostly on CSS animating a spider emoji to walk a web emoji! https://tris.fly.dev/

#indiedev #crawler #tris #triswebcrawler #seo #seotools #webcrawler #scraper #nodejs #release #dev #indieapp

#indieapp #dev #release #nodejs #scraper #webcrawler

Vedran Mandić @[email protected] · 2024-01-29 · 21:31 UTC

Added some #CSS magic and media queries to make it more mobile-friendly for my #triswebcrawler!

Check it: https://tris.fly.dev/

Also, finally added a text input so you can enter whatever domain you like to crawl. :blobfoxhappy:

#webcrawler #indie #indiedev #node #nodejs #flyio #seo #seotool #seotools #sem #design #devux #ui

#css #triswebcrawler #webcrawler #indie #indiedev #node

Vedran Mandić @[email protected] · 2024-01-27 · 23:17 UTC

I am so happy with the first own web application 🎉 I have developed: Tris, a simple and free web crawler 🕸️ 🕷️ !

You can try it for free online: https://tris.fly.dev, limited to 3 parallel crawls and 100 links of path depth of 3.

Next thing I will add will be a text input to set a target domain hhh, now I am making it hard! 🙈

#node #nodejs #web #webcrawler #crawler #seo #datatools #webscraper #scraping #seotools #seotool #tris #triswebcrawler #webapp #indie #indiedev

#node #nodejs #web #webcrawler #crawler #seo

beSpacific @[email protected] · 2023-08-22 · 12:31 UTC

The #NewYorkTimes has blocked #OpenAI’s #webcrawler, meaning that OpenAI can’t use content from the publication to train its AI models. If you check the NYT’s robots.txt page, you can see that the NYT disallows #GPTBot, the crawler that OpenAI introduced earlier this month. Based on the #InternetArchive’s #WaybackMachine, it appears NYT blocked the crawler as early as August 17th. https://www.theverge.com/2023/8/21/23840705/new-york-times-openai-web-crawler-ai-gpt #copyright #legalresearch

#newyorktimes #openai #webcrawler #gptbot #internetarchive #waybackmachine

Relly Annett-Baker @[email protected] · 2022-12-06 · 08:49 UTC

Web nerds, developers and content modelling types - please go follow @eaton and read about the extraordinary box of tricks he and Autogram have been working on for Web Analysis.

Spidergram is honestly such an exciting tool, every time they showed me a bit of it I could think of a new use case and problem it would help with, and I vibrated in my chair a bit.

#webcrawler #webdev #contentmodel #spidergram