#webcrawler — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #webcrawler, aggregated by home.social.
-
Why Google’s new AI-saturated search page will be a disasterGoogle didn’t invent full-text search of the Internet – that honour belongs to early pioneers such as WebCrawler, Lycos and AltaVista. But for the last 25 years or so, Google has been synonymous with online searching, providing the quickest and most effective way to find things online (although its results may be getting worse.) More recently, it has been adding to its search engine more […]
#agentic #agents #ai #altavista #blackBox #chatbot #creators #dependency #google #interface #links #llms #lycos #magazines #newspapers #publishing #search #training #webcrawler #worldWideWeb https://walledculture.org/why-googles-new-ai-saturated-search-page-will-be-a-disaster/ -
Why Google’s new AI-saturated search page will be a disasterGoogle didn’t invent full-text search of the Internet – that honour belongs to early pioneers such as WebCrawler, Lycos and AltaVista. But for the last 25 years or so, Google has been synonymous with online searching, providing the quickest and most effective way to find things online (although its results may be getting worse.) More recently, it has been adding to its search engine more […]
#agentic #agents #ai #altavista #blackBox #chatbot #creators #dependency #google #interface #links #llms #lycos #magazines #newspapers #publishing #search #training #webcrawler #worldWideWeb https://walledculture.org/why-googles-new-ai-saturated-search-page-will-be-a-disaster/ -
Why Google’s new AI-saturated search page will be a disasterGoogle didn’t invent full-text search of the Internet – that honour belongs to early pioneers such as WebCrawler, Lycos and AltaVista. But for the last 25 years or so, Google has been synonymous with online searching, providing the quickest and most effective way to find things online (although its results may be getting worse.) More recently, it has been adding to its search engine more […]
#agentic #agents #ai #altavista #blackBox #chatbot #creators #dependency #google #interface #links #llms #lycos #magazines #newspapers #publishing #search #training #webcrawler #worldWideWeb https://walledculture.org/why-googles-new-ai-saturated-search-page-will-be-a-disaster/ -
Why Google’s new AI-saturated search page will be a disasterGoogle didn’t invent full-text search of the Internet – that honour belongs to early pioneers such as WebCrawler, Lycos and AltaVista. But for the last 25 years or so, Google has been synonymous with online searching, providing the quickest and most effective way to find things online (although its results may be getting worse.) More recently, it has been adding to its search engine more […]
#agentic #agents #ai #altavista #blackBox #chatbot #creators #dependency #google #interface #links #llms #lycos #magazines #newspapers #publishing #search #training #webcrawler #worldWideWeb https://walledculture.org/why-googles-new-ai-saturated-search-page-will-be-a-disaster/ -
Why Google’s new AI-saturated search page will be a disasterGoogle didn’t invent full-text search of the Internet – that honour belongs to early pioneers such as WebCrawler, Lycos and AltaVista. But for the last 25 years or so, Google has been synonymous with online searching, providing the quickest and most effective way to find things online (although its results may be getting worse.) More recently, it has been adding to its search engine more […]
#agentic #agents #ai #altavista #blackBox #chatbot #creators #dependency #google #interface #links #llms #lycos #magazines #newspapers #publishing #search #training #webcrawler #worldWideWeb https://walledculture.org/why-googles-new-ai-saturated-search-page-will-be-a-disaster/ -
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Spider v1.0.0 released.
Spider is not just another web crawler -- it is a purpose-built wordlist and ngram processor for hash cracking workflows.
URL Mode:
Point it at a URL and Spider crawls the target, extracts words, and generates frequency-sorted wordlists and/or ngrams.But, Spider does not stop at web crawling...
File Mode:
Feed it local files and it brings the same word-processing engine to your own datasets, scraped content, notes, dumps, configs, or any other plaintext source you want to turn into a targeted wordlist or ngram set.More info:
https://forum.hashpwn.net/post/52#spider #webcrawler #wordlist #generator #sort #ngram #cyclone #hashpwn #hashcracking
-
Oh, this is #fun.
#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.
I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.
And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.
Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.
Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.
#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer
-
Oh, this is #fun.
#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.
I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.
And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.
Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.
Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.
#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer
-
Oh, this is #fun.
#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.
I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.
And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.
Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.
Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.
#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer
-
Oh, this is #fun.
#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.
I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.
And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.
Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.
Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.
#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer
-
Oh, this is #fun.
#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.
I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.
And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.
Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.
Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.
#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer
-
ICYMI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #ProjectMariner #WebCrawler #ArtificialIntelligence
-
ICYMI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #ProjectMariner #WebCrawler #ArtificialIntelligence
-
ICYMI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #ProjectMariner #WebCrawler #ArtificialIntelligence
-
ICYMI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #ProjectMariner #WebCrawler #ArtificialIntelligence
-
ICYMI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #ProjectMariner #WebCrawler #ArtificialIntelligence
-
What is the YandoriRSSBot?
I just happened to have my NLJ logs open (I had opened them when the site was slow for a moment). I saw something called the YandoriRSSBot requesting the NLJ ATOM feed. While not unprecedented, almost all the feed fetchers ask for the regular RSS feed. I decided to search for the user agent to see if it is coming from a new feed reader that I had never heard of. Unfortunately, Known Agents has no information about it beyond the fact that it has been reported in the wild. But I ran another […]https://social.emucafe.org/naferrell/what-is-the-yandorirssbot-02-25-26/
-
What is the YandoriRSSBot?
I just happened to have my NLJ logs open (I had opened them when the site was slow for a moment). I saw something called the YandoriRSSBot requesting the NLJ ATOM feed. While not unprecedented, almost all the feed fetchers ask for the regular RSS feed. I decided to search for the user agent to see if it is coming from a new feed reader that I had never heard of. Unfortunately, Known Agents has no information about it beyond the fact that it has been reported in the wild. But I ran another […]https://social.emucafe.org/naferrell/what-is-the-yandorirssbot-02-25-26/
-
What is the YandoriRSSBot?
I just happened to have my NLJ logs open (I had opened them when the site was slow for a moment). I saw something called the YandoriRSSBot requesting the NLJ ATOM feed. While not unprecedented, almost all the feed fetchers ask for the regular RSS feed. I decided to search for the user agent to see if it is coming from a new feed reader that I had never heard of. Unfortunately, Known Agents has no information about it beyond the fact that it has been reported in the wild. But I ran another […]https://social.emucafe.org/naferrell/what-is-the-yandorirssbot-02-25-26/
-
Công cụ Website-Crawler giúp thu thập dữ liệu từ website dưới dạng JSON hoặc CSV, phù hợp để dùng với mô hình ngôn ngữ lớn (LLM). Hỗ trợ crawl hoặc scrape toàn bộ website nhanh chóng, dễ sử dụng. #WebCrawler #DataExtraction #LLM #AI #CôngCụ #WebScraping #MachineLearning #AI #LLM #WebCrawler #DataExtraction
-
Exa-d: How to store the web in S3
https://exa.ai/blog/exa-d
#ycombinator #ai_search_engine #web_search_api #webcrawler #serp_api #web_api #google_search_api #google_serp_api #people_search_engines #perplexity_ai_search_engine_features #ai_search_engine_free #search_engine_ai #free_people_search_engines #best_ai_search_engine #web_api_security #ai_search_engines #search_api #free_ai_search_engine #web_scraping_api #bing_search_api #webcrawler_search_engine #search_engine_rankings_api -
Exa-d: How to store the web in S3
https://exa.ai/blog/exa-d
#ycombinator #ai_search_engine #web_search_api #webcrawler #serp_api #web_api #google_search_api #google_serp_api #people_search_engines #perplexity_ai_search_engine_features #ai_search_engine_free #search_engine_ai #free_people_search_engines #best_ai_search_engine #web_api_security #ai_search_engines #search_api #free_ai_search_engine #web_scraping_api #bing_search_api #webcrawler_search_engine #search_engine_rankings_api -
Exa-d: How to store the web in S3
https://exa.ai/blog/exa-d
#ycombinator #ai_search_engine #web_search_api #webcrawler #serp_api #web_api #google_search_api #google_serp_api #people_search_engines #perplexity_ai_search_engine_features #ai_search_engine_free #search_engine_ai #free_people_search_engines #best_ai_search_engine #web_api_security #ai_search_engines #search_api #free_ai_search_engine #web_scraping_api #bing_search_api #webcrawler_search_engine #search_engine_rankings_api -
Exa-d: How to store the web in S3
https://exa.ai/blog/exa-d
#ycombinator #ai_search_engine #web_search_api #webcrawler #serp_api #web_api #google_search_api #google_serp_api #people_search_engines #perplexity_ai_search_engine_features #ai_search_engine_free #search_engine_ai #free_people_search_engines #best_ai_search_engine #web_api_security #ai_search_engines #search_api #free_ai_search_engine #web_scraping_api #bing_search_api #webcrawler_search_engine #search_engine_rankings_api -
🦀 Crab.so – công cụ crawler web miễn phí, nhẹ, dành cho SEO. Được phát triển như dự án phụ, chưa phải đối thủ Screaming Frog nhưng hữu ích cho kiểm tra site. Mọi phản hồi, đề xuất cải tiến đều hoan nghênh! #SEO #WebCrawler #CôngCụMiễnPhí #Crawl #SideProject #CôngCụSEO
https://www.reddit.com/r/SideProject/comments/1qc09ox/a_free_lightweight_screaming_frog_alternative/
-
I've checked on #YaCy from time to time because the project seemed very interesting but the resources (disk space and memory) too big for it to be run on cheap hardware as a hobby. I don't know of any other #OpenSource (optionally) #distributed #searchEngine with #webCrawler included (independent of Google and co., unlike metasearch engines).
I thought maybe somebody will rewrite it in Rust or something, but no luck so far. There was an announcement of significant optimisations made once, but the resources needed seem to be huge still.
Sadly, the focus nowadays seems to be on adding #AI to it. I guess I'll wait until the bubble is gone. 😕 -
I've checked on #YaCy from time to time because the project seemed very interesting but the resources (disk space and memory) too big for it to be run on cheap hardware as a hobby. I don't know of any other #OpenSource (optionally) #distributed #searchEngine with #webCrawler included (independent of Google and co., unlike metasearch engines).
I thought maybe somebody will rewrite it in Rust or something, but no luck so far. There was an announcement of significant optimisations made once, but the resources needed seem to be huge still.
Sadly, the focus nowadays seems to be on adding #AI to it. I guess I'll wait until the bubble is gone. 😕 -
I've checked on #YaCy from time to time because the project seemed very interesting but the resources (disk space and memory) too big for it to be run on cheap hardware as a hobby. I don't know of any other #OpenSource (optionally) #distributed #searchEngine with #webCrawler included (independent of Google and co., unlike metasearch engines).
I thought maybe somebody will rewrite it in Rust or something, but no luck so far. There was an announcement of significant optimisations made once, but the resources needed seem to be huge still.
Sadly, the focus nowadays seems to be on adding #AI to it. I guess I'll wait until the bubble is gone. 😕 -
I've checked on #YaCy from time to time because the project seemed very interesting but the resources (disk space and memory) too big for it to be run on cheap hardware as a hobby. I don't know of any other #OpenSource (optionally) #distributed #searchEngine with #webCrawler included (independent of Google and co., unlike metasearch engines).
I thought maybe somebody will rewrite it in Rust or something, but no luck so far. There was an announcement of significant optimisations made once, but the resources needed seem to be huge still.
Sadly, the focus nowadays seems to be on adding #AI to it. I guess I'll wait until the bubble is gone. 😕 -
I've checked on #YaCy from time to time because the project seemed very interesting but the resources (disk space and memory) too big for it to be run on cheap hardware as a hobby. I don't know of any other #OpenSource (optionally) #distributed #searchEngine with #webCrawler included (independent of Google and co., unlike metasearch engines).
I thought maybe somebody will rewrite it in Rust or something, but no luck so far. There was an announcement of significant optimisations made once, but the resources needed seem to be huge still.
Sadly, the focus nowadays seems to be on adding #AI to it. I guess I'll wait until the bubble is gone. 😕 -
Was ist denn da seit ein paar Tagen für ein
#Crawler auf meiner Webseite unterwegs? So viele Connections vom Webserver sehe ich nicht immer.Mal schauen, wann der durch ist. Laut Check der IPs: CHINANET, 21ViaNet(China),Inc., Tencent cloud computing (Beijing)
#China #Webcrawler -
Wikipedia verzeichnet Besucherrückgang durch KI und Social Media
Wikipedia verliert im Jahr 2025 Besucher:innen. Grund dafür sind künstliche Intelligenz in Suchmaschinen und die wachsende Nutzung sozialer Medien.Wikipedia: Weniger Seitenaufrufe durch KI und
https://www.apfeltalk.de/magazin/news/wikipedia-verzeichnet-besucherrueckgang-durch-ki-und-social-media/
#KI #News #Besucherzahlen #Google #KnstlicheIntelligenz #PewResearch #SocialMedia #Webcrawler #Wikipedia #Wissensplattform -
Wikipedia verzeichnet Besucherrückgang durch KI und Social Media
Wikipedia verliert im Jahr 2025 Besucher:innen. Grund dafür sind künstliche Intelligenz in Suchmaschinen und die wachsende Nutzung sozialer Medien.Wikipedia: Weniger Seitenaufrufe durch KI und
https://www.apfeltalk.de/magazin/news/wikipedia-verzeichnet-besucherrueckgang-durch-ki-und-social-media/
#KI #News #Besucherzahlen #Google #KnstlicheIntelligenz #PewResearch #SocialMedia #Webcrawler #Wikipedia #Wissensplattform -
Wikipedia verzeichnet Besucherrückgang durch KI und Social Media
Wikipedia verliert im Jahr 2025 Besucher:innen. Grund dafür sind künstliche Intelligenz in Suchmaschinen und die wachsende Nutzung sozialer Medien.Wikipedia: Weniger Seitenaufrufe durch KI und
https://www.apfeltalk.de/magazin/news/wikipedia-verzeichnet-besucherrueckgang-durch-ki-und-social-media/
#KI #News #Besucherzahlen #Google #KnstlicheIntelligenz #PewResearch #SocialMedia #Webcrawler #Wikipedia #Wissensplattform -
Wikipedia verzeichnet Besucherrückgang durch KI und Social Media
Wikipedia verliert im Jahr 2025 Besucher:innen. Grund dafür sind künstliche Intelligenz in Suchmaschinen und die wachsende Nutzung sozialer Medien.Wikipedia: Weniger Seitenaufrufe durch KI und
https://www.apfeltalk.de/magazin/news/wikipedia-verzeichnet-besucherrueckgang-durch-ki-und-social-media/
#KI #News #Besucherzahlen #Google #KnstlicheIntelligenz #PewResearch #SocialMedia #Webcrawler #Wikipedia #Wissensplattform -
Wikipedia verzeichnet Besucherrückgang durch KI und Social Media
Wikipedia verliert im Jahr 2025 Besucher:innen. Grund dafür sind künstliche Intelligenz in Suchmaschinen und die wachsende Nutzung sozialer Medien.Wikipedia: Weniger Seitenaufrufe durch KI und
https://www.apfeltalk.de/magazin/news/wikipedia-verzeichnet-besucherrueckgang-durch-ki-und-social-media/
#KI #News #Besucherzahlen #Google #KnstlicheIntelligenz #PewResearch #SocialMedia #Webcrawler #Wikipedia #Wissensplattform -
8 Web Scraping & Crawling Tools mit n8n-Anbindung (Workflow-Vorlage zum kostenlosen Download)
Wir schauen uns heute an, wie ihr Web Scraping und Crawling betreiben könnt. Dazu schauen wir uns 8 verschiedene Tools an und verbinden diese auch direkt mit n8n, damit ihr die extrahierten Daten in einem Workflow weiter verarbeiten könnt.
https://www.youtube.com/watch?v=LP571gnIg7A
#n8n #ki #automatisierung #webscraping #webcrawler #webscraper
-
https://social.emucafe.org/naferrell/user-agent-godhatesmastodon-08-22-25/
The New Leaf Journal became inaccessable for about 1-2 minutes this morning. Fortunately, I opened the site almost immediately when it happened. I opened my server logs and found what was probably the offending bot/scraper so I could block it. I kept the server logs open to watch for any other questionable activity. I saw an interesting user-agent string.
Aug 22 11:22:46 [IP ADDRESS] - - [22/Aug/2025:15:22:46 +0000] "GET / HTTP/1.1" 200 63425 "-" "GodHatesMastodon"My two sites are often crawled by Mastodon servers and Fediverse-related crawlers because both sites function as ActivityPub servers (you can follow this account on the Fediverse at @naferrell@social.emcafe.org). I had not previously seen the crawler GodHatesMastodon, but I understand through the grapevine that there are some questionable instances out there. Fortunately, there is no reason for anyone to hate The New Leaf Journal. As my friend and colleague Victor V. Gurbo once explained, “The New Leaf Journal is a family website.”
-
https://social.emucafe.org/naferrell/user-agent-godhatesmastodon-08-22-25/
The New Leaf Journal became inaccessable for about 1-2 minutes this morning. Fortunately, I opened the site almost immediately when it happened. I opened my server logs and found what was probably the offending bot/scraper so I could block it. I kept the server logs open to watch for any other questionable activity. I saw an interesting user-agent string.
Aug 22 11:22:46 [IP ADDRESS] - - [22/Aug/2025:15:22:46 +0000] "GET / HTTP/1.1" 200 63425 "-" "GodHatesMastodon"My two sites are often crawled by Mastodon servers and Fediverse-related crawlers because both sites function as ActivityPub servers (you can follow this account on the Fediverse at @naferrell@social.emcafe.org). I had not previously seen the crawler GodHatesMastodon, but I understand through the grapevine that there are some questionable instances out there. Fortunately, there is no reason for anyone to hate The New Leaf Journal. As my friend and colleague Victor V. Gurbo once explained, “The New Leaf Journal is a family website.”
-
https://social.emucafe.org/naferrell/user-agent-godhatesmastodon-08-22-25/
The New Leaf Journal became inaccessable for about 1-2 minutes this morning. Fortunately, I opened the site almost immediately when it happened. I opened my server logs and found what was probably the offending bot/scraper so I could block it. I kept the server logs open to watch for any other questionable activity. I saw an interesting user-agent string.
Aug 22 11:22:46 [IP ADDRESS] - - [22/Aug/2025:15:22:46 +0000] "GET / HTTP/1.1" 200 63425 "-" "GodHatesMastodon"My two sites are often crawled by Mastodon servers and Fediverse-related crawlers because both sites function as ActivityPub servers (you can follow this account on the Fediverse at @naferrell@social.emcafe.org). I had not previously seen the crawler GodHatesMastodon, but I understand through the grapevine that there are some questionable instances out there. Fortunately, there is no reason for anyone to hate The New Leaf Journal. As my friend and colleague Victor V. Gurbo once explained, “The New Leaf Journal is a family website.”
-
https://social.emucafe.org/naferrell/user-agent-godhatesmastodon-08-22-25/
The New Leaf Journal became inaccessable for about 1-2 minutes this morning. Fortunately, I opened the site almost immediately when it happened. I opened my server logs and found what was probably the offending bot/scraper so I could block it. I kept the server logs open to watch for any other questionable activity. I saw an interesting user-agent string.
Aug 22 11:22:46 [IP ADDRESS] - - [22/Aug/2025:15:22:46 +0000] "GET / HTTP/1.1" 200 63425 "-" "GodHatesMastodon"My two sites are often crawled by Mastodon servers and Fediverse-related crawlers because both sites function as ActivityPub servers (you can follow this account on the Fediverse at @naferrell@social.emcafe.org). I had not previously seen the crawler GodHatesMastodon, but I understand through the grapevine that there are some questionable instances out there. Fortunately, there is no reason for anyone to hate The New Leaf Journal. As my friend and colleague Victor V. Gurbo once explained, “The New Leaf Journal is a family website.”
-
AI crawler Firecrawl raises $14.5M, is still looking to hire agents as employees
Firecrawl’s co-founder and CEO Caleb Peffer knew the exact moment he found the investor to lead his Series…
#NewsBeep #News #Artificialintelligence #AI #AIagents #ArtificialIntelligence #AU #Australia #nexusventurepartners #Technology #Webcrawler
https://www.newsbeep.com/au/82894/ -
#Firecrawl, an #opensource #webcrawler for #developers and #AIagents, raised $14.5 million in a Series A round led by Nexus Venture Partners. The company, which is already profitable, plans to use the funds to expand its team and develop tools to help website owners get paid when AI uses their content. https://techcrunch.com/2025/08/19/ai-crawler-firecrawl-raises-14-5m-is-still-looking-to-hire-agents-as-employees/?Pirates.BZ #Pirates #Tech #Startup #News
-
#Firecrawl, an #opensource #webcrawler for #developers and #AIagents, raised $14.5 million in a Series A round led by Nexus Venture Partners. The company, which is already profitable, plans to use the funds to expand its team and develop tools to help website owners get paid when AI uses their content. https://techcrunch.com/2025/08/19/ai-crawler-firecrawl-raises-14-5m-is-still-looking-to-hire-agents-as-employees/?Pirates.BZ #Pirates #Tech #Startup #News
-
#Firecrawl, an #opensource #webcrawler for #developers and #AIagents, raised $14.5 million in a Series A round led by Nexus Venture Partners. The company, which is already profitable, plans to use the funds to expand its team and develop tools to help website owners get paid when AI uses their content. https://techcrunch.com/2025/08/19/ai-crawler-firecrawl-raises-14-5m-is-still-looking-to-hire-agents-as-employees/?Pirates.BZ #Pirates #Tech #Startup #News
-
#Firecrawl, an #opensource #webcrawler for #developers and #AIagents, raised $14.5 million in a Series A round led by Nexus Venture Partners. The company, which is already profitable, plans to use the funds to expand its team and develop tools to help website owners get paid when AI uses their content. https://techcrunch.com/2025/08/19/ai-crawler-firecrawl-raises-14-5m-is-still-looking-to-hire-agents-as-employees/?Pirates.BZ #Pirates #Tech #Startup #News