home.social

#scraper — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #scraper, aggregated by home.social.

  1. Posts nur für den Feed

    Ich las vorhin davon, dass Sourcefeed einem ein Feed-Only-Publishing anbietet. Irgendwie klingt das erstmal wild, aber wenn man darüber nachdenkt, ist es wie ein Podcast für Blogs. Du machst quasi Blogbeiträge nur für Leute, die dich im Feedreader abonniert haben.

    Mir gefällt dieser Gedanke. Auf der „Why“-Seite gibt es noch ein paar gute Gründe, wie beispielsweise die Tatsache, dass die typischen AI-Scraper die Seiten nicht als Trainingsmaterial verwenden, weil sie Feeds ignorieren.

    Na, mal sehen. Ich würde jetzt kein Geld für so ein Feature ausgeben, deshalb habe ich das einfach mal hier im Blog verbaut. Diese Posts werden vorerst noch ganz normal ins Fediverse gepusht, aber mal schauen, wie lange noch.

    Deshalb: Feed abonnieren.

    🔗

    #shortpost #blogpost #blog #Feed #Sourcefeed #FeedOnly #AI #Scraper

  2. Posts nur für den Feed

    Ich las vorhin davon, dass Sourcefeed einem ein Feed-Only-Publishing anbietet. Irgendwie klingt das erstmal wild, aber wenn man darüber nachdenkt, ist es wie ein Podcast für Blogs. Du machst quasi Blogbeiträge nur für Leute, die dich im Feedreader abonniert haben.

    Mir gefällt dieser Gedanke. Auf der „Why“-Seite gibt es noch ein paar gute Gründe, wie beispielsweise die Tatsache, dass die typischen AI-Scraper die Seiten nicht als Trainingsmaterial verwenden, weil sie Feeds ignorieren.

    Na, mal sehen. Ich würde jetzt kein Geld für so ein Feature ausgeben, deshalb habe ich das einfach mal hier im Blog verbaut. Diese Posts werden vorerst noch ganz normal ins Fediverse gepusht, aber mal schauen, wie lange noch.

    Deshalb: Feed abonnieren.

    🔗

    #shortpost #blogpost #blog #Feed #Sourcefeed #FeedOnly #AI #Scraper

  3. Posts nur für den Feed

    Ich las vorhin davon, dass Sourcefeed einem ein Feed-Only-Publishing anbietet. Irgendwie klingt das erstmal wild, aber wenn man darüber nachdenkt, ist es wie ein Podcast für Blogs. Du machst quasi Blogbeiträge nur für Leute, die dich im Feedreader abonniert haben.

    Mir gefällt dieser Gedanke. Auf der „Why“-Seite gibt es noch ein paar gute Gründe, wie beispielsweise die Tatsache, dass die typischen AI-Scraper die Seiten nicht als Trainingsmaterial verwenden, weil sie Feeds ignorieren.

    Na, mal sehen. Ich würde jetzt kein Geld für so ein Feature ausgeben, deshalb habe ich das einfach mal hier im Blog verbaut. Diese Posts werden vorerst noch ganz normal ins Fediverse gepusht, aber mal schauen, wie lange noch.

    Deshalb: Feed abonnieren.

    🔗

    #shortpost #blogpost #blog #Feed #Sourcefeed #FeedOnly #AI #Scraper

  4. Posts nur für den Feed

    Ich las vorhin davon, dass Sourcefeed einem ein Feed-Only-Publishing anbietet. Irgendwie klingt das erstmal wild, aber wenn man darüber nachdenkt, ist es wie ein Podcast für Blogs. Du machst quasi Blogbeiträge nur für Leute, die dich im Feedreader abonniert haben.

    Mir gefällt dieser Gedanke. Auf der „Why“-Seite gibt es noch ein paar gute Gründe, wie beispielsweise die Tatsache, dass die typischen AI-Scraper die Seiten nicht als Trainingsmaterial verwenden, weil sie Feeds ignorieren.

    Na, mal sehen. Ich würde jetzt kein Geld für so ein Feature ausgeben, deshalb habe ich das einfach mal hier im Blog verbaut. Diese Posts werden vorerst noch ganz normal ins Fediverse gepusht, aber mal schauen, wie lange noch.

    Deshalb: Feed abonnieren.

    🔗

    #shortpost #blogpost #blog #Feed #Sourcefeed #FeedOnly #AI #Scraper

  5. Posts nur für den Feed

    Ich las vorhin davon, dass Sourcefeed einem ein Feed-Only-Publishing anbietet. Irgendwie klingt das erstmal wild, aber wenn man darüber nachdenkt, ist es wie ein Podcast für Blogs. Du machst quasi Blogbeiträge nur für Leute, die dich im Feedreader abonniert haben.

    Mir gefällt dieser Gedanke. Auf der „Why“-Seite gibt es noch ein paar gute Gründe, wie beispielsweise die Tatsache, dass die typischen AI-Scraper die Seiten nicht als Trainingsmaterial verwenden, weil sie Feeds ignorieren.

    Na, mal sehen. Ich würde jetzt kein Geld für so ein Feature ausgeben, deshalb habe ich das einfach mal hier im Blog verbaut. Diese Posts werden vorerst noch ganz normal ins Fediverse gepusht, aber mal schauen, wie lange noch.

    Deshalb: Feed abonnieren.

    🔗

    #shortpost #blogpost #blog #Feed #Sourcefeed #FeedOnly #AI #Scraper

  6. Mehr Features gegen Idioten

    Ihr kennt es: Egal, wie sehr ihr versucht, die Idioten auszusperren, sie finden einen anderen Weg, doch die Seite zu sehen/scrapen.

    zum Blogpost…

    #blogpost #blog #GeoIP #Scraper #Idioten

  7. RE: mastodon.online/@NatureMC/1164

    And it works! Since blocking the worst #scraping bots, all these "visitors" staying only some seconds on a blog article, are gone. 👍 And I can tell you that they were *many*, often more than real humans reading my blog. Illegal training of LLMs & Co. is a bigger problem than many are aware of.

    #noAI #scraper #blogging #blogger #bloggingCommunity

  8. RE: mastodon.online/@NatureMC/1164

    And it works! Since blocking the worst #scraping bots, all these "visitors" staying only some seconds on a blog article, are gone. 👍 And I can tell you that they were *many*, often more than real humans reading my blog. Illegal training of LLMs & Co. is a bigger problem than many are aware of.

    #noAI #scraper #blogging #blogger #bloggingCommunity

  9. RE: mastodon.online/@NatureMC/1164

    And it works! Since blocking the worst #scraping bots, all these "visitors" staying only some seconds on a blog article, are gone. 👍 And I can tell you that they were *many*, often more than real humans reading my blog. Illegal training of LLMs & Co. is a bigger problem than many are aware of.

    #noAI #scraper #blogging #blogger #bloggingCommunity

  10. RE: mastodon.online/@NatureMC/1164

    And it works! Since blocking the worst #scraping bots, all these "visitors" staying only some seconds on a blog article, are gone. 👍 And I can tell you that they were *many*, often more than real humans reading my blog. Illegal training of LLMs & Co. is a bigger problem than many are aware of.

    #noAI #scraper #blogging #blogger #bloggingCommunity

  11. RE: mastodon.online/@NatureMC/1164

    And it works! Since blocking the worst #scraping bots, all these "visitors" staying only some seconds on a blog article, are gone. 👍 And I can tell you that they were *many*, often more than real humans reading my blog. Illegal training of LLMs & Co. is a bigger problem than many are aware of.

    #noAI #scraper #blogging #blogger #bloggingCommunity

  12. Some updates on my website maintenance woes:

    Starting last July, I built a new wiki for my translations of German folk tales. And soon after I started doing so, it started to experience frequent, hours-long outages. I started to research possible causes, but eventually concluded that the primary cause were so many requests from anonymous #scraper bot networks deserpate for new scraps of data to feed into their #LLM models that the wiki simply couldn't cope. Even when I increased my hosting plan _twice_ last September, this only served to make the outages less common - not to stop them.

    In March, I drastically reduced the amount of work I did on the wiki, as it was functionally complete - I had added more than 700 folk tales to it by that stage. Sure, there are always further tales to add - I didn't stop translating those tales, after all. But now I am adding 10-20 tales per month, not 100+.

    And funnily enough, I haven't noticed any major outages for this past month - or even minor ones. I guess the scraper bot networks noticed that I don't have that much new data to steal, and largely moved on to new prey they can harass.

    So, what can we conclude from this?

    If you are maintaining a website that produces lots of new content on a regular basis, you _will_ get hammered by these scrapers. robots.txt will do nothing - these use anonymous, ever-changing IP addresses. Maybe you can thwart them with #Cloudfare or similar technologies which I haven't tried out (I am a rank beginner when it comes to website administration, to be frank).

    Otherwise you will either have to slow down the publication of new content, pay lots of money for an oversized hosting plan, or live with periodic outages until the #AIBubble bursts, and there is no longer a trillion dollar business case for scraping every website a thousand times a month.

    wiki.sunkencastles.com/wiki/Ma

  13. Some updates on my website maintenance woes:

    Starting last July, I built a new wiki for my translations of German folk tales. And soon after I started doing so, it started to experience frequent, hours-long outages. I started to research possible causes, but eventually concluded that the primary cause were so many requests from anonymous #scraper bot networks deserpate for new scraps of data to feed into their #LLM models that the wiki simply couldn't cope. Even when I increased my hosting plan _twice_ last September, this only served to make the outages less common - not to stop them.

    In March, I drastically reduced the amount of work I did on the wiki, as it was functionally complete - I had added more than 700 folk tales to it by that stage. Sure, there are always further tales to add - I didn't stop translating those tales, after all. But now I am adding 10-20 tales per month, not 100+.

    And funnily enough, I haven't noticed any major outages for this past month - or even minor ones. I guess the scraper bot networks noticed that I don't have that much new data to steal, and largely moved on to new prey they can harass.

    So, what can we conclude from this?

    If you are maintaining a website that produces lots of new content on a regular basis, you _will_ get hammered by these scrapers. robots.txt will do nothing - these use anonymous, ever-changing IP addresses. Maybe you can thwart them with #Cloudfare or similar technologies which I haven't tried out (I am a rank beginner when it comes to website administration, to be frank).

    Otherwise you will either have to slow down the publication of new content, pay lots of money for an oversized hosting plan, or live with periodic outages until the #AIBubble bursts, and there is no longer a trillion dollar business case for scraping every website a thousand times a month.

    wiki.sunkencastles.com/wiki/Ma

  14. Some updates on my website maintenance woes:

    Starting last July, I built a new wiki for my translations of German folk tales. And soon after I started doing so, it started to experience frequent, hours-long outages. I started to research possible causes, but eventually concluded that the primary cause were so many requests from anonymous #scraper bot networks deserpate for new scraps of data to feed into their #LLM models that the wiki simply couldn't cope. Even when I increased my hosting plan _twice_ last September, this only served to make the outages less common - not to stop them.

    In March, I drastically reduced the amount of work I did on the wiki, as it was functionally complete - I had added more than 700 folk tales to it by that stage. Sure, there are always further tales to add - I didn't stop translating those tales, after all. But now I am adding 10-20 tales per month, not 100+.

    And funnily enough, I haven't noticed any major outages for this past month - or even minor ones. I guess the scraper bot networks noticed that I don't have that much new data to steal, and largely moved on to new prey they can harass.

    So, what can we conclude from this?

    If you are maintaining a website that produces lots of new content on a regular basis, you _will_ get hammered by these scrapers. robots.txt will do nothing - these use anonymous, ever-changing IP addresses. Maybe you can thwart them with #Cloudfare or similar technologies which I haven't tried out (I am a rank beginner when it comes to website administration, to be frank).

    Otherwise you will either have to slow down the publication of new content, pay lots of money for an oversized hosting plan, or live with periodic outages until the #AIBubble bursts, and there is no longer a trillion dollar business case for scraping every website a thousand times a month.

    wiki.sunkencastles.com/wiki/Ma

  15. Some updates on my website maintenance woes:

    Starting last July, I built a new wiki for my translations of German folk tales. And soon after I started doing so, it started to experience frequent, hours-long outages. I started to research possible causes, but eventually concluded that the primary cause were so many requests from anonymous #scraper bot networks deserpate for new scraps of data to feed into their #LLM models that the wiki simply couldn't cope. Even when I increased my hosting plan _twice_ last September, this only served to make the outages less common - not to stop them.

    In March, I drastically reduced the amount of work I did on the wiki, as it was functionally complete - I had added more than 700 folk tales to it by that stage. Sure, there are always further tales to add - I didn't stop translating those tales, after all. But now I am adding 10-20 tales per month, not 100+.

    And funnily enough, I haven't noticed any major outages for this past month - or even minor ones. I guess the scraper bot networks noticed that I don't have that much new data to steal, and largely moved on to new prey they can harass.

    So, what can we conclude from this?

    If you are maintaining a website that produces lots of new content on a regular basis, you _will_ get hammered by these scrapers. robots.txt will do nothing - these use anonymous, ever-changing IP addresses. Maybe you can thwart them with #Cloudfare or similar technologies which I haven't tried out (I am a rank beginner when it comes to website administration, to be frank).

    Otherwise you will either have to slow down the publication of new content, pay lots of money for an oversized hosting plan, or live with periodic outages until the #AIBubble bursts, and there is no longer a trillion dollar business case for scraping every website a thousand times a month.

    wiki.sunkencastles.com/wiki/Ma

  16. Some updates on my website maintenance woes:

    Starting last July, I built a new wiki for my translations of German folk tales. And soon after I started doing so, it started to experience frequent, hours-long outages. I started to research possible causes, but eventually concluded that the primary cause were so many requests from anonymous #scraper bot networks deserpate for new scraps of data to feed into their #LLM models that the wiki simply couldn't cope. Even when I increased my hosting plan _twice_ last September, this only served to make the outages less common - not to stop them.

    In March, I drastically reduced the amount of work I did on the wiki, as it was functionally complete - I had added more than 700 folk tales to it by that stage. Sure, there are always further tales to add - I didn't stop translating those tales, after all. But now I am adding 10-20 tales per month, not 100+.

    And funnily enough, I haven't noticed any major outages for this past month - or even minor ones. I guess the scraper bot networks noticed that I don't have that much new data to steal, and largely moved on to new prey they can harass.

    So, what can we conclude from this?

    If you are maintaining a website that produces lots of new content on a regular basis, you _will_ get hammered by these scrapers. robots.txt will do nothing - these use anonymous, ever-changing IP addresses. Maybe you can thwart them with #Cloudfare or similar technologies which I haven't tried out (I am a rank beginner when it comes to website administration, to be frank).

    Otherwise you will either have to slow down the publication of new content, pay lots of money for an oversized hosting plan, or live with periodic outages until the #AIBubble bursts, and there is no longer a trillion dollar business case for scraping every website a thousand times a month.

    wiki.sunkencastles.com/wiki/Ma

  17. Backing up Spotify - Anna’s Blog

    "We backed up Spotify (metadata and music files). It’s distributed in bulk torrents (~300TB). It’s the world’s first “preservation archive” for music which is fully open (meaning it can easily be mirrored by anyone with enough disk space), with 86 million music files, representing around 99.6% of listens."

    Link: annas-archive.li/blog/backing-

    #linkdump #archive #blogpost #scraper #spotify

  18. #Cloudflare 09:00: I will protect you from all these scrapers smoking your cpu and webserver bandwidth.

    Takes a lunch 🥪

    Cloudflare 13:00: Hey AI-model hot boys… I have a 1 click #scraper for you to fill up that model!

    developers.cloudflare.com/chan

  19. New rule: Every time I notice an overnight outage with my website, a new scraper gets added to my robots.txt file.

    Welcome to the list, "Amzn-SearchBot".

    #Amazon #Scraper #AIScraper

  20. New rule: Every time I notice an overnight outage with my website, a new scraper gets added to my robots.txt file.

    Welcome to the list, "Amzn-SearchBot".

    #Amazon #Scraper #AIScraper

  21. New rule: Every time I notice an overnight outage with my website, a new scraper gets added to my robots.txt file.

    Welcome to the list, "Amzn-SearchBot".

    #Amazon #Scraper #AIScraper

  22. New rule: Every time I notice an overnight outage with my website, a new scraper gets added to my robots.txt file.

    Welcome to the list, "Amzn-SearchBot".

    #Amazon #Scraper #AIScraper

  23. New rule: Every time I notice an overnight outage with my website, a new scraper gets added to my robots.txt file.

    Welcome to the list, "Amzn-SearchBot".

    #Amazon #Scraper #AIScraper

  24. weiß jemand ob es einen Weg gibt aus #Microsoft #Teams die #Schichten zu extrahieren?

    Würde die Daten gerne in den Familien #Kalender mit aufnehmen.

    Gibt es da ne #api, oder notfalls ein #scraper dafür?

    #msteams #frage

  25. weiß jemand ob es einen Weg gibt aus #Microsoft #Teams die #Schichten zu extrahieren?

    Würde die Daten gerne in den Familien #Kalender mit aufnehmen.

    Gibt es da ne #api, oder notfalls ein #scraper dafür?

    #msteams #frage

  26. Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.

    #teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz

  27. Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.

    #teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz

  28. Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.

    #teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz

  29. Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.

    #teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz

  30. Do you know of a #facebook #scraper that currently work "out-of-the-box"?

  31. Ah, the eternal struggle between the noble #blogger and those dastardly HTML-scraping fiends 🤖🦹‍♂️! Terence Eden just can't wrap his head around why anyone would choose brute force over the #elegance of an API—because heaven forbid anyone use a different method to gather useless #data from his #digital #diary 🎭.
    shkspr.mobi/blog/2025/12/stop- #vs #scraper #API #gathering #HackerNews #ngated

  32. staying stealthy and reliable so your agents always get what they need without friction

    "Staying stealthy"!? Hey #Mozilla, stealthy against whom? The companies trying to track users across the open web? That's the type of tech your charter calls for right!? No?

    It can click, scroll, search, and submit just like a human, navigating complex flows with real-time feedback and adaptive behavior. Get full control of
    the web, without the complexity.

    You're building tech for the parasites to scrape the open web "without friction"? You're mimicing the very users your foundation claims to put first against extractive systems. Do you have no sense of shame or at least irony!? We're looking to you to be the last bastion that can defend the web at scale and you're out here building a Trojan horse in the cover of darkness.

    It's not that I have just lost respect I've lost hope, I am not sorry for this very confrontational and scorched earth post. I'm sad and seething!

    via @elilla
    https://transmom.love/@elilla/115564272417922503

    #Mozilla #OpenWeb #AI #Bot #Scraper

  33. MWoffliner 1.17 has been released!
    npmjs.com/package/mwoffliner/v

    This is a minor new version of our flagship #MediaWiki #scraper but the changelog is still huge with really a lot of bug fixes!
    github.com/openzim/mwoffliner/

  34. Bad morning,

    I just woke up to find out that openAI scraped my entire website, ignoring robots.txt

    #webmaster #scraper #openai

  35. Since people are dunking on #Meta #scraping again I'll share one tidbit: when @jonah and I was investigating some performance issues, I noticed that Meta-ExternalAgent was scraping /auth/sign_up and one specific invite link with different `accept` parameters (which indicates acceptance of rules), however because Mastodon returns 200 (and shows the rules again) on invalid `accept` parameters the #scraper just keeps going...