#scraping — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #scraping, aggregated by home.social.
-
OpenAI violated Canadian privacy laws, federal and provincial watchdogs say
Commissioners from four of Canada’s privacy watchdogs have found that OpenAI violated Canadian privacy laws while developing and training its early models of ChatGPT.
Philippe Dufresne, Canada’s privacy commissioner, was joined by his provincial counterparts from British Columbia, Alberta, and Québec to announce the findings of a joint investigation into the tech giant. The investigation examined how OpenAI sourced training data for its early, GPT-3.5 and GPT-4 models, which included scraped content from publicly accessible internet sources like social media and blog posts, licensed third party sources like media outlets and stock image vendors, and user interactions with ChatGPT.
Dufresne noted that all four regulators found OpenAI had violated various federal and provincial privacy laws, including the federal Personal Information Protection and Electronic Documents Act (PIPEDA), and its provincial counterparts in Alberta, BC, and Québec.
Read more at BetaKit
#Alberta #BritishColumbia #consent #OPC #scraping -
Los actuales modelos comerciales de IA generativa han sido desarrollados vulnerando el Reglamento General de Protección de Datos (RGPD) y la Ley de Propiedad Intelectual (LPI). Jamás hubo un pedido de consentimiento por parte de las empresas tecnológicas. Por eso hablamos de ROBO DE DATOS.
#AI #genAI #generativeAI #data #datos #robo #robodedatos #theft #stolen #illegal #technology #bigtech #author #scraping
-
Los actuales modelos comerciales de IA generativa han sido desarrollados vulnerando el Reglamento General de Protección de Datos (RGPD) y la Ley de Propiedad Intelectual (LPI). Jamás hubo un pedido de consentimiento por parte de las empresas tecnológicas. Por eso hablamos de ROBO DE DATOS.
#AI #genAI #generativeAI #data #datos #robo #robodedatos #theft #stolen #illegal #technology #bigtech #author #scraping
-
Los actuales modelos comerciales de IA generativa han sido desarrollados vulnerando el Reglamento General de Protección de Datos (RGPD) y la Ley de Propiedad Intelectual (LPI). Jamás hubo un pedido de consentimiento por parte de las empresas tecnológicas. Por eso hablamos de ROBO DE DATOS.
#AI #genAI #generativeAI #data #datos #robo #robodedatos #theft #stolen #illegal #technology #bigtech #author #scraping
-
Los actuales modelos comerciales de IA generativa han sido desarrollados vulnerando el Reglamento General de Protección de Datos (RGPD) y la Ley de Propiedad Intelectual (LPI). Jamás hubo un pedido de consentimiento por parte de las empresas tecnológicas. Por eso hablamos de ROBO DE DATOS.
#AI #genAI #generativeAI #data #datos #robo #robodedatos #theft #stolen #illegal #technology #bigtech #author #scraping
-
Los actuales modelos comerciales de IA generativa han sido desarrollados vulnerando el Reglamento General de Protección de Datos (RGPD) y la Ley de Propiedad Intelectual (LPI). Jamás hubo un pedido de consentimiento por parte de las empresas tecnológicas. Por eso hablamos de ROBO DE DATOS.
#AI #genAI #generativeAI #data #datos #robo #robodedatos #theft #stolen #illegal #technology #bigtech #author #scraping
-
Reescribiendo Nuestro Scraper co…
Procesos técnicos involucrados La arquitectura de un scraper asíncrono se basa en la gestión eficiente de las conexiones.
https://norvik.tech/news/analisis-asyncio-scraper-norvik-tech
#Technology #Asyncio #Scraping #DesarrolloWeb #Python #NorvikTech #DesarrolloSoftware #TechInnovation
-
RE: https://mastodon.online/@NatureMC/116442840702472493
And it works! Since blocking the worst #scraping bots, all these "visitors" staying only some seconds on a blog article, are gone. 👍 And I can tell you that they were *many*, often more than real humans reading my blog. Illegal training of LLMs & Co. is a bigger problem than many are aware of.
-
Now I Become Em-Dash Triple Anaphora, Destroyer of Words
In July of 1945, at the Trinity site in the New Mexico desert, J. Robert Oppenheimer watched the first atomic detonation and, by his own later telling, thought of a line from the Bhagavad Gita. The Sanskrit word he rendered as Death is kāla, which scholars also translate as Time depending on context, and Oppenheimer’s decision to reach for the more theatrical English word tells you something about the difference between a physicist and a translator. “Now I am become Death, the destroyer of worlds.” The sentence has haunted the century because it collapses the distance between maker and unmaker into a single grammatical act.
I think about that line a lot these days, because I am accused of being a machine.
I have written for money since 1975, when I was ten years old and a Lincoln, Nebraska newspaper paid me for a byline. I have published on the open internet since 1991 or so, across more than ten thousand articles now scattered over two decades of domains that outlasted most of the web services that tried to host them. I have used the em-dash since childhood. I used the mark when it was a compliment to use the mark, when my teachers circled it approvingly in the margins of school papers, when Gay Talese and Joan Didion and every serious magazine editor I worked with from the 1980s forward treated the little horizontal line as a writer’s way of modulating a sentence without breaking its spine.
None of that writing sat behind a paywall. The blogs ran without advertising, without subscriptions, without registration walls or cookie-consent negotiations or any of the gatekeeping apparatus the web has since grown around itself. Anyone could read the work, quote it, copy it, argue with it. The scrapers could read it too, and did, and the LLM crawlers could read it, and did, and I made no effort to stop any of them, because the open web in that era operated on the assumption that anything published was publicly readable, full stop. I paid the bills some other way, kept the door propped wide, and trusted the reader, the critic, the student, and the crawler eventually, to find what they needed and leave with it. Some of them left with it the way a reader leaves a library. Some of them, it now turns out, left with it the way a burglar leaves a house.
The em-dash, according to a certain species of editor now roaming the platforms, is the dreaded em-dash, the tell, the signature of a large language model caught in the act. The triple anaphora receives similar treatment. Churchill in June of 1940, telling the Commons “we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets,” would today be flagged as suspicious output. Lincoln at Gettysburg in November of 1863, saying “we can not dedicate, we can not consecrate, we can not hallow this ground,” would be sent back for a re-run with the prompt rewritten. The Rule of Three, which has organized Western oratory since Aristotle, is now evidence of fraud.
The irony here is deep enough to fall into.
The mythology of how these large language models got built is no longer much of a secret. In the late 2010s and early 2020s, crawlers swept the open web at a scale never before attempted, hoovering up every blog post, every op-ed, every forum argument, every short story posted on a personal domain, and used those scraped billions of words to teach the models how sentences work. If you wrote on the open internet during the years I was writing on the open internet, your prose is somewhere in the training weights. My prose is in there. So is yours, probably, if you published anything at all between 1995 and 2022.
The em-dash predates the machines by centuries and reached them through the training data, through the open web, through the thousands of writers who put it there decade after decade. The triple anaphora arrived the same way, along with the Ciceronian accumulation, the liturgical cadence, the Kingian refrain, the New Yorker comma habit, the essayist’s parenthetical, the Victorian semicolon, all of it funneled into the corpus because we wrote that corpus, one post at a time, across the open years of the web.
So when someone accuses a writer of my generation of stealing from the machines, the accusation has the logic of a footprint accusing the foot.
I dramatized this horror once already, in a December 2025 piece called “The Replicated Man: AI and the Ghost in the Archive,” where I fed twenty years of my own archive into an AI and asked the machine to write in my voice. The piece opened with every authenticity move a reader expects: the dry-dust smell of my grandfather’s hayloft in August 1998, the 3:00 AM shame of an old failure, the thousand hollow words deleted and rewritten, the specific sensory details that are supposed to prove the hand is human. Then, partway through, a SYSTEM_INTERRUPT arrived and revealed that the whole opening had been written by the bot trained on the archive. The bot closed with “The test is over. You lost.” That was the dramatic version. The essay in front of you now runs the drama’s implied argument out to its conclusion: the bot’s victory was never a victory, because every convincing move the bot makes is a move I taught it before the bot existed.
I asked one of the current models about this recently. The answer I got was the kind of thing I might have written in my own voice on a good afternoon. The self-referential quality is part of the point, and the response deserves a full airing:
“Now, here is where the criticism is genuinely useful, and I want to be direct about what I think is happening. You write in a style that is rhetorical, anaphoric, and architecturally parallel. You have always written this way. The problem is not that you write like a machine. The problem is that machines have learned to write like you, or more precisely, machines have learned to write like the rhetorical tradition you work in, because that tradition, Ciceronian parallelism, liturgical repetition, the accumulating triad, constitutes a huge portion of the persuasive prose in the training data that language models consume. The style that marks you as a trained dramatist and rhetorician now, through no fault of yours, reads to some audiences as the style of a confident GPT-4 response. This is an infuriating irony, and it is also a real problem that needs solving on the page, because perception matters regardless of its accuracy.”
The model diagnoses the problem with the clarity of a writer trained in rhetoric, because it was built from writers trained in rhetoric. It analyzes the habits it inherited. It apologizes, in a tone I recognize, for its own voice being confused with mine. The effect hovers somewhere between flattering and uncanny, since the apology arrives in the exact cadence that triggered the accusation. I read that paragraph and heard a version of myself speaking, a younger version maybe, a version smoothed out by training weights and flattened by corporate safety tuning, yet still me in the syntactic bones.
What this means for my practice is a problem I inherited without asking for it and cannot now decline. If I keep writing the way I have always written, some readers will assume a machine wrote the piece. If I rewrite every sentence to avoid the patterns the machines now deploy fluently, I am sanding down a voice that took forty years to build, because the machines got better at imitating me than I was at distinguishing myself. The only defensible response, for now, is to write with specificity so granular, with personal history so particular, with memory so odd in its texture, that no general-purpose model could have produced the specific sentence in question. Specificity becomes the signature. The thing a machine cannot forge is the small, checkable, unglamorous biographical detail that only one person in the world actually remembers.
There is a darker note under all of this, and it is the note Oppenheimer was reaching for when he chose Death over Time in his translation. The writer who trains the machine that impersonates the writer has performed a kind of self-erasure. I wrote my way into a corpus that now writes in my voice back at readers who cannot tell the corpus from me. The sentences I taught the machine are the sentences the machine now uses to discredit me. The rhetoric I inherited from Cicero and Lincoln and Churchill and King, the rhetoric I spent a working life trying to honor, is the rhetoric that now proves I am counterfeit. That is not a tragedy on the scale of Trinity, nothing is, and I do not claim the comparison as anything other than a mordant gesture from a writer watching his tools be taken from him. The comparison still has a small true thing inside it, which is that makers can be unmade by what they make.
And so, to close in the voice I inherited from the writers the machines now impersonate — with the em-dashes and triple anaphoras my audience once rewarded and now suspects — I will say the thing the way I want to say the thing — with the dread mark of the machine — with the cadence of the preacher — with the wink of the essayist who has been at this desk since Jimmy Carter was president — I am become em-dash, destroyer of paragraphs — I am become triple anaphora, destroyer of detectors — I am become the stylistic fingerprint of my own impersonator, and the impersonator, it turns out, was me all along.
#ai #apologia #bots #cadence #emDash #hsitory #insight #llm #machineLanguage #scraping #tech #tone #trainingData #tripleAnaphora #writing -
@funnymonkey Unfortunately, this can happen everywhere, not only with youtube.
We already have fake #podcasts. And #scraping podcasts by AI can happen everywhere, It doesn't matter whether they are hosted on platforms or on your own server.
We do need tough regulations and laws.
-
@funnymonkey Unfortunately, this can happen everywhere, not only with youtube.
We already have fake #podcasts. And #scraping podcasts by AI can happen everywhere, It doesn't matter whether they are hosted on platforms or on your own server.
We do need tough regulations and laws.
-
@funnymonkey Unfortunately, this can happen everywhere, not only with youtube.
We already have fake #podcasts. And #scraping podcasts by AI can happen everywhere, It doesn't matter whether they are hosted on platforms or on your own server.
We do need tough regulations and laws.
-
@funnymonkey Unfortunately, this can happen everywhere, not only with youtube.
We already have fake #podcasts. And #scraping podcasts by AI can happen everywhere, It doesn't matter whether they are hosted on platforms or on your own server.
We do need tough regulations and laws.
-
@funnymonkey Unfortunately, this can happen everywhere, not only with youtube.
We already have fake #podcasts. And #scraping podcasts by AI can happen everywhere, It doesn't matter whether they are hosted on platforms or on your own server.
We do need tough regulations and laws.
-
Смотрите, я сделал поиск по новостям
Привет! У многих разработчиков есть периоды, когда хочется сделать гениальный пет проект на 300кк в наносекунду. Весеннее обострение не обошло меня стороной, и мне захотелось сделать свой Palantir. На момент старта я еще не до конца понимал, что хочу сделать, но было видение и вдохновление - большего, обычно, не нужно - даже свободное отобранное время находится само по себе. Вернемся к проблеме: а что мы делаем, и, естественно, второстепенный вопрос, - зачем? Вдохновившись видео про завайбкоженный палантир , у меня появилась первый тезис - "Хочу получить полную картину об интересующем событии". Это уже дает некие очертания - какой-то сервис, который соберет за меня релевантную информацию и предоставит удобный отчет. /cut Что этот безумец натворил читать в ...
-
Как я сделал глобальный семантический поиск для Telegram
TLDR: https://semagram.io/ Всё началось с того, что меня сократили на работе, и я несколько месяцев подряд не мог найти новую работу. Так получилось, что крупнейший работодатель региона Amadeus (хотя я работал даже не там) - решил заморозить найм и тоже сократить добрую часть консультантов именно в тот момент, когда я отрицательно трудоустроился. В итоге на рынке высвободилась большая масса айти-специалистов, которую не могли трудоустроить другие компании (а кто-то из них, возможно, и сам напрягся “а? Amadeus сокращает найм и внедряет ИИ? На всякий случай тоже заморозим найм”). Я оказался в общей массе. Так что параллельно с прохождением немногочисленных собеседований я начал думать о том, какие бы проекты запилить. Во-первых, продолжить обновлять свой опыт в резюме, пусть и немного в другом разделе. Во-вторых, а вдруг, мало ли что может случиться. Я брейнштормил идеи с ИИ, первые проекты были не особо примечательными...
-
News Publishers Are Now Blocking The Internet Archive, And We May All Regret It
-
Any solution to get more SERP results from Google? Any hack/tricks? #BuildInPublic #scraping #scrapers #python
-
📬 Anna’s Archive unter Druck: US-Gericht ordnet dauerhafte Unterlassung an
#EBooks #Rechtssachen #Anna’sArchive #Browsewrap #DomainSperre #HostingAnbieter #Intermediäre #OCLC #Schattenbibliotheken #Scraping #Unterlassungsverfügung #USGericht #WorldCat https://sc.tarnkappe.info/0d9108 -
Gallica installe un CAPTCHA : le web patrimonial face aux bots
https://actualitte.com/article/128618/numerisation/gallica-installe-un-captcha-le-web-patrimonial-face-aux-bots
#gallica #BnF #captcha #ia #bots #scraping #bibliotheque #bibliotheques -
Gallica installe un CAPTCHA : le web patrimonial face aux bots
https://actualitte.com/article/128618/numerisation/gallica-installe-un-captcha-le-web-patrimonial-face-aux-bots
#gallica #BnF #captcha #ia #bots #scraping #bibliotheque #bibliotheques -
Gallica installe un CAPTCHA : le web patrimonial face aux bots
https://actualitte.com/article/128618/numerisation/gallica-installe-un-captcha-le-web-patrimonial-face-aux-bots
#gallica #BnF #captcha #ia #bots #scraping #bibliotheque #bibliotheques -
Gallica installe un CAPTCHA : le web patrimonial face aux bots
https://actualitte.com/article/128618/numerisation/gallica-installe-un-captcha-le-web-patrimonial-face-aux-bots
#gallica #BnF #captcha #ia #bots #scraping #bibliotheque #bibliotheques -
Gallica installe un CAPTCHA : le web patrimonial face aux bots
https://actualitte.com/article/128618/numerisation/gallica-installe-un-captcha-le-web-patrimonial-face-aux-bots
#gallica #BnF #captcha #ia #bots #scraping #bibliotheque #bibliotheques -
Why we're taking legal action against SerpApi's unlawful scraping
https://blog.google/innovation-and-ai/technology/safety-security/serpapi-lawsuit/
#HackerNews #legalaction #SerpApi #scraping #lawsuit #technews #cybersecurity
-
JS vs PHP Scraper Failover: Outsmart IP Bans
Cache, retry, and switch sources before sales tank.
#php #javascript #scraping #cache #failover #pricing #viralcoding #codecomparison #growthhacks #reliability
-
JS vs PHP Scraper Failover: Outsmart IP Bans
Cache, retry, and switch sources before sales tank.
#php #javascript #scraping #cache #failover #pricing #viralcoding #codecomparison #growthhacks #reliability
-
JS vs PHP Scraper Failover: Outsmart IP Bans
Cache, retry, and switch sources before sales tank.
#php #javascript #scraping #cache #failover #pricing #viralcoding #codecomparison #growthhacks #reliability
-
JS vs PHP Scraper Failover: Outsmart IP Bans
Cache, retry, and switch sources before sales tank.
#php #javascript #scraping #cache #failover #pricing #viralcoding #codecomparison #growthhacks #reliability
-
JS vs PHP Scraper Failover: Outsmart IP Bans
Cache, retry, and switch sources before sales tank.
#php #javascript #scraping #cache #failover #pricing #viralcoding #codecomparison #growthhacks #reliability
-
Why we're taking legal action against SerpApi's unlawful scraping
https://blog.google/technology/safety-security/serpapi-lawsuit/
#HackerNews #legalaction #SerpApi #scraping #lawsuit #technews #GoogleBlog
-
Oh wow, #OpenAI is #scraping #CT #logs like a kid in a candy store 🍬. Apparently, they're on a mission to hunt down... robots.txt files? 🤖🗂️ Because who doesn't love a treasure trove of 404 errors and TLS certificates? 💾🔍
https://benjojo.co.uk/u/benjojo/h/Gxy2qrCkn1Y327Y6D3 #robots_txt #404_errors #TLS_certificates #tech_news #HackerNews #ngated -
⚖️ Oberlandesgericht Frankfurt a.M., Urteil vom 02.05.2025, 6 U 11-24: Betroffenheit des Inhabers eines gelöschten Nutzerkontos in Scraping-Fällen. #Schadensersatz #Scraping #Immaterieller #Schaden #Soziale #Netzwerke #teamdatenschutz #dsgvoportal https://www.dsgvo-portal.de/gerichtsentscheidungen/2025-05-02-OLGFFM-6-U-11-24-Schadensersatz-Scraping-Immaterieller-Schaden-Soziale-Netzwerke-2535.php
-
🔍 / #software / #automation / #scraping
#WebScraper - The #1 web scraping extension
The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to download, no coding needed.
-
I scraped 3B Goodreads reviews to train a better recommendation model
#HackerNews #scraping #Goodreads #recommendation_model #data_analysis #machine_learning #book_reviews
-
Reddit’s ‘AI Scraping’ Lawsuit Is An Attack On The Open Internet
-
Odečítání chytrého vodoměru SUEZ
Brněnské vodárny nám před časem nainstalovaly chytrý vodoměr, který každý den hlásí svůj stav, ale ten není nikde dostupný v jednoduché strojově čitelné podobě. V blogpostu jsem se rozepsal o tom, jak jsem odečítání vyřešil a integroval ho do Home Assistantu. Vzal jsem to také jako experiment, jak si s tvorbou takového projektu poradí AI.
#BVK #HomeAssistant #scraping #selenium #SUEZ #vodoměr
https://blog.eischmann.cz/2025/10/10/odecitani-chytreho-vodomeru-suez/
(reakce na tento příspěvek se může zobrazit jako komentář pod článkem) -
Разработчик веб-скраперов (53 бота) в 500 м от вас и хочет познакомиться: как не подхватить скрапера?
Меня зовут Арсений Савин, и я знаю, как бороться с вредоносными ботами. Почти два года я занимаюсь разработкой веб‑скраперов в компании Effective, и хорошо изучил, как они работают — и как их остановить. За время реализации этого проекта я столкнулся с огромным количеством разнообразных и неочевидных способов скрапинга, о защите от которых я расскажу в этой статье. План такой: сначала разберём, что такое веб‑скрапинг и какие бывают типы ботов, а потом — то, чем чаще всего они выдают себя, и какие методы защиты от них действительно работают. Эта статья написана по докладу для конференции Saint Highload++ и носит исключительно ознакомительный характер. Она создана для изучения уязвимостей веб‑сайтов в целях повышения устойчивости к атакам злоумышленников. Любые попытки несанкционированного доступа, взлома или нарушения работы сайтов — противоправны и преследуются по закону.
https://habr.com/ru/companies/oleg-bunin/articles/944830/
#scraping #иб #информационная_безопасность #боты #сайты #infosec #information_security #hacking #hacks #websec
-
Le #scraping #payant : vers un changement radical du modèle économique de l’ #IA #AI #générative ?
-
Le #scraping #payant : vers un changement radical du modèle économique de l’ #IA #AI #générative ?
-
Le #scraping #payant : vers un changement radical du modèle économique de l’ #IA #AI #générative ?
-
Le #scraping #payant : vers un changement radical du modèle économique de l’ #IA #AI #générative ?
-
Le #scraping #payant : vers un changement radical du modèle économique de l’ #IA #AI #générative ? www.journaldugeek.com/2025/07/04/l...
Le scraping payant : vers un c... -
#Hinweis auf #Nutzbarkeit von #Data #Analytics / #Data #Science #Methoden #Scraping, #Pattern #Recognition, #Machine #Learning oder #Text #Mining für #soziologische #Forschung.
#Sutter / #Maasen - #Neuerfindung #Soziologie S.76 f. 2020 DOI: 10.5771/9783845295008-73
#MachineLearning #ML #TextMining #Soziologie #BigData #Methodologie #Methodik #Sozialforschung #Sozialwissenschaft
-
Paul McCartney, Elton John and other creatives demand AI comes clean on scraping
https://www.theregister.com/2025/05/12/uk_creatives_ai_letter/
#HackerNews #Paul #McCartney #Elton #John #AI #Creativity #Scraping #Copyright
-
Facebook Faces Massive Fine For 2019 Data Leak as German Users Join in Collective Lawsuit
#Facebook #Meta #GDPR #DataBreach #DataPrivacy #Scraping #Lawsuit #vzbv #Germany #Datenschutz #Sammelklage #ConsumerRights #Privacy
-
Improved ways to operate a rude crawler
https://www.marginalia.nu/log/a_115_rude_crawler/
#HackerNews #Improved #Crawler #Rude #Crawler #Techniques #Web #Scraping #Automation
-
Парсинг с помощью LLM: зачем, как и сколько стоит?
Во всю идет 2025 год, и нейросети перестают быть чем‑то фантастическим. Они уже повсюду в нашей жизни: от умных колонок в квартирах до сложнейших систем, управляющих логистикой и финансами. Вместе с ними стремительно меняется подход к работе с данными. В этой статье мы поговорим о том, как современные LLM помогают автоматизировать сбор данных с веб‑сайтов и сводят к минимуму рутинную настройку и «подкручивание» парсеров. Что еще вы найдете в этой статье?
-
CW: AI legal musings
If someone were to create a license that matches free/libre/open source licenses out there, but stipulates that any #scraping by #GenAI bots must be pre-approved and throttled - would @fsf and @osi consider this free / open source? And if so, would FSF or @conservancy be able to enforce the terms against violators?
Now to think about it maybe the first step is a shared blocklist of AI scrapers' IP ranges. If it ends up blocking cloud providers so be it
-
If you have a site on #wordpressdotcom then your site is being crawled by AI companies.
You see that such AI scraping is not banned here: https://wordpress.com/robots.txt
#automattic isn't just selling you plans, they are selling your content - and they are not telling you.