#trainingai — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #trainingai, aggregated by home.social.
-
The Register: Yet another experiment proves it’s too damn simple to poison large language models. “Unlike search engines that let you judge competing sources, search-backed AI chatbots can turn shaky web material into confident answers. Case in point: A security engineer convinced several bots that he was the reigning world champion of a popular German card game, even though no such […]
https://rbfirehose.com/2026/05/07/the-register-yet-another-experiment-proves-its-too-damn-simple-to-poison-large-language-models/ -
BBC: Meta in row after workers who say they saw smart glasses users having sex lose jobs. “Meta is under pressure to explain why it cancelled a major contract with a company it was using to train AI, shortly after some of its Kenya-based workers alleged they had to view graphic content captured by Meta smart glasses.”
https://rbfirehose.com/2026/05/07/bbc-meta-in-row-after-workers-who-say-they-saw-smart-glasses-users-having-sex-lose-jobs/ -
BBC: Meta in row after workers who say they saw smart glasses users having sex lose jobs. “Meta is under pressure to explain why it cancelled a major contract with a company it was using to train AI, shortly after some of its Kenya-based workers alleged they had to view graphic content captured by Meta smart glasses.”
https://rbfirehose.com/2026/05/07/bbc-meta-in-row-after-workers-who-say-they-saw-smart-glasses-users-having-sex-lose-jobs/ -
BBC: Meta in row after workers who say they saw smart glasses users having sex lose jobs. “Meta is under pressure to explain why it cancelled a major contract with a company it was using to train AI, shortly after some of its Kenya-based workers alleged they had to view graphic content captured by Meta smart glasses.”
https://rbfirehose.com/2026/05/07/bbc-meta-in-row-after-workers-who-say-they-saw-smart-glasses-users-having-sex-lose-jobs/ -
BBC: Meta in row after workers who say they saw smart glasses users having sex lose jobs. “Meta is under pressure to explain why it cancelled a major contract with a company it was using to train AI, shortly after some of its Kenya-based workers alleged they had to view graphic content captured by Meta smart glasses.”
https://rbfirehose.com/2026/05/07/bbc-meta-in-row-after-workers-who-say-they-saw-smart-glasses-users-having-sex-lose-jobs/ -
BBC: Meta in row after workers who say they saw smart glasses users having sex lose jobs. “Meta is under pressure to explain why it cancelled a major contract with a company it was using to train AI, shortly after some of its Kenya-based workers alleged they had to view graphic content captured by Meta smart glasses.”
https://rbfirehose.com/2026/05/07/bbc-meta-in-row-after-workers-who-say-they-saw-smart-glasses-users-having-sex-lose-jobs/ -
Associated Press: Mark Zuckerberg ‘personally authorized’ Meta’s copyright infringement, publishers allege. “Five publishing houses and author Scott Turow sued Meta and CEO Mark Zuckerberg on Tuesday, alleging the company illegally used millions of copyrighted works to train its AI language system Llama.”
https://rbfirehose.com/2026/05/06/associated-press-mark-zuckerberg-personally-authorized-metas-copyright-infringement-publishers-allege/ -
Ars Technica: The hidden cost of Google’s AI defaults and the illusion of choice. “What we’re seeing in both free and paid Google accounts is the power of defaults in the AI era. The default is sharing data for AI training. The default is AI summaries in your email. The default is AI-powered document creation. You can change these settings, but Google has to know most people won’t do […]
https://rbfirehose.com/2026/05/01/ars-technica-the-hidden-cost-of-googles-ai-defaults-and-the-illusion-of-choice/ -
BBC: Meta to track workers’ clicks and keystrokes to train AI. “Meta will start tracking the way employees work, including their keystrokes and mouse clicks, to train its artificial intelligence (AI) models. The company, which owns Instagram and Facebook, told workers on Tuesday that a new tool will run on Meta’s computers and internal apps, logging their activity to be used as training data […]
https://rbfirehose.com/2026/04/22/bbc-meta-to-track-workers-clicks-and-keystrokes-to-train-ai/ -
The Register: Bad teacher bots can leave hidden marks on model students. “New research warns about the dangers of teaching LLMs on the output of other models, showing that undesirable traits can be transmitted ‘subliminally’ from teacher to student, even when they are scrubbed from training data.”
https://rbfirehose.com/2026/04/20/the-register-bad-teacher-bots-can-leave-hidden-marks-on-model-students/ -
Fast Company: Shuttered startups are selling old Slack chats and emails to AI companies. “According to a report by Forbes, defunct companies are selling their digital footprints to AI companies as training data—and making real money from it. Shanna Johnson, the CEO of now-defunct software company Cielo24, told the publication that she was able to sell every Slack message, internal email, and […]
https://rbfirehose.com/2026/04/18/fast-company-shuttered-startups-are-selling-old-slack-chats-and-emails-to-ai-companies/ -
NIST: NIST Helps Fingerprint Examiners With New Data and Software Release. “A NIST collection of 10,000 fingerprints has now been fully annotated with details that will help train both human fingerprint examiners and AI tools. NIST has also released open-source software that can help evaluate and sort fingerprints according to their quality, potentially helping fingerprint examiners work more […]
https://rbfirehose.com/2026/03/26/nist-nist-helps-fingerprint-examiners-with-new-data-and-software-release/ -
RE: https://social.vivaldi.net/@brucelawson/116243807011110029
Re: Pokémon Go training data.
Well there were very strong hints in the terms of service even early on, and a large part of why I stopped playing the game after the first month or two… Wish I still had the links that convinced me.
I hate that basically everything I started to “wear a tin foil hat” about over the past 10 years has since been proven to be abused by corporations... “You are the product” etc etc.
-
MakeUseOf: I stopped my data from being used to train AI (you might want to too). “Copyright laws are tipped in favor of AI companies because, so far, the onus is on users to ‘opt out’ of their personal data, chats, and creative work being used to train AI models. Opting out won’t remove work from a dataset that’s already been used to train AI models, but it could prevent it in the future. […]
https://rbfirehose.com/2026/03/17/makeuseof-i-stopped-my-data-from-being-used-to-train-ai-you-might-want-to-too/ -
TechSpot: Smart TV apps are quietly scraping web data for AI training. “Companies specializing in scraping or otherwise harvesting publicly available content to train AI models are becoming increasingly common. In particular, some firms are targeting smart TV applications and similar platforms, attempting to leverage users’ internet connectivity in exchange for low-cost incentives such as […]
https://rbfirehose.com/2026/03/04/techspot-smart-tv-apps-are-quietly-scraping-web-data-for-ai-training/ -
Ars Technica: Microsoft deletes blog telling users to train AI on pirated Harry Potter books. “Following backlash in a Hacker News thread, Microsoft deleted a blog post that critics said encouraged developers to pirate Harry Potter books to train AI models that could then be used to create AI slop.”
https://rbfirehose.com/2026/02/24/ars-technica-microsoft-deletes-blog-telling-users-to-train-ai-on-pirated-harry-potter-books/ -
Reuters: Musk’s Starlink updates privacy policy to allow consumer data to train AI. “SpaceX (SPAX.PVT) revised its Starlink privacy policy to allow the use of customer data for AI training, a shift that could bolster Elon Musk’s AI ambitions.”
https://rbfirehose.com/2026/01/30/reuters-musks-starlink-updates-privacy-policy-to-allow-consumer-data-to-train-ai/ -
Geeks Are Sexy: Stack Overflow Is Dead, Long Live Its Training Data. “The numbers are brutal. Monthly questions have collapsed from more than 200,000 in 2014 to just 3,862 in December 2025. The reason is obvious: 84% of developers now use AI coding assistants. Turns out people prefer instant answers over being publicly shamed for forgetting a semicolon.” I tried Stack Overflow once. ONCE. It […]
https://rbfirehose.com/2026/01/19/geeks-are-sexy-stack-overflow-is-dead-long-live-its-training-data/ -
Mashable: Common Crawl accused of feeding paywalled content to AI companies. “In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes.”
-
Reuters: China’s Huawei Co-Develops DeepSeek Model, Improves Censoring. “Chinese tech giant Huawei has co-developed a safety-focused version of artificial intelligence model DeepSeek that it said is ‘nearly 100% successful’ in preventing discussion of politically sensitive topics.”
-
The Register: Uber India starts offering drivers gigs collecting and classifying info for AI models. “Megha Yethadka, global head of Uber AI Solutions, revealed the new gigs in a Thursday LinkedIn post in which she said drivers sometimes have downtime during the day or might want to make some extra cash after hours. Yethadka said the work can involve reviewing photos, counting objects, […]
-
The Conversation: How poisoned data can trick AI − and how to stop it. “Data poisoning might not be entirely preventable. But there are commonsense measures that can help guard against it, such as placing limits on data processing volume and vetting data inputs against a strict checklist to keep control of the training process. Mechanisms that can help to detect poisonous attacks before they […]
-
The Conversation: ‘Are you joking, mate?’ AI doesn’t get sarcasm in non-American varieties of English. “Large language models are often reported to achieve superlative performance on several standardised sets of tasks known as benchmarks. The majority of benchmark tests are written in Standard American English. This implies that, while large language models are being aggressively sold by […]
-
ZDNet: Cloudflare just changed the internet, and it’s bad news for the AI giants. “The major internet Content Delivery Network (CDN), Cloudflare, has declared war on AI companies. Starting July 1, Cloudflare now blocks by default AI web crawlers accessing content from your websites without permission or compensation.”
-
PetaPixel: YouTubers Surprised That Google Uses Their Videos to Train AI Models. “According to a report by CNBC, Google is tapping into YouTube’s library of 20 billion videos to train its AI models. The news outlet cited a source not authorized to speak publicly about the matter. Google later confirmed to CNBC that it does use YouTube videos to train its AI, but says it only relies on a […]
-
MIT News: Training LLMs to self-detoxify their language. “Over time, most of us develop an internal ‘guide’ that enables us to learn context behind conversation; it also frequently directs us away from sharing information and sentiments that are, or could be, harmful or inappropriate. As it turns out, large language models (LLMs) — which are trained on extensive, public datasets and […]
https://rbfirehose.com/2025/04/22/mit-news-training-llms-to-self-detoxify-their-language/
-
MIT News: New method efficiently safeguards sensitive AI training data. “MIT researchers recently developed a framework, based on a new privacy metric called PAC Privacy, that could maintain the performance of an AI model while ensuring sensitive data, such as medical images or financial records, remain safe from attackers. Now, they’ve taken this work a step further by making their […]
-
The Conversation: Africa’s data workers are being exploited by foreign tech firms – 4 ways to protect them. “Since 2015, we have been studying the central role of African data workers in building and maintaining artificial intelligence (AI) systems, acting as ‘data janitors’. Our research found that companies rarely acknowledge the use of human workers in AI value chains, thus they […]
-
Search Engine Land: Meet LLMs.txt, a proposed standard for AI website content crawling. “While many content creators are interested in the proposal’s potential merits, it also has detractors. But given the rapidly changing landscape for content produced in a world of artificial intelligence, llms.txt is certainly worth discussing.”
-
Ars Technica: Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries. “Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures—adjusting robots.txt, blocking known […]
-
TorrentFreak: Meta’s BitTorrent Uploads of ‘Pirate Library’ Data Equaled 30% of Downloads, Expert Says. “A lawsuit filed by several authors against Meta centers on Meta’s alleged use of pirated books for AI training data and the technical details of BitTorrent which was used to obtain them. Yesterday, Meta filed a motion for summary judgment, while countering the authors’ request to […]
-
MIT Press: A note on LibGen and the unauthorized use of our authors’ work. “We want to be clear: The MIT Press has not licensed any of our books or journal articles for LLM training purposes, nor have we granted permission for any such use. However, we are well aware that many MIT Press publications have ended up in pirated training data sets. We share the deep distress of our authors whose […]
-
Ars Technica: Cloudflare turns AI against itself with endless maze of irrelevant facts. “On Wednesday, web infrastructure provider Cloudflare announced a new feature called ‘AI Labyrinth’ that aims to combat unauthorized AI data scraping by serving fake AI-generated content to bots. The tool will attempt to thwart AI companies that crawl websites without permission to collect training data […]
-
Tom’s Hardware: Meta defends using pirated material, claims it’s legal if you don’t seed content. “Meta claimed in a court filing this week that despite torrenting an 82 TB dataset of pirated, copyrighted material from shadow libraries to train its LLaMA AI models, that employees ‘took precautions not to “seed” any downloaded files’. The act of Seeding in torrenting terminology refers to […]
-
Perishable Press: Ultimate Block List to Stop AI Bots. “The focus of this post is aimed at website owners who want to stop AI bots from crawling their web pages, as much as possible. To help people with this, I’ve been collecting data and researching AI bots for many months now, and have put together a ‘Mega Block List’ to help stop AI bots from devouring your content.”
https://rbfirehose.com/2025/02/12/perishable-press-ultimate-block-list-to-stop-ai-bots/
-
TorrentFreak: ‘Meta Torrented over 81 TB of Data Through Anna’s Archive, Despite Few Seeders’. “Freshly unsealed court documents reveal that Meta downloaded significant amounts of data from shadow libraries through Anna’s Archive. The company’s use of BitTorrent was already known, but internal email communication reveals sources and terabytes of downloaded data, as well as a struggle […]
-
Ars Technica: How one YouTuber is trying to poison the AI bots stealing her content. “It’s not hard to find YouTubers complaining about a flood of these faceless channels stealing their embedded transcript files and running them through AI summarizers to generate their own instant knock-offs. But one YouTuber is trying to fight back, seeding her transcripts with junk data that is invisible to […]
-
Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]
-
Clueso secures $1.4M seed funding led by f7 Ventures. After recovering from a 25k user base drop to zero, the company focuses on AI-driven training video production.
#Clueso #SeedFunding #f7Ventures #AIDriven #Videos #UserRecovery #TechInvestment #Resilience #StartupNews #AIProduction #UserBase #ZeroToHero #TechInnovation #VideoProduction #TrainingContent #f7Investment #StartupSuccess #AIContent #TechFunding #VideoTraining #Growth #TechRecovery #TrainingAI