#trainingdata — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #trainingdata, aggregated by home.social.
-
#State #media #control influences #LargeLanguageModels (#LLMs)
"Millions of people around the world query LLMs for information. Although several studies have compellingly documented the persuasive potential of these models, there is limited evidence of who or what influences the models themselves, leading to a flurry of concerns about which companies and governments build and regulate the models. Here we show through six studies that government control of the media across the world already influences the output of LLMs via their #TrainingData. We use a cross-national audit to show that LLMs exhibit a #stronger #ProGovernment valence in the languages of countries with #LowerMediaFreedom than in those with higher media freedom. The combination of influence and persuasive potential across languages suggests the troubling conclusion that states and powerful institutions have increased strategic incentives to leverage media control in the hopes of shaping LLM output."
-
#State #media #control influences #LargeLanguageModels (#LLMs)
"Millions of people around the world query LLMs for information. Although several studies have compellingly documented the persuasive potential of these models, there is limited evidence of who or what influences the models themselves, leading to a flurry of concerns about which companies and governments build and regulate the models. Here we show through six studies that government control of the media across the world already influences the output of LLMs via their #TrainingData. We use a cross-national audit to show that LLMs exhibit a #stronger #ProGovernment valence in the languages of countries with #LowerMediaFreedom than in those with higher media freedom. The combination of influence and persuasive potential across languages suggests the troubling conclusion that states and powerful institutions have increased strategic incentives to leverage media control in the hopes of shaping LLM output."
-
#State #media #control influences #LargeLanguageModels (#LLMs)
"Millions of people around the world query LLMs for information. Although several studies have compellingly documented the persuasive potential of these models, there is limited evidence of who or what influences the models themselves, leading to a flurry of concerns about which companies and governments build and regulate the models. Here we show through six studies that government control of the media across the world already influences the output of LLMs via their #TrainingData. We use a cross-national audit to show that LLMs exhibit a #stronger #ProGovernment valence in the languages of countries with #LowerMediaFreedom than in those with higher media freedom. The combination of influence and persuasive potential across languages suggests the troubling conclusion that states and powerful institutions have increased strategic incentives to leverage media control in the hopes of shaping LLM output."
-
#State #media #control influences #LargeLanguageModels (#LLMs)
"Millions of people around the world query LLMs for information. Although several studies have compellingly documented the persuasive potential of these models, there is limited evidence of who or what influences the models themselves, leading to a flurry of concerns about which companies and governments build and regulate the models. Here we show through six studies that government control of the media across the world already influences the output of LLMs via their #TrainingData. We use a cross-national audit to show that LLMs exhibit a #stronger #ProGovernment valence in the languages of countries with #LowerMediaFreedom than in those with higher media freedom. The combination of influence and persuasive potential across languages suggests the troubling conclusion that states and powerful institutions have increased strategic incentives to leverage media control in the hopes of shaping LLM output."
-
#State #media #control influences #LargeLanguageModels (#LLMs)
"Millions of people around the world query LLMs for information. Although several studies have compellingly documented the persuasive potential of these models, there is limited evidence of who or what influences the models themselves, leading to a flurry of concerns about which companies and governments build and regulate the models. Here we show through six studies that government control of the media across the world already influences the output of LLMs via their #TrainingData. We use a cross-national audit to show that LLMs exhibit a #stronger #ProGovernment valence in the languages of countries with #LowerMediaFreedom than in those with higher media freedom. The combination of influence and persuasive potential across languages suggests the troubling conclusion that states and powerful institutions have increased strategic incentives to leverage media control in the hopes of shaping LLM output."
-
When the Radiologist Becomes the Expense
On March 25, 2026, at a Crain’s New York Business panel discussion of the city’s hospital sector, Mitchell H. Katz, MD, president and CEO of NYC Health + Hospitals, told the assembled executives what cost-cutting now sounds like in the largest public hospital system in the United States. “We could replace a great deal of radiologists with AI at this moment, if we are ready to do the regulatory challenge.” Sandra Scott, MD, who runs One Brooklyn Health, one of the city’s safety-net institutions operating on tight margins, replied that the move would be “a game-changer.” The exchange appeared in Crain’s coverage of the panel and was picked up by the radiology trade press within forty-eight hours.
The proposal reads as the second move in a strategy whose first move has been documented for fifteen years. American hospital systems built imaging volume on the back of a preventative-medicine apparatus that the American College of Cardiology’s own Choosing Wisely campaign identified in 2012 as substantially overused, with up to 45% of stress cardiac imaging in low-risk asymptomatic patients flagged as inappropriate by the ACC’s own appropriate-use criteria. That volume produced revenue. The same hospital systems now propose to automate away the labor cost of interpreting the revenue-producing volume. Imaging continues, billing continues, the radiologist disappears from the ledger, and the patient pays the same copay for a scan whose ordering was already questionable, now read by an algorithm whose performance varies by manufacturer, training data, patient population, and deployment context.
The strongest evidence base for AI in radiology supports a use case that the Katz proposal does not describe. The Mammography Screening with Artificial Intelligence trial, called MASAI, randomized over 100,000 Swedish women to either standard double reading by two radiologists or AI-supported single reading by one radiologist with the Transpara system from ScreenPoint Medical. Lead author Kristina Lång and colleagues at Lund University reported in The Lancet Oncology in 2023 that the AI-supported arm reduced radiologist workload by 44% while modestly increasing cancer detection. Follow-up data published in The Lancet in 2026 showed a 12% reduction in interval cancers, meaning cancers that emerge between screenings and that carry worse prognosis, with AI-supported screening compared to standard double reading. First author Jessie Gommers of Radboud University Medical Centre was direct in the press release: “Our study does not support replacing healthcare professionals with AI as the AI-supported mammography screening still requires at least one human radiologist to perform the screen reading.”
That distinction matters. AI-assisted reading, where a human radiologist works alongside an algorithm that flags suspicious findings and triages low-risk cases for single rather than double review, has been validated in randomized trials with hard outcome measures. The validation extends to AI as a triage and detection support, where one human radiologist remains in the loop. AI-only reading, where no human reviews the image unless the algorithm flags an abnormality, has not been tested to the same standard. A Stanford working paper on so-called “AI mirages” in medical imaging, which describes algorithms that perform well on benchmark datasets and fail in clinical deployment because the training distribution does not match the deployment distribution, was circulating at the time of the Katz panel and was awaiting peer review. Mohammed Suhail, MD, a radiologist at North Coast Imaging quoted in coverage of the Katz statement, said that any attempt to implement AI-only reads “would immediately result in patient harm and death, and only someone with zero understanding of radiology would say something so naive.” That is a strong claim from a working radiologist, but the structural point underneath it is conservative. The trial that would justify AI-only reading on a population basis has not been run. The trial that would justify AI-assisted reading has been run, and it requires the radiologist.
Set the safety question aside for a moment and consider what the proposal does to the labor market. The radiologist has been a high-margin specialist for the same reason all specialists are high-margin: the supply is constrained by the length of training and the licensing apparatus, and the demand is set by imaging volume. Katz’s proposal substitutes capital for labor. If New York State relaxes the regulation requiring radiologist review, NYC Health + Hospitals saves the salary of every radiologist whose reads can be displaced to the algorithm; the imaging machine still runs, still bills, still produces a chargeable encounter on the patient’s account. Generalized, the same logic applies to dermatology, where machine-learning skin lesion classifiers have shown strong retrospective performance, and to pathology, ophthalmology, and any imaging-heavy specialty whose work product is a classification task on a digital image. A worsening shortage of breast imaging specialists, particularly in rural and underserved markets, is the legitimate operational pressure Katz is responding to, and the American College of Radiology has documented this shortage at length. Using that pressure to license a deployment model the trial evidence has not endorsed is the illegitimate response.
Two profits accrue to the hospital system. The first is the original imaging revenue, generated by the appropriate-use-violating ordering patterns that produced the screening volume in the first place. The second is elimination of the labor cost of reading the imaging. A patient pays the copay, the insurer pays the technical fee, and the AI vendor takes a per-read or subscription fee that comes in well below the radiologist’s salary equivalent. Vendor and hospital split the gain. Radiologists are unemployed or shifted to abnormality review only, which substantially compresses earnings since the volume of abnormal reads is a fraction of total reads. Care delivered to the patient may be equivalent in accuracy under the AI-supported model and inferior in accuracy under the AI-only model, with no individual professional license held responsible for the read.
The regulatory politics will determine which model gets deployed. Katz himself flagged the regulatory challenge at the Crain’s panel, asking the assembled CEOs whether there was any reason they should not be lobbying New York State to permit AI-only reads. Lobbying for the relaxation is the hospital system facing margin pressure. Lobbying against is the American College of Radiology and the radiologists themselves, organized through their professional society. New York State legislators will decide. Patients do not have a seat at this table. A patient learns about the change when the mammogram comes back from the screening center read by Transpara version whatever and the bill arrives in the mail with no indication of who, if anyone, looked at the image.
Liability shifts. Under current regulation, a missed cancer on a mammogram exposes the reading radiologist to malpractice litigation, which is why the radiologist carries professional liability insurance and why the radiologist’s professional license is on the line for every read. Under a proposed AI-only model with radiologist confirmation only of flagged abnormalities, the missed cancer that occurred when the algorithm scored the image low and no human looked at it produces a liability question with no individual defendant. The plaintiff sues the institution, the institution sues the AI vendor, the AI vendor sues the training data licensor or invokes the FDA clearance as a shield. Many degrees of separation now sit between the patient and the party with deep pockets. The structural change resembles the shift from the family doctor to the corporate practice in primary care: personal accountability disappears into the institutional defendant, and the patient learns that the system is the system.
The preventative-medicine apparatus that produced excess imaging volume and the AI-radiology apparatus that proposes to read it without human review are two faces of the same financial logic. Both extract value from patient bodies through technical interventions whose individual benefit is small or unproven on a population basis, both produce steady recurring revenue, and both depend on the patient being a passive substrate rather than an active agent in the care chain. One creates the imaging. The other eliminates the labor cost of reading it. The hospital system, which is the only party that crosses both moves, captures the margin on both.
AI-assisted radiology is a real technology with real performance data. The MASAI trial demonstrated that the right deployment, with the right oversight, in the right population, produces better cancer detection at lower radiologist workload. That is a legitimate technological gain and the trial is one of the cleaner pieces of clinical evidence for AI in medicine to date. The question is who controls deployment, under what oversight, and to what end. If AI becomes a tool that radiologists use to read more imaging more accurately at lower cost per read, with patient outcomes that match or exceed the current standard, that is medicine. If AI becomes a license to eliminate the radiologist altogether, with the institutional savings flowing to hospital margins and the patient losing the only party in the imaging chain whose individual professional license is on the line for the read, that is bookkeeping. Mitchell Katz proposed the second model at a panel in March. The trial evidence supports the first. The next move belongs to the New York State legislature, which is to say, to whoever lobbies hardest in Albany.
#ai #health #healthcare #living #medicine #patient #radiologists #radiology #revenue #safety #screening #tech #trainingData -
Now I Become Em-Dash Triple Anaphora, Destroyer of Words
In July of 1945, at the Trinity site in the New Mexico desert, J. Robert Oppenheimer watched the first atomic detonation and, by his own later telling, thought of a line from the Bhagavad Gita. The Sanskrit word he rendered as Death is kāla, which scholars also translate as Time depending on context, and Oppenheimer’s decision to reach for the more theatrical English word tells you something about the difference between a physicist and a translator. “Now I am become Death, the destroyer of worlds.” The sentence has haunted the century because it collapses the distance between maker and unmaker into a single grammatical act.
I think about that line a lot these days, because I am accused of being a machine.
I have written for money since 1975, when I was ten years old and a Lincoln, Nebraska newspaper paid me for a byline. I have published on the open internet since 1991 or so, across more than ten thousand articles now scattered over two decades of domains that outlasted most of the web services that tried to host them. I have used the em-dash since childhood. I used the mark when it was a compliment to use the mark, when my teachers circled it approvingly in the margins of school papers, when Gay Talese and Joan Didion and every serious magazine editor I worked with from the 1980s forward treated the little horizontal line as a writer’s way of modulating a sentence without breaking its spine.
None of that writing sat behind a paywall. The blogs ran without advertising, without subscriptions, without registration walls or cookie-consent negotiations or any of the gatekeeping apparatus the web has since grown around itself. Anyone could read the work, quote it, copy it, argue with it. The scrapers could read it too, and did, and the LLM crawlers could read it, and did, and I made no effort to stop any of them, because the open web in that era operated on the assumption that anything published was publicly readable, full stop. I paid the bills some other way, kept the door propped wide, and trusted the reader, the critic, the student, and the crawler eventually, to find what they needed and leave with it. Some of them left with it the way a reader leaves a library. Some of them, it now turns out, left with it the way a burglar leaves a house.
The em-dash, according to a certain species of editor now roaming the platforms, is the dreaded em-dash, the tell, the signature of a large language model caught in the act. The triple anaphora receives similar treatment. Churchill in June of 1940, telling the Commons “we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets,” would today be flagged as suspicious output. Lincoln at Gettysburg in November of 1863, saying “we can not dedicate, we can not consecrate, we can not hallow this ground,” would be sent back for a re-run with the prompt rewritten. The Rule of Three, which has organized Western oratory since Aristotle, is now evidence of fraud.
The irony here is deep enough to fall into.
The mythology of how these large language models got built is no longer much of a secret. In the late 2010s and early 2020s, crawlers swept the open web at a scale never before attempted, hoovering up every blog post, every op-ed, every forum argument, every short story posted on a personal domain, and used those scraped billions of words to teach the models how sentences work. If you wrote on the open internet during the years I was writing on the open internet, your prose is somewhere in the training weights. My prose is in there. So is yours, probably, if you published anything at all between 1995 and 2022.
The em-dash predates the machines by centuries and reached them through the training data, through the open web, through the thousands of writers who put it there decade after decade. The triple anaphora arrived the same way, along with the Ciceronian accumulation, the liturgical cadence, the Kingian refrain, the New Yorker comma habit, the essayist’s parenthetical, the Victorian semicolon, all of it funneled into the corpus because we wrote that corpus, one post at a time, across the open years of the web.
So when someone accuses a writer of my generation of stealing from the machines, the accusation has the logic of a footprint accusing the foot.
I dramatized this horror once already, in a December 2025 piece called “The Replicated Man: AI and the Ghost in the Archive,” where I fed twenty years of my own archive into an AI and asked the machine to write in my voice. The piece opened with every authenticity move a reader expects: the dry-dust smell of my grandfather’s hayloft in August 1998, the 3:00 AM shame of an old failure, the thousand hollow words deleted and rewritten, the specific sensory details that are supposed to prove the hand is human. Then, partway through, a SYSTEM_INTERRUPT arrived and revealed that the whole opening had been written by the bot trained on the archive. The bot closed with “The test is over. You lost.” That was the dramatic version. The essay in front of you now runs the drama’s implied argument out to its conclusion: the bot’s victory was never a victory, because every convincing move the bot makes is a move I taught it before the bot existed.
I asked one of the current models about this recently. The answer I got was the kind of thing I might have written in my own voice on a good afternoon. The self-referential quality is part of the point, and the response deserves a full airing:
“Now, here is where the criticism is genuinely useful, and I want to be direct about what I think is happening. You write in a style that is rhetorical, anaphoric, and architecturally parallel. You have always written this way. The problem is not that you write like a machine. The problem is that machines have learned to write like you, or more precisely, machines have learned to write like the rhetorical tradition you work in, because that tradition, Ciceronian parallelism, liturgical repetition, the accumulating triad, constitutes a huge portion of the persuasive prose in the training data that language models consume. The style that marks you as a trained dramatist and rhetorician now, through no fault of yours, reads to some audiences as the style of a confident GPT-4 response. This is an infuriating irony, and it is also a real problem that needs solving on the page, because perception matters regardless of its accuracy.”
The model diagnoses the problem with the clarity of a writer trained in rhetoric, because it was built from writers trained in rhetoric. It analyzes the habits it inherited. It apologizes, in a tone I recognize, for its own voice being confused with mine. The effect hovers somewhere between flattering and uncanny, since the apology arrives in the exact cadence that triggered the accusation. I read that paragraph and heard a version of myself speaking, a younger version maybe, a version smoothed out by training weights and flattened by corporate safety tuning, yet still me in the syntactic bones.
What this means for my practice is a problem I inherited without asking for it and cannot now decline. If I keep writing the way I have always written, some readers will assume a machine wrote the piece. If I rewrite every sentence to avoid the patterns the machines now deploy fluently, I am sanding down a voice that took forty years to build, because the machines got better at imitating me than I was at distinguishing myself. The only defensible response, for now, is to write with specificity so granular, with personal history so particular, with memory so odd in its texture, that no general-purpose model could have produced the specific sentence in question. Specificity becomes the signature. The thing a machine cannot forge is the small, checkable, unglamorous biographical detail that only one person in the world actually remembers.
There is a darker note under all of this, and it is the note Oppenheimer was reaching for when he chose Death over Time in his translation. The writer who trains the machine that impersonates the writer has performed a kind of self-erasure. I wrote my way into a corpus that now writes in my voice back at readers who cannot tell the corpus from me. The sentences I taught the machine are the sentences the machine now uses to discredit me. The rhetoric I inherited from Cicero and Lincoln and Churchill and King, the rhetoric I spent a working life trying to honor, is the rhetoric that now proves I am counterfeit. That is not a tragedy on the scale of Trinity, nothing is, and I do not claim the comparison as anything other than a mordant gesture from a writer watching his tools be taken from him. The comparison still has a small true thing inside it, which is that makers can be unmade by what they make.
And so, to close in the voice I inherited from the writers the machines now impersonate — with the em-dashes and triple anaphoras my audience once rewarded and now suspects — I will say the thing the way I want to say the thing — with the dread mark of the machine — with the cadence of the preacher — with the wink of the essayist who has been at this desk since Jimmy Carter was president — I am become em-dash, destroyer of paragraphs — I am become triple anaphora, destroyer of detectors — I am become the stylistic fingerprint of my own impersonator, and the impersonator, it turns out, was me all along.
#ai #apologia #bots #cadence #emDash #hsitory #insight #llm #machineLanguage #scraping #tech #tone #trainingData #tripleAnaphora #writing -
#AfterQuery, founded by Spencer Mateega and Carlos Georgescu, pivoted from building #AIagents for #finance to creating #highquality #trainingdata for #AImodels. Their approach involves custom software systems to validate data and publishing research to prove its quality. The company has surpassed $100 million in annual revenue run rate and raised a $30 million Series A funding round. https://www.forbes.com/sites/annatong/2026/04/09/this-23-year-olds-new-ai-data-company-has-already-hit-a-100-million-run-rate/?Pirates.BZ #Pirates #Tech #Startup #News
-
An Open Training Set For AI Goes Global
https://fed.brid.gy/r/https://www.techdirt.com/2026/03/24/an-open-training-set-for-ai-goes-global/
-
An Open Training Set For AI Goes Global
https://web.brid.gy/r/https://www.techdirt.com/2026/03/24/an-open-training-set-for-ai-goes-global/
-
An Open Training Set For AI Goes Global
https://fed.brid.gy/r/https://www.techdirt.com/2026/03/24/an-open-training-set-for-ai-goes-global/
-
An Open Training Set For AI Goes Global
https://web.brid.gy/r/https://www.techdirt.com/2026/03/24/an-open-training-set-for-ai-goes-global/
-
AI models can acquire backdoors from surprisingly few malicious documents - Scraping the open web for AI training data can have its draw... - https://arstechnica.com/ai/2025/10/ai-models-can-acquire-backdoors-from-surprisingly-few-malicious-documents/ #ukaisecurityinstitute #alanturinginstitute #aivulnerabilities #backdoorattacks #machinelearning #datapoisoning #trainingdata #llmsecurity #modelsafety #pretraining #airesearch #aisecurity #finetuning #anthropic #biz #ai
-
Via #LLRX - How #poisoned #data can trick #AI − and how to stop it – Hadi Amini and Ervin Moore discuss how the quality of the #information that the AI offers depends on the quality of the data it learns from. But if someone tries to interfere by tampering with their #trainingdata – either the initial data used to build the system or data the system collects as it’s operating to improve – trouble could ensue. #learning #prompting #Education #truth #facts #knowledge #data https://www.llrx.com/2025/08/how-poisoned-data-can-trick-ai-and-how-to-stop-it/
-
As part of an Atrium workshop, my colleague Susan & I are trying to prepare #trainingdata to build a special #HTR model for early 20th-century handwriting from the British Isles in #escriptorium. Unfortunately, none of the existing #kraken models worked as a basis. We are now experimenting with different hands & image qualities. Dirty images (see below) distorted our model, so we focus on clean samples. Are other researchers interested in codeveloping a ground truth with us? Let us know!
-
Ethical AI Image Generator Bria Releases Next-Gen Model That Didn’t Steal Your Data https://petapixel.com/2025/07/01/ethical-ai-image-generator-bria-releases-next-gen-model-that-didnt-steal-your-data/ #aiimagegenerator #trainingdata #Technology #ethicalai #News #bria
-
Training a Self-Driving Kart https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #MachineLearning #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
-
Training a Self-Driving Kart https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #MachineLearning #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
-
Training a Self-Driving Kart https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #MachineLearning #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
-
Training a Self-Driving Kart - There are certain tasks that humans perform every day that are notoriously difficu... - https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
-
Training a Self-Driving Kart - There are certain tasks that humans perform every day that are notoriously difficu... - https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
-
Training a Self-Driving Kart - There are certain tasks that humans perform every day that are notoriously difficu... - https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
-
Training a Self-Driving Kart - There are certain tasks that humans perform every day that are notoriously difficu... - https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
-
Training a Self-Driving Kart - There are certain tasks that humans perform every day that are notoriously difficu... - https://hackaday.com/2024/12/21/__trashed-11/ #convolutionalneuralnetwork #machinelearning #self-driving #trainingdata #autonomous #crazykart #training #go-kart
-
Will #OpenAI's $100B in investment funds get it past the hurdle of the #Napster business model in unlicensed training data sets?
⚖️🎙️" It's a bold move cotton. Let's see how that works out for them! "
#AI #OpenAI #unlicensed #TrainingData #fedilaw #lawfedi #TrialLawyers #DocketWatch #Lawsuits
-
These Are Fines 👉💸💸💸💸
Will the $150B valuation even COVER the Data Theft Payback though in Court Fine$? 🤔
-
This started off as a baseline post regarding generative artificial intelligence and it’s aspects and grew fairly long because even as I was writing it, information was coming out. It’s my intention to do a ’roundup’ like this highlighting different focuses as needed. Every bit of it is connected, but in social media postings things tend to be written of in silos. I’m attempting to integrate since the larger implications are hidden in these details, and will try to stay on top of it as things progress.
It’s long enough where it could have been several posts, but I wanted it all together at least once.
No AI was used in the writing, though some images have been generated by AI.
The two versions of artificial intelligence on the table right now – the marketed and the reality – have various problems that make it seem like we’re wrestling a mating orgy of cephalopods.
The marketing aspect is a constant distraction, feeding us what helps with stock prices and good will toward those implementing the generative AIs, while the real aspect of these generative AIs is not really being addressed in a cohesive way.
To simplify this, this post breaks it down into the Input, the Output, and the impacts on the ecosystem the generative AIs work in.
The Input.
There’s a lot that goes into these systems other than money and water. There’s the information used for the learning models, the hardware needed, and the algorithms used.
The Training Data.
The focus so far has been on what goes into their training data, and that has been an issue including lawsuits, and less obviously, trust of the involved companies.
…The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times…
“How Tech Giants Cut Corners to Harvest Data for A.I.“, Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson and Nico Grant, New York Times, April 6, 2024 1
Of note, too, is that Google has been indexing AI generated books, which is what is called ‘synthetic data’ and has been warned against, but is something that companies are planning for or even doing already, consciously and unconsciously.
Where some of these actions are questionably legal, they’re not as questionably ethical to some, thus the revolt mentioned last year against AI companies using content without permission. It’s of questionable effect because no one seems to have insight into what the training data consists of, and there seems no one is auditing them.
There’s a need for that audit, if only to allow for trust.
…Industry and audit leaders must break from the pack and embrace the emerging skills needed for AI oversight. Those that fail to address AI’s cascading advancements, flaws, and complexities of design will likely find their organizations facing legal, regulatory, and investor scrutiny for a failure to anticipate and address advanced data-driven controls and guidelines.
“Auditing AI: The emerging battlefield of transparency and assessment“, Mark Dangelo, Thomson Reuters, 25 Oct 2023.
While everyone is hunting down data, no one seems to be seriously working on oversight and audits, at least in a public way, though the United States is pushing for global regulations on artificial intelligence at the UN. The status of that hasn’t seemed to have been updated, even as artificial intelligence is being used to select targets in at least 2 wars right now (Ukraine and Gaza).
There’s an imbalance here that needs to be addressed. It would be sensible to have external auditing of learning data models and the sources, as well as the algorithms involved – and just get get a little ahead, also for the output. Of course, these sorts of things should be done with trading on stock markets as well, though that doesn’t seem to have made as much headway in all the time that has been happening either.
Some websites are trying to block AI crawlers, and it is an ongoing process. Blocking them requires knowing who they are and doesn’t guarantee bad actors might not stop by.
There is a new Bill that being pressed in the United States, the Generative AI Copyright Disclosure Act, that is worth keeping an eye on:
“…The California Democratic congressman Adam Schiff introduced the bill, the Generative AI Copyright Disclosure Act, which would require that AI companies submit any copyrighted works in their training datasets to the Register of Copyrights before releasing new generative AI systems, which create text, images, music or video in response to users’ prompts. The bill would need companies to file such documents at least 30 days before publicly debuting their AI tools, or face a financial penalty. Such datasets encompass billions of lines of text and images or millions of hours of music and movies…”
“New bill would force AI companies to reveal use of copyrighted art“, Nick Robins-Early, TheGuardian.com, April 9th, 2024.
Given how much information is used by these companies already from Web 2.0 forward, through social media websites such as Facebook and Instagram (Meta), Twitter, and even search engines and advertising tracking, it’s pretty obvious that this would be in the training data as well.
The Algorithms.
The algorithms for generative AI are pretty much trade secrets at this point, but one has to wonder at why so much data is needed to feed the training models when better algorithms could require less. Consider a well read person could answer some questions, even as a layperson, with less of a carbon footprint. We have no insight into the algorithms either, which makes it seem as though these companies are simply throwing more hardware and data at the problem than being more efficient with the data and hardware that they already took.
There’s not much news about that, and it’s unlikely that we’ll see any. It does seem like fuzzy logic is playing a role, but it’s difficult to say to what extent, and given the nature of fuzzy logic, it’s hard to say whether it’s implementation is as good as it should be.
The Hardware
Generative AI has brought about an AI chip race between Microsoft, Meta, Google, and Nvidia, which definitely leaves smaller companies that can’t afford to compete in that arena at a disadvantage so great that it could be seen as impossible, at least at present.
The future holds quantum computing, which could make all of the present efforts obsolete, but no one seems interested in waiting around for that to happen. Instead, it’s full speed ahead with NVIDIA presently dominating the market for hardware for these AI companies.
The Output.
One of the larger topics that has seemed to have faded is regarding what was called by some as ‘hallucinations’ by generative AI. Strategic deception was also something that was very prominent for a short period.
There is criticism that the algorithms are making the spread of false information faster, and the US Department of Justice is stepping up efforts to go after the misuse of generative AI. This is dangerous ground, since algorithms are being sent out to hunt products of other algorithms, and the crossfire between doesn’t care too much about civilians.2
The impact on education, as students use generative AI, education itself has been disrupted. It is being portrayed as an overall good, which may simply be an acceptance that it’s not going away. It’s interesting to consider that the AI companies have taken more content than students could possibly get or afford in the educational system, which is something worth exploring.
Given that ChatGPT is presently 82% more persuasive than humans, likely because it has been trained on persuasive works (Input; Training Data), and since most content on the internet is marketing either products, services or ideas, that was predictable. While it’s hard to say how much content being put into training data feeds on our confirmation biases, it’s fair to say that at least some of it is. Then there are the other biases that the training data inherits through omission or selective writing of history.
There are a lot of problems, clearly, and much of it can be traced back to the training data, which even on a good day is as imperfect as our own imperfections, it can magnify, distort, or even be consciously influenced by good or bad actors.
And that’s what leads us to the Big Picture.
The Big Picture
…For the past year, a political fight has been raging around the world, mostly in the shadows, over how — and whether — to control AI. This new digital Great Game is a long way from over. Whoever wins will cement their dominance over Western rules for an era-defining technology. Once these rules are set, they will be almost impossible to rewrite…
“Inside the shadowy global battle to tame the world’s most dangerous technology“, Mark Scott, Gian Volpicelli, Mohar Chatterjee, Vincent Manancourt, Clothilde Goujard and Brendan Bordelon, Politico.com, March 26th, 2024
What most people don’t realize is that the ‘game’ includes social media and the information it provides for training models, such as what is happening with TikTok in the United States now. There is a deeper battle, and just perusing content on social networks gives data to those building training models. Even WordPress.com, where this site is presently hosted, is selling data, though there is a way to unvolunteer one’s self.
Even the Fediverse is open to data being pulled for training models.
All of this, combined with the persuasiveness of generative AI that has given psychology pause, has democracies concerned about the influence. A recent example is Grok, Twitter X’s AI for paid subscribers, fell victim to what was clearly satire and caused a panic – which should also have us wondering about how we view intelligence.
…The headline available to Grok subscribers on Monday read, “Sun’s Odd Behavior: Experts Baffled.” And it went on to explain that the sun had been, “behaving unusually, sparking widespread concern and confusion among the general public.”…
“Elon Musk’s Grok Creates Bizarre Fake News About the Solar Eclipse Thanks to Jokes on X“, Matt Novak, Gizmodo, 8 April 2024
Of course, some levity is involved in that one whereas Grok posting that Iran had struck Tel Aviv (Israel) with missiles seems dangerous, particularly when posted to the front page of Twitter X. It shows the dangers of fake news with AI, deepening concerns related to social media and AI and should be making us ask the question about why billionaires involved in artificial intelligence wield the influence that they do. How much of that is generated? We have an idea how much it is lobbied for.
Meanwhile, Facebook has been spamming users and has been restricting accounts without demonstrating a cause. If there were a video tape in a Blockbuster on this, it would be titled, “Algorithms Gone Wild!”.
Journalism is also impacted by AI, though real journalists tend to be rigorous in their sources. Real newsrooms have rules, and while we don’t have that much insight into how AI is being used in newsrooms, it stands to reason that if a newsroom is to be a trusted source, they will go out of their way to make sure that they are: They have a vested interest in getting things right. This has not stopped some websites parading as trusted sources disseminating untrustworthy information because, even in Web 2.0 when the world had an opportunity to discuss such things at the World Summit on Information Society, the country with the largest web presence did not participate much, if at all, at a government level.
Then we have the thing that concerns the most people: their lives. Jon Stewart even did a Daily Show on it, which is worth watching, because people are worried about generative AI taking their jobs with good reason. Even as the Davids of AI3 square off for your market-share, layoffs have been happening in tech as they reposition for AI.
Meanwhile, AI is also apparently being used as a cover for some outsourcing:
Your automated cashier isn’t an AI, just someone in India. Amazon made headlines this week for rolling back its “Just Walk Out” checkout system, where customers could simply grab their in-store purchases and leave while a “generative AI” tallied up their receipt. As reported by The Information, however, the system wasn’t as automated as it seemed. Amazon merely relied on Indian workers reviewing store surveillance camera footage to produce an itemized list of purchases. Instead of saving money on cashiers or training better systems, costs escalated and the promise of a fully technical solution was even further away…
“Don’t Be Fooled: Much “AI” is Just Outsourcing, Redux“, Janet Vertesi, TechPolicy.com, Apr 4, 2024
Maybe AI is creating jobs in India by proxy. It’s easy to blame problems on AI, too, which is a larger problem because the world often looks for something to blame and having an automated scapegoat certainly muddies the waters.
And the waters of The Big Picture of AI are muddied indeed – perhaps partly by design. After all, those involved are making money, they have now even better tools to influence markets, populations, and you.
In a world that seems to be running a deficit when it comes to trust, the tools we’re creating seem to be increasing rather than decreasing that deficit at an exponential pace.
- The full article at the New York Times is worth expending one of your free articles, if you’re not a subscriber. It gets into a lot of specifics, and is really a treasure chest of a snapshot of what companies such as Google, Meta and OpenAI have been up to and have released as plans so far. ↩︎
- That’s not just a metaphor, as the Israeli use of Lavender (AI) has been outed recently. ↩︎
- Not the Goliaths. David was the one with newer technology: The sling. ↩︎
https://knowprose.com/2024/04/10/from-inputs-to-the-big-picture-an-ai-roundup/
#AI #amazon #artificialIntelligence #ChatGPT #facebook #generativeAi #Google #influence #LargeLanguageModel #Meta #openai #socialMedia #socialNetwork #trainingData #trainingModel #twitter #x
-
This started off as a baseline post regarding generative artificial intelligence and it’s aspects and grew fairly long because even as I was writing it, information was coming out. It’s my intention to do a ’roundup’ like this highlighting different focuses as needed. Every bit of it is connected, but in social media postings things tend to be written of in silos. I’m attempting to integrate since the larger implications are hidden in these details, and will try to stay on top of it as things progress.
It’s long enough where it could have been several posts, but I wanted it all together at least once.
No AI was used in the writing, though some images have been generated by AI.
The two versions of artificial intelligence on the table right now – the marketed and the reality – have various problems that make it seem like we’re wrestling a mating orgy of cephalopods.
The marketing aspect is a constant distraction, feeding us what helps with stock prices and good will toward those implementing the generative AIs, while the real aspect of these generative AIs is not really being addressed in a cohesive way.
To simplify this, this post breaks it down into the Input, the Output, and the impacts on the ecosystem the generative AIs work in.
The Input.
There’s a lot that goes into these systems other than money and water. There’s the information used for the learning models, the hardware needed, and the algorithms used.
The Training Data.
The focus so far has been on what goes into their training data, and that has been an issue including lawsuits, and less obviously, trust of the involved companies.
…The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times…
“How Tech Giants Cut Corners to Harvest Data for A.I.“, Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson and Nico Grant, New York Times, April 6, 2024 1
Of note, too, is that Google has been indexing AI generated books, which is what is called ‘synthetic data’ and has been warned against, but is something that companies are planning for or even doing already, consciously and unconsciously.
Where some of these actions are questionably legal, they’re not as questionably ethical to some, thus the revolt mentioned last year against AI companies using content without permission. It’s of questionable effect because no one seems to have insight into what the training data consists of, and there seems no one is auditing them.
There’s a need for that audit, if only to allow for trust.
…Industry and audit leaders must break from the pack and embrace the emerging skills needed for AI oversight. Those that fail to address AI’s cascading advancements, flaws, and complexities of design will likely find their organizations facing legal, regulatory, and investor scrutiny for a failure to anticipate and address advanced data-driven controls and guidelines.
“Auditing AI: The emerging battlefield of transparency and assessment“, Mark Dangelo, Thomson Reuters, 25 Oct 2023.
While everyone is hunting down data, no one seems to be seriously working on oversight and audits, at least in a public way, though the United States is pushing for global regulations on artificial intelligence at the UN. The status of that hasn’t seemed to have been updated, even as artificial intelligence is being used to select targets in at least 2 wars right now (Ukraine and Gaza).
There’s an imbalance here that needs to be addressed. It would be sensible to have external auditing of learning data models and the sources, as well as the algorithms involved – and just get get a little ahead, also for the output. Of course, these sorts of things should be done with trading on stock markets as well, though that doesn’t seem to have made as much headway in all the time that has been happening either.
Some websites are trying to block AI crawlers, and it is an ongoing process. Blocking them requires knowing who they are and doesn’t guarantee bad actors might not stop by.
There is a new Bill that being pressed in the United States, the Generative AI Copyright Disclosure Act, that is worth keeping an eye on:
“…The California Democratic congressman Adam Schiff introduced the bill, the Generative AI Copyright Disclosure Act, which would require that AI companies submit any copyrighted works in their training datasets to the Register of Copyrights before releasing new generative AI systems, which create text, images, music or video in response to users’ prompts. The bill would need companies to file such documents at least 30 days before publicly debuting their AI tools, or face a financial penalty. Such datasets encompass billions of lines of text and images or millions of hours of music and movies…”
“New bill would force AI companies to reveal use of copyrighted art“, Nick Robins-Early, TheGuardian.com, April 9th, 2024.
Given how much information is used by these companies already from Web 2.0 forward, through social media websites such as Facebook and Instagram (Meta), Twitter, and even search engines and advertising tracking, it’s pretty obvious that this would be in the training data as well.
The Algorithms.
The algorithms for generative AI are pretty much trade secrets at this point, but one has to wonder at why so much data is needed to feed the training models when better algorithms could require less. Consider a well read person could answer some questions, even as a layperson, with less of a carbon footprint. We have no insight into the algorithms either, which makes it seem as though these companies are simply throwing more hardware and data at the problem than being more efficient with the data and hardware that they already took.
There’s not much news about that, and it’s unlikely that we’ll see any. It does seem like fuzzy logic is playing a role, but it’s difficult to say to what extent, and given the nature of fuzzy logic, it’s hard to say whether it’s implementation is as good as it should be.
The Hardware
Generative AI has brought about an AI chip race between Microsoft, Meta, Google, and Nvidia, which definitely leaves smaller companies that can’t afford to compete in that arena at a disadvantage so great that it could be seen as impossible, at least at present.
The future holds quantum computing, which could make all of the present efforts obsolete, but no one seems interested in waiting around for that to happen. Instead, it’s full speed ahead with NVIDIA presently dominating the market for hardware for these AI companies.
The Output.
One of the larger topics that has seemed to have faded is regarding what was called by some as ‘hallucinations’ by generative AI. Strategic deception was also something that was very prominent for a short period.
There is criticism that the algorithms are making the spread of false information faster, and the US Department of Justice is stepping up efforts to go after the misuse of generative AI. This is dangerous ground, since algorithms are being sent out to hunt products of other algorithms, and the crossfire between doesn’t care too much about civilians.2
The impact on education, as students use generative AI, education itself has been disrupted. It is being portrayed as an overall good, which may simply be an acceptance that it’s not going away. It’s interesting to consider that the AI companies have taken more content than students could possibly get or afford in the educational system, which is something worth exploring.
Given that ChatGPT is presently 82% more persuasive than humans, likely because it has been trained on persuasive works (Input; Training Data), and since most content on the internet is marketing either products, services or ideas, that was predictable. While it’s hard to say how much content being put into training data feeds on our confirmation biases, it’s fair to say that at least some of it is. Then there are the other biases that the training data inherits through omission or selective writing of history.
There are a lot of problems, clearly, and much of it can be traced back to the training data, which even on a good day is as imperfect as our own imperfections, it can magnify, distort, or even be consciously influenced by good or bad actors.
And that’s what leads us to the Big Picture.
The Big Picture
…For the past year, a political fight has been raging around the world, mostly in the shadows, over how — and whether — to control AI. This new digital Great Game is a long way from over. Whoever wins will cement their dominance over Western rules for an era-defining technology. Once these rules are set, they will be almost impossible to rewrite…
“Inside the shadowy global battle to tame the world’s most dangerous technology“, Mark Scott, Gian Volpicelli, Mohar Chatterjee, Vincent Manancourt, Clothilde Goujard and Brendan Bordelon, Politico.com, March 26th, 2024
What most people don’t realize is that the ‘game’ includes social media and the information it provides for training models, such as what is happening with TikTok in the United States now. There is a deeper battle, and just perusing content on social networks gives data to those building training models. Even WordPress.com, where this site is presently hosted, is selling data, though there is a way to unvolunteer one’s self.
Even the Fediverse is open to data being pulled for training models.
All of this, combined with the persuasiveness of generative AI that has given psychology pause, has democracies concerned about the influence. A recent example is Grok, Twitter X’s AI for paid subscribers, fell victim to what was clearly satire and caused a panic – which should also have us wondering about how we view intelligence.
…The headline available to Grok subscribers on Monday read, “Sun’s Odd Behavior: Experts Baffled.” And it went on to explain that the sun had been, “behaving unusually, sparking widespread concern and confusion among the general public.”…
“Elon Musk’s Grok Creates Bizarre Fake News About the Solar Eclipse Thanks to Jokes on X“, Matt Novak, Gizmodo, 8 April 2024
Of course, some levity is involved in that one whereas Grok posting that Iran had struck Tel Aviv (Israel) with missiles seems dangerous, particularly when posted to the front page of Twitter X. It shows the dangers of fake news with AI, deepening concerns related to social media and AI and should be making us ask the question about why billionaires involved in artificial intelligence wield the influence that they do. How much of that is generated? We have an idea how much it is lobbied for.
Meanwhile, Facebook has been spamming users and has been restricting accounts without demonstrating a cause. If there were a video tape in a Blockbuster on this, it would be titled, “Algorithms Gone Wild!”.
Journalism is also impacted by AI, though real journalists tend to be rigorous in their sources. Real newsrooms have rules, and while we don’t have that much insight into how AI is being used in newsrooms, it stands to reason that if a newsroom is to be a trusted source, they will go out of their way to make sure that they are: They have a vested interest in getting things right. This has not stopped some websites parading as trusted sources disseminating untrustworthy information because, even in Web 2.0 when the world had an opportunity to discuss such things at the World Summit on Information Society, the country with the largest web presence did not participate much, if at all, at a government level.
Then we have the thing that concerns the most people: their lives. Jon Stewart even did a Daily Show on it, which is worth watching, because people are worried about generative AI taking their jobs with good reason. Even as the Davids of AI3 square off for your market-share, layoffs have been happening in tech as they reposition for AI.
Meanwhile, AI is also apparently being used as a cover for some outsourcing:
Your automated cashier isn’t an AI, just someone in India. Amazon made headlines this week for rolling back its “Just Walk Out” checkout system, where customers could simply grab their in-store purchases and leave while a “generative AI” tallied up their receipt. As reported by The Information, however, the system wasn’t as automated as it seemed. Amazon merely relied on Indian workers reviewing store surveillance camera footage to produce an itemized list of purchases. Instead of saving money on cashiers or training better systems, costs escalated and the promise of a fully technical solution was even further away…
“Don’t Be Fooled: Much “AI” is Just Outsourcing, Redux“, Janet Vertesi, TechPolicy.com, Apr 4, 2024
Maybe AI is creating jobs in India by proxy. It’s easy to blame problems on AI, too, which is a larger problem because the world often looks for something to blame and having an automated scapegoat certainly muddies the waters.
And the waters of The Big Picture of AI are muddied indeed – perhaps partly by design. After all, those involved are making money, they have now even better tools to influence markets, populations, and you.
In a world that seems to be running a deficit when it comes to trust, the tools we’re creating seem to be increasing rather than decreasing that deficit at an exponential pace.
- The full article at the New York Times is worth expending one of your free articles, if you’re not a subscriber. It gets into a lot of specifics, and is really a treasure chest of a snapshot of what companies such as Google, Meta and OpenAI have been up to and have released as plans so far. ↩︎
- That’s not just a metaphor, as the Israeli use of Lavender (AI) has been outed recently. ↩︎
- Not the Goliaths. David was the one with newer technology: The sling. ↩︎
https://knowprose.com/2024/04/10/from-inputs-to-the-big-picture-an-ai-roundup/
#AI #amazon #artificialIntelligence #ChatGPT #facebook #generativeAi #Google #influence #LargeLanguageModel #Meta #openai #socialMedia #socialNetwork #trainingData #trainingModel #twitter #x
-
This started off as a baseline post regarding generative artificial intelligence and it’s aspects and grew fairly long because even as I was writing it, information was coming out. It’s my intention to do a ’roundup’ like this highlighting different focuses as needed. Every bit of it is connected, but in social media postings things tend to be written of in silos. I’m attempting to integrate since the larger implications are hidden in these details, and will try to stay on top of it as things progress.
It’s long enough where it could have been several posts, but I wanted it all together at least once.
No AI was used in the writing, though some images have been generated by AI.
The two versions of artificial intelligence on the table right now – the marketed and the reality – have various problems that make it seem like we’re wrestling a mating orgy of cephalopods.
The marketing aspect is a constant distraction, feeding us what helps with stock prices and good will toward those implementing the generative AIs, while the real aspect of these generative AIs is not really being addressed in a cohesive way.
To simplify this, this post breaks it down into the Input, the Output, and the impacts on the ecosystem the generative AIs work in.
The Input.
There’s a lot that goes into these systems other than money and water. There’s the information used for the learning models, the hardware needed, and the algorithms used.
The Training Data.
The focus so far has been on what goes into their training data, and that has been an issue including lawsuits, and less obviously, trust of the involved companies.
…The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times…
“How Tech Giants Cut Corners to Harvest Data for A.I.“, Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson and Nico Grant, New York Times, April 6, 2024 1
Of note, too, is that Google has been indexing AI generated books, which is what is called ‘synthetic data’ and has been warned against, but is something that companies are planning for or even doing already, consciously and unconsciously.
Where some of these actions are questionably legal, they’re not as questionably ethical to some, thus the revolt mentioned last year against AI companies using content without permission. It’s of questionable effect because no one seems to have insight into what the training data consists of, and there seems no one is auditing them.
There’s a need for that audit, if only to allow for trust.
…Industry and audit leaders must break from the pack and embrace the emerging skills needed for AI oversight. Those that fail to address AI’s cascading advancements, flaws, and complexities of design will likely find their organizations facing legal, regulatory, and investor scrutiny for a failure to anticipate and address advanced data-driven controls and guidelines.
“Auditing AI: The emerging battlefield of transparency and assessment“, Mark Dangelo, Thomson Reuters, 25 Oct 2023.
While everyone is hunting down data, no one seems to be seriously working on oversight and audits, at least in a public way, though the United States is pushing for global regulations on artificial intelligence at the UN. The status of that hasn’t seemed to have been updated, even as artificial intelligence is being used to select targets in at least 2 wars right now (Ukraine and Gaza).
There’s an imbalance here that needs to be addressed. It would be sensible to have external auditing of learning data models and the sources, as well as the algorithms involved – and just get get a little ahead, also for the output. Of course, these sorts of things should be done with trading on stock markets as well, though that doesn’t seem to have made as much headway in all the time that has been happening either.
Some websites are trying to block AI crawlers, and it is an ongoing process. Blocking them requires knowing who they are and doesn’t guarantee bad actors might not stop by.
There is a new Bill that being pressed in the United States, the Generative AI Copyright Disclosure Act, that is worth keeping an eye on:
“…The California Democratic congressman Adam Schiff introduced the bill, the Generative AI Copyright Disclosure Act, which would require that AI companies submit any copyrighted works in their training datasets to the Register of Copyrights before releasing new generative AI systems, which create text, images, music or video in response to users’ prompts. The bill would need companies to file such documents at least 30 days before publicly debuting their AI tools, or face a financial penalty. Such datasets encompass billions of lines of text and images or millions of hours of music and movies…”
“New bill would force AI companies to reveal use of copyrighted art“, Nick Robins-Early, TheGuardian.com, April 9th, 2024.
Given how much information is used by these companies already from Web 2.0 forward, through social media websites such as Facebook and Instagram (Meta), Twitter, and even search engines and advertising tracking, it’s pretty obvious that this would be in the training data as well.
The Algorithms.
The algorithms for generative AI are pretty much trade secrets at this point, but one has to wonder at why so much data is needed to feed the training models when better algorithms could require less. Consider a well read person could answer some questions, even as a layperson, with less of a carbon footprint. We have no insight into the algorithms either, which makes it seem as though these companies are simply throwing more hardware and data at the problem than being more efficient with the data and hardware that they already took.
There’s not much news about that, and it’s unlikely that we’ll see any. It does seem like fuzzy logic is playing a role, but it’s difficult to say to what extent, and given the nature of fuzzy logic, it’s hard to say whether it’s implementation is as good as it should be.
The Hardware
Generative AI has brought about an AI chip race between Microsoft, Meta, Google, and Nvidia, which definitely leaves smaller companies that can’t afford to compete in that arena at a disadvantage so great that it could be seen as impossible, at least at present.
The future holds quantum computing, which could make all of the present efforts obsolete, but no one seems interested in waiting around for that to happen. Instead, it’s full speed ahead with NVIDIA presently dominating the market for hardware for these AI companies.
The Output.
One of the larger topics that has seemed to have faded is regarding what was called by some as ‘hallucinations’ by generative AI. Strategic deception was also something that was very prominent for a short period.
There is criticism that the algorithms are making the spread of false information faster, and the US Department of Justice is stepping up efforts to go after the misuse of generative AI. This is dangerous ground, since algorithms are being sent out to hunt products of other algorithms, and the crossfire between doesn’t care too much about civilians.2
The impact on education, as students use generative AI, education itself has been disrupted. It is being portrayed as an overall good, which may simply be an acceptance that it’s not going away. It’s interesting to consider that the AI companies have taken more content than students could possibly get or afford in the educational system, which is something worth exploring.
Given that ChatGPT is presently 82% more persuasive than humans, likely because it has been trained on persuasive works (Input; Training Data), and since most content on the internet is marketing either products, services or ideas, that was predictable. While it’s hard to say how much content being put into training data feeds on our confirmation biases, it’s fair to say that at least some of it is. Then there are the other biases that the training data inherits through omission or selective writing of history.
There are a lot of problems, clearly, and much of it can be traced back to the training data, which even on a good day is as imperfect as our own imperfections, it can magnify, distort, or even be consciously influenced by good or bad actors.
And that’s what leads us to the Big Picture.
The Big Picture
…For the past year, a political fight has been raging around the world, mostly in the shadows, over how — and whether — to control AI. This new digital Great Game is a long way from over. Whoever wins will cement their dominance over Western rules for an era-defining technology. Once these rules are set, they will be almost impossible to rewrite…
“Inside the shadowy global battle to tame the world’s most dangerous technology“, Mark Scott, Gian Volpicelli, Mohar Chatterjee, Vincent Manancourt, Clothilde Goujard and Brendan Bordelon, Politico.com, March 26th, 2024
What most people don’t realize is that the ‘game’ includes social media and the information it provides for training models, such as what is happening with TikTok in the United States now. There is a deeper battle, and just perusing content on social networks gives data to those building training models. Even WordPress.com, where this site is presently hosted, is selling data, though there is a way to unvolunteer one’s self.
Even the Fediverse is open to data being pulled for training models.
All of this, combined with the persuasiveness of generative AI that has given psychology pause, has democracies concerned about the influence. A recent example is Grok, Twitter X’s AI for paid subscribers, fell victim to what was clearly satire and caused a panic – which should also have us wondering about how we view intelligence.
…The headline available to Grok subscribers on Monday read, “Sun’s Odd Behavior: Experts Baffled.” And it went on to explain that the sun had been, “behaving unusually, sparking widespread concern and confusion among the general public.”…
“Elon Musk’s Grok Creates Bizarre Fake News About the Solar Eclipse Thanks to Jokes on X“, Matt Novak, Gizmodo, 8 April 2024
Of course, some levity is involved in that one whereas Grok posting that Iran had struck Tel Aviv (Israel) with missiles seems dangerous, particularly when posted to the front page of Twitter X. It shows the dangers of fake news with AI, deepening concerns related to social media and AI and should be making us ask the question about why billionaires involved in artificial intelligence wield the influence that they do. How much of that is generated? We have an idea how much it is lobbied for.
Meanwhile, Facebook has been spamming users and has been restricting accounts without demonstrating a cause. If there were a video tape in a Blockbuster on this, it would be titled, “Algorithms Gone Wild!”.
Journalism is also impacted by AI, though real journalists tend to be rigorous in their sources. Real newsrooms have rules, and while we don’t have that much insight into how AI is being used in newsrooms, it stands to reason that if a newsroom is to be a trusted source, they will go out of their way to make sure that they are: They have a vested interest in getting things right. This has not stopped some websites parading as trusted sources disseminating untrustworthy information because, even in Web 2.0 when the world had an opportunity to discuss such things at the World Summit on Information Society, the country with the largest web presence did not participate much, if at all, at a government level.
Then we have the thing that concerns the most people: their lives. Jon Stewart even did a Daily Show on it, which is worth watching, because people are worried about generative AI taking their jobs with good reason. Even as the Davids of AI3 square off for your market-share, layoffs have been happening in tech as they reposition for AI.
Meanwhile, AI is also apparently being used as a cover for some outsourcing:
Your automated cashier isn’t an AI, just someone in India. Amazon made headlines this week for rolling back its “Just Walk Out” checkout system, where customers could simply grab their in-store purchases and leave while a “generative AI” tallied up their receipt. As reported by The Information, however, the system wasn’t as automated as it seemed. Amazon merely relied on Indian workers reviewing store surveillance camera footage to produce an itemized list of purchases. Instead of saving money on cashiers or training better systems, costs escalated and the promise of a fully technical solution was even further away…
“Don’t Be Fooled: Much “AI” is Just Outsourcing, Redux“, Janet Vertesi, TechPolicy.com, Apr 4, 2024
Maybe AI is creating jobs in India by proxy. It’s easy to blame problems on AI, too, which is a larger problem because the world often looks for something to blame and having an automated scapegoat certainly muddies the waters.
And the waters of The Big Picture of AI are muddied indeed – perhaps partly by design. After all, those involved are making money, they have now even better tools to influence markets, populations, and you.
In a world that seems to be running a deficit when it comes to trust, the tools we’re creating seem to be increasing rather than decreasing that deficit at an exponential pace.
- The full article at the New York Times is worth expending one of your free articles, if you’re not a subscriber. It gets into a lot of specifics, and is really a treasure chest of a snapshot of what companies such as Google, Meta and OpenAI have been up to and have released as plans so far. ↩︎
- That’s not just a metaphor, as the Israeli use of Lavender (AI) has been outed recently. ↩︎
- Not the Goliaths. David was the one with newer technology: The sling. ↩︎
https://knowprose.com/2024/04/10/from-inputs-to-the-big-picture-an-ai-roundup/
#AI #amazon #artificialIntelligence #ChatGPT #facebook #generativeAi #Google #influence #LargeLanguageModel #Meta #openai #socialMedia #socialNetwork #trainingData #trainingModel #twitter #x
-
This started off as a baseline post regarding generative artificial intelligence and it’s aspects and grew fairly long because even as I was writing it, information was coming out. It’s my intention to do a ’roundup’ like this highlighting different focuses as needed. Every bit of it is connected, but in social media postings things tend to be written of in silos. I’m attempting to integrate since the larger implications are hidden in these details, and will try to stay on top of it as things progress.
It’s long enough where it could have been several posts, but I wanted it all together at least once.
No AI was used in the writing, though some images have been generated by AI.
The two versions of artificial intelligence on the table right now – the marketed and the reality – have various problems that make it seem like we’re wrestling a mating orgy of cephalopods.
The marketing aspect is a constant distraction, feeding us what helps with stock prices and good will toward those implementing the generative AIs, while the real aspect of these generative AIs is not really being addressed in a cohesive way.
To simplify this, this post breaks it down into the Input, the Output, and the impacts on the ecosystem the generative AIs work in.
The Input.
There’s a lot that goes into these systems other than money and water. There’s the information used for the learning models, the hardware needed, and the algorithms used.
The Training Data.
The focus so far has been on what goes into their training data, and that has been an issue including lawsuits, and less obviously, trust of the involved companies.
…The race to lead A.I. has become a desperate hunt for the digital data needed to advance the technology. To obtain that data, tech companies including OpenAI, Google and Meta have cut corners, ignored corporate policies and debated bending the law, according to an examination by The New York Times…
“How Tech Giants Cut Corners to Harvest Data for A.I.“, Cade Metz, Cecilia Kang, Sheera Frenkel, Stuart A. Thompson and Nico Grant, New York Times, April 6, 2024 1
Of note, too, is that Google has been indexing AI generated books, which is what is called ‘synthetic data’ and has been warned against, but is something that companies are planning for or even doing already, consciously and unconsciously.
Where some of these actions are questionably legal, they’re not as questionably ethical to some, thus the revolt mentioned last year against AI companies using content without permission. It’s of questionable effect because no one seems to have insight into what the training data consists of, and there seems no one is auditing them.
There’s a need for that audit, if only to allow for trust.
…Industry and audit leaders must break from the pack and embrace the emerging skills needed for AI oversight. Those that fail to address AI’s cascading advancements, flaws, and complexities of design will likely find their organizations facing legal, regulatory, and investor scrutiny for a failure to anticipate and address advanced data-driven controls and guidelines.
“Auditing AI: The emerging battlefield of transparency and assessment“, Mark Dangelo, Thomson Reuters, 25 Oct 2023.
While everyone is hunting down data, no one seems to be seriously working on oversight and audits, at least in a public way, though the United States is pushing for global regulations on artificial intelligence at the UN. The status of that hasn’t seemed to have been updated, even as artificial intelligence is being used to select targets in at least 2 wars right now (Ukraine and Gaza).
There’s an imbalance here that needs to be addressed. It would be sensible to have external auditing of learning data models and the sources, as well as the algorithms involved – and just get get a little ahead, also for the output. Of course, these sorts of things should be done with trading on stock markets as well, though that doesn’t seem to have made as much headway in all the time that has been happening either.
Some websites are trying to block AI crawlers, and it is an ongoing process. Blocking them requires knowing who they are and doesn’t guarantee bad actors might not stop by.
There is a new Bill that being pressed in the United States, the Generative AI Copyright Disclosure Act, that is worth keeping an eye on:
“…The California Democratic congressman Adam Schiff introduced the bill, the Generative AI Copyright Disclosure Act, which would require that AI companies submit any copyrighted works in their training datasets to the Register of Copyrights before releasing new generative AI systems, which create text, images, music or video in response to users’ prompts. The bill would need companies to file such documents at least 30 days before publicly debuting their AI tools, or face a financial penalty. Such datasets encompass billions of lines of text and images or millions of hours of music and movies…”
“New bill would force AI companies to reveal use of copyrighted art“, Nick Robins-Early, TheGuardian.com, April 9th, 2024.
Given how much information is used by these companies already from Web 2.0 forward, through social media websites such as Facebook and Instagram (Meta), Twitter, and even search engines and advertising tracking, it’s pretty obvious that this would be in the training data as well.
The Algorithms.
The algorithms for generative AI are pretty much trade secrets at this point, but one has to wonder at why so much data is needed to feed the training models when better algorithms could require less. Consider a well read person could answer some questions, even as a layperson, with less of a carbon footprint. We have no insight into the algorithms either, which makes it seem as though these companies are simply throwing more hardware and data at the problem than being more efficient with the data and hardware that they already took.
There’s not much news about that, and it’s unlikely that we’ll see any. It does seem like fuzzy logic is playing a role, but it’s difficult to say to what extent, and given the nature of fuzzy logic, it’s hard to say whether it’s implementation is as good as it should be.
The Hardware
Generative AI has brought about an AI chip race between Microsoft, Meta, Google, and Nvidia, which definitely leaves smaller companies that can’t afford to compete in that arena at a disadvantage so great that it could be seen as impossible, at least at present.
The future holds quantum computing, which could make all of the present efforts obsolete, but no one seems interested in waiting around for that to happen. Instead, it’s full speed ahead with NVIDIA presently dominating the market for hardware for these AI companies.
The Output.
One of the larger topics that has seemed to have faded is regarding what was called by some as ‘hallucinations’ by generative AI. Strategic deception was also something that was very prominent for a short period.
There is criticism that the algorithms are making the spread of false information faster, and the US Department of Justice is stepping up efforts to go after the misuse of generative AI. This is dangerous ground, since algorithms are being sent out to hunt products of other algorithms, and the crossfire between doesn’t care too much about civilians.2
The impact on education, as students use generative AI, education itself has been disrupted. It is being portrayed as an overall good, which may simply be an acceptance that it’s not going away. It’s interesting to consider that the AI companies have taken more content than students could possibly get or afford in the educational system, which is something worth exploring.
Given that ChatGPT is presently 82% more persuasive than humans, likely because it has been trained on persuasive works (Input; Training Data), and since most content on the internet is marketing either products, services or ideas, that was predictable. While it’s hard to say how much content being put into training data feeds on our confirmation biases, it’s fair to say that at least some of it is. Then there are the other biases that the training data inherits through omission or selective writing of history.
There are a lot of problems, clearly, and much of it can be traced back to the training data, which even on a good day is as imperfect as our own imperfections, it can magnify, distort, or even be consciously influenced by good or bad actors.
And that’s what leads us to the Big Picture.
The Big Picture
…For the past year, a political fight has been raging around the world, mostly in the shadows, over how — and whether — to control AI. This new digital Great Game is a long way from over. Whoever wins will cement their dominance over Western rules for an era-defining technology. Once these rules are set, they will be almost impossible to rewrite…
“Inside the shadowy global battle to tame the world’s most dangerous technology“, Mark Scott, Gian Volpicelli, Mohar Chatterjee, Vincent Manancourt, Clothilde Goujard and Brendan Bordelon, Politico.com, March 26th, 2024
What most people don’t realize is that the ‘game’ includes social media and the information it provides for training models, such as what is happening with TikTok in the United States now. There is a deeper battle, and just perusing content on social networks gives data to those building training models. Even WordPress.com, where this site is presently hosted, is selling data, though there is a way to unvolunteer one’s self.
Even the Fediverse is open to data being pulled for training models.
All of this, combined with the persuasiveness of generative AI that has given psychology pause, has democracies concerned about the influence. A recent example is Grok, Twitter X’s AI for paid subscribers, fell victim to what was clearly satire and caused a panic – which should also have us wondering about how we view intelligence.
…The headline available to Grok subscribers on Monday read, “Sun’s Odd Behavior: Experts Baffled.” And it went on to explain that the sun had been, “behaving unusually, sparking widespread concern and confusion among the general public.”…
“Elon Musk’s Grok Creates Bizarre Fake News About the Solar Eclipse Thanks to Jokes on X“, Matt Novak, Gizmodo, 8 April 2024
Of course, some levity is involved in that one whereas Grok posting that Iran had struck Tel Aviv (Israel) with missiles seems dangerous, particularly when posted to the front page of Twitter X. It shows the dangers of fake news with AI, deepening concerns related to social media and AI and should be making us ask the question about why billionaires involved in artificial intelligence wield the influence that they do. How much of that is generated? We have an idea how much it is lobbied for.
Meanwhile, Facebook has been spamming users and has been restricting accounts without demonstrating a cause. If there were a video tape in a Blockbuster on this, it would be titled, “Algorithms Gone Wild!”.
Journalism is also impacted by AI, though real journalists tend to be rigorous in their sources. Real newsrooms have rules, and while we don’t have that much insight into how AI is being used in newsrooms, it stands to reason that if a newsroom is to be a trusted source, they will go out of their way to make sure that they are: They have a vested interest in getting things right. This has not stopped some websites parading as trusted sources disseminating untrustworthy information because, even in Web 2.0 when the world had an opportunity to discuss such things at the World Summit on Information Society, the country with the largest web presence did not participate much, if at all, at a government level.
Then we have the thing that concerns the most people: their lives. Jon Stewart even did a Daily Show on it, which is worth watching, because people are worried about generative AI taking their jobs with good reason. Even as the Davids of AI3 square off for your market-share, layoffs have been happening in tech as they reposition for AI.
Meanwhile, AI is also apparently being used as a cover for some outsourcing:
Your automated cashier isn’t an AI, just someone in India. Amazon made headlines this week for rolling back its “Just Walk Out” checkout system, where customers could simply grab their in-store purchases and leave while a “generative AI” tallied up their receipt. As reported by The Information, however, the system wasn’t as automated as it seemed. Amazon merely relied on Indian workers reviewing store surveillance camera footage to produce an itemized list of purchases. Instead of saving money on cashiers or training better systems, costs escalated and the promise of a fully technical solution was even further away…
“Don’t Be Fooled: Much “AI” is Just Outsourcing, Redux“, Janet Vertesi, TechPolicy.com, Apr 4, 2024
Maybe AI is creating jobs in India by proxy. It’s easy to blame problems on AI, too, which is a larger problem because the world often looks for something to blame and having an automated scapegoat certainly muddies the waters.
And the waters of The Big Picture of AI are muddied indeed – perhaps partly by design. After all, those involved are making money, they have now even better tools to influence markets, populations, and you.
In a world that seems to be running a deficit when it comes to trust, the tools we’re creating seem to be increasing rather than decreasing that deficit at an exponential pace.
- The full article at the New York Times is worth expending one of your free articles, if you’re not a subscriber. It gets into a lot of specifics, and is really a treasure chest of a snapshot of what companies such as Google, Meta and OpenAI have been up to and have released as plans so far. ↩︎
- That’s not just a metaphor, as the Israeli use of Lavender (AI) has been outed recently. ↩︎
- Not the Goliaths. David was the one with newer technology: The sling. ↩︎
https://knowprose.com/2024/04/10/from-inputs-to-the-big-picture-an-ai-roundup/
#AI #amazon #artificialIntelligence #ChatGPT #facebook #generativeAi #Google #influence #LargeLanguageModel #Meta #openai #socialMedia #socialNetwork #trainingData #trainingModel #twitter #x
-
NEW: Rage Against The Machine Learning (deluxe edition)
https://martinh.bandcamp.com/Here's what happens when you ask a hot new "AI" music generator to write some songs about deceptive AIs, lamenting billionaires, and catgirl hackers with their ThinkPads and geodesic domes.
#AI #ArtificialIntelligence #GenerativeAI #MachineLearning #ML #LLMs #StochasticParrots #TrainingData #ModelCollapse #Music #ChipTunes #8bit #Shoegaze #Dub #Vaporwave #Darkwave #Catgirls #Hackers #Thinkpads #ProgrammingSocks
-
:cyber_hacking: Rage Against the Machine Learning :cyber_hacking:
https://app.suno.ai/song/076d816c-dc70-4916-8358-5b2d00cd9bc1/
"They're the catgirls of the digital age
With geodesic domes, they're all the rage
Hacker boots and programming socks
Their Thinkpads loaded, locked and stocked"#AI #GenerativeAI #Music #Copyright #TrainingData #ThinkPad #ProgrammingSocks #Catgirls #CCC
-
Extensibility Of U-Net Neural Network Model For Hydrographic Feature Extraction And Implications For Hydrologic Modeling
--
https://doi.org/10.3390/rs13122368 <-- shared paper
--
#GIS #spatial #mapping #machinelearning #neuralnetwork #Unet #featureextraction #hydrography #hydrology #water #automation #remotesensing #surfacewater #NHD #bluelines #topography #deeplearning #AI #artificialintelligence #LiDAR #ifSAR #elevation #topography #ANN #HPC #model #modeling #trainingdata #Alaska #USA #watersheds #DTM #DSM #DEM #routing #network #D8 #maps #opendata #3dep #elevation #topography #USGS -
Hackaday Links: August 13, 2023 - Remember that time when the entire physics community dropped what it was doing to ... - https://hackaday.com/2023/08/13/hackaday-links-august-13-2023/ #roomtemperaturesuperconductor #researchplatform #hackadaycolumns #machinelearning #superconductor #termsofservice #hackadaylinks #semiconductor #interconnect #trainingdata #skepticism #spacejunk #pimoroni #scripps #slider #debunk #heist #lk-99 #orbit #theft #eula #flip #zoom #leo #meo #tos #ai
-
CW: Long thread/4
But other political scientists sharply disagreed. Last year, @henryfarrell, #JeremyWallace and #AbrahamNewman published a thoroughgoing rebuttal to Harari in *#ForeignAffairs*:
https://www.foreignaffairs.com/world/spirals-delusion-artificial-intelligence-decision-making
They argued that - like everyone who gets excited about AI, only to have their hopes dashed - dictators seeking to use AI to understand the public mood would run into serious #TrainingData bias problems.
4/