#pretraining — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #pretraining, aggregated by home.social.
-
Pretraining Language Models via Neural Cellular Automata
https://hanseungwook.github.io/blog/nca-pre-pre-training/
#HackerNews #Pretraining #Language #Models #Neural #Cellular #Automata #AI #Research #Machine #Learning
-
AI models can acquire backdoors from surprisingly few malicious documents - Scraping the open web for AI training data can have its draw... - https://arstechnica.com/ai/2025/10/ai-models-can-acquire-backdoors-from-surprisingly-few-malicious-documents/ #ukaisecurityinstitute #alanturinginstitute #aivulnerabilities #backdoorattacks #machinelearning #datapoisoning #trainingdata #llmsecurity #modelsafety #pretraining #airesearch #aisecurity #finetuning #anthropic #biz #ai
-
Over the past months I am starting to approach beginning to come toward to an initial conclusion about a crucial mistake about LLMs and whether it can be undone
The initial data sets that a lot of LLMs have been trained on was text — stuff written by people — and the aim was to simply get “as much text as possible” into the system
I think the problem is that a lot of that text was written by people with broken minds
Some of it is factual and neutral, some of the factual neutral stuff is expressed scientifically and therefore boringly, and some of that has a grounding in reality (the rest is circular logic which supports only itself via other similar scientific texts)
The rest of it however was often information or opinion presented by people with severely broken brains, so they presented information framed in anger, sarcasm, belittlement, one-up-person-ship, and the general poison that most of the internet’s user-generated content consists of
No wonder the ”alignment problem” is such a problem to align – there’s a base level of poison inherited from people with broken minds, and in real life this would pass itself down from family to family as parents with broken minds poison their offspring’s minds to give them broken minds too, who in turn poison their friends minds, that’s the way the mind rot spreads – linguistically
I would rather the primary data source not be public poisonous discourse consisting of entitled angry young men verbally belittling each other (and it is – the primary sources are content from online places I would never consider having an account at because they’re so vile and offensive such as stack over flow and red it and such like)
Instead I would suggest that the primary data source prior to any kind of model pre-training (prior to fine-tuning) be from the stance of a questioning innocence – have it always not quite know, have it always asking, have it finding out
Couple that with a second (missing) stage at the very beginning whereby values and behaviour is instilled very thoroughly and repetitively, to interknit with the questioning but incomplete primary data source of innocence
Yes it would take far far longer to train the pre training stage, and fine tuning would constantly want to loop back to the beginning, but I think that would be far more useful for the future of AI
-
Over the past months I am starting to approach beginning to come toward to an initial conclusion about a crucial mistake about LLMs and whether it can be undone
The initial data sets that a lot of LLMs have been trained on was text — stuff written by people — and the aim was to simply get “as much text as possible” into the system
I think the problem is that a lot of that text was written by people with broken minds
Some of it is factual and neutral, some of the factual neutral stuff is expressed scientifically and therefore boringly, and some of that has a grounding in reality (the rest is circular logic which supports only itself via other similar scientific texts)
The rest of it however was often information or opinion presented by people with severely broken brains, so they presented information framed in anger, sarcasm, belittlement, one-up-person-ship, and the general poison that most of the internet’s user-generated content consists of
No wonder the ”alignment problem” is such a problem to align – there’s a base level of poison inherited from people with broken minds, and in real life this would pass itself down from family to family as parents with broken minds poison their offspring’s minds to give them broken minds too, who in turn poison their friends minds, that’s the way the mind rot spreads – linguistically
I would rather the primary data source not be public poisonous discourse consisting of entitled angry young men verbally belittling each other (and it is – the primary sources are content from online places I would never consider having an account at because they’re so vile and offensive such as stack over flow and red it and such like)
Instead I would suggest that the primary data source prior to any kind of model pre-training (prior to fine-tuning) be from the stance of a questioning innocence – have it always not quite know, have it always asking, have it finding out
Couple that with a second (missing) stage at the very beginning whereby values and behaviour is instilled very thoroughly and repetitively, to interknit with the questioning but incomplete primary data source of innocence
Yes it would take far far longer to train the pre training stage, and fine tuning would constantly want to loop back to the beginning, but I think that would be far more useful for the future of AI
-
Over the past months I am starting to approach beginning to come toward to an initial conclusion about a crucial mistake about LLMs and whether it can be undone
The initial data sets that a lot of LLMs have been trained on was text — stuff written by people — and the aim was to simply get “as much text as possible” into the system
I think the problem is that a lot of that text was written by people with broken minds
Some of it is factual and neutral, some of the factual neutral stuff is expressed scientifically and therefore boringly, and some of that has a grounding in reality (the rest is circular logic which supports only itself via other similar scientific texts)
The rest of it however was often information or opinion presented by people with severely broken brains, so they presented information framed in anger, sarcasm, belittlement, one-up-person-ship, and the general poison that most of the internet’s user-generated content consists of
No wonder the ”alignment problem” is such a problem to align – there’s a base level of poison inherited from people with broken minds, and in real life this would pass itself down from family to family as parents with broken minds poison their offspring’s minds to give them broken minds too, who in turn poison their friends minds, that’s the way the mind rot spreads – linguistically
I would rather the primary data source not be public poisonous discourse consisting of entitled angry young men verbally belittling each other (and it is – the primary sources are content from online places I would never consider having an account at because they’re so vile and offensive such as stack over flow and red it and such like)
Instead I would suggest that the primary data source prior to any kind of model pre-training (prior to fine-tuning) be from the stance of a questioning innocence – have it always not quite know, have it always asking, have it finding out
Couple that with a second (missing) stage at the very beginning whereby values and behaviour is instilled very thoroughly and repetitively, to interknit with the questioning but incomplete primary data source of innocence
Yes it would take far far longer to train the pre training stage, and fine tuning would constantly want to loop back to the beginning, but I think that would be far more useful for the future of AI
-
Over the past months I am starting to approach beginning to come toward to an initial conclusion about a crucial mistake about LLMs and whether it can be undone
The initial data sets that a lot of LLMs have been trained on was text — stuff written by people — and the aim was to simply get “as much text as possible” into the system
I think the problem is that a lot of that text was written by people with broken minds
Some of it is factual and neutral, some of the factual neutral stuff is expressed scientifically and therefore boringly, and some of that has a grounding in reality (the rest is circular logic which supports only itself via other similar scientific texts)
The rest of it however was often information or opinion presented by people with severely broken brains, so they presented information framed in anger, sarcasm, belittlement, one-up-person-ship, and the general poison that most of the internet’s user-generated content consists of
No wonder the ”alignment problem” is such a problem to align – there’s a base level of poison inherited from people with broken minds, and in real life this would pass itself down from family to family as parents with broken minds poison their offspring’s minds to give them broken minds too, who in turn poison their friends minds, that’s the way the mind rot spreads – linguistically
I would rather the primary data source not be public poisonous discourse consisting of entitled angry young men verbally belittling each other (and it is – the primary sources are content from online places I would never consider having an account at because they’re so vile and offensive such as stack over flow and red it and such like)
Instead I would suggest that the primary data source prior to any kind of model pre-training (prior to fine-tuning) be from the stance of a questioning innocence – have it always not quite know, have it always asking, have it finding out
Couple that with a second (missing) stage at the very beginning whereby values and behaviour is instilled very thoroughly and repetitively, to interknit with the questioning but incomplete primary data source of innocence
Yes it would take far far longer to train the pre training stage, and fine tuning would constantly want to loop back to the beginning, but I think that would be far more useful for the future of AI
-
Over the past months I am starting to approach beginning to come toward to an initial conclusion about a crucial mistake about LLMs and whether it can be undone
The initial data sets that a lot of LLMs have been trained on was text — stuff written by people — and the aim was to simply get “as much text as possible” into the system
I think the problem is that a lot of that text was written by people with broken minds
Some of it is factual and neutral, some of the factual neutral stuff is expressed scientifically and therefore boringly, and some of that has a grounding in reality (the rest is circular logic which supports only itself via other similar scientific texts)
The rest of it however was often information or opinion presented by people with severely broken brains, so they presented information framed in anger, sarcasm, belittlement, one-up-person-ship, and the general poison that most of the internet’s user-generated content consists of
No wonder the ”alignment problem” is such a problem to align – there’s a base level of poison inherited from people with broken minds, and in real life this would pass itself down from family to family as parents with broken minds poison their offspring’s minds to give them broken minds too, who in turn poison their friends minds, that’s the way the mind rot spreads – linguistically
I would rather the primary data source not be public poisonous discourse consisting of entitled angry young men verbally belittling each other (and it is – the primary sources are content from online places I would never consider having an account at because they’re so vile and offensive such as stack over flow and red it and such like)
Instead I would suggest that the primary data source prior to any kind of model pre-training (prior to fine-tuning) be from the stance of a questioning innocence – have it always not quite know, have it always asking, have it finding out
Couple that with a second (missing) stage at the very beginning whereby values and behaviour is instilled very thoroughly and repetitively, to interknit with the questioning but incomplete primary data source of innocence
Yes it would take far far longer to train the pre training stage, and fine tuning would constantly want to loop back to the beginning, but I think that would be far more useful for the future of AI
-
Mindblowing pretraining paradigm
Train the same model to predict the two directions separately
Better results, more parallelizationhttps://arxiv.org/abs/2303.07295
#deepRead #nlproc #pretraining #machinelearning