#text-processing — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #text-processing, aggregated by home.social.
-
It's food for thought, too. All this fuss made about cat, repeated to this day in some quarters, and of the three things explicitly pointed out as wrong with what Berkeley did, only a tool to do one of them, vis(1), was ever written in 43 years.
The ironic cherry on top is that the only operating systems that come with vis out of the box are the BSDs.
-
It's food for thought, too. All this fuss made about cat, repeated to this day in some quarters, and of the three things explicitly pointed out as wrong with what Berkeley did, only a tool to do one of them, vis(1), was ever written in 43 years.
The ironic cherry on top is that the only operating systems that come with vis out of the box are the BSDs.
-
The 'cat -v considered harmful folks' (Kernighan and Pike) invented vis(1).
> The preferred approach in this case is a separate program that deals with non-printable characters. We called ours vis […] because its job is to make things visible.
I suspect that they would approve of sqz(1) likewise abstracted from cat -s, because they also said
> cat isn’t […] for compressing multiple blank lines […] it’s for concatenating files.
-
The 'cat -v considered harmful folks' (Kernighan and Pike) invented vis(1).
> The preferred approach in this case is a separate program that deals with non-printable characters. We called ours vis […] because its job is to make things visible.
I suspect that they would approve of sqz(1) likewise abstracted from cat -s, because they also said
> cat isn’t […] for compressing multiple blank lines […] it’s for concatenating files.
-
Well since vis(1) came about because of cat -v, it seems that there should be a sqz(1) because of cat -s .
-
Well since vis(1) came about because of cat -v, it seems that there should be a sqz(1) because of cat -s .
-
As you can see, @rl_dane has me wondering whether anyone has ever written a sqz(1) #Unix #TextProcessing tool, to go alongside vis(1).
https://tty0.social/@JdeBP/116834743954481512
I am tempted to just quickly knock one together and put it into the #nosh toolset.
-
As you can see, @rl_dane has me wondering whether anyone has ever written a sqz(1) #Unix #TextProcessing tool, to go alongside vis(1).
https://tty0.social/@JdeBP/116834743954481512
I am tempted to just quickly knock one together and put it into the #nosh toolset.
-
When you view manual pages on a video terminal, the -s ('squeeze') option to the final $PAGER in the pipeline does this. You'll be surprised at how many blank lines there really are in manuals.
less, (BSD) more, most, and (my) console-tty37-viewer all have this option.
Given that, it's an of course moment, with a nod to the people who think that this should no more be a part of cat than vis(1) is, that cat has a squeeze option too.
-
When you view manual pages on a video terminal, the -s ('squeeze') option to the final $PAGER in the pipeline does this. You'll be surprised at how many blank lines there really are in manuals.
less, (BSD) more, most, and (my) console-tty37-viewer all have this option.
Given that, it's an of course moment, with a nod to the people who think that this should no more be a part of cat than vis(1) is, that cat has a squeeze option too.
-
🤔 Ah, yet another academic masterpiece on the magical powers of 'grep'—because who knew that sifting through text could be so agentically transformative? 🚀 Apparently, we need an army of agent harnesses to do what Ctrl+F has been mastering since the dawn of time. 😜 Thanks for the 🧠-bending #insights, arXiv!
https://arxiv.org/abs/2605.15184 #grep #textprocessing #arXiv #automation #academichumor #HackerNews #ngated -
🤔 Ah, yet another academic masterpiece on the magical powers of 'grep'—because who knew that sifting through text could be so agentically transformative? 🚀 Apparently, we need an army of agent harnesses to do what Ctrl+F has been mastering since the dawn of time. 😜 Thanks for the 🧠-bending #insights, arXiv!
https://arxiv.org/abs/2605.15184 #grep #textprocessing #arXiv #automation #academichumor #HackerNews #ngated -
@rl_dane If you’re interested in working with Bible texts, you might want to look at https://platform.youversion.com/ – it provides free access via APIs and SDKs, so you don’t need to scrape or re‑parse the text yourself. The fast‑track licensing respects copyright and direct access to the source text helps you avoid introducing issues around textual integrity.
-
Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):
-
Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):
-
sentencex - by Wikimedia:
https://github.com/wikimedia/sentencex
A sentence segmentation library with wide language support optimized for speed and utility.
Written in #Rust.
Bindings are available for #Python, #NodeJS and #WASM
Might be useful for my #SpeechToText system! 👀
-
sentencex - by Wikimedia:
https://github.com/wikimedia/sentencex
A sentence segmentation library with wide language support optimized for speed and utility.
Written in #Rust.
Bindings are available for #Python, #NodeJS and #WASM
Might be useful for my #SpeechToText system! 👀
-
LLMs are getting better at character-level text manipulation
https://blog.burkert.me/posts/llm_evolution_character_manipulation/
#HackerNews #LLMs #CharacterManipulation #TextProcessing #AIInnovation #MachineLearning
-
LLMs are getting better at character-level text manipulation
https://blog.burkert.me/posts/llm_evolution_character_manipulation/
#HackerNews #LLMs #CharacterManipulation #TextProcessing #AIInnovation #MachineLearning
-
The palindrome problem – Unicode edition
https://wiesmann.codiferes.net/wordpress/archives/41500
#C++ #CodePoints #GraphemeClusters #java #Javascript #ProgrammingLanguage #Python #Swift #TextProcessing #Unicode
-
The palindrome problem – Unicode edition
https://wiesmann.codiferes.net/wordpress/archives/41500
#C++ #CodePoints #GraphemeClusters #java #Javascript #ProgrammingLanguage #Python #Swift #TextProcessing #Unicode
-
Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA
#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique
-
Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA
#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique
-
🚀 Behold the epic tale of Janet's #PEG #module, where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of #parsing magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
https://bakpakin.com/writing/how-janets-peg-works.html #Janet #readability #textprocessing #regex #HackerNews #ngated -
🚀 Behold the epic tale of Janet's #PEG #module, where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of #parsing magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
https://bakpakin.com/writing/how-janets-peg-works.html #Janet #readability #textprocessing #regex #HackerNews #ngated -
Once again, keyword matching to the rescue…
-
Once again, keyword matching to the rescue…
-
🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows
At #DHd2025, Nina Rastinger explores how well #AI handles abbreviations & NER:
✅ NER works well, even with small, low-cost models
❌ Abbreviations are tricky—costs & resource demands skyrocket
🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
Balancing accuracy & efficiency in text processing remains a challenge! ⚖️ -
🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows
At #DHd2025, Nina Rastinger explores how well #AI handles abbreviations & NER:
✅ NER works well, even with small, low-cost models
❌ Abbreviations are tricky—costs & resource demands skyrocket
🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
Balancing accuracy & efficiency in text processing remains a challenge! ⚖️ -
New at PragProg
Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.
In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.
http://pragprog.com/titles/d-snrem
#regularexpressions #patternmatching #regex #regexp #textprocessing
-
New at PragProg
Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.
In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.
http://pragprog.com/titles/d-snrem
#regularexpressions #patternmatching #regex #regexp #textprocessing
-
Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
In modern times jq should be added to the traditional list. -
Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
In modern times jq should be added to the traditional list. -
Discovered a neat new tool last week: https://github.com/wr7/refold
It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.
```
// > Some quoted text
// > to reflow.
````refold` handles this.
#Rust #RustLang #TextProcessing #TextManipulation #TextEditor
-
Discovered a neat new tool last week: https://github.com/wr7/refold
It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.
```
// > Some quoted text
// > to reflow.
````refold` handles this.
#Rust #RustLang #TextProcessing #TextManipulation #TextEditor
-
Getting ready to run an online introductory XSLT course for people writing or maintaining stylesheets.
#XSLT #XML #Schematron #XSpec #declarative #functionalProgramming #textProcessing #digitalHumanities #JATS
-
Getting ready to run an online introductory XSLT course for people writing or maintaining stylesheets.
#XSLT #XML #Schematron #XSpec #declarative #functionalProgramming #textProcessing #digitalHumanities #JATS
-
Hello!
I am pleased to announce a new version of my "CLI text processing with GNU Coreutils" ebook. This ebook will help you learn 20+ specialized text processing commands provided by the coreutils package.
Links:
* Free PDF/EPUB: https://learnbyexample.gumroad.com/l/cli_coreutils (till 10-Apr-2024)
* Web version: https://learnbyexample.github.io/cli_text_processing_coreutils/
* Markdown source, exercise solutions, etc: https://github.com/learnbyexample/cli_text_processing_coreutils
* Short video about the book: https://youtu.be/oCnJLu_PUbY
* Interactive TUI app: https://github.com/learnbyexample/TUI-apps/tree/main/CLI-Exercises (includes some coreutils exercises)I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors.
Happy learning :)
-
Hello!
I am pleased to announce a new version of my "CLI text processing with GNU Coreutils" ebook. This ebook will help you learn 20+ specialized text processing commands provided by the coreutils package.
Links:
* Free PDF/EPUB: https://learnbyexample.gumroad.com/l/cli_coreutils (till 10-Apr-2024)
* Web version: https://learnbyexample.github.io/cli_text_processing_coreutils/
* Markdown source, exercise solutions, etc: https://github.com/learnbyexample/cli_text_processing_coreutils
* Short video about the book: https://youtu.be/oCnJLu_PUbY
* Interactive TUI app: https://github.com/learnbyexample/TUI-apps/tree/main/CLI-Exercises (includes some coreutils exercises)I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors.
Happy learning :)
-
An old blog post I wrote, which I had nearly forgotten. «Falsehoods programmers believe about text»
-
An old blog post I wrote, which I had nearly forgotten. «Falsehoods programmers believe about text»
-
Looking to level up your Unix automation skills? Check out our new blog post: Perl: Changing the Game in Unix Automation and Text Processing
https://www.eliza-ng.me/post/sperlunixtools/
#Perl #UnixAutomation #TextProcessing -
Ajit Dash has published an article on optimizing the usage of tokens in GPT for efficient text processing. The post highlights the importance of tokenization in language models and how GPT-3 uses tokens. https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/unlocking-the-power-of-tokens-optimizing-token-usage-in-gpt-for/ba-p/3826665 #Microsoft #textprocessing #tokens #softcorpremium
-
Oh, this looks fantastic! ✨
#Rust library to compare strings (or any sequences). 25+ algorithms, pure Rust, common interface, #Unicode support.
https://github.com/life4/textdistance.rs
Based on popular and battle-tested textdistance #Python library (and written by the same author).
Apparently, it also takes algorithms from the #talisman #JavaScript library, which I wished for to be written in Rust.
https://github.com/Yomguithereal/talisman
#TextProcessing #NLP #TextDiffing #Diff #RustLang #Crate #CrateTip