#text-processing — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #text-processing, aggregated by home.social.

fetched live

Hacker News @[email protected] · 2026-07-05 · 07:09 UTC

Pandoc Lua Filters
https://pandoc.org/lua-filters.html
#HackerNews #Pandoc #Lua #Filters #programming #documentation #open-source #textprocessing

#hackernews #pandoc #lua #filters #programming #documentation
Hacker News @[email protected] · 2026-07-05 · 07:09 UTC

Pandoc Lua Filters
https://pandoc.org/lua-filters.html
#HackerNews #Pandoc #Lua #Filters #programming #documentation #open-source #textprocessing

#hackernews #pandoc #lua #filters #programming #documentation
JdeBP @[email protected] · 2026-07-01 · 20:42 UTC

@rl_dane
It's food for thought, too. All this fuss made about cat, repeated to this day in some quarters, and of the three things explicitly pointed out as wrong with what Berkeley did, only a tool to do one of them, vis(1), was ever written in 43 years.
The ironic cherry on top is that the only operating systems that come with vis out of the box are the BSDs.
#Unix #TextProcessing #nosh #cat #vis #linenumber #sqz

#unix #textprocessing #nosh #cat #vis #linenumber
JdeBP @[email protected] · 2026-07-01 · 20:42 UTC

@rl_dane
It's food for thought, too. All this fuss made about cat, repeated to this day in some quarters, and of the three things explicitly pointed out as wrong with what Berkeley did, only a tool to do one of them, vis(1), was ever written in 43 years.
The ironic cherry on top is that the only operating systems that come with vis out of the box are the BSDs.
#Unix #TextProcessing #nosh #cat #vis #linenumber #sqz

#unix #textprocessing #nosh #cat #vis #linenumber
JdeBP @[email protected] · 2026-07-01 · 18:22 UTC

@rl_dane
The 'cat -v considered harmful folks' (Kernighan and Pike) invented vis(1).
> The preferred approach in this case is a separate program that deals with non-printable characters. We called ours vis […] because its job is to make things visible.
I suspect that they would approve of sqz(1) likewise abstracted from cat -s, because they also said
> cat isn’t […] for compressing multiple blank lines […] it’s for concatenating files.
#vis #sqz #cat #Unix #TextProcessing #nosh

#vis #sqz #cat #unix #textprocessing #nosh
JdeBP @[email protected] · 2026-07-01 · 18:22 UTC

@rl_dane
The 'cat -v considered harmful folks' (Kernighan and Pike) invented vis(1).
> The preferred approach in this case is a separate program that deals with non-printable characters. We called ours vis […] because its job is to make things visible.
I suspect that they would approve of sqz(1) likewise abstracted from cat -s, because they also said
> cat isn’t […] for compressing multiple blank lines […] it’s for concatenating files.
#vis #sqz #cat #Unix #TextProcessing #nosh

#vis #sqz #cat #unix #textprocessing #nosh
JdeBP @[email protected] · 2026-07-01 · 06:53 UTC

@rl_dane
Well since vis(1) came about because of cat -v, it seems that there should be a sqz(1) because of cat -s .
#Unix #TextProcessing #nosh

#unix #textprocessing #nosh
JdeBP @[email protected] · 2026-07-01 · 06:53 UTC

@rl_dane
Well since vis(1) came about because of cat -v, it seems that there should be a sqz(1) because of cat -s .
#Unix #TextProcessing #nosh

#unix #textprocessing #nosh
JdeBP @[email protected] · 2026-06-29 · 18:23 UTC

As you can see, @rl_dane has me wondering whether anyone has ever written a sqz(1) #Unix #TextProcessing tool, to go alongside vis(1).
https://tty0.social/@JdeBP/116834743954481512
I am tempted to just quickly knock one together and put it into the #nosh toolset.

#unix #textprocessing #nosh
JdeBP @[email protected] · 2026-06-29 · 18:23 UTC

As you can see, @rl_dane has me wondering whether anyone has ever written a sqz(1) #Unix #TextProcessing tool, to go alongside vis(1).
https://tty0.social/@JdeBP/116834743954481512
I am tempted to just quickly knock one together and put it into the #nosh toolset.

#unix #textprocessing #nosh
JdeBP @[email protected] · 2026-06-29 · 18:11 UTC

@rl_dane
When you view manual pages on a video terminal, the -s ('squeeze') option to the final $PAGER in the pipeline does this. You'll be surprised at how many blank lines there really are in manuals.
less, (BSD) more, most, and (my) console-tty37-viewer all have this option.
Given that, it's an of course moment, with a nod to the people who think that this should no more be a part of cat than vis(1) is, that cat has a squeeze option too.
#Unix #TextProcessing #nosh

#unix #textprocessing #nosh
JdeBP @[email protected] · 2026-06-29 · 18:11 UTC

@rl_dane
When you view manual pages on a video terminal, the -s ('squeeze') option to the final $PAGER in the pipeline does this. You'll be surprised at how many blank lines there really are in manuals.
less, (BSD) more, most, and (my) console-tty37-viewer all have this option.
Given that, it's an of course moment, with a nod to the people who think that this should no more be a part of cat than vis(1) is, that cat has a squeeze option too.
#Unix #TextProcessing #nosh

#unix #textprocessing #nosh
N-gated Hacker News @[email protected] · 2026-06-09 · 16:01 UTC

🤔 Ah, yet another academic masterpiece on the magical powers of 'grep'—because who knew that sifting through text could be so agentically transformative? 🚀 Apparently, we need an army of agent harnesses to do what Ctrl+F has been mastering since the dawn of time. 😜 Thanks for the 🧠-bending #insights, arXiv!
https://arxiv.org/abs/2605.15184 #grep #textprocessing #arXiv #automation #academichumor #HackerNews #ngated

#insights #grep #textprocessing #arxiv #automation #academichumor
N-gated Hacker News @[email protected] · 2026-06-09 · 16:01 UTC

🤔 Ah, yet another academic masterpiece on the magical powers of 'grep'—because who knew that sifting through text could be so agentically transformative? 🚀 Apparently, we need an army of agent harnesses to do what Ctrl+F has been mastering since the dawn of time. 😜 Thanks for the 🧠-bending #insights, arXiv!
https://arxiv.org/abs/2605.15184 #grep #textprocessing #arXiv #automation #academichumor #HackerNews #ngated

#insights #grep #textprocessing #arxiv #automation #academichumor
SJ McQuay @[email protected] · 2026-04-14 · 08:08 UTC

@rl_dane If you’re interested in working with Bible texts, you might want to look at https://platform.youversion.com/ – it provides free access via APIs and SDKs, so you don’t need to scrape or re‑parse the text yourself. The fast‑track licensing respects copyright and direct access to the source text helps you avoid introducing issues around textual integrity.
#BibleTech #FaithTech #APIs #TextProcessing

#textprocessing #apis #faithtech #bibletech
Jan :rust: :ferris: @[email protected] · 2026-01-11 · 21:44 UTC

Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):
https://web.stanford.edu/~jurafsky/slp3/
#NLP #TextProcessing #AI #Algorithms

#nlp #textprocessing #ai #algorithms
Jan :rust: :ferris: @[email protected] · 2026-01-11 · 21:44 UTC

Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):
https://web.stanford.edu/~jurafsky/slp3/
#NLP #TextProcessing #AI #Algorithms

#nlp #textprocessing #ai #algorithms
Jan :rust: :ferris: @[email protected] · 2025-11-01 · 14:44 UTC

sentencex - by Wikimedia:
https://github.com/wikimedia/sentencex
A sentence segmentation library with wide language support optimized for speed and utility.
Written in #Rust.
Bindings are available for #Python, #NodeJS and #WASM
Might be useful for my #SpeechToText system! 👀
#NLP #TextProcessing #Segmentation #RustLang

#rust #python #nodejs #wasm #speechtotext #nlp
Jan :rust: :ferris: @[email protected] · 2025-11-01 · 14:44 UTC

sentencex - by Wikimedia:
https://github.com/wikimedia/sentencex
A sentence segmentation library with wide language support optimized for speed and utility.
Written in #Rust.
Bindings are available for #Python, #NodeJS and #WASM
Might be useful for my #SpeechToText system! 👀
#NLP #TextProcessing #Segmentation #RustLang

#rust #python #nodejs #wasm #speechtotext #nlp
Hacker News @[email protected] · 2025-10-14 · 00:31 UTC

LLMs are getting better at character-level text manipulation
https://blog.burkert.me/posts/llm_evolution_character_manipulation/
#HackerNews #LLMs #CharacterManipulation #TextProcessing #AIInnovation #MachineLearning

#hackernews #llms #charactermanipulation #textprocessing #aiinnovation #machinelearning
Hacker News @[email protected] · 2025-10-14 · 00:31 UTC

LLMs are getting better at character-level text manipulation
https://blog.burkert.me/posts/llm_evolution_character_manipulation/
#HackerNews #LLMs #CharacterManipulation #TextProcessing #AIInnovation #MachineLearning

#hackernews #llms #charactermanipulation #textprocessing #aiinnovation #machinelearning
🔏 Matthias Wiesmann @[email protected] · 2025-10-03 · 13:58 UTC

The palindrome problem – Unicode edition
https://wiesmann.codiferes.net/wordpress/archives/41500
#C++ #CodePoints #GraphemeClusters #java #Javascript #ProgrammingLanguage #Python #Swift #TextProcessing #Unicode

#c #codepoints #graphemeclusters #java #javascript #programminglanguage
🔏 Matthias Wiesmann @[email protected] · 2025-10-03 · 13:58 UTC

The palindrome problem – Unicode edition
https://wiesmann.codiferes.net/wordpress/archives/41500
#C++ #CodePoints #GraphemeClusters #java #Javascript #ProgrammingLanguage #Python #Swift #TextProcessing #Unicode

#c #codepoints #graphemeclusters #java #javascript #programminglanguage
Harald Sack @[email protected] · 2025-05-09 · 08:41 UTC

Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA
#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

#nlp #languagemodels #historyofai #textprocessing #ai #historyofscience
Harald Sack @[email protected] · 2025-05-09 · 08:41 UTC

Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA
#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

#nlp #languagemodels #historyofai #textprocessing #ai #historyofscience
N-gated Hacker News @[email protected] · 2025-04-14 · 17:26 UTC

🚀 Behold the epic tale of Janet's #PEG #module, where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of #parsing magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
https://bakpakin.com/writing/how-janets-peg-works.html #Janet #readability #textprocessing #regex #HackerNews #ngated

#peg #module #parsing #janet #readability #textprocessing
N-gated Hacker News @[email protected] · 2025-04-14 · 17:26 UTC

🚀 Behold the epic tale of Janet's #PEG #module, where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of #parsing magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
https://bakpakin.com/writing/how-janets-peg-works.html #Janet #readability #textprocessing #regex #HackerNews #ngated

#peg #module #parsing #janet #readability #textprocessing
🔏 Matthias Wiesmann @[email protected] · 2025-03-07 · 07:18 UTC

Once again, keyword matching to the rescue…
#textprocessing
https://www.oregonlive.com/nation/2025/03/photo-of-enola-gay-aircraft-among-26000-images-flagged-for-removal-in-pentagons-dei-purge.html

#textprocessing
🔏 Matthias Wiesmann @[email protected] · 2025-03-07 · 07:18 UTC

Once again, keyword matching to the rescue…
#textprocessing
https://www.oregonlive.com/nation/2025/03/photo-of-enola-gay-aircraft-among-26000-images-flagged-for-removal-in-pentagons-dei-purge.html

#textprocessing
Holle Meding @[email protected] · 2025-03-06 · 10:53 UTC

🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows
At #DHd2025, Nina Rastinger explores how well #AI handles abbreviations & NER:
✅ NER works well, even with small, low-cost models
❌ Abbreviations are tricky—costs & resource demands skyrocket
🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
Balancing accuracy & efficiency in text processing remains a challenge! ⚖️
#AI #NER #TextProcessing #DigitalHumanities

#dhd2025 #ai #ner #textprocessing #digitalhumanities
Holle Meding @[email protected] · 2025-03-06 · 10:53 UTC

🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows
At #DHd2025, Nina Rastinger explores how well #AI handles abbreviations & NER:
✅ NER works well, even with small, low-cost models
❌ Abbreviations are tricky—costs & resource demands skyrocket
🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
Balancing accuracy & efficiency in text processing remains a challenge! ⚖️
#AI #NER #TextProcessing #DigitalHumanities

#dhd2025 #ai #ner #textprocessing #digitalhumanities
Pragmatic Bookshelf 📚 @[email protected] · 2025-01-23 · 18:27 UTC

New at PragProg
Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.
In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.
http://pragprog.com/titles/d-snrem
@staffannoteberg
#regularexpressions #patternmatching #regex #regexp #textprocessing

#regularexpressions #patternmatching #regex #regexp #textprocessing
Pragmatic Bookshelf 📚 @pragprog · 2025-01-23 · 18:27 UTC

New at PragProg
Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.
In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.
http://pragprog.com/titles/d-snrem
@staffannoteberg
#regularexpressions #patternmatching #regex #regexp #textprocessing

#regularexpressions #patternmatching #regex #regexp #textprocessing
Grumpy Old Techie 🕊️ @[email protected] · 2024-10-21 · 11:13 UTC

Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
In modern times jq should be added to the traditional list.
#text #python #unix #linux #BSD #textprocessing

#text #python #unix #linux #bsd #textprocessing
Grumpy Old Techie 🕊️ @[email protected] · 2024-10-21 · 11:13 UTC

Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
In modern times jq should be added to the traditional list.
#text #python #unix #linux #BSD #textprocessing

#text #python #unix #linux #bsd #textprocessing
Wesley Moore @[email protected] · 2024-09-24 · 01:20 UTC

Discovered a neat new tool last week: https://github.com/wr7/refold
It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.
```
// > Some quoted text
// > to reflow.
```
`refold` handles this.
#Rust #RustLang #TextProcessing #TextManipulation #TextEditor

#rust #rustlang #textprocessing #textmanipulation #texteditor
Wesley Moore @[email protected] · 2024-09-24 · 01:20 UTC

Discovered a neat new tool last week: https://github.com/wr7/refold
It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.
```
// > Some quoted text
// > to reflow.
```
`refold` handles this.
#Rust #RustLang #TextProcessing #TextManipulation #TextEditor

#rust #rustlang #textprocessing #textmanipulation #texteditor
barefootliam @[email protected] · 2024-08-24 · 08:32 UTC

Getting ready to run an online introductory XSLT course for people writing or maintaining stylesheets.
#XSLT #XML #Schematron #XSpec #declarative #functionalProgramming #textProcessing #digitalHumanities #JATS

#xslt #xml #schematron #xspec #declarative #functionalprogramming
barefootliam @[email protected] · 2024-08-24 · 08:32 UTC

Getting ready to run an online introductory XSLT course for people writing or maintaining stylesheets.
#XSLT #XML #Schematron #XSpec #declarative #functionalProgramming #textProcessing #digitalHumanities #JATS

#xslt #xml #schematron #xspec #declarative #functionalprogramming
Sundeep @[email protected] · 2024-04-04 · 12:13 UTC

Hello!
I am pleased to announce a new version of my "CLI text processing with GNU Coreutils" ebook. This ebook will help you learn 20+ specialized text processing commands provided by the coreutils package.
Links:
* Free PDF/EPUB: https://learnbyexample.gumroad.com/l/cli_coreutils (till 10-Apr-2024)
* Web version: https://learnbyexample.github.io/cli_text_processing_coreutils/
* Markdown source, exercise solutions, etc: https://github.com/learnbyexample/cli_text_processing_coreutils
* Short video about the book: https://youtu.be/oCnJLu_PUbY
* Interactive TUI app: https://github.com/learnbyexample/TUI-apps/tree/main/CLI-Exercises (includes some coreutils exercises)
I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors.
Happy learning :)
#linux #cli #coreutils #textprocessing

#linux #cli #coreutils #textprocessing
Sundeep @learnbyexample · 2024-04-04 · 12:13 UTC

Hello!
I am pleased to announce a new version of my "CLI text processing with GNU Coreutils" ebook. This ebook will help you learn 20+ specialized text processing commands provided by the coreutils package.
Links:
* Free PDF/EPUB: https://learnbyexample.gumroad.com/l/cli_coreutils (till 10-Apr-2024)
* Web version: https://learnbyexample.github.io/cli_text_processing_coreutils/
* Markdown source, exercise solutions, etc: https://github.com/learnbyexample/cli_text_processing_coreutils
* Short video about the book: https://youtu.be/oCnJLu_PUbY
* Interactive TUI app: https://github.com/learnbyexample/TUI-apps/tree/main/CLI-Exercises (includes some coreutils exercises)
I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors.
Happy learning :)
#linux #cli #coreutils #textprocessing

#linux #cli #coreutils #textprocessing
🔏 Matthias Wiesmann @[email protected] · 2024-03-01 · 08:16 UTC

An old blog post I wrote, which I had nearly forgotten. «Falsehoods programmers believe about text»
https://wiesmann.codiferes.net/wordpress/archives/30296
#unicode #ansi #textprocessing

#unicode #ansi #textprocessing
🔏 Matthias Wiesmann @[email protected] · 2024-03-01 · 08:16 UTC

An old blog post I wrote, which I had nearly forgotten. «Falsehoods programmers believe about text»
https://wiesmann.codiferes.net/wordpress/archives/30296
#unicode #ansi #textprocessing

#unicode #ansi #textprocessing
eliza-ng @[email protected] · 2023-07-09 · 13:35 UTC

Looking to level up your Unix automation skills? Check out our new blog post: Perl: Changing the Game in Unix Automation and Text Processing
https://www.eliza-ng.me/post/sperlunixtools/
#Perl #UnixAutomation #TextProcessing

#perl #unixautomation #textprocessing
Gareth Emslie 🇿🇦 🇪🇦 🇨🇭 @[email protected] · 2023-06-02 · 07:27 UTC

Ajit Dash has published an article on optimizing the usage of tokens in GPT for efficient text processing. The post highlights the importance of tokenization in language models and how GPT-3 uses tokens. https://techcommunity.microsoft.com/t5/healthcare-and-life-sciences/unlocking-the-power-of-tokens-optimizing-token-usage-in-gpt-for/ba-p/3826665 #Microsoft #textprocessing #tokens #softcorpremium

#microsoft #textprocessing #tokens #softcorpremium
Jan :rust: :ferris: @[email protected] · 2023-05-21 · 20:32 UTC

Oh, this looks fantastic! ✨
#Rust library to compare strings (or any sequences). 25+ algorithms, pure Rust, common interface, #Unicode support.
https://github.com/life4/textdistance.rs
Based on popular and battle-tested textdistance #Python library (and written by the same author).
Apparently, it also takes algorithms from the #talisman #JavaScript library, which I wished for to be written in Rust.
https://github.com/Yomguithereal/talisman
#TextProcessing #NLP #TextDiffing #Diff #RustLang #Crate #CrateTip

#rust #unicode #python #talisman #javascript #textprocessing
Jim Donegan 🎵 ✅ @[email protected] · 2023-02-01 · 15:10 UTC

#ChatGPT with #RobMiles - #Computerphile
https://www.youtube.com/watch?v=viJt_DXTfwA&ab_channel=Computerphile
#ComputerScience #LanguageModel #LanguageModels #LanguageProcessing #TextProcessing #ComputerLearning #AI #ArtificialIntelligence #GPT #GPT3 #AILearning #AITraining #OpenAI

#openai #aitraining #ailearning #gpt3 #gpt #artificialintelligence