home.social

#text-processing — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #text-processing, aggregated by home.social.

fetched live
  1. @rl_dane

    It's food for thought, too. All this fuss made about cat, repeated to this day in some quarters, and of the three things explicitly pointed out as wrong with what Berkeley did, only a tool to do one of them, vis(1), was ever written in 43 years.

    The ironic cherry on top is that the only operating systems that come with vis out of the box are the BSDs.

    #Unix #TextProcessing #nosh #cat #vis #linenumber #sqz

  2. @rl_dane

    It's food for thought, too. All this fuss made about cat, repeated to this day in some quarters, and of the three things explicitly pointed out as wrong with what Berkeley did, only a tool to do one of them, vis(1), was ever written in 43 years.

    The ironic cherry on top is that the only operating systems that come with vis out of the box are the BSDs.

    #Unix #TextProcessing #nosh #cat #vis #linenumber #sqz

  3. @rl_dane

    The 'cat -v considered harmful folks' (Kernighan and Pike) invented vis(1).

    > The preferred approach in this case is a separate program that deals with non-printable characters. We called ours vis […] because its job is to make things visible.

    I suspect that they would approve of sqz(1) likewise abstracted from cat -s, because they also said

    > cat isn’t […] for compressing multiple blank lines […] it’s for concatenating files.

    #vis #sqz #cat #Unix #TextProcessing #nosh

  4. @rl_dane

    The 'cat -v considered harmful folks' (Kernighan and Pike) invented vis(1).

    > The preferred approach in this case is a separate program that deals with non-printable characters. We called ours vis […] because its job is to make things visible.

    I suspect that they would approve of sqz(1) likewise abstracted from cat -s, because they also said

    > cat isn’t […] for compressing multiple blank lines […] it’s for concatenating files.

    #vis #sqz #cat #Unix #TextProcessing #nosh

  5. @rl_dane

    Well since vis(1) came about because of cat -v, it seems that there should be a sqz(1) because of cat -s .

    #Unix #TextProcessing #nosh

  6. @rl_dane

    Well since vis(1) came about because of cat -v, it seems that there should be a sqz(1) because of cat -s .

    #Unix #TextProcessing #nosh

  7. As you can see, @rl_dane has me wondering whether anyone has ever written a sqz(1) #Unix #TextProcessing tool, to go alongside vis(1).

    tty0.social/@JdeBP/11683474395

    I am tempted to just quickly knock one together and put it into the #nosh toolset.

  8. As you can see, @rl_dane has me wondering whether anyone has ever written a sqz(1) #Unix #TextProcessing tool, to go alongside vis(1).

    tty0.social/@JdeBP/11683474395

    I am tempted to just quickly knock one together and put it into the #nosh toolset.

  9. @rl_dane

    When you view manual pages on a video terminal, the -s ('squeeze') option to the final $PAGER in the pipeline does this. You'll be surprised at how many blank lines there really are in manuals.

    less, (BSD) more, most, and (my) console-tty37-viewer all have this option.

    Given that, it's an of course moment, with a nod to the people who think that this should no more be a part of cat than vis(1) is, that cat has a squeeze option too.

    #Unix #TextProcessing #nosh

  10. @rl_dane

    When you view manual pages on a video terminal, the -s ('squeeze') option to the final $PAGER in the pipeline does this. You'll be surprised at how many blank lines there really are in manuals.

    less, (BSD) more, most, and (my) console-tty37-viewer all have this option.

    Given that, it's an of course moment, with a nod to the people who think that this should no more be a part of cat than vis(1) is, that cat has a squeeze option too.

    #Unix #TextProcessing #nosh

  11. 🤔 Ah, yet another academic masterpiece on the magical powers of 'grep'—because who knew that sifting through text could be so agentically transformative? 🚀 Apparently, we need an army of agent harnesses to do what Ctrl+F has been mastering since the dawn of time. 😜 Thanks for the 🧠-bending #insights, arXiv!
    arxiv.org/abs/2605.15184 #grep #textprocessing #arXiv #automation #academichumor #HackerNews #ngated

  12. 🤔 Ah, yet another academic masterpiece on the magical powers of 'grep'—because who knew that sifting through text could be so agentically transformative? 🚀 Apparently, we need an army of agent harnesses to do what Ctrl+F has been mastering since the dawn of time. 😜 Thanks for the 🧠-bending #insights, arXiv!
    arxiv.org/abs/2605.15184 #grep #textprocessing #arXiv #automation #academichumor #HackerNews #ngated

  13. @rl_dane If you’re interested in working with Bible texts, you might want to look at platform.youversion.com/ – it provides free access via APIs and SDKs, so you don’t need to scrape or re‑parse the text yourself. The fast‑track licensing respects copyright and direct access to the source text helps you avoid introducing issues around textual integrity.

    #BibleTech #FaithTech #APIs #TextProcessing

  14. sentencex - by Wikimedia:

    github.com/wikimedia/sentencex

    A sentence segmentation library with wide language support optimized for speed and utility.

    Written in #Rust.

    Bindings are available for #Python, #NodeJS and #WASM

    Might be useful for my #SpeechToText system! 👀

    #NLP #TextProcessing #Segmentation #RustLang

  15. sentencex - by Wikimedia:

    github.com/wikimedia/sentencex

    A sentence segmentation library with wide language support optimized for speed and utility.

    Written in #Rust.

    Bindings are available for #Python, #NodeJS and #WASM

    Might be useful for my #SpeechToText system! 👀

    #NLP #TextProcessing #Segmentation #RustLang

  16. Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.

    F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA

    #NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

  17. Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.

    F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA

    #NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

  18. 🚀 Behold the epic tale of Janet's #PEG #module, where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of #parsing magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
    bakpakin.com/writing/how-janet #Janet #readability #textprocessing #regex #HackerNews #ngated

  19. 🚀 Behold the epic tale of Janet's #PEG #module, where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of #parsing magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
    bakpakin.com/writing/how-janet #Janet #readability #textprocessing #regex #HackerNews #ngated

  20. 🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows

    At #DHd2025, Nina Rastinger explores how well #AI handles abbreviations & NER:

    ✅ NER works well, even with small, low-cost models
    ❌ Abbreviations are tricky—costs & resource demands skyrocket
    🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
    Balancing accuracy & efficiency in text processing remains a challenge! ⚖️

    #AI #NER #TextProcessing #DigitalHumanities

  21. 🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows

    At #DHd2025, Nina Rastinger explores how well #AI handles abbreviations & NER:

    ✅ NER works well, even with small, low-cost models
    ❌ Abbreviations are tricky—costs & resource demands skyrocket
    🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
    Balancing accuracy & efficiency in text processing remains a challenge! ⚖️

    #AI #NER #TextProcessing #DigitalHumanities

  22. New at PragProg

    Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.

    In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.

    pragprog.com/titles/d-snrem

    @staffannoteberg

    #regularexpressions #patternmatching #regex #regexp #textprocessing

  23. New at PragProg

    Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.

    In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.

    pragprog.com/titles/d-snrem

    @staffannoteberg

  24. Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
    Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
    In modern times jq should be added to the traditional list.

    #text #python #unix #linux #BSD #textprocessing

  25. Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
    Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
    In modern times jq should be added to the traditional list.

    #text #python #unix #linux #BSD #textprocessing

  26. Discovered a neat new tool last week: github.com/wr7/refold

    It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.

    ```
    // > Some quoted text
    // > to reflow.
    ```

    `refold` handles this.

    #Rust #RustLang #TextProcessing #TextManipulation #TextEditor

  27. Discovered a neat new tool last week: github.com/wr7/refold

    It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.

    ```
    // > Some quoted text
    // > to reflow.
    ```

    `refold` handles this.

    #Rust #RustLang #TextProcessing #TextManipulation #TextEditor

  28. Hello!

    I am pleased to announce a new version of my "CLI text processing with GNU Coreutils" ebook. This ebook will help you learn 20+ specialized text processing commands provided by the coreutils package.

    Links:

    * Free PDF/EPUB: learnbyexample.gumroad.com/l/c (till 10-Apr-2024)
    * Web version: learnbyexample.github.io/cli_t
    * Markdown source, exercise solutions, etc: github.com/learnbyexample/cli_
    * Short video about the book: youtu.be/oCnJLu_PUbY
    * Interactive TUI app: github.com/learnbyexample/TUI- (includes some coreutils exercises)

    I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors.

    Happy learning :)

    #linux #cli #coreutils #textprocessing

  29. Hello!

    I am pleased to announce a new version of my "CLI text processing with GNU Coreutils" ebook. This ebook will help you learn 20+ specialized text processing commands provided by the coreutils package.

    Links:

    * Free PDF/EPUB: learnbyexample.gumroad.com/l/c (till 10-Apr-2024)
    * Web version: learnbyexample.github.io/cli_t
    * Markdown source, exercise solutions, etc: github.com/learnbyexample/cli_
    * Short video about the book: youtu.be/oCnJLu_PUbY
    * Interactive TUI app: github.com/learnbyexample/TUI- (includes some coreutils exercises)

    I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors.

    Happy learning :)

  30. Looking to level up your Unix automation skills? Check out our new blog post: Perl: Changing the Game in Unix Automation and Text Processing
    eliza-ng.me/post/sperlunixtool
    #Perl #UnixAutomation #TextProcessing

  31. Ajit Dash has published an article on optimizing the usage of tokens in GPT for efficient text processing. The post highlights the importance of tokenization in language models and how GPT-3 uses tokens. techcommunity.microsoft.com/t5 #Microsoft #textprocessing #tokens #softcorpremium

  32. Oh, this looks fantastic! ✨

    #Rust library to compare strings (or any sequences). 25+ algorithms, pure Rust, common interface, #Unicode support.

    github.com/life4/textdistance.

    Based on popular and battle-tested textdistance #Python library (and written by the same author).

    Apparently, it also takes algorithms from the #talisman #JavaScript library, which I wished for to be written in Rust.

    github.com/Yomguithereal/talis

    #TextProcessing #NLP #TextDiffing #Diff #RustLang #Crate #CrateTip