#tokenizers — Public Fediverse posts on home.social

रञ्जित (Ranjit Mathew) @[email protected] · 2025-08-22 · 13:09 UTC

Great 👌🏽:

“Strategies For Very Fast Lexers”, Matteo / ‘xnacly’ (https://xnacly.me/posts/2025/fast-lexer-strategies/).

Via HN: https://news.ycombinator.com/item?id=44560871

On Lobsters: https://lobste.rs/s/75zw2o/strategies_for_very_fast_lexers

#Compilers #Lexers #Tokenizers #LexicalAnalyzers #Speed #C #Programming #Efficiency #Optimization #PLDI

#compilers #lexers #tokenizers #lexicalanalyzers #speed #c

Arthur Hau, PhD🐶🐱🌱🎵🦣 @[email protected] · 2025-05-16 · 08:45 UTC

To most people the word #token is a black box. I am not using the #tokenizers that are commonly used in #DeepLearning #LLM. Instead I am using my own #WordCoding system that I will call yxxx+. I am using base 16 for coding 300 common ESL English words for my #SLM project. y ranges from 0-F which denotes the #POS (part of speech) of a word. xxx are 3 base 16 digits. Theoretically, I can expand my model to 4000 "base" words. + denotes an additional code which I will explain later. #AI

#ai #pos #slm #wordcoding #llm #deeplearning

:rss: .NET Blog @[email protected] · 2025-04-22 · 17:05 UTC

Introducing the AI Dev Gallery: Your Gateway to Local AI Development with .NET
https://devblogs.microsoft.com/dotnet/introducing-ai-dev-gallery-gateway-to-local-ai-development/

#microsoft #NET #AI #NET_9 #dev_tools #generative_ai #Machine_Learning #tokenizers #vector_search

#microsoft #net #ai #net_9 #dev_tools #generative_ai

michabbb @[email protected] · 2024-10-01 · 22:01 UTC

🔧 #code2prompt: A command-line tool for converting codebases to #LLM prompts

Key features:
• 📁 Generates well-formatted #Markdown prompts with source tree structure
• 🛠️ Customizable #Handlebars templates for versatile prompt generation
• 🔍 Respects .gitignore and supports file filtering with glob patterns
• 🔢 Displays token count using various #tokenizers (cl100k, p50k, r50k_base)
• 📊 #Git diff integration for commit messages and #PullRequest descriptions
• 📋 Automatic clipboard copy and option to save output to file

Additional capabilities:
• 🔢 Line numbering for source code blocks
• 🔀 JSON output option for structured data
• 🚫 Exclusion of files/folders from source tree
• 📝 Support for user-defined variables in templates

#opensource project written in #Rust, available on #crates_io and #AUR

Useful for:
• Quick #LLM prompt generation from codebases
• Code documentation and analysis
• Bug finding and security vulnerability assessment
• Performance optimization suggestions

https://github.com/mufeedvh/code2prompt

#code2prompt #llm #markdown #handlebars #tokenizers #git