home.social

#tokenizers — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #tokenizers, aggregated by home.social.

  1. To most people the word #token is a black box. I am not using the #tokenizers that are commonly used in #DeepLearning #LLM. Instead I am using my own #WordCoding system that I will call yxxx+. I am using base 16 for coding 300 common ESL English words for my #SLM project. y ranges from 0-F which denotes the #POS (part of speech) of a word. xxx are 3 base 16 digits. Theoretically, I can expand my model to 4000 "base" words. + denotes an additional code which I will explain later. #AI

  2. 🔧 #code2prompt: A command-line tool for converting codebases to #LLM prompts

    Key features:
    • 📁 Generates well-formatted #Markdown prompts with source tree structure
    • 🛠️ Customizable #Handlebars templates for versatile prompt generation
    • 🔍 Respects .gitignore and supports file filtering with glob patterns
    • 🔢 Displays token count using various #tokenizers (cl100k, p50k, r50k_base)
    • 📊 #Git diff integration for commit messages and #PullRequest descriptions
    • 📋 Automatic clipboard copy and option to save output to file

    Additional capabilities:
    • 🔢 Line numbering for source code blocks
    • 🔀 JSON output option for structured data
    • 🚫 Exclusion of files/folders from source tree
    • 📝 Support for user-defined variables in templates

    #opensource project written in #Rust, available on #crates_io and #AUR

    Useful for:
    • Quick #LLM prompt generation from codebases
    • Code documentation and analysis
    • Bug finding and security vulnerability assessment
    • Performance optimization suggestions

    github.com/mufeedvh/code2promp