#gpt4v — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #gpt4v, aggregated by home.social.
-
🔍 Major breakthrough in multimodal AI research:
#InfinityMM dataset launches with 43.4M entries across 4 categories: 10M image descriptions, 24.4M visual instructions, 6M high-quality instructions & 3M #AI generated data
🧠 Technical highlights:
New #AquilaVL2B model uses #LLaVA architecture with #Qwen25 language model & #SigLIP for image processing
Despite only 2B parameters, achieves state-of-the-art results in multiple benchmarks
Exceptional performance: #MMStar (54.9%), #MathVista (59%), #MMBench (75.2%)🚀 Training innovation:
4-stage training process with increasing complexity
Combines image recognition, instruction classification & response generation
Uses #opensource models like RAM++ for data generation💡 Industry impact:
Model trained on both #Nvidia A100 GPUs & Chinese chips
Complete dataset & model available to research community
Shows promising results compared to commercial systems like #GPT4V -
Intrigued by multi-modal (text + vision) models like LLaVa, I tried an experiment to create a browser extension that walks the DOM, finds images without good alt, converts the image to Base64, sends it to LLaVa 7B 1.5 (running in LlamaCPP's server) and injects the rich description back into the image tag's alt. Needs much more work and testing, but amazing what a ~5GB (quantised at 5bit) model can do!
Edit: now on Github: https://github.com/daaain/image-alt-text-generator-extension