#trtllm — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #trtllm, aggregated by home.social.
-
Fitting an LLM on a GPU is a bit like photography. Model weights = film sensitivity, activation size = shutter speed, I/O tensors = aperture. These 3 dials control your model's memory footprint, just as they shape a photo's exposure.
Just realised this while trying to fit Llama 3.1 on my 24GB GPU with TRT-LLM: https://nvidia.github.io/TensorRT-LLM/reference/memory.html.