home.social
  1. Regex vs. LLM for B2B document extraction. This week, I tried out both.

    :blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.

    :blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.

    :blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.

    :blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.

    Full comparison with code and trade-off breakdown on TDS: shorturl.at/v4gdl