home.social

#data-engineering — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #data-engineering, aggregated by home.social.

fetched live
  1. Regex vs. LLM for B2B document extraction. This week, I tried out both.

    :blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.

    :blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.

    :blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.

    :blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.

    Full comparison with code and trade-off breakdown on TDS: shorturl.at/v4gdl

    #Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR

  2. Join me in reading Designing Data-Intensive Applications (the new, 2nd Edition)! I want to host an async, virtual book club with the goal to read ~1 chapter a week :). The first edition is probably the most recommended data systems book I know, so it's time! More info + signups in the 🧵. #dataengineering

  3. Learn how to build self-healing data pipelines for finance. Implement Evaluator-Optimizer loops to achieve 98% accuracy in autonomous data governance. hackernoon.com/implementing-ev #dataengineering

  4. How #Netflix boosted #ApacheDruid performance: by implementing interval-aware caching, they now serve 84% of analytics results from cache and have reduced query load by 33%.

    The secret? Decomposing rolling window queries into reusable time segments.
    ✅ Reduces scan volume
    ✅ Improves P90 latency
    ✅ Optimizes real-time analytics

    Details on #InfoQ: bit.ly/4uHG4DE

    #SoftwareArchitecture #DistributedSystems #DataAnalytics #TimeSeriesData #Caching #BigData #DataEngineering

  5. A few gotchas worth sharing from crawling techdocs: 1. add an explicit exclude list for translation subpaths (/de/), 2. filter out mailto: / javascript: hrefs, 3. techdocs often include PDFs: download and extract text :). More learnings and code snippets on my most recent blogpost. 🧵 #dataengineering

  6. How Olivia Chen Breaks Down the Modern Data Stack and Why the Architecture Conversation Matters [Ad] The modern data stack is one of the most discussed and frequently misunderstood topics in enterp...

    #Business #Cloud-Computing #data #Data-Engineering #Data-Governance #Enterprise-Technology #sponsored #Technology

    Origin | Interest | Match
  7. In this #InfoQ podcast, Somtochi Onyekwere breaks down:
    • Recent developments in #DistributedDataSystems
    • How to achieve fast, eventually consistent replication across distributed nodes
    • Using #CRDTs (Conflict-free Replicated Data Types) to resolve data conflicts seamlessly

    🎧 Listen here: bit.ly/49OiNaE

    #DataEngineering #EventualConsistency