home.social

#dataengineering — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #dataengineering, aggregated by home.social.

  1. Regex vs. LLM for B2B document extraction. This week, I tried out both.

    :blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.

    :blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.

    :blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.

    :blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.

    Full comparison with code and trade-off breakdown on TDS: shorturl.at/v4gdl

  2. Regex vs. LLM for B2B document extraction. This week, I tried out both.

    :blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.

    :blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.

    :blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.

    :blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.

    Full comparison with code and trade-off breakdown on TDS: shorturl.at/v4gdl

    #Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR

  3. Regex vs. LLM for B2B document extraction. This week, I tried out both.

    :blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.

    :blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.

    :blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.

    :blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.

    Full comparison with code and trade-off breakdown on TDS: shorturl.at/v4gdl

    #Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR

  4. Regex vs. LLM for B2B document extraction. This week, I tried out both.

    :blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.

    :blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.

    :blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.

    :blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.

    Full comparison with code and trade-off breakdown on TDS: shorturl.at/v4gdl

    #Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR

  5. Regex vs. LLM for B2B document extraction. This week, I tried out both.

    :blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.

    :blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.

    :blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.

    :blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.

    Full comparison with code and trade-off breakdown on TDS: shorturl.at/v4gdl

    #Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR

  6. Join me in reading Designing Data-Intensive Applications (the new, 2nd Edition)! I want to host an async, virtual book club with the goal to read ~1 chapter a week :). The first edition is probably the most recommended data systems book I know, so it's time! More info + signups in the 🧵.

  7. 🔍 Spark + Elasticsearch Debugging 🧵

    Building a cybersecurity analytics platform. Hit 2 blockers:

    ❌ JAR path mismatch → Fixed absolute path
    ❌ No data nodes (single-node Docker ES) → Added es.nodes.wan.only=true

    ✅ Result: 89 records loaded. Working pipeline!

    Lesson: Verify JAR paths + disable node discovery for single-node ES.

    #PySpark #Elasticsearch #DataEngineering #CyberSecurity #Debugging

  8. 🔍 Spark + Elasticsearch Debugging 🧵

    Building a cybersecurity analytics platform. Hit 2 blockers:

    ❌ JAR path mismatch → Fixed absolute path
    ❌ No data nodes (single-node Docker ES) → Added es.nodes.wan.only=true

    ✅ Result: 89 records loaded. Working pipeline!

    Lesson: Verify JAR paths + disable node discovery for single-node ES.

    #PySpark #Elasticsearch #DataEngineering #CyberSecurity #Debugging

  9. How #Netflix boosted #ApacheDruid performance: by implementing interval-aware caching, they now serve 84% of analytics results from cache and have reduced query load by 33%.

    The secret? Decomposing rolling window queries into reusable time segments.
    ✅ Reduces scan volume
    ✅ Improves P90 latency
    ✅ Optimizes real-time analytics

    Details on #InfoQ: bit.ly/4uHG4DE

    #SoftwareArchitecture #DistributedSystems #DataAnalytics #TimeSeriesData #Caching #BigData #DataEngineering

  10. IA générative a failli faire rater un candidat parfait : CV éliminé à cause de mots-clés manquants. Solution : vérification manuelle, critères adaptés. #Recrutement #IA #RH #Erreur #DataEngineering ... linkedin.com/posts/gabriel-cha

  11. How Olivia Chen Breaks Down the Modern Data Stack and Why the Architecture Conversation Matters [Ad] The modern data stack is one of the most discussed and frequently misunderstood topics in enterp...

    #Business #Cloud-Computing #data #Data-Engineering #Data-Governance #Enterprise-Technology #sponsored #Technology

    Origin | Interest | Match
  12. In this #InfoQ podcast, Somtochi Onyekwere breaks down:
    • Recent developments in #DistributedDataSystems
    • How to achieve fast, eventually consistent replication across distributed nodes
    • Using #CRDTs (Conflict-free Replicated Data Types) to resolve data conflicts seamlessly

    🎧 Listen here: bit.ly/49OiNaE

    #DataEngineering #EventualConsistency

  13. 🎉 Milestone Unlocked: Finished the Data Engineering Zoomcamp!

    In 10 weeks, I moved from scripting to architecting systems. We built real production-grade infrastructure using Spark, Kafka, Airflow, and Kestra—not just hobby projects.

    Capstone: A Storage Hard Drive Dashboard using real failure data from Backblaze
    Stack: Terraform + Docker infra, Airflow orchestration, dbt modeling, Streamlit viz.

    Key Lessons:
    ✅️ "It works on my laptop" isn't a strategy.
    ✅ Need IaC, partitioning, clustering, and strict error handling.
    ✅ dbt ensures reproducible, tested models.
    ✅ Infra is invisible work—if it breaks, your code fails.

    Take the leap! It’s challenging but by week 10, pieces click into place. Seeing my pipeline run autonomously felt like crossing the finish line. 🏁

    Thanks Data Talks Club team! On to the next challenge!

    My project: github.com/ammartin8/hard_driv

    #mastodon #fediverse #data #spark #dataengineering #ai #technology #datatools #datapipelines #fedihire #thursday #sql #observability #etl #python #github

  14. End-to-End Storage Drive Analytics Platform Complete! 🚀

    Spent the past weeks on my Data Engineering Zoomcamp final project. Excited to share an end-to-end platform analyzing Backblaze hard drive data to bridge enterprise telemetry and consumer accessibility.

    The pipeline ingests daily SMART snapshots into GCS, builds a star schema with dbt, and serves insights via Streamlit dashboard showing failure rates by brand and model. Infrastructure is managed with Terraform; the warehouse was optimized using partitioning to improve query performance.

    To increase accessibility, switching to open-source/free-ish tools so anyone can dive in without a cloud signup (and plus my trial expired 🙈). My goal is providing drive reliability data so creators, homelabbers, business, or casual users feel informed about their next storage purchase. 😊

    Check out the repo for details: github.com/ammartin8/hard_driv
    #DataEngineering #harddrive #opensource #cloud #streamlit #buildinpublic #selfhosting #mastodon #python #fediverse #backblaze

  15. Most GNSS survey failures aren’t equipment—they’re poor site selection.

    I’m experimenting with using low-cost GNSS receivers (L76K vs u-blox MAX-M10S) to pre-qualify sites before running NOAA OPUS.

    ~30–90 min logs → PostgreSQL → HDOP, CN0, wander analysis.

    Early results show tradeoffs between signal strength and positional stability.

    salemdata.net/johnpress/?p=628

    Anyone exploring similar GNSS site validation?
    #GNSS #GPS #Geodesy #ESP32 #PostgreSQL #OpenSource #DataEngineering #Meshtastic

  16. Most GNSS survey failures aren’t equipment—they’re poor site selection.

    I’m experimenting with using low-cost GNSS receivers (L76K vs u-blox MAX-M10S) to pre-qualify sites before running NOAA OPUS.

    ~30–90 min logs → PostgreSQL → HDOP, CN0, wander analysis.

    Early results show tradeoffs between signal strength and positional stability.

    salemdata.net/johnpress/?p=628

    Anyone exploring similar GNSS site validation?
    #GNSS #GPS #Geodesy #ESP32 #PostgreSQL #OpenSource #DataEngineering #Meshtastic

  17. Kafka in-cluster replication won't save you from a regional outage. You need a battle-tested Multi-Region strategy.
    Read this to master:
    - RPO/RTO trade-offs for global scale
    - Active-Active vs. Stretched Clusters (3-DC & 2.5-DC)
    - Solving the offset translation nightmare
    - Real-world failover testing
    softwaremill.com/guide-to-apac

    #ApacheKafka #SystemArchitecture #DisasterRecovery #DataEngineering

  18. if you knew today exactly where the #ITMarket will be starving for talent tomorrow...
    How would your next salary negotiation look?
    Forget the Master’s degree. Forget the Bachelor’s. If you have massive #Grit in your system, you are the solution the market is screaming for - even if the corporate recruiting filters haven't realized it yet.

    This is autonomy engineered in a shell.
    This is #BruteForceEngineering.
    #DataScience #DataEngineering #HiringGap #Python #Grit #gritlab #AnarchyInTheShell

  19. The Scraper-Base-Kit is now Private.

    After reviewing the velocity of our global ingest, I’ve decided to move the base-kit to the Private Core.

    The engine is too fast, the footprint too small.

    This thing is like a shotgun that fires once at the world!!!!
    We are not just building tools; we are guarding the data.

    #ArchLinux #RootCause #IT #hiringgap #analsys #GitHubActions #AnarchyInTheShell #ContinuumHQ #Dataengineering #DataScience#agenticworkflow #SRE #Grit #Gritlab #buildInPublic

  20. 0:07! It's midnight!
    The Machine is silent, but the Pipeline is Green. 🟢

    What started as a day of debugging ended as a fully automated Intelligence. The Elégence Report Engine is live. 217 countries, one Grit-Score, zero human error.

    T480 closing now. Tomorrow, the world gets the data it deserves.

    #ArchLinux #RootCause #IT #hiringgap #analsys #GitHubActions #AnarchyInTheShell #ContinuumHQ #Dataengineering #DataScience#agenticworkflow #SRE #Grit #Gritlab #buildInPublic

  21. System says exit 1. I say: Challenge accepted.

    Debugging the engine.

    No corporate cloud fluff, just raw logs, Arch Linux, and a broken pipeline that’s about to get crushed.

    don't solve theoretical problems. break real ones. ✊🔥

    #ArchLinux #RootCause #LUKS #GitHubActions #AnarchyInTheShell #ContinuumHQ #Dataengineering #DataScience#agenticworkflow #SRE #Grit #Gritlab

  22. Every data professional should understand these seven core concepts.

    From data warehouses and lakes to pipelines, meshes, and governance, these form the foundation of modern analytics infrastructure.
    Mastering them bridges the gap between raw data and actionable business insights.

    📕 ebokify.com/ai-data-science

    #DataEngineering #DataScience #DataAnalytics #ETL #DataWarehouse #BigData #BusinessIntelligence #DataPipeline #DataGovernance

  23. Serious about SQL?
    Start with structure, not random tutorials.

    I’ve created 65 structured SQL lessons covering:
    • Fundamentals
    • Joins & Advanced Queries
    • CTEs & Window Functions
    • Query Optimization
    • BI Concepts
    • Interview Preparation

    Each lesson builds on the previous one.
    Clear. Practical. Designed for data analysts.

    📕 ebokify.com/sql

    📕 ebokify.com/data-analysis

    #SQL #DataAnalytics #DataAnalyst #DataEngineering #BusinessIntelligence #AnalyticsCareers #TechEducation #DataScience

  24. Shifting Left delivers clean, reliable, and accessible data to everyone who needs it - right when they need it.

    The result? Less complexity, lower overhead, and far less break-fix work, freeing teams to focus on higher-value problems.

    At the core of a #ShiftLeft strategy are Data Products. They form the backbone of healthy data communication and ensure quality is built in - not patched on later.

    📖 Great insights from this #InfoQ article on rethinking the Medallion Architecture: bit.ly/3WHjxsf

    #SoftwareArchitecture #DataMesh #DataEngineering #DataLake #DataPipelines

  25. #throwback From data swamp to data lakehouse 🏗️ Josef Machytka shares real-world lessons on building a lakehouse with PostgreSQL, BigQuery, and GCS—covering formats, scaling, governance, and data quality. Keep your data clean and useful. ▶️ Watch now! youtube.com/watch?v=AUdEjYnXGb

    #PostgreSQL #PGDay #PPDD #DataLakehouse #DataEngineering

  26. 💡 Apache Airflow 2025 Recap

    2026 has arrived which marks a great opportunity to review and assess the changes the last year has brought to the ever evolving landscape of open source data tools.

    In our new #blog post we look at Apache Airflow and how the leading open source orchestration platform has changed in the last year with the bit v3 major release.

    🔗 nextlytics.com/blog/apache-air

    #apacheairflow #airflow #opensource #dataengineering #datascience