#data-engineering — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #data-engineering, aggregated by home.social.
-
Data Engineer Senior | 15 ans expérience | Data, BioInformatique, Cybersécurité, Ludopédagogie | #DataEngineering #BioInformatique #CyberSécurité #Ludopédagogie #Tech ... https://www.linkedin.com/posts/gabriel-chandesris_cv-data-engineer-senior-15-ann%C3%A9es-dexp%C3%A9riences-ugcPost-7460712652261130241-5hOm
-
Regex vs. LLM for B2B document extraction. This week, I tried out both.
:blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.
:blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.
:blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.
:blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.
Full comparison with code and trade-off breakdown on TDS: https://shorturl.at/v4gdl
#Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR
-
Join me in reading Designing Data-Intensive Applications (the new, 2nd Edition)! I want to host an async, virtual book club with the goal to read ~1 chapter a week :). The first edition is probably the most recommended data systems book I know, so it's time! More info + signups in the 🧵. #dataengineering
-
Learn how to build self-healing data pipelines for finance. Implement Evaluator-Optimizer loops to achieve 98% accuracy in autonomous data governance. https://hackernoon.com/implementing-evaluator-optimizer-loops-for-autonomous-data-governance-in-self-healing-pipelines #dataengineering
-
How #Netflix boosted #ApacheDruid performance: by implementing interval-aware caching, they now serve 84% of analytics results from cache and have reduced query load by 33%.
The secret? Decomposing rolling window queries into reusable time segments.
✅ Reduces scan volume
✅ Improves P90 latency
✅ Optimizes real-time analyticsDetails on #InfoQ: https://bit.ly/4uHG4DE
#SoftwareArchitecture #DistributedSystems #DataAnalytics #TimeSeriesData #Caching #BigData #DataEngineering
-
A few gotchas worth sharing from crawling techdocs: 1. add an explicit exclude list for translation subpaths (/de/), 2. filter out mailto: / javascript: hrefs, 3. techdocs often include PDFs: download and extract text :). More learnings and code snippets on my most recent blogpost. 🧵 #dataengineering
-
Fc, a lossless compressor for floating-point streams
https://github.com/xtellect/fc
#HackerNews #losslesscompression #floatingpoint #streams #dataengineering #softwaredevelopment
-
Project Manager for Data Science -- Arnaout Lab @UCSF
Arnaout Lab @UCSFSee the full job description on jobRxiv: https://jobrxiv.org/job/arnaout-lab-ucsf-27778-project-manager-for-data-science-arnaout-lab-ucsf/
#computerengineering #computerscience #computervision #dataengineering #healthdatascience #medicalimaging #ScienceJobs #hiring #research
https://jobrxiv.org/job/arnaout-lab-ucsf-27778-project-manager-for-data-science-arnaout-lab-ucsf/?fsp_sid=12041 -
Batch ou Stream : latence, volume, coût, complexité. Cas : rapports (batch), fraudes (stream), hybride. Conseil : commencer par batch. #DataEngineering #Batch #Stream #Tech #Choix ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-batch-stream-share-7459524088475185152-PJWL
-
5 compétences data engineering 2026 : pipelines ETL/ELT, bases de données, cloud, automatisation, enjeux métiers. Salaires : 45k€-90k€. #DataEngineering #Tech #Compétences #Cloud #Automatisation ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-tech-compaeztences-share-7459522795107004416-st8x
-
3 erreurs pipelines données : qualité des données, pipelines monolithiques, manque de documentation. Solutions : nettoyage intégré, découpage, README. #DataEngineering #Pipeline #Tech #BonnesPratiques #Erreurs ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-pipeline-tech-share-7459521741233651712--cJR
-
5x perf increase on writes with FPW disabled in Postgres
https://www.databricks.com/blog/how-lakebase-architecture-delivers-5x-faster-postgres-writes
#HackerNews #Postgres #FPW #Performance #Database #Optimization #DataEngineering
-
How Olivia Chen Breaks Down the Modern Data Stack and Why the Architecture Conversation Matters [Ad] The modern data stack is one of the most discussed and frequently misunderstood topics in enterp...
#Business #Cloud-Computing #data #Data-Engineering #Data-Governance #Enterprise-Technology #sponsored #Technology
Origin | Interest | Match -
In this #InfoQ podcast, Somtochi Onyekwere breaks down:
• Recent developments in #DistributedDataSystems
• How to achieve fast, eventually consistent replication across distributed nodes
• Using #CRDTs (Conflict-free Replicated Data Types) to resolve data conflicts seamlessly🎧 Listen here: https://bit.ly/49OiNaE