#dataengineering — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #dataengineering, aggregated by home.social.
-
Regex vs. LLM for B2B document extraction. This week, I tried out both.
:blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.
:blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.
:blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.
:blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.
Full comparison with code and trade-off breakdown on TDS: https://shorturl.at/v4gdl
#Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR
-
Regex vs. LLM for B2B document extraction. This week, I tried out both.
:blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.
:blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.
:blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.
:blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.
Full comparison with code and trade-off breakdown on TDS: https://shorturl.at/v4gdl
#Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR
-
Regex vs. LLM for B2B document extraction. This week, I tried out both.
:blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.
:blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.
:blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.
:blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.
Full comparison with code and trade-off breakdown on TDS: https://shorturl.at/v4gdl
#Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR
-
Regex vs. LLM for B2B document extraction. This week, I tried out both.
:blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.
:blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.
:blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.
:blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.
Full comparison with code and trade-off breakdown on TDS: https://shorturl.at/v4gdl
#Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR
-
Regex vs. LLM for B2B document extraction. This week, I tried out both.
:blobcoffee: The rule-based pipeline with pytesseract + regex worked perfectly for Layout A. For Layout B? Every single field returned None.
:blobcoffee: Because "PO Number" and "Order Reference" are the same thing for a human. Not for a regex pattern.
:blobcoffee: The LLM-based approach (pytesseract + Ollama + LLaMA 3) extracted both layouts correctly, without touching a single rule. It even normalized the date format automatically.
:blobcoffee: But LLMs aren't always the right answer. If your documents are stable, speed matters at scale, or explainability is required, regex might still win.
Full comparison with code and trade-off breakdown on TDS: https://shorturl.at/v4gdl
#Python #DataScience #business #technology #dataengineering #LLM #Automation #OCR
-
Join me in reading Designing Data-Intensive Applications (the new, 2nd Edition)! I want to host an async, virtual book club with the goal to read ~1 chapter a week :). The first edition is probably the most recommended data systems book I know, so it's time! More info + signups in the 🧵. #dataengineering
-
🔍 Spark + Elasticsearch Debugging 🧵
Building a cybersecurity analytics platform. Hit 2 blockers:
❌ JAR path mismatch → Fixed absolute path
❌ No data nodes (single-node Docker ES) → Added es.nodes.wan.only=true✅ Result: 89 records loaded. Working pipeline!
Lesson: Verify JAR paths + disable node discovery for single-node ES.
#PySpark #Elasticsearch #DataEngineering #CyberSecurity #Debugging
-
🔍 Spark + Elasticsearch Debugging 🧵
Building a cybersecurity analytics platform. Hit 2 blockers:
❌ JAR path mismatch → Fixed absolute path
❌ No data nodes (single-node Docker ES) → Added es.nodes.wan.only=true✅ Result: 89 records loaded. Working pipeline!
Lesson: Verify JAR paths + disable node discovery for single-node ES.
#PySpark #Elasticsearch #DataEngineering #CyberSecurity #Debugging
-
How #Netflix boosted #ApacheDruid performance: by implementing interval-aware caching, they now serve 84% of analytics results from cache and have reduced query load by 33%.
The secret? Decomposing rolling window queries into reusable time segments.
✅ Reduces scan volume
✅ Improves P90 latency
✅ Optimizes real-time analyticsDetails on #InfoQ: https://bit.ly/4uHG4DE
#SoftwareArchitecture #DistributedSystems #DataAnalytics #TimeSeriesData #Caching #BigData #DataEngineering
-
Project Manager for Data Science -- Arnaout Lab @UCSF
Arnaout Lab @UCSFSee the full job description on jobRxiv: https://jobrxiv.org/job/arnaout-lab-ucsf-27778-project-manager-for-data-science-arnaout-lab-ucsf/
#computerengineering #computerscience #computervision #dataengineering #healthdatascience #medicalimaging #ScienceJobs #hiring #research
https://jobrxiv.org/job/arnaout-lab-ucsf-27778-project-manager-for-data-science-arnaout-lab-ucsf/?fsp_sid=12041 -
3 erreurs pipelines données : qualité des données, pipelines monolithiques, manque de documentation. Solutions : nettoyage intégré, découpage, README. #DataEngineering #Pipeline #Tech #BonnesPratiques #Erreurs ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-pipeline-tech-share-7459521741233651712--cJR
-
3 erreurs pipelines données : qualité des données, pipelines monolithiques, manque de documentation. Solutions : nettoyage intégré, découpage, README. #DataEngineering #Pipeline #Tech #BonnesPratiques #Erreurs ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-pipeline-tech-share-7459521741233651712--cJR
-
3 erreurs pipelines données : qualité des données, pipelines monolithiques, manque de documentation. Solutions : nettoyage intégré, découpage, README. #DataEngineering #Pipeline #Tech #BonnesPratiques #Erreurs ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-pipeline-tech-share-7459521741233651712--cJR
-
3 erreurs pipelines données : qualité des données, pipelines monolithiques, manque de documentation. Solutions : nettoyage intégré, découpage, README. #DataEngineering #Pipeline #Tech #BonnesPratiques #Erreurs ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-pipeline-tech-share-7459521741233651712--cJR
-
3 erreurs pipelines données : qualité des données, pipelines monolithiques, manque de documentation. Solutions : nettoyage intégré, découpage, README. #DataEngineering #Pipeline #Tech #BonnesPratiques #Erreurs ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-pipeline-tech-share-7459521741233651712--cJR
-
IA générative a failli faire rater un candidat parfait : CV éliminé à cause de mots-clés manquants. Solution : vérification manuelle, critères adaptés. #Recrutement #IA #RH #Erreur #DataEngineering ... https://www.linkedin.com/posts/gabriel-chandesris_recrutement-ia-rh-share-7459496571114598400-gsAv
-
How Olivia Chen Breaks Down the Modern Data Stack and Why the Architecture Conversation Matters [Ad] The modern data stack is one of the most discussed and frequently misunderstood topics in enterp...
#Business #Cloud-Computing #data #Data-Engineering #Data-Governance #Enterprise-Technology #sponsored #Technology
Origin | Interest | Match -
In this #InfoQ podcast, Somtochi Onyekwere breaks down:
• Recent developments in #DistributedDataSystems
• How to achieve fast, eventually consistent replication across distributed nodes
• Using #CRDTs (Conflict-free Replicated Data Types) to resolve data conflicts seamlessly🎧 Listen here: https://bit.ly/49OiNaE
-
Recruter Data Engineer en 1 semaine : profil précis, canaux ciblés, processus accéléré, vente du projet. Résultat : 120 CV → 1 embauche. #Recrutement #DataEngineering #RH #Tech #Urgence ... https://www.linkedin.com/posts/gabriel-chandesris_recrutement-dataengineering-rh-share-7457371387930902528-e6ZA
-
Data vs. Big Data : Data gagne pour 90% des projets, Big Data pour scale/temps réel. #DataEngineering #BigData #Tech #Humour #Optimisation ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-bigdata-tech-share-7456654084331237377-IHfH
-
Projet data parfait n’existe pas : livrez MVP en 3 mois, nettoyage intégré, itération. Exemple : data lake inutile vs rapport SQL. #DataEngineering #Tech #Humeur #MVP #Agile ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-tech-humeur-share-7455898511407198208-tIMu
-
Missing [Survey, etc] Data Can Be A Geographic Phenomenon
--
https://doi.org/10.1080/24694452.2026.2640220 <-- shared paper
--
#GIS #mapping #spatial #DataScience #missing #data #spatial #AAG #autocorrelation #geographicallyweightedregression #GWR #imputation #missingdata #survey #surveynonresponse #incomplete #surveyquestions #ethnicity #income #spatialdata #alldataisspatial #UK #FinancialLives #geography #spatialanalysis #geostatistics #location #imputing #statistics #dataset #DataImputation #MissingData #DataCleaning #DataPreprocessing #DataWrangling #DataQuality #DataEngineering #FinancialData #FinancialAnalytics #FinincialModeling #FinDataScience -
Missing [Survey, etc] Data Can Be A Geographic Phenomenon
--
https://doi.org/10.1080/24694452.2026.2640220 <-- shared paper
--
#GIS #mapping #spatial #DataScience #missing #data #spatial #AAG #autocorrelation #geographicallyweightedregression #GWR #imputation #missingdata #survey #surveynonresponse #incomplete #surveyquestions #ethnicity #income #spatialdata #alldataisspatial #UK #FinancialLives #geography #spatialanalysis #geostatistics #location #imputing #statistics #dataset #DataImputation #MissingData #DataCleaning #DataPreprocessing #DataWrangling #DataQuality #DataEngineering #FinancialData #FinancialAnalytics #FinincialModeling #FinDataScience -
Missing [Survey, etc] Data Can Be A Geographic Phenomenon
--
https://doi.org/10.1080/24694452.2026.2640220 <-- shared paper
--
#GIS #mapping #spatial #DataScience #missing #data #spatial #AAG #autocorrelation #geographicallyweightedregression #GWR #imputation #missingdata #survey #surveynonresponse #incomplete #surveyquestions #ethnicity #income #spatialdata #alldataisspatial #UK #FinancialLives #geography #spatialanalysis #geostatistics #location #imputing #statistics #dataset #DataImputation #MissingData #DataCleaning #DataPreprocessing #DataWrangling #DataQuality #DataEngineering #FinancialData #FinancialAnalytics #FinincialModeling #FinDataScience -
Missing [Survey, etc] Data Can Be A Geographic Phenomenon
--
https://doi.org/10.1080/24694452.2026.2640220 <-- shared paper
--
#GIS #mapping #spatial #DataScience #missing #data #spatial #AAG #autocorrelation #geographicallyweightedregression #GWR #imputation #missingdata #survey #surveynonresponse #incomplete #surveyquestions #ethnicity #income #spatialdata #alldataisspatial #UK #FinancialLives #geography #spatialanalysis #geostatistics #location #imputing #statistics #dataset #DataImputation #MissingData #DataCleaning #DataPreprocessing #DataWrangling #DataQuality #DataEngineering #FinancialData #FinancialAnalytics #FinincialModeling #FinDataScience -
Missing [Survey, etc] Data Can Be A Geographic Phenomenon
--
https://doi.org/10.1080/24694452.2026.2640220 <-- shared paper
--
#GIS #mapping #spatial #DataScience #missing #data #spatial #AAG #autocorrelation #geographicallyweightedregression #GWR #imputation #missingdata #survey #surveynonresponse #incomplete #surveyquestions #ethnicity #income #spatialdata #alldataisspatial #UK #FinancialLives #geography #spatialanalysis #geostatistics #location #imputing #statistics #dataset #DataImputation #MissingData #DataCleaning #DataPreprocessing #DataWrangling #DataQuality #DataEngineering #FinancialData #FinancialAnalytics #FinincialModeling #FinDataScience -
🎉 Milestone Unlocked: Finished the Data Engineering Zoomcamp!
In 10 weeks, I moved from scripting to architecting systems. We built real production-grade infrastructure using Spark, Kafka, Airflow, and Kestra—not just hobby projects.
Capstone: A Storage Hard Drive Dashboard using real failure data from Backblaze
Stack: Terraform + Docker infra, Airflow orchestration, dbt modeling, Streamlit viz.Key Lessons:
✅️ "It works on my laptop" isn't a strategy.
✅ Need IaC, partitioning, clustering, and strict error handling.
✅ dbt ensures reproducible, tested models.
✅ Infra is invisible work—if it breaks, your code fails.Take the leap! It’s challenging but by week 10, pieces click into place. Seeing my pipeline run autonomously felt like crossing the finish line. 🏁
Thanks Data Talks Club team! On to the next challenge!
My project: https://github.com/ammartin8/hard_drive_analytics_dashboard
#mastodon #fediverse #data #spark #dataengineering #ai #technology #datatools #datapipelines #fedihire #thursday #sql #observability #etl #python #github
-
Documenter pipelines data : schéma, README.md, exemples code. Outils : dbt docs, Markdown. #DataEngineering #Documentation #Tech #dbt #SQL ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-documentation-tech-share-7454088629414604800-OjBC
-
Data Mesh : pour entreprises >500 employés, équipes autonomes. Piège : complexité. Exemple : temps de livraison/2. #DataEngineering #DataMesh #Scalabilité #Tech #Architecture ... https://www.linkedin.com/posts/gabriel-chandesris_dataengineering-datamesh-scalabilitaez-share-7454087469127536640-s5cK
-
End-to-End Storage Drive Analytics Platform Complete! 🚀
Spent the past weeks on my Data Engineering Zoomcamp final project. Excited to share an end-to-end platform analyzing Backblaze hard drive data to bridge enterprise telemetry and consumer accessibility.
The pipeline ingests daily SMART snapshots into GCS, builds a star schema with dbt, and serves insights via Streamlit dashboard showing failure rates by brand and model. Infrastructure is managed with Terraform; the warehouse was optimized using partitioning to improve query performance.
To increase accessibility, switching to open-source/free-ish tools so anyone can dive in without a cloud signup (and plus my trial expired 🙈). My goal is providing drive reliability data so creators, homelabbers, business, or casual users feel informed about their next storage purchase. 😊
Check out the repo for details: https://github.com/ammartin8/hard_drive_analytics_dashboard
#DataEngineering #harddrive #opensource #cloud #streamlit #buildinpublic #selfhosting #mastodon #python #fediverse #backblaze -
Most GNSS survey failures aren’t equipment—they’re poor site selection.
I’m experimenting with using low-cost GNSS receivers (L76K vs u-blox MAX-M10S) to pre-qualify sites before running NOAA OPUS.
~30–90 min logs → PostgreSQL → HDOP, CN0, wander analysis.
Early results show tradeoffs between signal strength and positional stability.
https://salemdata.net/johnpress/?p=628
Anyone exploring similar GNSS site validation?
#GNSS #GPS #Geodesy #ESP32 #PostgreSQL #OpenSource #DataEngineering #Meshtastic -
Most GNSS survey failures aren’t equipment—they’re poor site selection.
I’m experimenting with using low-cost GNSS receivers (L76K vs u-blox MAX-M10S) to pre-qualify sites before running NOAA OPUS.
~30–90 min logs → PostgreSQL → HDOP, CN0, wander analysis.
Early results show tradeoffs between signal strength and positional stability.
https://salemdata.net/johnpress/?p=628
Anyone exploring similar GNSS site validation?
#GNSS #GPS #Geodesy #ESP32 #PostgreSQL #OpenSource #DataEngineering #Meshtastic -
Project Manager for Data Science -- Arnaout Lab @UCSF
Arnaout Lab @UCSFSee the full job description on jobRxiv: https://jobrxiv.org/job/arnaout-lab-ucsf-27778-project-manager-for-data-science-arnaout-lab-ucsf/
#computerengineering #computerscience #computervision #dataengineering #healthdatascience #medicalimaging #Projectmanagement #ScienceJobs #hiring #research
https://jobrxiv.org/job/arnaout-lab-ucsf-27778-project-manager-for-data-science-arnaout-lab-ucsf/?fsp_sid=10841 -
The two most beautiful languages I know, united in one!
No room for prejudice.#BruteForceEngineering #DevAgainstTheMachine #DataScience #DataEngineering #HiringGap #Python #Grit #gritlab #AnarchyInTheShell #Builder
-
Kafka in-cluster replication won't save you from a regional outage. You need a battle-tested Multi-Region strategy.
Read this to master:
- RPO/RTO trade-offs for global scale
- Active-Active vs. Stretched Clusters (3-DC & 2.5-DC)
- Solving the offset translation nightmare
- Real-world failover testing
https://softwaremill.com/guide-to-apache-kafka-disaster-recovery-and-multi-region-architectures/#ApacheKafka #SystemArchitecture #DisasterRecovery #DataEngineering
-
It's alive!
fresh news from the Data Crier!#BruteForceEngineering.
#DataScience #DataEngineering #HiringGap #Python #Grit #gritlab #AnarchyInTheShell -
if you knew today exactly where the #ITMarket will be starving for talent tomorrow...
How would your next salary negotiation look?
Forget the Master’s degree. Forget the Bachelor’s. If you have massive #Grit in your system, you are the solution the market is screaming for - even if the corporate recruiting filters haven't realized it yet.This is autonomy engineered in a shell.
This is #BruteForceEngineering.
#DataScience #DataEngineering #HiringGap #Python #Grit #gritlab #AnarchyInTheShell -
The Scraper-Base-Kit is now Private.
After reviewing the velocity of our global ingest, I’ve decided to move the base-kit to the Private Core.
The engine is too fast, the footprint too small.
This thing is like a shotgun that fires once at the world!!!!
We are not just building tools; we are guarding the data.#ArchLinux #RootCause #IT #hiringgap #analsys #GitHubActions #AnarchyInTheShell #ContinuumHQ #Dataengineering #DataScience#agenticworkflow #SRE #Grit #Gritlab #buildInPublic
-
0:07! It's midnight!
The Machine is silent, but the Pipeline is Green. 🟢What started as a day of debugging ended as a fully automated Intelligence. The Elégence Report Engine is live. 217 countries, one Grit-Score, zero human error.
T480 closing now. Tomorrow, the world gets the data it deserves.
#ArchLinux #RootCause #IT #hiringgap #analsys #GitHubActions #AnarchyInTheShell #ContinuumHQ #Dataengineering #DataScience#agenticworkflow #SRE #Grit #Gritlab #buildInPublic
-
System says exit 1. I say: Challenge accepted.
Debugging the engine.
No corporate cloud fluff, just raw logs, Arch Linux, and a broken pipeline that’s about to get crushed.
don't solve theoretical problems. break real ones. ✊🔥
#ArchLinux #RootCause #LUKS #GitHubActions #AnarchyInTheShell #ContinuumHQ #Dataengineering #DataScience#agenticworkflow #SRE #Grit #Gritlab
-
Every data professional should understand these seven core concepts.
From data warehouses and lakes to pipelines, meshes, and governance, these form the foundation of modern analytics infrastructure.
Mastering them bridges the gap between raw data and actionable business insights.📕 https://ebokify.com/ai-data-science
#DataEngineering #DataScience #DataAnalytics #ETL #DataWarehouse #BigData #BusinessIntelligence #DataPipeline #DataGovernance
-
Serious about SQL?
Start with structure, not random tutorials.I’ve created 65 structured SQL lessons covering:
• Fundamentals
• Joins & Advanced Queries
• CTEs & Window Functions
• Query Optimization
• BI Concepts
• Interview PreparationEach lesson builds on the previous one.
Clear. Practical. Designed for data analysts.📕 https://ebokify.com/data-analysis
#SQL #DataAnalytics #DataAnalyst #DataEngineering #BusinessIntelligence #AnalyticsCareers #TechEducation #DataScience
-
Financial Modeling Series: #1 How to Build a Finance ML Dataset — Python Solution
This post covers: clean prices, feature windows, forward labels, and sanity checks you can run before training any model.
#Finance #MachineLearning #Python #TimeSeries #Quant #market #ai #dataEngineering #trading
@ai @markets @programming @theartificialintelligence @towardsdatascience @pythonclcoding @MastodonEngineering @medium
-
Financial Modeling Series: #1 How to Build a Finance ML Dataset — Python Solution
This post covers: clean prices, feature windows, forward labels, and sanity checks you can run before training any model.
#Finance #MachineLearning #Python #TimeSeries #Quant #market #ai #dataEngineering #trading
@ai @markets @programming @theartificialintelligence @towardsdatascience @pythonclcoding @MastodonEngineering @medium
-
Financial Modeling Series: #1 How to Build a Finance ML Dataset — Python Solution
This post covers: clean prices, feature windows, forward labels, and sanity checks you can run before training any model.
#Finance #MachineLearning #Python #TimeSeries #Quant #market #ai #dataEngineering #trading
@ai @markets @programming @theartificialintelligence @towardsdatascience @pythonclcoding @MastodonEngineering @medium
-
Financial Modeling Series: #1 How to Build a Finance ML Dataset — Python Solution
This post covers: clean prices, feature windows, forward labels, and sanity checks you can run before training any model.
#Finance #MachineLearning #Python #TimeSeries #Quant #market #ai #dataEngineering #trading
@ai @markets @programming @theartificialintelligence @towardsdatascience @pythonclcoding @MastodonEngineering @medium
-
Financial Modeling Series: #1 How to Build a Finance ML Dataset — Python Solution
This post covers: clean prices, feature windows, forward labels, and sanity checks you can run before training any model.
#Finance #MachineLearning #Python #TimeSeries #Quant #market #ai #dataEngineering #trading
@ai @markets @programming @theartificialintelligence @towardsdatascience @pythonclcoding @MastodonEngineering @medium
-
Shifting Left delivers clean, reliable, and accessible data to everyone who needs it - right when they need it.
The result? Less complexity, lower overhead, and far less break-fix work, freeing teams to focus on higher-value problems.
At the core of a #ShiftLeft strategy are Data Products. They form the backbone of healthy data communication and ensure quality is built in - not patched on later.
📖 Great insights from this #InfoQ article on rethinking the Medallion Architecture: https://bit.ly/3WHjxsf
#SoftwareArchitecture #DataMesh #DataEngineering #DataLake #DataPipelines
-
#throwback From data swamp to data lakehouse 🏗️ Josef Machytka shares real-world lessons on building a lakehouse with PostgreSQL, BigQuery, and GCS—covering formats, scaling, governance, and data quality. Keep your data clean and useful. ▶️ Watch now! https://www.youtube.com/watch?v=AUdEjYnXGbI&list=PL_m-TUcr7ZvnSBmPoxZvcB1lfy7C9eced&index=4
-
💡 Apache Airflow 2025 Recap
2026 has arrived which marks a great opportunity to review and assess the changes the last year has brought to the ever evolving landscape of open source data tools.
In our new #blog post we look at Apache Airflow and how the leading open source orchestration platform has changed in the last year with the bit v3 major release.
#apacheairflow #airflow #opensource #dataengineering #datascience