home.social

#apachespark — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #apachespark, aggregated by home.social.

  1. Treating SparkContext as a control tower shifts how you think about Spark: not just as an API, but as the coordinator for your entire distributed engine.

    Read More: zalt.me/blog/2026/05/sparkcont

    #ApacheSpark #SparkContext #distributed #systems

  2. 96% fewer out-of-memory (OOM) failures!

    #Pinterest shared how it improved the reliability of its #ApacheSpark workloads.

    By focusing on:
    ✅ Enhanced observability
    ✅ Configuration tuning
    ✅ Automatic memory retries

    The changes addressed persistent job failures affecting recommendation systems and large-scale data processing.

    Details here ⇨ bit.ly/4smqrQD

    #SoftwareArchitecture #BigData #CostOptimization #Memory #DistributedSystems #Observability #InfoQ

  3. Bellevue / Seattle area friends: I’m super stoked for next week’s Spark Community Spring (Friday Mar 13th: spooky 👻).

    If you’ve ever wanted to contribute to Apache Spark, come hang out and get your first Spark PR started with Felix Cheung, Huaxin Gao, Devin Petersohn, and myself :)

    We’ll help folks find starter issues, get their dev environments set up, and walk through the contribution process.

    There will be free lunch, and if enough people show up… maybe even Taco Bell for an afternoon snack*.

    #ApacheSpark #OSS #hackathon #freelunch #tacofridaymaaaaybe

    luma.com/rrfvx0ey

    (* Depends on attendance)

  4. #Pinterest launched a next-gen CDC-based ingestion framework.

    Using #ApacheKafka, #ApacheFlink, #ApacheSpark & #ApacheIceberg, they achieved:
    • Latency cut from 24+ hours to 15 minutes
    • Processing of only changed records
    • Support for incremental updates & deletions
    • Petabyte-scale data across 1,000+ pipelines

    Win: optimized cost & efficiency!

    Read the architectural deep dive on InfoQ 👉 bit.ly/4rMJB2H

    #SoftwareArchitecture #ChangeDataCapture

  5. #CaseStudy - Agoda consolidated multiple independent data pipelines into a central #ApacheSpark platform, eliminating financial data inconsistencies.

    A multi-layered quality framework - with automated checks, ML anomaly detection, and data contracts - ensures accurate financial metrics while handling millions of daily bookings.

    Deep dive into the architecture here ⇨ bit.ly/4a109NP

    #InfoQ #SoftwareArchitecture #AI #DataPipelines

  6. Discover how Decathlon, one of the world’s leading sports retailers, adopted the #opensource library #Polars to optimize its data workflows.

    By migrating from Apache Spark to Polars for small input datasets, Decathlon achieved:
    • Significant speed
    • Meaningful cost savings

    👉 Learn more: bit.ly/4qmb2zc

    #InfoQ #AI #ApacheSpark

  7. #CaseStudy - #Lyft rearchitected its ML platform, LyftLearn, into a hybrid system!

    Offline workloads now run on AWS SageMaker, while Kubernetes continues to power online model serving.

    The result❓ Read #InfoQ and find out 👉 bit.ly/3Y3hTBG

    #SoftwareArchitecture #AI #ML #ApacheSpark #Kubernetes

  8. Today is the DBA Appreciation Day!

    Bring your DBAs a cake and a coffee, please. And don't drop any tables in production, pretty please. It's weekend ...

    #PostgreSQL #SQLServer #Oracle #DB2 #MySQL #MariaDB #Snowflake #SQLite #Neo4j #Teradata #SAPHana #Aerospike #ApacheSpark #Clickhouse #Informix #WarehousePG #Greenplum #Adabas

  9. 🎃The October issue of #CheckpointChronicle is now out 🌟

    It covers Ververica's Fluss, #ApacheFlink 2.0, Iggy.rs, Strimzi's support for #ApacheKafka 4.0, tons of OTF material from @vanlightly, Christian Hollinger's write up of ngrok's data platform, nice detail of how SmartNews use #ApacheIceberg with Flink and #ApacheSpark, a good writeup from Sudhendu Pandey on #ApachePolaris, notes from Kir Titievsky on Kafka's Avro serialisers, and much more!

    dcbl.link/cc-oct242

  10. anybody know if it is ok to run #apachespark and #apachehive on the same box? I have 969 #java processes on this #centos box, which seems like a lot, but not sure if it is actually a problem.

    Something is certainly a problem.

    #bigdata

  11. Claus Stadler is presenting their work behind SANSA: 'Scaling RML and SPARQL-based Knowledge Graph Construction with Apache Spark' now at the Knowledge Graph Construction Workshop!

    @eswc_conf @aksw

  12. “A really big deal”—Dolly is a free, open source, ChatGPT-style AI model - Enlarge (credit: Databricks)

    On Wednesday, Databricks released... - arstechnica.com/?p=1931693 #largelanguagemodels #machinelearning #textsynthesis #apachespark #databricks #eleutherai #finetuning #biz#pythia #dolly #llama #meta #ai

  13. Die Cloudera Data Platform One bündelt alle für Datenanalyse und -erkundung erforderlichen Tools als Software-as-a-Service auf Basis der Lakehouse-Architektur.
    Data Science: Cloudera startet All-in-one-Datendienst in der Cloud
  14. Die Big Data Tools 1.6, ein Plug-in für Zugriff auf Zeppelin Notebooks, beherrscht nun auch das Monitoring von Apache Flink und bindet den Hive Metastore ein.
    JetBrains' Big Data Tools 1.6 behalten Flink-Jobs im Auge
  15. Das neue Release der Data-Science-Software SystemDS führt ein Federated Backend für Mehrmandantenfähigkeit ein und vollzieht das Update auf Java 11 und Spark 3.
    Data Science: Apache SystemDS 3.0 erhält ein Backend für Multi-Tenancy
  16. Google und OpenMined machen die Vorzüge des differenzierten Datenschutzes auch der Python Developer Community als Open Source zugänglich.
    PipelineDP: Differential Privacy Framework für das Python-Universum
  17. Google und OpenMined machen die Vorzüge des differenzierten Datenschutzes auch der Python Developer Community als Open Source zugänglich.
    PipelineDP: Differential Privacy Framework für das Python-Universum
  18. Die Erweiterung zum Zugriff auf Zeppelin Notebooks und für das Monitoring von Spark- und Hadoop-Anwendungen ist nun in Version 1.0 verfügbar.
    Big Data Tools: JetBrains-Plug-in für Apache Zeppelin verlässt die Preview-Phase
  19. Die Tools für ETL-Prozesse, Data-Pipeline-Orchestrierung, Automatisierung und Monitoring sind als Spark-Service in die Cloudera Data Platform integriert.
    Cloudera startet Cloud-nativen Dienst für Data Engineering
  20. Das Major-Release der Big-Data-Engine hat viele Verbesserungen, aber auch neue Ansätze im Gepäck, die höhere Performance und mehr Kompatibilität versprechen.
    Apache Spark 3.0 liefert erweiterte SQL-Funktionen und ein Update der Python-API
    #ApacheSpark #BigData #DataStreaming #Databricks