home.social

#reliabilityengineering — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #reliabilityengineering, aggregated by home.social.

  1. The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE

    Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership

    atozofsoftwareengineering.blog

  2. The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE

    Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership

    atozofsoftwareengineering.blog

  3. The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE

    Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership

    atozofsoftwareengineering.blog

  4. The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE

    Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership

    atozofsoftwareengineering.blog

  5. The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE

    Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership

    atozofsoftwareengineering.blog

  6. SRE is about sleeping well 🌙

    The goal is not midnight heroics.
    It is building systems that fail safely so humans can rest.

    webdad.eu/2026/05/14/%f0%9f%98

  7. SLIs, SLOs, and Error Budgets explained with pizza 🍕

    Reliability isn’t about perfection.
    It’s about delivering most pizzas on time — and having room to improve.

    webdad.eu/2026/03/19/%f0%9f%8d

  8. Global availability incident: Facebook.
    Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
    Operational characteristics:
    • Sudden spike in user reports (~4:15 PM ET)
    • Global impact footprint
    • No immediate root cause transparency
    • Service restoration within ~2 hours
    Availability is a security pillar — and outages expose:
    - Centralization risk
    - Cascading dependency exposure
    - Business continuity gaps
    - API reliance vulnerabilities

    For security and reliability engineers:
    Are social platforms integrated into your risk register and DR modeling?

    Source: bleepingcomputer.com/news/tech

    Engage below.
    Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
    Repost to inform your network.

    #Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations

  9. Global availability incident: Facebook.
    Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
    Operational characteristics:
    • Sudden spike in user reports (~4:15 PM ET)
    • Global impact footprint
    • No immediate root cause transparency
    • Service restoration within ~2 hours
    Availability is a security pillar — and outages expose:
    - Centralization risk
    - Cascading dependency exposure
    - Business continuity gaps
    - API reliance vulnerabilities

    For security and reliability engineers:
    Are social platforms integrated into your risk register and DR modeling?

    Source: bleepingcomputer.com/news/tech

    Engage below.
    Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
    Repost to inform your network.

    #Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations

  10. Global availability incident: Facebook.
    Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
    Operational characteristics:
    • Sudden spike in user reports (~4:15 PM ET)
    • Global impact footprint
    • No immediate root cause transparency
    • Service restoration within ~2 hours
    Availability is a security pillar — and outages expose:
    - Centralization risk
    - Cascading dependency exposure
    - Business continuity gaps
    - API reliance vulnerabilities

    For security and reliability engineers:
    Are social platforms integrated into your risk register and DR modeling?

    Source: bleepingcomputer.com/news/tech

    Engage below.
    Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
    Repost to inform your network.

    #Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations

  11. Global availability incident: Facebook.
    Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
    Operational characteristics:
    • Sudden spike in user reports (~4:15 PM ET)
    • Global impact footprint
    • No immediate root cause transparency
    • Service restoration within ~2 hours
    Availability is a security pillar — and outages expose:
    - Centralization risk
    - Cascading dependency exposure
    - Business continuity gaps
    - API reliance vulnerabilities

    For security and reliability engineers:
    Are social platforms integrated into your risk register and DR modeling?

    Source: bleepingcomputer.com/news/tech

    Engage below.
    Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
    Repost to inform your network.

    #Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations

  12. Global availability incident: Facebook.
    Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
    Operational characteristics:
    • Sudden spike in user reports (~4:15 PM ET)
    • Global impact footprint
    • No immediate root cause transparency
    • Service restoration within ~2 hours
    Availability is a security pillar — and outages expose:
    - Centralization risk
    - Cascading dependency exposure
    - Business continuity gaps
    - API reliance vulnerabilities

    For security and reliability engineers:
    Are social platforms integrated into your risk register and DR modeling?

    Source: bleepingcomputer.com/news/tech

    Engage below.
    Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
    Repost to inform your network.

    #Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations

  13. The 2024 CrowdStrike outage caused a worldwide Windows Blue Screen crash, impacting airlines, banks, and enterprises.
    This deep dive explains how DevOps & SRE teams mitigated impact, recovered systems, and prevented total failure.
    🔗 shorturl.at/VLqxz

    #CrowdStrikeOutage #DevOps #SRE #IncidentManagement #CyberResilience #CloudOps #PostMortem #ReliabilityEngineering #aws

  14. The 2024 CrowdStrike outage caused a worldwide Windows Blue Screen crash, impacting airlines, banks, and enterprises.
    This deep dive explains how DevOps & SRE teams mitigated impact, recovered systems, and prevented total failure.
    🔗 shorturl.at/VLqxz

    #CrowdStrikeOutage #DevOps #SRE #IncidentManagement #CyberResilience #CloudOps #PostMortem #ReliabilityEngineering #aws

  15. As I always say, when your teams tell you it’s “just a routine #upgrade”, be extra wary.

    In this case, a “routine upgrade” disabled emergency calling and contributed to two *deaths*.

    #networks #qa #reliability #reliabilityengineering #process

    theregister.com/2025/12/19/opt

  16. As I always say, when your teams tell you it’s “just a routine #upgrade”, be extra wary.

    In this case, a “routine upgrade” disabled emergency calling and contributed to two *deaths*.

    #networks #qa #reliability #reliabilityengineering #process

    theregister.com/2025/12/19/opt

  17. As I always say, when your teams tell you it’s “just a routine #upgrade”, be extra wary.

    In this case, a “routine upgrade” disabled emergency calling and contributed to two *deaths*.

    #networks #qa #reliability #reliabilityengineering #process

    theregister.com/2025/12/19/opt

  18. Most reliability incidents don’t start with a failure.
    They start with “we’ll fix it later.”

    A brittle deploy, a noisy alert, a manual process that “rarely runs.”
    Months pass, context fades, the system grows.

    Then something small breaks — and the deferred work becomes the incident.

    Reliability doesn’t fail all at once.
    It erodes quietly, then shows up loudly.

    #ReliabilityEngineering #TechDebt

  19. 🚀 AI is redefining Site Reliability Engineering (SRE)

    What started as ensuring web apps were fast and resilient has now entered a new era: AI Reliability Engineering. Inference workloads demand speed, trust, and observability far beyond traditional infrastructure challenges.

    RELIANOID’s role as SRE experts, supporting organizations in building systems where AI can truly thrive

    🔗 relianoid.com/blog/ai-reliabil

  20. Hey everyone!

    I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.

    Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.

    If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.

    Pretty soon our savings will be completely gone.

    #SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience

  21. Hey everyone!

    I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.

    Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.

    If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.

    Pretty soon our savings will be completely gone.

    #SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience

  22. Hey everyone!

    I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.

    Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.

    If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.

    Pretty soon our savings will be completely gone.

    #SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience

  23. Hey everyone!

    I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.

    Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.

    If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.

    Pretty soon our savings will be completely gone.

    #SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience

  24. Hey everyone!

    I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.

    Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.

    If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.

    Pretty soon our savings will be completely gone.

    #SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience

  25. Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.

    The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.

    sounding.com/2025/05/02/schmit

    #SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering

  26. Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.

    The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.

    sounding.com/2025/05/02/schmit

    #SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering

  27. Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.

    The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.

    sounding.com/2025/05/02/schmit

    #SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering

  28. Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.

    The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.

    sounding.com/2025/05/02/schmit

    #SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering

  29. Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.

    The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.

    sounding.com/2025/05/02/schmit

    #SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering

  30. My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps techtarget.com/searchitoperati

  31. My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps techtarget.com/searchitoperati

  32. My second report from : Some of the same lessons -- and unsolved problems -- from supporting apps in production carry over to apps, but not all. Attendees discussed the similarities and important differences. techtarget.com/searchitoperati

  33. My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps techtarget.com/searchitoperati

  34. My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps techtarget.com/searchitoperati

  35. Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. ppc.land/google-sres-reveal-ho #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering

  36. Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. ppc.land/google-sres-reveal-ho #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering

  37. Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. ppc.land/google-sres-reveal-ho #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering

  38. Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. ppc.land/google-sres-reveal-ho #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering

  39. Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. ppc.land/google-sres-reveal-ho #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering