#reliabilityengineering — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #reliabilityengineering, aggregated by home.social.
-
The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE
Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership -
The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE
Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership -
The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE
Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership -
The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE
Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership -
The Engineering Leadership Crisis Nobody Talks About 🚨 #EngineeringLeadership #SoftwareEngineering #PlatformEngineering #TechLeadership #Microservices #SRE
Modern engineering teams are collapsing under platform complexity, AI chaos, organizational scaling failures, and unreliable architectures. This deep technical leadership guide explains how elite engineering leaders manage platform rewrites, reliability crises, organizational chaos, and large-scale modernization without destroying delivery velocity. #SoftwareArchitecture #EngineeringManagement #DevOps #CloudComputing #Leadership -
SRE is about sleeping well 🌙
The goal is not midnight heroics.
It is building systems that fail safely so humans can rest.#SRE #ReliabilityEngineering #OnCall
https://webdad.eu/2026/05/14/%f0%9f%98%b4-sre-is-about-sleeping-well/
-
SLIs, SLOs, and Error Budgets explained with pizza 🍕
Reliability isn’t about perfection.
It’s about delivering most pizzas on time — and having room to improve.#SRE #DevOps #ReliabilityEngineering
https://webdad.eu/2026/03/19/%f0%9f%8d%95-slis-slos-and-error-budgets-explained-with-pizza-delivery/
-
Spacecraft controls don’t get a second chance — every input must be reliable, clean, and redundant. 🚀🔘
know more:https://zurl.co/zmWFi
#Smidmart #SpaceTech #AerospaceComponents #SealedSwitches #LowOutgassing #Spacecraft #ReliabilityEngineering #Redundancy #Avionics -
Global availability incident: Facebook.
Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
Operational characteristics:
• Sudden spike in user reports (~4:15 PM ET)
• Global impact footprint
• No immediate root cause transparency
• Service restoration within ~2 hours
Availability is a security pillar — and outages expose:
- Centralization risk
- Cascading dependency exposure
- Business continuity gaps
- API reliance vulnerabilitiesFor security and reliability engineers:
Are social platforms integrated into your risk register and DR modeling?Engage below.
Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
Repost to inform your network.#Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations
-
Global availability incident: Facebook.
Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
Operational characteristics:
• Sudden spike in user reports (~4:15 PM ET)
• Global impact footprint
• No immediate root cause transparency
• Service restoration within ~2 hours
Availability is a security pillar — and outages expose:
- Centralization risk
- Cascading dependency exposure
- Business continuity gaps
- API reliance vulnerabilitiesFor security and reliability engineers:
Are social platforms integrated into your risk register and DR modeling?Engage below.
Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
Repost to inform your network.#Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations
-
Global availability incident: Facebook.
Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
Operational characteristics:
• Sudden spike in user reports (~4:15 PM ET)
• Global impact footprint
• No immediate root cause transparency
• Service restoration within ~2 hours
Availability is a security pillar — and outages expose:
- Centralization risk
- Cascading dependency exposure
- Business continuity gaps
- API reliance vulnerabilitiesFor security and reliability engineers:
Are social platforms integrated into your risk register and DR modeling?Engage below.
Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
Repost to inform your network.#Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations
-
Global availability incident: Facebook.
Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
Operational characteristics:
• Sudden spike in user reports (~4:15 PM ET)
• Global impact footprint
• No immediate root cause transparency
• Service restoration within ~2 hours
Availability is a security pillar — and outages expose:
- Centralization risk
- Cascading dependency exposure
- Business continuity gaps
- API reliance vulnerabilitiesFor security and reliability engineers:
Are social platforms integrated into your risk register and DR modeling?Engage below.
Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
Repost to inform your network.#Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations
-
Global availability incident: Facebook.
Meta confirmed service disruptions impacting account access, alongside high disruptions reported in Ad Manager and business APIs.
Operational characteristics:
• Sudden spike in user reports (~4:15 PM ET)
• Global impact footprint
• No immediate root cause transparency
• Service restoration within ~2 hours
Availability is a security pillar — and outages expose:
- Centralization risk
- Cascading dependency exposure
- Business continuity gaps
- API reliance vulnerabilitiesFor security and reliability engineers:
Are social platforms integrated into your risk register and DR modeling?Engage below.
Follow @technadu for infrastructure resilience, cybersecurity, and outage intelligence.
Repost to inform your network.#Infosec #ServiceAvailability #CloudRisk #Meta #FacebookOutage #BusinessContinuity #DigitalInfrastructure #ReliabilityEngineering #CyberResilience #PlatformRisk #ITOperations
-
The 2024 CrowdStrike outage caused a worldwide Windows Blue Screen crash, impacting airlines, banks, and enterprises.
This deep dive explains how DevOps & SRE teams mitigated impact, recovered systems, and prevented total failure.
🔗 https://shorturl.at/VLqxz#CrowdStrikeOutage #DevOps #SRE #IncidentManagement #CyberResilience #CloudOps #PostMortem #ReliabilityEngineering #aws
-
The 2024 CrowdStrike outage caused a worldwide Windows Blue Screen crash, impacting airlines, banks, and enterprises.
This deep dive explains how DevOps & SRE teams mitigated impact, recovered systems, and prevented total failure.
🔗 https://shorturl.at/VLqxz#CrowdStrikeOutage #DevOps #SRE #IncidentManagement #CyberResilience #CloudOps #PostMortem #ReliabilityEngineering #aws
-
As I always say, when your teams tell you it’s “just a routine #upgrade”, be extra wary.
In this case, a “routine upgrade” disabled emergency calling and contributed to two *deaths*.
#networks #qa #reliability #reliabilityengineering #process
https://www.theregister.com/2025/12/19/optus_emergency_outages_cause_report
-
As I always say, when your teams tell you it’s “just a routine #upgrade”, be extra wary.
In this case, a “routine upgrade” disabled emergency calling and contributed to two *deaths*.
#networks #qa #reliability #reliabilityengineering #process
https://www.theregister.com/2025/12/19/optus_emergency_outages_cause_report
-
As I always say, when your teams tell you it’s “just a routine #upgrade”, be extra wary.
In this case, a “routine upgrade” disabled emergency calling and contributed to two *deaths*.
#networks #qa #reliability #reliabilityengineering #process
https://www.theregister.com/2025/12/19/optus_emergency_outages_cause_report
-
Most reliability incidents don’t start with a failure.
They start with “we’ll fix it later.”A brittle deploy, a noisy alert, a manual process that “rarely runs.”
Months pass, context fades, the system grows.Then something small breaks — and the deferred work becomes the incident.
Reliability doesn’t fail all at once.
It erodes quietly, then shows up loudly. -
🚀 AI is redefining Site Reliability Engineering (SRE)
What started as ensuring web apps were fast and resilient has now entered a new era: AI Reliability Engineering. Inference workloads demand speed, trust, and observability far beyond traditional infrastructure challenges.
RELIANOID’s role as SRE experts, supporting organizations in building systems where AI can truly thrive
🔗 https://www.relianoid.com/blog/ai-reliability-engineering-the-new-era-of-sre/
#AI #SRE #DevOps #ReliabilityEngineering #Observability #RELIANOID
-
Thanks for all the boosts y'all!!
Someone pointed out I should mention: I am in Southern California, fully remote.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Thanks for all the boosts y'all!!
Someone pointed out I should mention: I am in Southern California, fully remote.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Thanks for all the boosts y'all!!
Someone pointed out I should mention: I am in Southern California, fully remote.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Thanks for all the boosts y'all!!
Someone pointed out I should mention: I am in Southern California, fully remote.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Thanks for all the boosts y'all!!
Someone pointed out I should mention: I am in Southern California, fully remote.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Four applications out today, thank you for the suggestions and links!
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Four applications out today, thank you for the suggestions and links!
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Four applications out today, thank you for the suggestions and links!
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Four applications out today, thank you for the suggestions and links!
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Four applications out today, thank you for the suggestions and links!
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Hey everyone!
I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.
Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.
If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.
Pretty soon our savings will be completely gone.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Hey everyone!
I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.
Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.
If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.
Pretty soon our savings will be completely gone.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Hey everyone!
I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.
Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.
If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.
Pretty soon our savings will be completely gone.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Hey everyone!
I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.
Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.
If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.
Pretty soon our savings will be completely gone.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Hey everyone!
I just got a rejection from a long interview cycle. It's weighing on me very heavy right now, my migraine is making sure of it.
Currently I am feeling emotionally drained and today I don't feel like I can keep doing this for much longer. Yet I must persist.
If you know of anything having to do with Site Reliability, Observability, or Incident Management... please, please, send it my way. I will take anything at this point, junior roles, mid-level, I will do anything.
Pretty soon our savings will be completely gone.
#SRE #FediHire #ReliabilityEngineering #IncidentResponse #resilience
-
Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.
The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.
https://www.sounding.com/2025/05/02/schmitt-tdg-sre/
#SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering
-
Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.
The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.
https://www.sounding.com/2025/05/02/schmitt-tdg-sre/
#SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering
-
Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.
The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.
https://www.sounding.com/2025/05/02/schmitt-tdg-sre/
#SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering
-
Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.
The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.
https://www.sounding.com/2025/05/02/schmitt-tdg-sre/
#SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering
-
Here's a new blog post from me! It's a small "book" review, which is actually a workbook with some essays at the beginning.
The short book is Maj. John Schmitt's exercise book on Tactical Decision Games (TDGs) for the Marines (and likely other branches), and I noticed how much the philosophy behind them is shared with the sorts of Practice of Practice games we play to understand the system and prepare ourselves for incidents.
https://www.sounding.com/2025/05/02/schmitt-tdg-sre/
#SRE #TDG #PracticeOfPractice #TabletopExercises #OperationalReadiness #IncidentResponse #TacticalDecisionGames #Resilience #ResilienceEngineering #ReliabilityEngineering
-
My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps https://www.techtarget.com/searchitoperations/news/366621071/Site-Reliability-Engineers-weigh-MLOps-vs-LLMOps
-
My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps https://www.techtarget.com/searchitoperations/news/366621071/Site-Reliability-Engineers-weigh-MLOps-vs-LLMOps
-
My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps https://www.techtarget.com/searchitoperations/news/366621071/Site-Reliability-Engineers-weigh-MLOps-vs-LLMOps
-
My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps https://www.techtarget.com/searchitoperations/news/366621071/Site-Reliability-Engineers-weigh-MLOps-vs-LLMOps
-
My second report from #SREcon: Some of the same lessons -- and unsolved problems -- from supporting #machinelearning apps in production carry over to #generativeAI apps, but not all. Attendees discussed the similarities and important differences. #reliabilityengineering #ML #GenAI #LLM #AI #MLOps #LLMOps https://www.techtarget.com/searchitoperations/news/366621071/Site-Reliability-Engineers-weigh-MLOps-vs-LLMOps
-
Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. https://ppc.land/google-sres-reveal-how-search-handled-record-world-cup-traffic-spike-2/?utm_source=dlvr.it&utm_medium=mastodon #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering
-
Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. https://ppc.land/google-sres-reveal-how-search-handled-record-world-cup-traffic-spike-2/?utm_source=dlvr.it&utm_medium=mastodon #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering
-
Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. https://ppc.land/google-sres-reveal-how-search-handled-record-world-cup-traffic-spike-2/?utm_source=dlvr.it&utm_medium=mastodon #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering
-
Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. https://ppc.land/google-sres-reveal-how-search-handled-record-world-cup-traffic-spike-2/?utm_source=dlvr.it&utm_medium=mastodon #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering
-
Google SREs reveal how search handled record World Cup traffic spike: Google's Search Reliability team discusses challenges, strategies, and successes in maintaining service during peak events. https://ppc.land/google-sres-reveal-how-search-handled-record-world-cup-traffic-spike-2/?utm_source=dlvr.it&utm_medium=mastodon #GoogleSRE #WorldCup #SearchEngine #TechNews #ReliabilityEngineering