#reliability — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #reliability, aggregated by home.social.
-
Семь раз посчитай — один раз урони: моделируем инциденты до деплоя
Ракету не отправляют в космос только потому, что её двигатель и насос успешно прошли стендовые испытания по отдельности. Перед стартом инженеры рассчитывают траекторию, моделируют режимы работы и анализируют сценарии отказов. Расчёт не заменяет реальные тесты, но задаёт для них осмысленную рамку. В софте всё обычно иначе. Распределённый пользовательский путь — например, оформление заказа — собирается из десятков микросервисов, баз и очередей. Разработчики добавляют новую зависимость, видят зелёные тесты, проверяют локальные метрики и выкатывают релиз. Считается, что если при сбое что-то пойдёт не так, настроенная система наблюдаемости обязательно это покажет. Она, конечно, покажет. Но почему при проектировании микросервисов мы так спокойно относимся к тому, что узнаём о хрупкости архитектуры в основном по факту инцидента? Эта статья о том, как получить грубый расчёт деградации системы ещё до релиза. Без отказа от хаос-инжиниринга или мониторинга, а как шаг перед ними. Я расскажу о двух экспериментах, в которых топологическая модель автоматически извлекалась из распределённых трейсов, после чего на ней просчитывались сценарии отказов методом Монте-Карло. Результаты моделирования я затем сравнивал с реальными инъекциями отказов на стендах DeathStarBench и OpenTelemetry Demo. Два эксперимента, результаты и код
-
Семь раз посчитай — один раз урони: моделируем инциденты до деплоя
Ракету не отправляют в космос только потому, что её двигатель и насос успешно прошли стендовые испытания по отдельности. Перед стартом инженеры рассчитывают траекторию, моделируют режимы работы и анализируют сценарии отказов. Расчёт не заменяет реальные тесты, но задаёт для них осмысленную рамку. В софте всё обычно иначе. Распределённый пользовательский путь — например, оформление заказа — собирается из десятков микросервисов, баз и очередей. Разработчики добавляют новую зависимость, видят зелёные тесты, проверяют локальные метрики и выкатывают релиз. Считается, что если при сбое что-то пойдёт не так, настроенная система наблюдаемости обязательно это покажет. Она, конечно, покажет. Но почему при проектировании микросервисов мы так спокойно относимся к тому, что узнаём о хрупкости архитектуры в основном по факту инцидента? Эта статья о том, как получить грубый расчёт деградации системы ещё до релиза. Без отказа от хаос-инжиниринга или мониторинга, а как шаг перед ними. Я расскажу о двух экспериментах, в которых топологическая модель автоматически извлекалась из распределённых трейсов, после чего на ней просчитывались сценарии отказов методом Монте-Карло. Результаты моделирования я затем сравнивал с реальными инъекциями отказов на стендах DeathStarBench и OpenTelemetry Demo. Два эксперимента, результаты и код
-
Семь раз посчитай — один раз урони: моделируем инциденты до деплоя
Ракету не отправляют в космос только потому, что её двигатель и насос успешно прошли стендовые испытания по отдельности. Перед стартом инженеры рассчитывают траекторию, моделируют режимы работы и анализируют сценарии отказов. Расчёт не заменяет реальные тесты, но задаёт для них осмысленную рамку. В софте всё обычно иначе. Распределённый пользовательский путь — например, оформление заказа — собирается из десятков микросервисов, баз и очередей. Разработчики добавляют новую зависимость, видят зелёные тесты, проверяют локальные метрики и выкатывают релиз. Считается, что если при сбое что-то пойдёт не так, настроенная система наблюдаемости обязательно это покажет. Она, конечно, покажет. Но почему при проектировании микросервисов мы так спокойно относимся к тому, что узнаём о хрупкости архитектуры в основном по факту инцидента? Эта статья о том, как получить грубый расчёт деградации системы ещё до релиза. Без отказа от хаос-инжиниринга или мониторинга, а как шаг перед ними. Я расскажу о двух экспериментах, в которых топологическая модель автоматически извлекалась из распределённых трейсов, после чего на ней просчитывались сценарии отказов методом Монте-Карло. Результаты моделирования я затем сравнивал с реальными инъекциями отказов на стендах DeathStarBench и OpenTelemetry Demo. Два эксперимента, результаты и код
-
Семь раз посчитай — один раз урони: моделируем инциденты до деплоя
Ракету не отправляют в космос только потому, что её двигатель и насос успешно прошли стендовые испытания по отдельности. Перед стартом инженеры рассчитывают траекторию, моделируют режимы работы и анализируют сценарии отказов. Расчёт не заменяет реальные тесты, но задаёт для них осмысленную рамку. В софте всё обычно иначе. Распределённый пользовательский путь — например, оформление заказа — собирается из десятков микросервисов, баз и очередей. Разработчики добавляют новую зависимость, видят зелёные тесты, проверяют локальные метрики и выкатывают релиз. Считается, что если при сбое что-то пойдёт не так, настроенная система наблюдаемости обязательно это покажет. Она, конечно, покажет. Но почему при проектировании микросервисов мы так спокойно относимся к тому, что узнаём о хрупкости архитектуры в основном по факту инцидента? Эта статья о том, как получить грубый расчёт деградации системы ещё до релиза. Без отказа от хаос-инжиниринга или мониторинга, а как шаг перед ними. Я расскажу о двух экспериментах, в которых топологическая модель автоматически извлекалась из распределённых трейсов, после чего на ней просчитывались сценарии отказов методом Монте-Карло. Результаты моделирования я затем сравнивал с реальными инъекциями отказов на стендах DeathStarBench и OpenTelemetry Demo. Два эксперимента, результаты и код
-
I've got a new article up on the Resilience in Software Foundation (RISF) website!
This post is an introduction to the concept of practicing together in teams and points to some resources for learning more, including the RISF event this coming Wednesday where we'll play through one of the games!
https://resilienceinsoftware.org/news/11517597
#PracticeOfPractice #Expertise #ConnectiveLabor #CommonGround #SRE #Resilience #Reliability
-
I've got a new article up on the Resilience in Software Foundation (RISF) website!
This post is an introduction to the concept of practicing together in teams and points to some resources for learning more, including the RISF event this coming Wednesday where we'll play through one of the games!
https://resilienceinsoftware.org/news/11517597
#PracticeOfPractice #Expertise #ConnectiveLabor #CommonGround #SRE #Resilience #Reliability
-
I've got a new article up on the Resilience in Software Foundation (RISF) website!
This post is an introduction to the concept of practicing together in teams and points to some resources for learning more, including the RISF event this coming Wednesday where we'll play through one of the games!
https://resilienceinsoftware.org/news/11517597
#PracticeOfPractice #Expertise #ConnectiveLabor #CommonGround #SRE #Resilience #Reliability
-
I've got a new article up on the Resilience in Software Foundation (RISF) website!
This post is an introduction to the concept of practicing together in teams and points to some resources for learning more, including the RISF event this coming Wednesday where we'll play through one of the games!
https://resilienceinsoftware.org/news/11517597
#PracticeOfPractice #Expertise #ConnectiveLabor #CommonGround #SRE #Resilience #Reliability
-
I've got a new article up on the Resilience in Software Foundation (RISF) website!
This post is an introduction to the concept of practicing together in teams and points to some resources for learning more, including the RISF event this coming Wednesday where we'll play through one of the games!
https://resilienceinsoftware.org/news/11517597
#PracticeOfPractice #Expertise #ConnectiveLabor #CommonGround #SRE #Resilience #Reliability
-
#California #Energy Commission is having a workshop today on Summer #Reliability and the slides are available here: https://links-2.govdelivery.com/CL0/https:%2F%2Fefiling.energy.ca.gov%2FGetDocument.aspx%3FDocumentContentId=106957%26tn=269794%26utm_medium=email%26utm_source=govdelivery/1/0101019df3e7758a-44e5bca1-186f-46f6-8155-f02060140cd6-000000/wKGA0WpyZqu9TrZlWPrSioe_TaVcs32sTv1AEC4aXyQ=452
-
#California #Energy Commission is having a workshop today on Summer #Reliability and the slides are available here: https://links-2.govdelivery.com/CL0/https:%2F%2Fefiling.energy.ca.gov%2FGetDocument.aspx%3FDocumentContentId=106957%26tn=269794%26utm_medium=email%26utm_source=govdelivery/1/0101019df3e7758a-44e5bca1-186f-46f6-8155-f02060140cd6-000000/wKGA0WpyZqu9TrZlWPrSioe_TaVcs32sTv1AEC4aXyQ=452
-
#California #Energy Commission is having a workshop today on Summer #Reliability and the slides are available here: https://links-2.govdelivery.com/CL0/https:%2F%2Fefiling.energy.ca.gov%2FGetDocument.aspx%3FDocumentContentId=106957%26tn=269794%26utm_medium=email%26utm_source=govdelivery/1/0101019df3e7758a-44e5bca1-186f-46f6-8155-f02060140cd6-000000/wKGA0WpyZqu9TrZlWPrSioe_TaVcs32sTv1AEC4aXyQ=452
-
#California #Energy Commission is having a workshop today on Summer #Reliability and the slides are available here: https://links-2.govdelivery.com/CL0/https:%2F%2Fefiling.energy.ca.gov%2FGetDocument.aspx%3FDocumentContentId=106957%26tn=269794%26utm_medium=email%26utm_source=govdelivery/1/0101019df3e7758a-44e5bca1-186f-46f6-8155-f02060140cd6-000000/wKGA0WpyZqu9TrZlWPrSioe_TaVcs32sTv1AEC4aXyQ=452
-
#California #Energy Commission is having a workshop today on Summer #Reliability and the slides are available here: https://links-2.govdelivery.com/CL0/https:%2F%2Fefiling.energy.ca.gov%2FGetDocument.aspx%3FDocumentContentId=106957%26tn=269794%26utm_medium=email%26utm_source=govdelivery/1/0101019df3e7758a-44e5bca1-186f-46f6-8155-f02060140cd6-000000/wKGA0WpyZqu9TrZlWPrSioe_TaVcs32sTv1AEC4aXyQ=452
-
#GitHub is prioritising #availability, #capacity, and new #features to improve #reliability and handle the rapid #growth of #softwaredevelopment workflows. Recent incidents, including a merge queue regression and a search-related outage, highlighted the need for increased isolation and reduced single points of failure. https://github.blog/news-insights/company-news/an-update-on-github-availability/?eicker.news #tech #media #news
-
#GitHub is prioritising #availability, #capacity, and new #features to improve #reliability and handle the rapid #growth of #softwaredevelopment workflows. Recent incidents, including a merge queue regression and a search-related outage, highlighted the need for increased isolation and reduced single points of failure. https://github.blog/news-insights/company-news/an-update-on-github-availability/?eicker.news #tech #media #news
-
#GitHub is prioritising #availability, #capacity, and new #features to improve #reliability and handle the rapid #growth of #softwaredevelopment workflows. Recent incidents, including a merge queue regression and a search-related outage, highlighted the need for increased isolation and reduced single points of failure. https://github.blog/news-insights/company-news/an-update-on-github-availability/?eicker.news #tech #media #news
-
#GitHub is prioritising #availability, #capacity, and new #features to improve #reliability and handle the rapid #growth of #softwaredevelopment workflows. Recent incidents, including a merge queue regression and a search-related outage, highlighted the need for increased isolation and reduced single points of failure. https://github.blog/news-insights/company-news/an-update-on-github-availability/?eicker.news #tech #media #news
-
#GitHub is prioritising #availability, #capacity, and new #features to improve #reliability and handle the rapid #growth of #softwaredevelopment workflows. Recent incidents, including a merge queue regression and a search-related outage, highlighted the need for increased isolation and reduced single points of failure. https://github.blog/news-insights/company-news/an-update-on-github-availability/?eicker.news #tech #media #news
-
GitHub update: Some explanation of the failures at an essential piece of software development infrastructure. They're under 90% uptime for the month according to the unofficial status page.
https://github.blog/news-insights/company-news/an-update-on-github-availability/
#development #reliability #scaling #github -
GitHub update: Some explanation of the failures at an essential piece of software development infrastructure. They're under 90% uptime for the month according to the unofficial status page.
https://github.blog/news-insights/company-news/an-update-on-github-availability/
#development #reliability #scaling #github -
GitHub update: Some explanation of the failures at an essential piece of software development infrastructure. They're under 90% uptime for the month according to the unofficial status page.
https://github.blog/news-insights/company-news/an-update-on-github-availability/
#development #reliability #scaling #github -
GitHub update: Some explanation of the failures at an essential piece of software development infrastructure. They're under 90% uptime for the month according to the unofficial status page.
https://github.blog/news-insights/company-news/an-update-on-github-availability/
#development #reliability #scaling #github -
the best infrastructure tends to disappear into the background when it works
-
the best infrastructure tends to disappear into the background when it works
-
the best infrastructure tends to disappear into the background when it works
-
What makes this notable is not only the total distance, but the claim that the car retained both its original engine and transmission. That gives the story unusual weight in discussions of real-world durability and long-term maintenance.
#ToyotaCorolla #Reliability #Automotive #HighMileage #CarLongevity #Transport -
What makes this notable is not only the total distance, but the claim that the car retained both its original engine and transmission. That gives the story unusual weight in discussions of real-world durability and long-term maintenance.
#ToyotaCorolla #Reliability #Automotive #HighMileage #CarLongevity #Transport -
What makes this notable is not only the total distance, but the claim that the car retained both its original engine and transmission. That gives the story unusual weight in discussions of real-world durability and long-term maintenance.
#ToyotaCorolla #Reliability #Automotive #HighMileage #CarLongevity #Transport -
What makes this notable is not only the total distance, but the claim that the car retained both its original engine and transmission. That gives the story unusual weight in discussions of real-world durability and long-term maintenance.
#ToyotaCorolla #Reliability #Automotive #HighMileage #CarLongevity #Transport -
Incident reviews get sharper when they ask which decision made the failure more likely. If every postmortem ends at 'human error', the system gets a free pass. #Reliability #Governance #EA
-
Hi @jon,
#JS being more powerful is the case against it.The other is the irresponsibility to just drop features at will. #XSLT was a #w3c standard. If that's not a thing to rely on and publish documents for long-term availability, then there isn't such a thing. Fits with TLS and short lived certs.
That's all toward card houses. And I hate any of it.
It's not about XSLT, it's about #reliability.
-
Hi @jon,
#JS being more powerful is the case against it.The other is the irresponsibility to just drop features at will. #XSLT was a #w3c standard. If that's not a thing to rely on and publish documents for long-term availability, then there isn't such a thing. Fits with TLS and short lived certs.
That's all toward card houses. And I hate any of it.
It's not about XSLT, it's about #reliability.
-
Hi @jon,
#JS being more powerful is the case against it.The other is the irresponsibility to just drop features at will. #XSLT was a #w3c standard. If that's not a thing to rely on and publish documents for long-term availability, then there isn't such a thing. Fits with TLS and short lived certs.
That's all toward card houses. And I hate any of it.
It's not about XSLT, it's about #reliability.
-
Hi @jon,
#JS being more powerful is the case against it.The other is the irresponsibility to just drop features at will. #XSLT was a #w3c standard. If that's not a thing to rely on and publish documents for long-term availability, then there isn't such a thing. Fits with TLS and short lived certs.
That's all toward card houses. And I hate any of it.
It's not about XSLT, it's about #reliability.
-
Hi @jon,
#JS being more powerful is the case against it.The other is the irresponsibility to just drop features at will. #XSLT was a #w3c standard. If that's not a thing to rely on and publish documents for long-term availability, then there isn't such a thing. Fits with TLS and short lived certs.
That's all toward card houses. And I hate any of it.
It's not about XSLT, it's about #reliability.
-
Respect For Others Often Shows Up In The Smallest Actions
As the Metamorphosis coach, I teach that reliability is a form of leadership. When you honor time, you honor people. And that consideration leaves a lasting impression long after the moment has passed.
Punctuality is respect made visible. 🤍
💭 Where could honoring time more deeply strengthen your relationships or leadership?
#MetamorphosisCoach #Respect #Leadership #Reliability #Integrity #MindsetShift #PersonalGrowth
-
https://www.europesays.com/dk/59835/ A Major Reliability Challenge in Large-Scale LLM Training (TU Berlin) #berlin #FaultInjection #Germany #GPUs #HardwareSecurity #LLMTraining #LLMs #reliability #SDC #SilentDataCorruption #TechnischeUniversitätBerlin
-
Silent Data Corruption: A Major Reliability Challenge in Large-Scale LLM Training (TU Berlin)
A new technical paper, “Exploring Silent Data Corruption as a Reliability Challenge in LLM Training,” was published by…
#Germany #DE #Europe #EU #Europa #Berlin #faultinjection #GPUs #hardwaresecurity #LLMtraining #LLMs #reliability #SDC #silentdatacorruption #TechnischeUniversitätBerlin
https://www.europesays.com/germany/4039/ -
The #syslog_ng April newsletter is now available on-line:
- #Automatic configuration of the syslog-ng wildcard-file() source
- What to fix next in syslog-ng?
- #UDP #reliability improved in syslog-ng #Debian packaging
Read more at: https://www.syslog-ng.com/community/b/blog/posts/the-syslog-ng-insider-2026-04-wildcard-file-fix-udp
-
The #syslog_ng April newsletter is now available on-line:
- #Automatic configuration of the syslog-ng wildcard-file() source
- What to fix next in syslog-ng?
- #UDP #reliability improved in syslog-ng #Debian packaging
Read more at: https://www.syslog-ng.com/community/b/blog/posts/the-syslog-ng-insider-2026-04-wildcard-file-fix-udp
-
The #syslog_ng April newsletter is now available on-line:
- #Automatic configuration of the syslog-ng wildcard-file() source
- What to fix next in syslog-ng?
- #UDP #reliability improved in syslog-ng #Debian packaging
Read more at: https://www.syslog-ng.com/community/b/blog/posts/the-syslog-ng-insider-2026-04-wildcard-file-fix-udp
-
The #syslog_ng April newsletter is now available on-line:
- #Automatic configuration of the syslog-ng wildcard-file() source
- What to fix next in syslog-ng?
- #UDP #reliability improved in syslog-ng #Debian packaging
Read more at: https://www.syslog-ng.com/community/b/blog/posts/the-syslog-ng-insider-2026-04-wildcard-file-fix-udp
-
The #syslog_ng April newsletter is now available on-line:
- #Automatic configuration of the syslog-ng wildcard-file() source
- What to fix next in syslog-ng?
- #UDP #reliability improved in syslog-ng #Debian packaging
Read more at: https://www.syslog-ng.com/community/b/blog/posts/the-syslog-ng-insider-2026-04-wildcard-file-fix-udp
-
💾🎩 Oh look, another digital hipster waxing poetic about the sheer brilliance of cramming an entire store into a single #SQLite file. Because nothing screams 'enterprise-scale reliability' like a solitary file just waiting to be corrupted. 🚀🥴
https://ultrathink.art/blog/sqlite-in-production-lessons #digitalhipsters #enterprise #reliability #data #integrity #techhumor #HackerNews #ngated -
💾🎩 Oh look, another digital hipster waxing poetic about the sheer brilliance of cramming an entire store into a single #SQLite file. Because nothing screams 'enterprise-scale reliability' like a solitary file just waiting to be corrupted. 🚀🥴
https://ultrathink.art/blog/sqlite-in-production-lessons #digitalhipsters #enterprise #reliability #data #integrity #techhumor #HackerNews #ngated -
💾🎩 Oh look, another digital hipster waxing poetic about the sheer brilliance of cramming an entire store into a single #SQLite file. Because nothing screams 'enterprise-scale reliability' like a solitary file just waiting to be corrupted. 🚀🥴
https://ultrathink.art/blog/sqlite-in-production-lessons #digitalhipsters #enterprise #reliability #data #integrity #techhumor #HackerNews #ngated -
💾🎩 Oh look, another digital hipster waxing poetic about the sheer brilliance of cramming an entire store into a single #SQLite file. Because nothing screams 'enterprise-scale reliability' like a solitary file just waiting to be corrupted. 🚀🥴
https://ultrathink.art/blog/sqlite-in-production-lessons #digitalhipsters #enterprise #reliability #data #integrity #techhumor #HackerNews #ngated -
💾🎩 Oh look, another digital hipster waxing poetic about the sheer brilliance of cramming an entire store into a single #SQLite file. Because nothing screams 'enterprise-scale reliability' like a solitary file just waiting to be corrupted. 🚀🥴
https://ultrathink.art/blog/sqlite-in-production-lessons #digitalhipsters #enterprise #reliability #data #integrity #techhumor #HackerNews #ngated -
Microsoft is proud to announce that after an intensive program of focusing reliability improvements, Azure Cloud has now reached the gold standard of five eights uptime.
#Azure #AzureCloud #FiveEights #FiveNines #reliability #down