home.social

#postmortem — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #postmortem, aggregated by home.social.

  1. When two Hetzner servers died at the same time

    On May 12, 2026, two of my Arch Linux + LUKS servers at Hetzner became unreachable at the same moment. Both had been running for 4+ months without issue. Both had received the same pacman -Syyu the day before, but had stayed on the old kernel until the morning the websites stopped responding. I rebooted — SSH never came back. nmap -Pn -p 22 showed filtered from anywhere. No ping. No banner. The Hetzner Robot panel insisted the hardware was fine.

    Several hours went into hypotheses that turned out to be wrong:

    • The encryptssh initcpio hook referencing a /usr/lib/initcpio/udev/11-dm-initramfs.rules file that no longer exists. Real bug, no boot impact — the initramfs rebuilds anyway.
    • PermitRootLogin no in sshd_config. Real misconfiguration, fixed it, didn’t help. A refusing sshd shows closed, not filtered.
    • Predictable interface-naming drift after the systemd 260 upgrade. Patched the .network config to match by MAC. Useful hardening; not the cause.
    • Stale GRUB stage1 + core.img in the MBR. Arch never re-runs grub-install after a grub package upgrade. Refreshed it. Still filtered.
    • Kernel 7.0.5 regression. Downgraded to 6.18.3, the kernel that had run for 4 months. Still filtered. So the kernel itself wasn’t it either.

    The clue was in the persistent journal: a single recorded boot from December 31 to May 12 10:13 UTC, and absolutely nothing after. Every reboot since the upgrade was failing before systemd-journald could flush to disk — so the failure had to be in the initramfs, before the root filesystem was even mounted.

    What it almost certainly was

    Hetzner Dedicated servers configure the initramfs network with ip=dhcp on the kernel command line. That depends on Hetzner’s DHCP server replying to whatever request format the current kernel sends. Somewhere between kernel 6.18 / iproute2 6.18 and kernel 7.0 / iproute2 7.0, the request format changed enough that Hetzner’s DHCP stopped responding. Effects:

    • Old kernel at runtime kept the interface already configured (Phase A — 32 hours of healthy operation after the package upgrade).
    • New kernel cold-boots, hits DHCP, never gets an IP, dropbear cannot listen, port 22 stays filtered.

    Hetzner’s own documentation has been quietly moving away from ip=dhcp toward static IPv4 in the kernel command line. The fix is exactly that:

    GRUB_CMDLINE_LINUX="cryptdevice=/dev/md1:cryptroot ip=A.B.C.D::GATEWAY:255.255.255.255:hostname:eth0:none"
    

    One line in /etc/default/grub, grub-mkconfig, reboot. No more dependency on Hetzner’s DHCP responding to whatever your current kernel sends.

    Why it matters for anyone running this stack

    If you run Arch on Hetzner Dedicated with full-disk encryption and remote unlock via dropbear, the ip=dhcp shipped by installimage is a latent bug. It can keep working for years and then break overnight, on every machine you have, after a routine pacman -Syyu. The static-IP version is what Hetzner now recommends and removes the entire dependency.

    Tooling

    While debugging, I turned the whole rescue / chroot / diagnose / fix workflow into a Python CLI (hal) — including hal fix static-ip, which derives the static cmdline directly from your existing systemd-networkd .network file:

    github.com/kevinveenbirkenbach/hetzner-arch-luks

    Single command, idempotent, reversible (the original /etc/default/grub is backed up to .hal-backup). If you’re on this stack, switch to static IP before the next kernel upgrade catches you.

    #ArchLinux #bootFailure #debugging #DevOps #DHCP #Dropbear #fullDiskEncryption #GRUB #Hetzner #initramfs #kernelUpgrade #Linux #LUKS #mkinitcpio #pacman #postmortem #PythonCLI #serverOutage #sysadmin #systemdNetworkd
  2. When two Hetzner servers died at the same time

    On May 12, 2026, two of my Arch Linux + LUKS servers at Hetzner became unreachable at the same moment. Both had been running for 4+ months without issue. Both had received the same pacman -Syyu the day before, but had stayed on the old kernel until the morning the websites stopped responding. I rebooted — SSH never came back. nmap -Pn -p 22 showed filtered from anywhere. No ping. No banner. The Hetzner Robot panel insisted the hardware was fine.

    Several hours went into hypotheses that turned out to be wrong:

    • The encryptssh initcpio hook referencing a /usr/lib/initcpio/udev/11-dm-initramfs.rules file that no longer exists. Real bug, no boot impact — the initramfs rebuilds anyway.
    • PermitRootLogin no in sshd_config. Real misconfiguration, fixed it, didn’t help. A refusing sshd shows closed, not filtered.
    • Predictable interface-naming drift after the systemd 260 upgrade. Patched the .network config to match by MAC. Useful hardening; not the cause.
    • Stale GRUB stage1 + core.img in the MBR. Arch never re-runs grub-install after a grub package upgrade. Refreshed it. Still filtered.
    • Kernel 7.0.5 regression. Downgraded to 6.18.3, the kernel that had run for 4 months. Still filtered. So the kernel itself wasn’t it either.

    The clue was in the persistent journal: a single recorded boot from December 31 to May 12 10:13 UTC, and absolutely nothing after. Every reboot since the upgrade was failing before systemd-journald could flush to disk — so the failure had to be in the initramfs, before the root filesystem was even mounted.

    What it almost certainly was

    Hetzner Dedicated servers configure the initramfs network with ip=dhcp on the kernel command line. That depends on Hetzner’s DHCP server replying to whatever request format the current kernel sends. Somewhere between kernel 6.18 / iproute2 6.18 and kernel 7.0 / iproute2 7.0, the request format changed enough that Hetzner’s DHCP stopped responding. Effects:

    • Old kernel at runtime kept the interface already configured (Phase A — 32 hours of healthy operation after the package upgrade).
    • New kernel cold-boots, hits DHCP, never gets an IP, dropbear cannot listen, port 22 stays filtered.

    Hetzner’s own documentation has been quietly moving away from ip=dhcp toward static IPv4 in the kernel command line. The fix is exactly that:

    GRUB_CMDLINE_LINUX="cryptdevice=/dev/md1:cryptroot ip=A.B.C.D::GATEWAY:255.255.255.255:hostname:eth0:none"
    

    One line in /etc/default/grub, grub-mkconfig, reboot. No more dependency on Hetzner’s DHCP responding to whatever your current kernel sends.

    Why it matters for anyone running this stack

    If you run Arch on Hetzner Dedicated with full-disk encryption and remote unlock via dropbear, the ip=dhcp shipped by installimage is a latent bug. It can keep working for years and then break overnight, on every machine you have, after a routine pacman -Syyu. The static-IP version is what Hetzner now recommends and removes the entire dependency.

    Tooling

    While debugging, I turned the whole rescue / chroot / diagnose / fix workflow into a Python CLI (hal) — including hal fix static-ip, which derives the static cmdline directly from your existing systemd-networkd .network file:

    github.com/kevinveenbirkenbach/hetzner-arch-luks

    Single command, idempotent, reversible (the original /etc/default/grub is backed up to .hal-backup). If you’re on this stack, switch to static IP before the next kernel upgrade catches you.

    #ArchLinux #bootFailure #debugging #DevOps #DHCP #Dropbear #fullDiskEncryption #GRUB #Hetzner #initramfs #kernelUpgrade #Linux #LUKS #mkinitcpio #pacman #postmortem #PythonCLI #serverOutage #sysadmin #systemdNetworkd
  3. When two Hetzner servers died at the same time

    On May 12, 2026, two of my Arch Linux + LUKS servers at Hetzner became unreachable at the same moment. Both had been running for 4+ months without issue. Both had received the same pacman -Syyu the day before, but had stayed on the old kernel until the morning the websites stopped responding. I rebooted — SSH never came back. nmap -Pn -p 22 showed filtered from anywhere. No ping. No banner. The Hetzner Robot panel insisted the hardware was fine.

    Several hours went into hypotheses that turned out to be wrong:

    • The encryptssh initcpio hook referencing a /usr/lib/initcpio/udev/11-dm-initramfs.rules file that no longer exists. Real bug, no boot impact — the initramfs rebuilds anyway.
    • PermitRootLogin no in sshd_config. Real misconfiguration, fixed it, didn’t help. A refusing sshd shows closed, not filtered.
    • Predictable interface-naming drift after the systemd 260 upgrade. Patched the .network config to match by MAC. Useful hardening; not the cause.
    • Stale GRUB stage1 + core.img in the MBR. Arch never re-runs grub-install after a grub package upgrade. Refreshed it. Still filtered.
    • Kernel 7.0.5 regression. Downgraded to 6.18.3, the kernel that had run for 4 months. Still filtered. So the kernel itself wasn’t it either.

    The clue was in the persistent journal: a single recorded boot from December 31 to May 12 10:13 UTC, and absolutely nothing after. Every reboot since the upgrade was failing before systemd-journald could flush to disk — so the failure had to be in the initramfs, before the root filesystem was even mounted.

    What it almost certainly was

    Hetzner Dedicated servers configure the initramfs network with ip=dhcp on the kernel command line. That depends on Hetzner’s DHCP server replying to whatever request format the current kernel sends. Somewhere between kernel 6.18 / iproute2 6.18 and kernel 7.0 / iproute2 7.0, the request format changed enough that Hetzner’s DHCP stopped responding. Effects:

    • Old kernel at runtime kept the interface already configured (Phase A — 32 hours of healthy operation after the package upgrade).
    • New kernel cold-boots, hits DHCP, never gets an IP, dropbear cannot listen, port 22 stays filtered.

    Hetzner’s own documentation has been quietly moving away from ip=dhcp toward static IPv4 in the kernel command line. The fix is exactly that:

    GRUB_CMDLINE_LINUX="cryptdevice=/dev/md1:cryptroot ip=A.B.C.D::GATEWAY:255.255.255.255:hostname:eth0:none"
    

    One line in /etc/default/grub, grub-mkconfig, reboot. No more dependency on Hetzner’s DHCP responding to whatever your current kernel sends.

    Why it matters for anyone running this stack

    If you run Arch on Hetzner Dedicated with full-disk encryption and remote unlock via dropbear, the ip=dhcp shipped by installimage is a latent bug. It can keep working for years and then break overnight, on every machine you have, after a routine pacman -Syyu. The static-IP version is what Hetzner now recommends and removes the entire dependency.

    Tooling

    While debugging, I turned the whole rescue / chroot / diagnose / fix workflow into a Python CLI (hal) — including hal fix static-ip, which derives the static cmdline directly from your existing systemd-networkd .network file:

    github.com/kevinveenbirkenbach/hetzner-arch-luks

    Single command, idempotent, reversible (the original /etc/default/grub is backed up to .hal-backup). If you’re on this stack, switch to static IP before the next kernel upgrade catches you.

    #ArchLinux #bootFailure #debugging #DevOps #DHCP #Dropbear #fullDiskEncryption #GRUB #Hetzner #initramfs #kernelUpgrade #Linux #LUKS #mkinitcpio #pacman #postmortem #PythonCLI #serverOutage #sysadmin #systemdNetworkd
  4. When two Hetzner servers died at the same time

    On May 12, 2026, two of my Arch Linux + LUKS servers at Hetzner became unreachable at the same moment. Both had been running for 4+ months without issue. Both had received the same pacman -Syyu the day before, but had stayed on the old kernel until the morning the websites stopped responding. I rebooted — SSH never came back. nmap -Pn -p 22 showed filtered from anywhere. No ping. No banner. The Hetzner Robot panel insisted the hardware was fine.

    Several hours went into hypotheses that turned out to be wrong:

    • The encryptssh initcpio hook referencing a /usr/lib/initcpio/udev/11-dm-initramfs.rules file that no longer exists. Real bug, no boot impact — the initramfs rebuilds anyway.
    • PermitRootLogin no in sshd_config. Real misconfiguration, fixed it, didn’t help. A refusing sshd shows closed, not filtered.
    • Predictable interface-naming drift after the systemd 260 upgrade. Patched the .network config to match by MAC. Useful hardening; not the cause.
    • Stale GRUB stage1 + core.img in the MBR. Arch never re-runs grub-install after a grub package upgrade. Refreshed it. Still filtered.
    • Kernel 7.0.5 regression. Downgraded to 6.18.3, the kernel that had run for 4 months. Still filtered. So the kernel itself wasn’t it either.

    The clue was in the persistent journal: a single recorded boot from December 31 to May 12 10:13 UTC, and absolutely nothing after. Every reboot since the upgrade was failing before systemd-journald could flush to disk — so the failure had to be in the initramfs, before the root filesystem was even mounted.

    What it almost certainly was

    Hetzner Dedicated servers configure the initramfs network with ip=dhcp on the kernel command line. That depends on Hetzner’s DHCP server replying to whatever request format the current kernel sends. Somewhere between kernel 6.18 / iproute2 6.18 and kernel 7.0 / iproute2 7.0, the request format changed enough that Hetzner’s DHCP stopped responding. Effects:

    • Old kernel at runtime kept the interface already configured (Phase A — 32 hours of healthy operation after the package upgrade).
    • New kernel cold-boots, hits DHCP, never gets an IP, dropbear cannot listen, port 22 stays filtered.

    Hetzner’s own documentation has been quietly moving away from ip=dhcp toward static IPv4 in the kernel command line. The fix is exactly that:

    GRUB_CMDLINE_LINUX="cryptdevice=/dev/md1:cryptroot ip=A.B.C.D::GATEWAY:255.255.255.255:hostname:eth0:none"
    

    One line in /etc/default/grub, grub-mkconfig, reboot. No more dependency on Hetzner’s DHCP responding to whatever your current kernel sends.

    Why it matters for anyone running this stack

    If you run Arch on Hetzner Dedicated with full-disk encryption and remote unlock via dropbear, the ip=dhcp shipped by installimage is a latent bug. It can keep working for years and then break overnight, on every machine you have, after a routine pacman -Syyu. The static-IP version is what Hetzner now recommends and removes the entire dependency.

    Tooling

    While debugging, I turned the whole rescue / chroot / diagnose / fix workflow into a Python CLI (hal) — including hal fix static-ip, which derives the static cmdline directly from your existing systemd-networkd .network file:

    github.com/kevinveenbirkenbach/hetzner-arch-luks

    Single command, idempotent, reversible (the original /etc/default/grub is backed up to .hal-backup). If you’re on this stack, switch to static IP before the next kernel upgrade catches you.

    #ArchLinux #bootFailure #debugging #DevOps #DHCP #Dropbear #fullDiskEncryption #GRUB #Hetzner #initramfs #kernelUpgrade #Linux #LUKS #mkinitcpio #pacman #postmortem #PythonCLI #serverOutage #sysadmin #systemdNetworkd
  5. When two Hetzner servers died at the same time

    On May 12, 2026, two of my Arch Linux + LUKS servers at Hetzner became unreachable at the same moment. Both had been running for 4+ months without issue. Both had received the same pacman -Syyu the day before, but had stayed on the old kernel until the morning the websites stopped responding. I rebooted — SSH never came back. nmap -Pn -p 22 showed filtered from anywhere. No ping. No banner. The Hetzner Robot panel insisted the hardware was fine.

    Several hours went into hypotheses that turned out to be wrong:

    • The encryptssh initcpio hook referencing a /usr/lib/initcpio/udev/11-dm-initramfs.rules file that no longer exists. Real bug, no boot impact — the initramfs rebuilds anyway.
    • PermitRootLogin no in sshd_config. Real misconfiguration, fixed it, didn’t help. A refusing sshd shows closed, not filtered.
    • Predictable interface-naming drift after the systemd 260 upgrade. Patched the .network config to match by MAC. Useful hardening; not the cause.
    • Stale GRUB stage1 + core.img in the MBR. Arch never re-runs grub-install after a grub package upgrade. Refreshed it. Still filtered.
    • Kernel 7.0.5 regression. Downgraded to 6.18.3, the kernel that had run for 4 months. Still filtered. So the kernel itself wasn’t it either.

    The clue was in the persistent journal: a single recorded boot from December 31 to May 12 10:13 UTC, and absolutely nothing after. Every reboot since the upgrade was failing before systemd-journald could flush to disk — so the failure had to be in the initramfs, before the root filesystem was even mounted.

    What it almost certainly was

    Hetzner Dedicated servers configure the initramfs network with ip=dhcp on the kernel command line. That depends on Hetzner’s DHCP server replying to whatever request format the current kernel sends. Somewhere between kernel 6.18 / iproute2 6.18 and kernel 7.0 / iproute2 7.0, the request format changed enough that Hetzner’s DHCP stopped responding. Effects:

    • Old kernel at runtime kept the interface already configured (Phase A — 32 hours of healthy operation after the package upgrade).
    • New kernel cold-boots, hits DHCP, never gets an IP, dropbear cannot listen, port 22 stays filtered.

    Hetzner’s own documentation has been quietly moving away from ip=dhcp toward static IPv4 in the kernel command line. The fix is exactly that:

    GRUB_CMDLINE_LINUX="cryptdevice=/dev/md1:cryptroot ip=A.B.C.D::GATEWAY:255.255.255.255:hostname:eth0:none"
    

    One line in /etc/default/grub, grub-mkconfig, reboot. No more dependency on Hetzner’s DHCP responding to whatever your current kernel sends.

    Why it matters for anyone running this stack

    If you run Arch on Hetzner Dedicated with full-disk encryption and remote unlock via dropbear, the ip=dhcp shipped by installimage is a latent bug. It can keep working for years and then break overnight, on every machine you have, after a routine pacman -Syyu. The static-IP version is what Hetzner now recommends and removes the entire dependency.

    Tooling

    While debugging, I turned the whole rescue / chroot / diagnose / fix workflow into a Python CLI (hal) — including hal fix static-ip, which derives the static cmdline directly from your existing systemd-networkd .network file:

    github.com/kevinveenbirkenbach/hetzner-arch-luks

    Single command, idempotent, reversible (the original /etc/default/grub is backed up to .hal-backup). If you’re on this stack, switch to static IP before the next kernel upgrade catches you.

    #ArchLinux #bootFailure #debugging #DevOps #DHCP #Dropbear #fullDiskEncryption #GRUB #Hetzner #initramfs #kernelUpgrade #Linux #LUKS #mkinitcpio #pacman #postmortem #PythonCLI #serverOutage #sysadmin #systemdNetworkd
  6. A quotation from George Washington

    Though in reviewing the incidents of my Administration I am unconscious of intentional error, I am nevertheless too sensible of my defects not to think it probable that I may have committed many errors. Whatever they may be, I fervently beseech the Almighty to avert or mitigate the evils to which they may tend. I shall also carry with me the hope that my country will never cease to view them with indulgence, and that, after forty-five years of my life dedicated to its service with an upright zeal, the faults of incompetent abilities will be consigned to oblivion, as myself must soon be to the mansions of rest.

    George Washington (1732–1799) American military leader, Founding Father, US President (1789–1797)
    Letter (1796-09-17), "Farewell Address" [with J. Madison, A. Hamilton]

    More about this quote: wist.info/washington-george/81…

    #quote #quotes #quotation #qotd #georgewashington #selfcriticism #selfassessment #retirement #administration #blame #error #fault #forgiveness #honestmistake #humility #mistake #oblivion #postmortem #president #review #selfdeprecation

  7. Как анализировать инциденты. История об ошибках

    Стоимость минуты простоя в iGaming может приносить миллионы упущенной прибыли и более тяжелые репутационные потери. Когда real‑time ставки замирают, а букмекерские терминалы уходят в ступор — это не просто баг. Это экзамен на зрелость команды и процессов. Что мы делаем после — определяет, повторится ли он снова.

    habr.com/ru/articles/933964/

    #rca #postmortem #incident #problem #highload #itil #itsm