rm -rf /var
Within Mythic Beasts we have an internal chat room that uses IRC (this is like Slack but free and securely stores all the history on our servers). Our monitoring system is called Ankou, named after Death’s henchman that watches over the dead, and has an IRC bot that alerts through our chat room.
This story starts with Ankoubot, who was the first to notice something was wrong with the world.
15:25:31
managed vds:abcdefg-ssh [NNNN-ssh]: 46.235.N.N => bad banner from `46.235.N.N’:[46.235.N.N – VDShost:vds-hex-f] managed vds:abcdefg-web [NNNN-web]: http://www.abcdefg.co.uk/ => Status 404 (<html> <head><title>404 Not Found</title></head> <body bgcolor=”white”> <center><h1>404 Not Found</h1></center> <hr><center>nginx/1.10.3</center> </body> </html…) [46.235.N.N www.abcdefg.co.uk VDShost:vds-hex-f]
15:31:42
I can’t get ssh in, I’m on the console.
15:38:16
This is an extremely broken install. ssh is blocked, none of the bind mounts work
Debugging is difficult because /var/log is missing. systemd appears completely unable to function and we have no functioning logging. Unable to get ssh to start and fighting multiple broken tools due to missing mounts, we restart the server and mail the customer explaining what we’ve discovered so far. This doesn’t help and it hangs attempting to configure NFS mounts.
15:53:53
- To
- support@mythic-beasts.com
- Date
- Tue, 18 Jul 2017 15:53:53 +0100
- Subject
- CRITICAL Customer NNNN: Broke our install
I just managed to do the dumbest of dumb and did an rm -rf on /var instead of ./var which broke most things and killed my ssh session before very long.
Can you let me know if that’s the case, and if you’ll be able to restore from the most recent backup?
15:56:35
Boot to recovery media completes ready for restore from backup.
15:58:34
- From
- support@mythic-beasts.com
- Date
- Tue, 18 Jul 2017 15:58:34 +0100
- Subject
- Re: CRITICAL Customer NNNN: Broke our install
Hi,
I just managed to do the dumbest of dumb and did an rm -rf on /var instead of ./var which broke most things and killed my ssh session before very long.
Ah!
We just had our alerts go off and I’d taken it to recovery mode to boot it again. I confess my thoughts had got to ‘how did this ever boot?’ and that explains why.
We’ve a backup from 00:14 this morning. I take it restoring from that would be a good plan? I’ll keep a copy of the fs incase there’s any newer changes you need to recover.
Pete
—
Mythic Support <support@mythic-beasts.com>
Live service status: http://status.mythic-beasts.com
Follow us on Twitter: https://twitter.com/Mythic_Beasts
16:05:07
- To
- support@mythic-beasts.com
- Date
- Tue, 18 Jul 2017 16:05:07 +0100
- Subject
- Re: CRITICAL Customer NNNN: Broke our install
[…]
Cheers, the restore from this morning should be fine, I don’t believe anything should have changed since then, on the filesystem or in the dbs.
16:08:36
managed vds:abcdefg-ssh [NNNN-ssh]: back to normalmanaged vds:abcdefg-web [NNNN-web]: back to normal
16:14:22
- From
- support@mythic-beasts.com
- Date
- Tue, 18 Jul 2017 16:14:22 +0100
- Subject
- Re: CRITICAL Customer NNNN: Broke our install
Hi,
Can you let me know if that’s the case, and if you’ll be able to restore from the most recent backup?
I’ve restored /var from the backup, booted the machine back up and done a mysql restore for you.
Our alarms have cleared and it looks like the website is back.
Can you confirm that’s all you need?
Regards,
Pete
—
Mythic Support <support@mythic-beasts.com>
Live service status: http://status.mythic-beasts.com
Follow us on Twitter: https://twitter.com/Mythic_Beasts
16:31:42
Customer confirms everything is restored and functional and gives permission to anonymously write up the incident for our blog including the following quote.
Mythic Beasts had come highly recommended to me for the level of support provided, and when it came to crunch time they were reacting to the problem before I’d even raised a support ticket.
This is exactly what we were looking for in a managed hosting provider, and I’m really glad we made the choice. Hopefully however, I won’t be causing quiet the same sort of problem for a looooong while.
In total the customer was offline for slightly over 30 minutes, after what can best be described as a catastrophic administrator error.