Lessons from the recent 3 days outage

Bady Public Seen by 177

git.fosscommunity.in was down for 3 days between Oct 4-7, 2018. First of all I would like to confess that it was mainly my mistakes which made the issue worse.

Mistake 1) Lack of proper communication.
Mistake 2) Tried to debug the wrong issue which only helped to create another bigger set of issues.

On Oct 3 a user reported in the Matrix room that he was unable to create new repos (see attachment1). The issue went unresolved that night as the server admins were not available. Since I had access to the server I logged in and checked the logs but couldn't find anything related to the reported issue.

On Oct 4 I woke up seeing the message that the GitLab service is down and the server shows "502 Bad Gateway". Since I was in a hurry to attend an event I had to wait until I get in train to check what went wrong. Upon further checkup I learned that after investigating the logs last night I had reloaded gitlab service but it failed somehow without providing any feedback and that was what causing the 502. Before starting the service again I thought it's better to give a try to resolve the issue regarding repo creation. On apt-upgrade there was an error regarding lvm2. I thought it may have something to do with the actual error. So I tried to troubleshoot it, tried some solutions from stackexchange sites but nothing worked. I am not sure what all solutions I tried, but in the end the copy-paste solutions resulted in downgrading the entire system and as a result many packages became broken including gitlab and apt. Without knowing what to do I spent some more time researching the issue online. After some time I lost ssh connection and I wasn't able to get back to the server anymore. That's when I called @praveenarimbrathod asking if he could take a look, but I was already late by that time because he restarted the server without knowing what all I had done! I didn't call him before because I knew he was busy, so I thought I shouldn't disturb him. But if I had communicated my activities in the Matrix room I could have avoided the server restart and hence the ssh connection would haven't been lost.

The only option left was to open a support ticket at gandi.net where the server is hosted and wait for their response. So I did that and waited for their reply. On the next day, Oct 5, I got a reply saying that we got to create a new server and attach the old hard disk to investigate what went wrong. But then comes the another issue, there was a mismatch in the available credits shown in the new gandi v5 interface and the old v4 interface. As a result it wasn't possible to create new server from v5 interface without paying additional amount despite the fact there's enough credits already available in the account. So I opened another ticket.

Meanwhile waiting for the reply, on Oct 6, @praveenarimbrathod asked me if I could try creating the new server from v4 interface. Fortunately it worked (using existing credits). So after creating new server I attached the hard disk to it and started troubleshooting. As mentioned above apt was broken such that apt update returns an error (see attachment2). Then @praveenarimbrathod asked if I could run apt -t stretch-backports -f install. It worked and many downgraded packages got upgraded. It also resolved the issues with apt.

On Oct 7 early morning, @praveenarimbrathod suggested that since the apt issue is solved we could now detach the hard disk from the new server and attach it back to the original server. So I did that and ran an apt upgrade after which GitLab service worked fine.

Important lessons learned

1) Communicate important activities through proper channels on the go.
2) It is not necessary to try to solve all issues you encounter, "if it ain't broke, don't fix it."
3) Do not blindly depend on the online solutions without trying to understand them properly (it may work in PCs, but don't try this in the prod servers!).

Suggestions

1) Document detailed as well as step-by-step maintenance procedure at wiki.fsci.org.in and ask maintainers to follow a standard procedure.
2) Document how and where to check for error logs.
3) Document known errors and issues and their impact level.

Feedback

Thanks to all who supported me during this crisis, especially to @praveenarimbrathod for his valuable advices and cooperation. It was a valuable experience and I'll keep the lessons learned in mind for the future.

Pirate Praveen Wed 10 Oct 2018 7:04AM

@bady thanks a lot for the detailed report and your persistent efforts to get the service back online. It is part and parcel of being an administrator to face, troubleshoot and fix issues. The important thing is to learn from the mistakes and take steps to avoid such mistakes in future.

Pirate Praveen Wed 10 Oct 2018 7:24AM

Updated https://wiki.fsci.org.in/index.php/GitLab#Setup with links to debian package specific documentation. Suggest missing parts (that you learned the hard way) via merge request to master branch.

Akshay Thu 11 Oct 2018 5:49AM

This is the best thing about any issue. We get a lot of documentation! And that helps immensely in the future.