I finally did it!
I had to spend the entire day yesterday and today to fix the server, but I did it!
Tomorrow I’ll try to catch up with the posts which should have gone up while I was doing the repairs.
The culprit was fail2ban
No clue why or how it happened, but every time I restored the backup (with fail2ban) I was unable to log back in after a reboot.
I basically got no response from the server over SSH.
Having no way of looking at the logs, I just assumed at first it had something to do with my SSH config.
I was wrong.
Why did it take me so long?
First things first, when you configure your Amazon S3 to move the files to Glacier after a while, well, Amazon obeys.
But when you need the files, you’ve got to unfreeze them.
If you don’t want to pay extra, that’s gonna take 3-5 hours.
This is a one-time time cost if you tell Amazon to keep the files unfrozen for long enough.
Then, reinitialising the server takes 10 minutes every time you mess up.
Installing the necessary packages to load the backup takes 5 minutes.
Downloading the entire backup usually takes around 10 minutes on the server.
Copying the files from the backup-folder to the live system takes another 5.
So that’s 30 minutes just to reset the server to a point where you can try to experiment with stuff.
That’s a HUGE iteration time, so progress is naturally quite slow.
You could skip all this if your host provides you with a KVM, which is a virtual physical access to the server.
But mine is so bad, I can’t even enter the login password before the connection resets :/
The oh so long search…
The first few hours were spent investigating the SSH config, making SSH gets started after a reboot…
After a while, I decided to only copy some files to see if I could pin down the culprit.
I immediately found out that the ‘etc/’ folder contained the issue (I looked at this one first because I initially suspected SSH).
After a few failed attempts, I changed strategy and tried to get the server running by only copying the folders I thought necessary.
After each modification, I had to reboot to check if it was the culprit.
At some point, after a long search already, I opened the auth log and saw a few failed login attempts from China.
So I decided it would be time to copy fail2ban to resolve the issue.
I restarted and there it was! I couldn’t connect anymore!
It’s finally working!
So I went through all the process of loading the backup again, except that this time I excluded the fail2ban folder.
And everything worked.
Just like that.
The lessons learned
1: chown is dangerous! don’t be lazy and look up command syntax if you’re unsure and don’t just assume things!
2: I’ve done something like 10 full backup restores, so I should be good when I have to do the next one
3: If you can’t connect back to the server, make sure it isn’t fail2ban doing stuff
4: Use a KVM if available!
5: Unfreeze Glacier files beforehand!
6: I learned a lot of collateral knowledge while scavenging my way through this. I feel less like a toddler wandering around the Unix ecosystem and more like a small child understanding some of the basics required to make some things wörk.
Also, don’t break something which is working fine.
Unless you want to learn how it works.
Then yeah, go for it, break stuff, and fix the mess yourself!
Live long and prosper!