Recovering from an Upgrade Failure at Boot
Twice now upgrading from "Karmic Koala" to "Lucid Lynx" both are LTS versions the server have failed. I had each server fail to boot after upgrade and whilst I can not understand why, I can at least try to find a way around the failure and restore the system to working order. This article is my attempt to recover from such a disaster.
The hardware these failures occurred on were on a "Dell 1425 Poweredge Server" and "Compaq Proliant DL380 G3 Server". So I am leaning to the possibility this is a generic failure rather than hardware related. Both systems used 4 separate partitions to store file system and data, other than the fact the hardware is very different the operating system and data software is partitioned identical.
- partition 1 /boot 2Gb
- partition 2 swap 8Gb (twice installed memory)
- partition 3 / 10Gb (operating system)
- partition 4 /home (whatever remains by way of space)
All the above partitions are primary partitions I only need 4 so all 4 partitions were used, each drive or drives can coupe with 4 primary partitions it can not have more without including extended partitions. Now what's interesting here is that the Dell Poweredge 1425 uses a single SATA hard disk and the Compaq Proliant uses 6 (Raid 5) Disks configuration. However, the proliant still saw this a single drive with four partitions so there is no similarity between the disk hardware and again the software failure may be considered a generic fault.
If you are considering an upgrade of server software take the server off line and work on it in a local network environment or at least be prepared to do so if the upgrade fails. Give yourself plenty of time to carry out the work, days not hours if the server should fail. Be sure to warn your audience beforehand by announcing a service downtime, proceed with upgrade at the server location, don't be tempted to do this remotely, updates are fine but server distro upgrades should be handled on site. All good advice, and you may never need any of it but if you do, then you are ready to get on with the server repairs if needed.
If your server has failed to boot you are probably just about to panic, however if you have followed my advice above you still have plenty of time left. So far so good, you should go and have a coffee first before starting the next task not that it is all that difficult you just need to be clear, cool, calm and collective.
- Remove server from rack and switch on server on a local ethernet network
- Be sure the Bios is directed to the CDRom drive or USB port (booting option)
- Insert into Cd Tray a copy of a recent server install CD suitable for the arcitecture
- Select your language and select Rescue a Broken System
- Select the usual Language Keyboard etc preamble
- Select the Default Hostname we don't care as we are intending to recover the system
- Select "Yes" for time zone
- Select your "/" rootfs (root filesystem) in my case the Dell was addressed as "/dev/sda3" and on the Compaq as "/dev/cciss/c0d0p3" both nice and easy to remember
- Select Execute a shell in "sda3" for the Dell and c0d0p3 for the Compaq
- Select continue
- As my boot directory is on a different partition to that of "/" rootfs it needs to be mounted onto "/boot" the directory whilst it is there is empty until mounted, as a console command screen is now available at the bottom of the screen and as you are user root type the following between the quotes:- "mount /dev/sda1 /boot" (for the Dell server) and "mount /dev/cciss/c0d0p1 /boot" (for the Compaq server)
- During the boot of the rescue system, provided you are on a local network DHCP has retrieved most of the network information but not all, this is due to the rootfs not being mounted before the network search. So we must edit a file in etc called "resolv.conf", use vim or some other command line editor and remark out the assigned nameserver. It is important you do this as you will want to change it back afterwards. Now add nameserver with the IP of your local router gateway and save the file.
- Test that your network is in fact working using ping to a known server on the WAN
- When a ping is returned test that the resolv.conf is correct by apt-get update if this fails and you can ping a server on the network then resolv.conf is wrong edit and try again, if the updates are working then move to the next task
- Now type apt-get upgrade./assigned
- If no upgrade takes place then type apt-get install update-manager-core and repeat both the above steps.
- What you are looking for is Grub to be updated more than anything else, if this happens and it will tell you if it has, proceed to reboot the server
- Use >CTRL-ALT-DEL<
- Leave the CD in the CDRom drive select language at boot but this time select First hard disk
- If this boots into the operating system on the hard disk you can leave well alone or try again to upgrade your distribution