Posts Tagged ‘lucid’
The base of our appliance products, SCB and SSB are heavily customized Ubuntu distributions. Most of them are based on the Dapper release, but starting with SCB 3.1, we started migrating them to the newest LTS, Lucid Lynx. Doing a direct upgrade from a 4 years old OS and switching from 32-bit to 64 at the same time was an enormous task which brought several interesting things, from which I’d like to share the one which I feel was the most important and which can be the most valuable for others as well.
To understand our problems, I have to start by briefly describing how the boot process works in our products. We have two firmwares: the boot and the core firmware. The first one is responsible for the early stages of the boot process, some low-level things and the HA operation, while the core firmware is responsible for the actual production work (that is, receiving logs in SSB or auditing connections in SCB). A simplified description of what happens at boot time is as follows:
- the bootloader is started and it selects the necessary boot firmware
- the linuxrc script in the initrd sets up the filesystem for the boot firmware and starts it
- the boot firmware sets up the HA interface and the DRBD filesystem between the HA nodes
- heartbeat is started by the boot firmware, which tries to figure out if it is the slave or the master node and if it’s the master, the core firmware is started
- when the core firmware is started, the DRBD filesystem under it is pulled up to be the primary on this node, some more filesystem magic is done, and
- the init process is started on the core firmware
This last step is where our problems with the Lucid boot process started. Starting the boot process on the purely System V init-based Dapper was simple: calling “/etc/init.d/rcS” and “/etc/init.d/rc 2” did the job. These scripts looked for initscripts in the /etc/rc*.d directories and called them in order. These were symlinks for scripts in /etc/init.d which started and stopped the services nicely.
Now, the boot process on Lucid has been partially converted to utilize Upstart, where things work a bit different. After the low-level boot (bootloader, kernel, initrd-linuxrc etc.) is done, execution is passed to “/sbin/init“, which is the Upstart daemon, and this is where the changes begin. In the traditional System V boot, after fiddling a bit around, the init process would do the same we do in our core firmware: call the initscripts in /etc/rc*.d in the proper sequence. Upstart works differently. It parses its own configuration files in “/etc/init” which contain information about the services that need to be started and contain pre-conditions for each of them (eg. NTP should be started after the primary network interface is up and running etc.) and then figures out the best sequence to start them. It’s a nice method as it allows parallelization and so a faster boot and the whole configuration is much cleaner as the system administrator does not have to juggle with explicit sequence numbers to make sure one service is started before an other one — which always reminded me of programming BASIC on C64 where you always left out 10 or 20 numbers between line numbers, just in case something would have to be injected there… What’s more, as not everything (including the users’ minds) has been transformed to use Upstart, a nice compatibility layer is maintained as well: upstart tries to run the remaining old-school /etc/rc*.d scripts and the initscripts of upstart-converted jobs can still mostly be found in /etc/init.d as symlinks to a script called “upstart-job” which just tells us that it’s been converted to use Upstart, but otherwise does everything as needed. So everything would be fine and work just as it did, unless…. [dramatic drumrolls]
Well, there’s a small hiccup: Upstart simply does not work in a chrooted environment. There are several reasons for it (it relies on having the PID 1, it communicates in a simple server-client architecture using dbus etc.), but it mainly seems like it simply wasn’t designed to work in a chroot. A quick googling brings us the reports of lots of complaining users and even a long-opened Launchpad bug entry, but in overall, it seems like it’s not something that can be trivially fixed, nor is it planned to be fixed soon by the Ubuntu guys. The only way they addressed this issue was to suggest to symlink /sbin/initctl to /bin/true in a chrooted environment so that it at least won’t trigger an error if the postinst script after an “apt-get install” tried to start the service it’d just installed.
At this point, we were a bit desperate. Fixing Upstart is not an easy task and it’s probably not the best idea, either: it’s simply not designed to work the way we want to use it. We could scrap Upstart altogether, but then we’d need to write and maintain separate System V-style initscripts for all the services that have been converted to use Upstart in upstream Ubuntu, which is never a good thing to do. So we decided to take a third path: write a new, minimal Upstart that is able to do just exactly what we need: take the stock, Ubuntu-provided Upstart configfile for a service and start/stop it as necessary — and be able to do it in a chroot.
Thus we wrote “upstart-dummy“, an almost-drop-in replacement for Upstart. It’s a Python script that can be put in place of the standard Upstart installation in /sbin/initctl. The commands “start”, “stop”, “restart”, “reload” and “status” are symlinked to /sbin/initctl by default and this script is able handle it when it is called through these symlinks. It tries to find the appropriate config file for the service, parse it and run the necessary commands, including pre-start, post-stop etc. scripts that can be found in them. What it cannot do though, and this is why it’s not a complete replacement, is to work as proper standalone init process: it is not able to figure out the dependencies and does not know what services it needs to start on a certain runlevel. But if you’re using it on a system that kept the System V init compatibility layer (and Lucid is one of them), all you need to do is to re-add the symlinks in “/etc/rc*.d” to “/etc/init.d/“, which will be a symlink to “/lib/init/upstart-job“, which will call this fake Upstart daemon, which will, this time, indeed work in a chroot. After this, you can simply start “/etc/init.d/rcS” and “/etc/init.d/rc 2” (which are still there in a stock Lucid) which will boot the chrooted system properly.
Unfortunately, one additional thing is needed for this to work. To stop a service in a stable way, we need to know its process id. Traditionally, these services were able to fork themselves and put their PIDs down in a pidfile, which the “start-stop-daemon” that was used in the initscripts in /etc/init.d could use to send a SIGHUP, SIGTERM or SIGKILL. When converting these services to use Upstart, the maintainers decided (logically, I have to add) that this was no longer needed: as it’s the init process itself that starts these services, it can keep track of the forks and the PIDs and so know which process to kill when it needs to be stopped. Now, while it is possible to re-add this functionality (saving their PIDs into pidfiles) to these services as it’d mean only to call them with some additional commandline arguments, it’d conflict with our original goal, that is, being able to use the stock Ubuntu configfiles. To achieve this, we’ve also made an addition to “start-stop-daemon” — we gave it a “–trace-pid” argument that makes it track the PID of the process it started and to save this to the place given in the “–pidfile” argument. For it to work properly, it needs to know if the process will try to fork once or twice, but fortunately, Upstart also needs to know that so that information is already in the stock Upstart config files.
Let’s summarize the whole thing:
- Upstart simply won’t work in a chrooted environment.
- Lots of services in Lucid have been converted to use Upstart but a compatibility layer for System V init is maintained.
- It is not practical to re-write the old initscripts for all these services.
- A possible way to avoid that and still be able to boot a chrooted Lucid properly:
- replace the Upstart-provided /sbin/initctl with the upstart-dummy script downloadable below
- compile a new start-stop-daemon with the patch added that is available below (the “dpkg” source package provides start-stop-daemon)
- if some of the services do not have symlinks in /etc/rc*.d, re-add them the usual way using “update-rc.d” (most of them will have, so there’s a little to do here)
- boot the chroot by starting “/etc/init.d/rcS” and “/etc/init.d/rc 2” after everything else is set up properly
The code for upstart-dummy and start-stop-daemon are available here and here. The former is released under a GPLv2 or later license, the latter is hereby put into public domain, as the original start-stop-daemon code is licensed as such, too. Please note that although they’ve been tested by us and seem to work just fine, we have a very specific environment where only a limited number of services need to be started and stopped. As usual, any questions, feedback and patches are welcome.
UPDATE: I’ve fixed a bunch of small bugs and created a git repo for upstart-dummy. It can be reached here: http://git.balabit.hu/?p=gyp/upstart-dummy.git;a=summary