Linux provides an amazing suite of functionality for embedded systems, and it is hard to overstate the benefits of using such a feature rich and well tested base.

This functionality comes at a price however, as when compared to non-Linux systems it generally means increased complexity, binary size, and more often than not startup time. For some systems, this longer startup time can be mitigated by never fully shutting down, and relying on sleep modes to reduce power consumption when the system is not in use. However for some systems, such as non-battery-backed devices where each usage requires a full power-on sequence, it is important that the cold power-on time is minimised.

As is often the case with complex systems, slow boot speed is generally not caused by one single issue, but rather a large collection of smaller factors. Following that, we find that fast overall startup time tends to be achieved as the result of the culmination of say 10 to 20 minor improvements.

When looking at startup time, we typically break the boot time of a Linux system down into three broad stages:

Power-on to Linux kernel load. This is typically referred to as the bootloader, but on some systems this can be entirely performed by the boot ROM. This phase starts when power is applied to the system, and is finished when the Linux kernel has been loaded from non-volatile storage into RAM.
Linux boot. This covers the time it takes the Linux kernel to unpack itself and initialise sufficient hardware to mount a root filesystem and begin executing application (userspace) software.
Application startup. This phase covers not just the start of the final application itself, but also all of the dependencies it might have - such as mounting filesystems, initialising network interfaces, performing any pre-run checks and so on.

Before attempting any improvements, it is important to analyse the existing system to determine the baseline boot speed.

For some systems this can be done by simply monitoring the console output, however this has two drawbacks - it cannot capture any time taken before the serial port is initialised, and as the serial port is often one of the slowest peripherals, outputting to it will actually slow the boot down. For these reasons, it is often good to instrument some lower level timing mechanisms. The simplest is to use spare GPIO pins, toggling them at various stages in the boot. Monitoring these signals on an oscilloscope can produce highly accurate timing without interfering with the boot itself. This kind of system can also be left enabled in the production software to make regression testing of boot speed easier (or even automated).

Once a baseline has been established, various techniques can then be used to begin making improvements. There are far more of these techniques than can reasonably be covered here, however the main areas we focus on are:

Ensuring that the initial kernel load is being performed as fast as the hardware can support. Typically backing storage systems, such as SPI, often have multiple modes or speeds. Often conservative defaults are supplied for these to ensure maximum compatibility with hardware designs. However if the particular hardware design has faster chips, this speed may not be being exploited in the default configuration.
Ensuring that busy-wait loops in the bootloader are minimised. Before Linux has started, the bootloader tends to be a much simpler, single-threaded system. This means that one operation will not be started until the previous operation has fully completed. However overlapping some operations (such as beginning kernel uncompression in parallel with loading it from storage), can provide significant gains.
Ensuring that only the drivers used on the embedded system are included in the kernel. By default Linux provides support for a massive array of hardware. For desktop environments, this is ideal as it provides maximum flexibility. However embedded systems typically have a much narrower target, and as such do not benefit from this flexibility. Reducing the drivers in the kernel will not only reduce the binary size (which makes for a faster kernel load time), it will also drop the driver initialisation code, which directly impacts boot speed.
Choosing the correct filesystem for the application. In particular, using a read-only filesystem for the core system, and using a read-write filesystem for user data. This allows for the read-write filesystem to be initialised in parallel with the application startup.
Ensuring parallel execution of application startup. This may require some rework of the application, but having a system that can cope with some subsystems not being immediately available can provide significant speedups. For instance, it may be that the user cannot save new data for the first 2-3 seconds of the application startup. However this allows other functionality to be fully available in that time, and can dramatically improve the perceived performance of the system.

Boot speed is an issue heavily discussed online, and is constantly evolving as new hardware becomes available and new software is written. There are some excellent resources on the topic available, such as:

Improving Linux Startup Time

Andre Renaud