Column 8: If one CPU is not enough. (2011-08-17)

It has always been much easier to build two of the same CPUs than a single CPU that runs twice as fast. If you purchase a machine with the fastest CPU available (at any given time). buying a CPU that runs twice as fast is simply not possible. Buying more of the same CPUs is indeed possible (if expensive) and building a machine that can use them all is also possible. Symmetric Multiprocessing (SMP) machines contain more than one CPU, but are otherwise a single system with a single main memory and a single I/O bus.

These days, multi-core CPUs are the standard and it would be foolish if Coreboot could not support them. Coreboot must configure the hardware on an SMP system correctly, it must provide the correct tables to the Operating System (so it knows which CPUs it should enable) and in most cases it has to run some initialization code on each CPU.

There are several types of SMP.

Classic SMP. A motherboard has several CPU sockets (typically two) and each socket may contain a CPU. Not all CPUs are capable of being used in a classic SMP system.
Multi-core CPU. A single CPU package contains more than one CPU core. Each CPU core is more or less a complete CPU (with its own L1 cache), but they share some resources (such as the L2 cache and the memory controller if that is part of the CPU package).
Hyperthreading. Only the register set (including the software-visible CPU state) is replicated, but the actual execution units of the CPU are not. Two programs can be active on the CPU at once and the CPU can switch between them with zero overhead. When one program is waiting for data from RAM (cache miss) the other program can run. On the other hand, if two programs are doing computation intensive work involving lots of multiply instructions, there is only one hardware multiplier that must be used by both programs, To the software, one Hyperthreading CPU appears as two independent CPUs. For example: one thread could run in real more, the other one in protected mode. All threads share the same caches.
NUMA (Non Uniform Memory Architecture). This system is used by SMP-capable AMD systems. Each CPU has its own RAM controller (northbridge functionality) and each physical CPU has its own RAM. All RAM is accessible from all CPUs, but of course an access to the CPU's local RAM is faster than an access to the RAM of a different CPU. The operating system must know what RAM address ranges belong to which CPU, so it can utilize the system efficiently. The job of Coreboot is much more complicated on such a system.
Clusters. This is not a form of SMP, but it is the next logical step after NUMA to increase the capacity of a multi-CPU system. A cluster contains several motherboards (each with its local disk storage) that are interconnected via fast networks. Each motherboard runs its own instance of an operating system, but there is software to divide applications over each of the machines, so the system as a whole appears to the user as one giant machine.

Different types of SMP can be combined in a single system. A top-of the line motherboard could have two physical CPUs (classic SMP), each having four cores (multi-core), each being capably of Hyperthreading.

History

SMP systems were already common in the mainframe world in the 1970s. In the 1980s Sequent built SMP machines (running Unix) based on off-the-shelf microcomputers. In 1987 they introduced a model based on the Intel 80386. They pioneered many of the hardware and software principles found in modern SMP machines running Linux.

With the introduction of the Intel Pentium in 1993, SMP on PC-class machines began for real. No mainstream operating system supported SMP at that time. Windows NT4.0 (1995) and Linux 2.0 (1996) had some support for SMP. Until 2003, SMP was reserved for high-end motherboards and high-end CPUs. When Intel introduced its Pentium 4 with Hyperthreading, SMP-capable systems started to be common in home computers. In 2006, dual-core CPUs had become common. As of 2011, essentially all PC-class machines have at least a dual-core CPU.

Hardware

The hardware of an SMP system is very complex. It involves (among other things) the following:

Connecting several CPUs via their Front Side Buses to a single northbridge (Intel) or connecting them to each other and to the southbridge via HyperTransport (AMD).
Ensuring cache coherency among the CPUs.
Delivering interrupts from devices to CPUs and among CPUs.

In a multi-core CPU these tasks are performed on-chip/

SMP systems use the APIC (Advanced Programmable Interrupt Controller) logic to deliver interrupts to one or more CPUs. This logic is now part of the CPU. Device interrupts can be delivered to one or more CPUs and CPUs can send interrupts among themselves (Inter Processor Interrupt or IPI). In particular the boot CPU can start and stop the other CPUs in the system via a sequence of IPIs.

When a hardware reset occurs, one CPU starts actually running and the others are stopped. The CPU that starts running is the boot CPU, the others are called Application Processors. In an x86 SMP system, each application processor starts in real mode at a page-aligned address (multiple of 4096), which has to lie in the real mode address range. The boot CPU has to send a sequence of IPIs (with the start address as a parameter) to start each application processor.

BIOS and the Operating System

In short, the responsibilities of the BIOS and the Operating System are as follows: the BIOS has to supply the correct tables that specify which CPUs are available, the Operating System has to start them up.

Nearly all of Coreboot is run by the boot CPU. Coreboot has to perform the following tasks for SMP systems:

Detect the presence of application CPUs. For classic SMP systems, some of the sockets could be empty. Coreboot has to detect it, for instance by trying to send an IPI to the CPU and checking whether it is acknowledged. For multi-core and Hyperthreading CPUs, the CPU model specifies how many cores and threads are present. Coreboot can find this out via the CPUID instruction.A single motherboard could be fitted with different (related) CPU types, possibly with a different number of cores or with and without Hyperthreading.
Configure any required hardware for the number of installed CPUs. This applies in particular to the HyperTransport links in AMD systems.
Create the correct tables that the Operating Systems will use to find out which CPUs are present. These are the Floating Pointer Table and MP Configuration Table.
In many cases Coreboot has to run a short piece of code on each of the CPUs (for instance to configure cache-related settings per CPU). Coreboot starts and stops each application processor, just like the Operating System would do.

The operating system has the following tasks to start the application processors up:

Read and parse the tables supplied by BIOS to find out what application processors are available.
Provide start-up code for the application processor to run. In Linux, this code has to set the CPU in protected mode and then it has to enable paging. Then it must jump to the appropriate address in the kernel. This start-up code is often called the "trampoline". It has to reside in real-mode memory (below 1M) and at a page-aligned address.
Send a sequence of IPI messages to actually start the application processor.

Setting up NUMA

The AMD Opteron CPUs have their own integrated memory controllers and they have SMP capability, resulting in a NUMA system. The job of setting this up is a bit tougher than for most other systems. The main issue is that each CPU has to run its own RAM initialization code. In a system with many CPUs, the boot CPU may not be able to set up all HyperTransport links, so this has to be delegated to application processors as well. A few notes:

The boot CPU can set up its own memory controller first. Application processors can use this RAM range right from the start and they do not have to rely on tricks such as "Cache as RAM".
The boot CPU can read all SPD EEPROMs from all memory modules and can pass this information to the application processors. The code running on the application processors only has to configure the RAM controller.