Column 8: If one CPU is not enough. (2011-08-17)
It has always been much easier to build two of the same CPUs than a
single CPU that runs twice as fast. If you purchase a machine with the
fastest CPU available (at any given time). buying a CPU that runs
twice as fast is simply not possible. Buying more of the same CPUs is
indeed possible (if expensive) and building a machine that can use
them all is also possible. Symmetric Multiprocessing (SMP) machines
contain more than one CPU, but are otherwise a single system with a
single main memory and a single I/O bus.
These days, multi-core CPUs are the standard and it would be foolish
if Coreboot could not support them. Coreboot must configure the
hardware on an SMP system correctly, it must provide the correct
tables to the Operating System (so it knows which CPUs it should
enable) and in most cases it has to run some initialization code on
each CPU.
There are several types of SMP.
- Classic SMP. A motherboard has several CPU sockets (typically
two) and each socket may contain a CPU. Not all CPUs are capable of
being used in a classic SMP system.
- Multi-core CPU. A single CPU package contains more than one CPU
core. Each CPU core is more or less a complete CPU (with its own L1
cache), but they share some resources (such as the L2 cache and the
memory controller if that is part of the CPU package).
- Hyperthreading. Only the register set (including the
software-visible CPU state) is replicated, but the actual execution
units of the CPU are not. Two programs can be active on the CPU at
once and the CPU can switch between them with zero overhead. When
one program is waiting for data from RAM (cache miss) the other
program can run. On the other hand, if two programs are doing
computation intensive work involving lots of multiply
instructions, there is only one hardware multiplier that must be
used by both programs, To the software, one Hyperthreading CPU appears
as two independent CPUs. For example: one thread could run in real
more, the other one in protected mode. All threads share the same caches.
- NUMA (Non Uniform Memory Architecture). This system is used by
SMP-capable AMD systems. Each CPU has its own RAM controller
(northbridge functionality) and each physical CPU has its own
RAM. All RAM is accessible from all CPUs, but of course an access to
the CPU's local RAM is faster than an access to the RAM of a
different CPU. The operating system must know what RAM address
ranges belong to which CPU, so it can utilize the system
efficiently. The job of Coreboot is much more complicated on such a
system.
- Clusters. This is not a form of SMP, but it is the next logical
step after NUMA to increase the capacity of a multi-CPU system. A
cluster contains several motherboards (each with its local disk
storage) that are interconnected via fast networks. Each
motherboard runs its own instance of an operating system, but there
is software to divide applications over each of the machines, so
the system as a whole appears to the user as one giant machine.
Different types of SMP can be combined in a single system. A top-of
the line motherboard could have two physical CPUs (classic SMP), each
having four cores (multi-core), each being capably of Hyperthreading.
History
SMP systems were already common in the mainframe world in the
1970s. In the 1980s Sequent built SMP machines (running Unix) based on
off-the-shelf microcomputers. In 1987 they introduced a model based on
the Intel 80386. They pioneered many of the hardware and software
principles found in modern SMP machines running Linux.
With the introduction of the Intel Pentium in 1993, SMP on PC-class
machines began for real. No mainstream operating system supported SMP
at that time. Windows NT4.0 (1995) and Linux 2.0 (1996) had some
support for SMP. Until 2003, SMP was reserved for high-end
motherboards and high-end CPUs. When Intel introduced its Pentium 4
with Hyperthreading, SMP-capable systems started to be common in home
computers. In 2006, dual-core CPUs had become common. As of 2011,
essentially all PC-class machines have at least a dual-core CPU.
Hardware
The hardware of an SMP system is very complex. It involves (among
other things) the following:
- Connecting several CPUs via their Front Side Buses to a single
northbridge (Intel) or connecting them to each other and to the
southbridge via HyperTransport (AMD).
- Ensuring cache coherency among the CPUs.
- Delivering interrupts from devices to CPUs and among CPUs.
In a multi-core CPU these tasks are performed on-chip/
SMP systems use the APIC (Advanced Programmable Interrupt Controller)
logic to deliver interrupts to one or more CPUs. This logic is now
part of the CPU. Device interrupts can be delivered to one or more
CPUs and CPUs can send interrupts among themselves (Inter Processor
Interrupt or IPI). In particular the boot CPU can start and stop the
other CPUs in the system via a sequence of IPIs.
When a hardware reset occurs, one CPU starts actually running and the
others are stopped. The CPU that starts running is the boot CPU, the
others are called Application Processors. In an x86 SMP system, each
application processor starts in real mode at a page-aligned address
(multiple of 4096), which has to lie in the real mode address range.
The boot CPU has to send a sequence of IPIs (with the start address as
a parameter) to start each application processor.
BIOS and the Operating System
In short, the responsibilities of the BIOS and the Operating System
are as follows: the BIOS has to supply the correct tables that specify
which CPUs are available, the Operating System has to start them up.
Nearly all of Coreboot is run by the boot CPU. Coreboot has to perform
the following tasks for SMP systems:
- Detect the presence of application CPUs. For classic SMP systems,
some of the sockets could be empty. Coreboot has to detect it, for
instance by trying to send an IPI to the CPU and checking whether it
is acknowledged. For multi-core and
Hyperthreading CPUs, the CPU model specifies how many cores and
threads are present. Coreboot can find this out via the CPUID
instruction.A single motherboard could be fitted with different
(related) CPU types, possibly with a different number of cores or
with and without Hyperthreading.
- Configure any required hardware for the number of installed
CPUs. This applies in particular to the HyperTransport links in AMD
systems.
- Create the correct tables that the Operating Systems will use to
find out which CPUs are present. These are the Floating Pointer
Table and MP Configuration Table.
- In many cases Coreboot has to run a short piece of code on each of
the CPUs (for instance to configure cache-related settings per
CPU). Coreboot starts and stops each application processor, just
like the Operating System would do.
The operating system has the following tasks to start the application
processors up:
- Read and parse the tables supplied by BIOS to find out what
application processors are available.
- Provide start-up code for the application processor to run. In
Linux, this
code has to set the CPU in protected mode and then it has to enable
paging. Then it must jump to the appropriate address in the
kernel. This start-up code is often called the "trampoline". It has
to reside in real-mode memory (below 1M) and at a page-aligned address.
- Send a sequence of IPI messages to actually start the application
processor.
Setting up NUMA
The AMD Opteron CPUs have their own integrated memory controllers and
they have SMP capability, resulting in a NUMA system. The job of
setting this up is a bit tougher than for most other systems. The
main issue is that each CPU has to run its own RAM initialization
code. In a system with many CPUs, the boot CPU may not be able to set
up all HyperTransport links, so this has to be delegated to
application processors as well.
A few notes:
- The boot CPU can set up its own memory controller
first. Application processors can use this RAM range right from the
start and they do not have to rely on tricks such as "Cache as
RAM".
- The boot CPU can read all SPD EEPROMs from all memory modules and
can pass this information to the application processors. The
code running on the application processors only has to configure the
RAM controller.