In the second part of this two-part article, published here, we report on Len's 10 ACPI myths.
Len Brown can only be a glutton for punishment; he is, after all, the maintainer of the Linux ACPI subsystem.
That is a difficult position to be in: ACPI involves getting into the BIOS layer, an area of system software which is not always known for careful, high-quality work. Supporting ACPI is a complex task which, among other things, requires the embedding of a specialised interpreter within the kernel, a hard sell at best.
Even with that background in mind, one must wonder just how much masochism is required to lead one to deliver three separate talks at the 2007 Ottawa Linux Symposium. That is just what Len did, however; the end result was a good view into several aspects of the power management problem.
Getting more from tickless
The first talk (on the tickless kernel) was supposed to be given by Suresh Siddha, who was unable to attend the event. The dynamic tick patches have been covered here before. Suresh/Len's talk was not really about how these patches work, but, instead, about the work which remains to be done to take full advantage of the tickless design. It seems that the work which has been done so far is just the beginning.
The problem is that, on a system used by Suresh and company, the average processor sleep time was still less than 1ms even after the dynamic tick code was enabled. Given that, one of the driving reasons for dynamic tick was to let the processor sleep for long periods of time - thus saving power - this is a disappointing result. It turns out that there is a lot which can be done to improve the situation, though.
Step number one is to address a kernel-space problem: there are a lot of active kernel timers which tend to spread out over time. As a result, the kernel wakes up much more often than it would if the timers were sufficiently well coordinated to expire at the same time whenever possible. As it happens, many kernel timers do not need great precision; a timer which fires some number of milliseconds later than scheduled is not a problem. So, if the kernel could defer some timers to fire at the same time as others, it can reduce the number of wakeups.
The deferrable timers patch does exactly that; the round_jiffies() function added in 2.6.19 can also help the kernel line up events. Adding this code brought the average sleep time up to 20ms, with the system handling 90 interrupts per second.
Next is the problem of hardware timers. On the i386 architecture, the preferred timer is the local APIC (LAPIC) timer, which is built into the processor and very fast to program. Unfortunately, putting the processor into a deep sleep also puts the LAPIC timer to sleep, a situation Len compared to unplugging one's alarm clock before going to bed.
In either case, oversleeping can be the unwanted result. The programmable interval timer (PIT) remains awake and is easily used, but it has a maximum event time of 27ms. If one wants the processor to sleep for longer than that, another solution must be found. That solution is the high-precision event timer (HPET), which has a maximum interval of at least three seconds. Getting access to the HPET can be hard, though; good BIOS support is spotty at best and the HPET is often disabled. If it can be forced on, however, the system can go to an average sleep period of about 56ms, handling 32 interrupts per second.
Even better is to get the HPET out of the "legacy mode" currently used by Linux. This mode is simple to use, but it requires the rebroadcasting of timer interrupts on multiprocessor systems. But the HPET can work with per-CPU channels, eliminating this problem. The result: average sleep time grows to 74ms.
At this point, the problem moves to user space. Since the release of powertop, there has been a lot of progress in this area; user-space applications which cause frequent wakeups unnecessarily stand out immediately and can be fixed. But, as Len noted, "user space still sucks."
Len then moved away from power consumption toward its effects - the generation of heat, in particular. The creation of excess heat is not a welcome behaviour in any device, but it becomes especially undesirable in handheld devices. Devices which make the user's hand sweat are less fun to use; those which get too hot to hold comfortably can be entirely unusable. So temperature management is important. But the nature of these devices can make thermal regulation tricky: there's no room for fans in a Linux-powered cellular phone, and the dissipation of heat can be hard in general.
The ACPI 3.0 specification includes a complicated thermal model. The device is divided up into zones, and each component has its thermal contribution to each zone characterised. Implementing this specification is a complex and difficult task - enough so that the Linux ACPI developers have no intention of doing it. They will, instead, focus on something simpler.
That something is the ACPI 2.0 thermal model. It includes thermal zones, each of which comes with temperature sensors and a set of trip points. The "critical shutdown" trip point is set somewhere just short of where the device begins to melt; should things get that warm, the device just needs to turn itself off as quickly as possible. Various other trip points will be encountered first; they should bring about increasingly strong measures for controlling temperature.
These can include turning on fans (if they exist), throttling devices, or suspending the system to disk. ACPI 2.0 includes an embedded controller which monitors the system's temperature sensors and sends events to the CPU when something interesting happens.
The in-progress thermal regulation code just uses the existing critical shutdown mechanism built into ACPI. There is also support for some of the passive trip points which bring about CPU throttling. For the non-processor thermal zones, though, the best thing to do is to let user space figure out how to respond, so that's what the ACPI code will do. There will be a netlink interface through which temperature events can be sent, and a set of sysfs directories for reading sensor values. The sysfs tree will also include control files which can be used by a user-space daemon to throttle specific devices in response to temperature events.
In the end, the kernel is really just a conduit, conducting events and control settings between the components of the device and user space. There were some questions on whether there will be a standardised set of sysfs knobs for every device; the answer appears to be "no." Each device is different, with its own control parameters; it is hard to create any sort of standard which can describe them all. Beyond that, the target environment is embedded devices, each of which is unique. It is expected that each device will have its own special-purpose management daemon designed especially for it, so there is no real benefit in trying to make things generic.
The impression one gets from all these talks is that quite a bit is happening in the power management area - a part of Linux which, for some time, has been seen as falling short of what it really needs to be. The increasing use of Linux in embedded systems can only help in this regard; there are a number of vendors who have a strong interest in improved support for intelligent use of power. Given time and continued work, power management may soon be one of those past problems which is no longer an issue.
In the second part of this article, published here, we report on Len's 10 ACPI myths.