Raw events and the perf ABI

By Jonathan Corbet
May 3, 2011

The perf events subsystem often looks like it's on the path to take over the kernel; there is a great deal of development activity there, and it has become a sort of generalized event reporting mechanism. But the original purpose of perf events was to provide access to the performance monitoring counters made available by the hardware, and it is still used to that end. The merging of perf was a bit of a hard pill for users of alternative performance monitoring tools to swallow, but they have mostly done so. The recent discussion on "offcore" events shows that there are still some things to argue about in this area, even if everybody seems likely to get what they want in the end.

perf events 子系统貌似将在内核中全面应用:相关的开发非常活跃,并且已经形成一种通用的事件报告机制。不过 perf events 的原始设计是为了访问硬件所提供的性能监控计数器,而且现在仍然被用于该用途。对于使用其它性能监控部件的开发者,转向到 perf 子系统是一剂难以吞咽的苦药,好在基本上都转的差不多了。但最近关于 "offcore" events 的讨论,显示该领域仍然存在争议,even if everybody seems likely to get what they want in the end.

The performance monitoring unit (PMU) is normally associated with the CPU; each processor has its own PMU for monitoring its own specific events. Some newer processors (such as Intel's Nehalem series) also provide a PMU which is not tied to any CPU; in the Nehalem case it's part of the "uncore" which handles memory access at the package level. The off-core PMU has a viewpoint which allows it to provide a better picture of the overall memory behavior of the system, so there is interest in gaining access to events from that PMU. Current kernels, though, do not provide access to these offcore events.

性能监控单元(PMU)通常是和CPU联系在一起的。每个处理器都有自己的 PMU 用以监控自己的特定事件。一些最新的处理器(如Intel的 Nehalem系列)还提供了一个没有和任何CPU关联的 PMU,该例子里,PMU 是 "uncore" 的一部分,负责在包层面来处理内存访问的性能计数。这个 offcore PMU 能够为系统的内存行为提供更好的全貌,从这个 PMU 获得事件的访问会很有趣。然而,当前版本的内核并不提供对这些 offcore 事件的访问。

For a while, the 2.6.39-rc kernel did provide access to these events, following the merging of a patch by Andi Kleen in March. One piece that was missing, though, was a patch to the user-space perf tool to provide access to this functionality. There was an attempt to merge that piece toward the end of April, but it did not yield the desired results; rather than merge the additional change, perf maintainer Ingo Molnar removed the ability to access offcore events entirely.

依赖于 Andi Kleen 在三月份提供的一个补丁,2.6.39rc 版内核提供了这些事件的访问方式。有人尝试在四月末提供一个让用户空间的 perf 工具来访问该功能的补丁,但没有完成。因为任何相关的改变已经没有意义,perf维护者Ingo Molnar将访问 offcore 事件的能力全部移除了。

Needless to say, that action has led to some unhappiness in the perf user community; there are developers who had already been making use of those events. Normally, breaking things in this way would be considered a regression, and the patch would be backed out again. But, since this functionality never appeared in a released kernel, it cannot really be called a regression. That, of course, is part of the point of removing the feature now.

自然不必说,这种行为必然会导致 perf 开发社区的不快;有些开发者已经在用那些事件了。通常,这会被视为一个 regression,相关最新补丁被迫回滚。但是由于这些功能并未出现在正式发行中,所以也不能报告 bug 说发生了 regression,只是我们移除了一个特性而已。

Ingo's complaint is straightforward: the interface to these events was too low-level and too difficult to use. The rejected perf patch had an example command which looked like:

    perf stat -e r1b7:20ff -a sleep 1

Non-expert readers may, indeed, be forgiven for not immediately understanding that this command would monitor access to remote DRAM - memory which is hosted on a different socket. Ingo asserted that the feature should be more easily used, perhaps with a command like (from the patch removing the feature):

    perf record -e dram-remote ./myapp


    perf stat -e r1b7:20ff -a sleep 1

对非专家读者来说,无法理解这条命令是指示监控 remote DRAM(memory which is hosted on a different socket,貌似是关联在其它CPU插槽上的内存)访问。Ingo认为这些特性应该更容易使用才对,比如像下面这条命令:

    perf record -e dram-remote ./myapp

He also said:

But this kind of usability is absolutely unacceptable - users should not be expected to type in magic, CPU and model specific incantations to get access to useful hardware functionality.
The proper solution is to expose useful offcore functionality via generalized events - that way users do not have to care which specific CPU model they are using, they can use the conceptual event and not some model specific quirky hexa number.

他同时说道,这种不易用性绝对无法令人接受。用户不应该被期望能够通过键入魔术般地、CPU级的、特定模式的咒语去访问有用的硬件功能。合适的方案是通过广义的事件来揭示 offcore 功能。这种方式下,用户不必去关心自己用的CPU模式,可以使用集中的事件而不是靠离奇的十六进制数指定的模式。

The key is the call for "generalized events" which are mapped, within the kernel, onto whatever counters the currently-running hardware uses to obtain that information. Users need not worry about the exact type of processor they are running on, and they need not dig out the data sheet to figure out what numbers will yield the results they want.

关键是定义出广义事件("generalized events"),它们在内核中被映射为计数器,正在运行的硬件用之来获得那些信息。用户不用去关心他们运行的处理器的型号,不必去通过挖掘数据表来确定出哪些数字将给出他们想要的结果。

Criticism of this move has taken a few forms. Generalized events, it is said, are a fine thing to have, but they can never reflect all of the weird, hardware-specific counters that each processor may provide. These events should also be managed in user space where there is more flexibility and no need to bloat the kernel. There were some complaints about how some of the existing generalized events have not always been implemented correctly on all architectures. And, they say, there will always be people who want to know what's in a specific hardware counter without having the kernel trying to generalize it away from them. As Vince Weaver put it:

Blocking access to raw events is the wrong idea. If anything, the whole "generic events" thing in the kernel should be ditched. Wrong events are used at times (see AMD branch events a few releases back, now Nehalem cache events). This all belongs in userspace, as was pointed out at the start. The kernel has no business telling users which perf events are interesting, or limiting them!

批评主义者已经采取了一些行为。广义事件,是一个很好的东西,但是却无法反应所有的怪异的,每个处理器可以提供的特定硬件的计数器。这些事件应该在用户空间被管理,这样会更灵活,而且不必使内核膨胀。有些抱怨是关于为何已经存在的一部分广义事件并不能在所有的体系结构下总是被正确执行。而且,总是有人想去直接使用特定的硬件计数器而不是内核提供的抽象层。正如Vince Weaver所说:

阻止访问 raw events 并不是个好主意,甚至整个内核中的通用事件都应该被丢弃。(现况导致了)时而出现不合适的事件应用(比如以前一些版本里 AMD 的 branch events,现在的 Nehalem的 cache events)。这些都应该在最开始就明确是属于用户空间的事件。内核没必要告诉用户哪个 perf events 是感兴趣的,或者限制它们。

Ingo's response is that the knowledge and techniques around performance monitoring should be concentrated in one place:

Well, the thing is, i think users are helped most if we add useful, highlevel PMU features added and not just an opaque raw event pass-through engine. The problem with lowlevel raw ABIs is that the tool space fragments into a zillion small hacks and there's no good concentration of know-how. I'd like the art of performance measurement to be generalized out, as well as it can be.


我认为用户需要我们增加有用的,高层的 PMU 特征,而不仅仅是提供不透明原始事件传递的引擎。仅有最底层的原始ABIs的问题是,这将功能分割成无数的小的hack,而没有一种好的集中方式。我希望性能衡量的艺术更一般化,而且也可以做到。

Vince, meanwhile, went on to claim that perf was a reinvention of the wheel which has ignored a lot of the experience built into its predecessors. There are, it seems, still some scars from that series of events. Thomas Gleixner disagreed with the claim that perf is an exercise in wheel reinvention, but he did say that he thought the raw events should be made available:

The problem at hand which ignited this flame war is definitely borderline and I don't agree with Ingo that it should not made be available right now in the raw form. That's an hardware enablement feature which can be useful even if tools/perf has not support for it and we have no generalized event for it. That's two different stories. perf has always allowed to use raw events and I don't see a reason why we should not do that in this case if it enables a subset of the perf userbase to make use of it.

Vince同时宣称perf是一个重复的革新,它忽略了前辈的大量的经验。似乎看起来仍旧有些那些一系列事件的疤痕。Thomas Gleixner不同意这种perf是重复创新的实践,但是他也认为原始事件应该可被用(译者注:从ABI的层面被暴露出来?):

The problem at hand which ignited this flame war is definitely borderline,我并不同意 Ingo 所说没有必要提供原始事件接口。那是一个有用的硬件可实施的特性,即使perf 还没有支持它,而且我们也没有普遍的事件支持它。那是两种不同的故事。perf总是允许使用原始事件而且我也并没有看到一个我们不应该那样做的理由,如果它使得一部分用户群体利用它。

It turns out that Ingo is fine with raw events too. His stated concern is that access to raw events should not be the primary means by which most users gain access to those performance counters. So he is blocking the availability of those events for now for two reasons. One of those is that he wants the generalized mode of access to be available first so that users will see it as the normal way to access offcore events. If there is never any need to figure out hexadecimal incantations, many user-space developers will never bother; as a result, their commands and code should eventually work on other processors as well.

The other reason for blocking raw events now is that, as the interface to these events is thought through, the ABI by which they are exposed to user space may need to change. Releasing the initial ABI in a stable kernel seems almost certain to cement it in place, given that people were already using it. By deferring these events for one cycle (somebody will certainly come up with a way to export them in 2.6.40), he hopes to avoid being stuck with a second-rate interface which has to be supported forever.

讨论最后,Ingo 也认为原始事件很好,不过他的立场是大多数用户访问性能计数器不应该以访问 raw events 作为主要手段。现在他阻止这些事件的可获得性基于两个原因。一个是他想首先实现通用的访问模式,这样用户能以很普通的方式来访问 offcore 事件。如果没有任何必要去理解十六进制的咒语,许多用户层面的开发者将不再烦恼。同时他们的命令和代码也能在其他处理器上也正常工作。


There can be no doubt that managing this feature in this way makes life harder for some developers. The kernel process can be obnoxious to deal with at times. But the hope is that doing things this way will lead to a kernel that everybody is happier with five years from now. If things work out that way, most of us can deal with a bit of frustration and a one-cycle delay now.

无疑以这种方式来管理这个特性对一些开发者来说很难。Linux 内核开发过程的处理难免可恶。这种方式是希望五年之后有一个每个人都很满意的内核。