`perf-intel-pt` ( 1 )

поддержка Intel Processor Trace в инструментах perf (Support for Intel Processor Trace within perf tools)

QUICKSTART

Формат

It is important to start small. That is because it is easy to capture vastly more data than can possibly be processed.

The simplest thing to do with Intel PT is userspace profiling of small programs. Data is captured with perf record e.g. to trace ls userspace-only:

perf record -e intel_pt//u ls

And profiled with perf report e.g.

perf report

To also trace kernel space presents a problem, namely kernel self-modifying code. A fairly good kernel image is available in /proc/kcore but to get an accurate image a copy of /proc/kcore needs to be made under the same conditions as the data capture. perf record can make a copy of /proc/kcore if the option --kcore is used, but access to /proc/kcore is restricted e.g.

sudo perf record -o pt_ls --kcore -e intel_pt// -- ls

which will create a directory named pt_ls and put the perf.data file (named simply data) and copies of /proc/kcore, /proc/kallsyms and /proc/modules into it. The other tools understand the directory format, so to use perf report becomes:

sudo perf report -i pt_ls

Because samples are synthesized after-the-fact, the sampling period can be selected for reporting. e.g. sample every microsecond

sudo perf report pt_ls --itrace=i1usge

See the sections below for more information about the --itrace option.

Beware the smaller the period, the more samples that are produced, and the longer it takes to process them.

Also note that the coarseness of Intel PT timing information will start to distort the statistical value of the sampling as the sampling period becomes smaller.

To represent software control flow, "branches" samples are produced. By default a branch sample is synthesized for every single branch. To get an idea what data is available you can use the perf script tool with all itrace sampling options, which will list all the samples.

perf record -e intel_pt//u ls perf script --itrace=ibxwpe

An interesting field that is not printed by default is flags which can be displayed as follows:

perf script --itrace=ibxwpe -F+flags

The flags are "bcrosyiABExgh" which stand for branch, call, return, conditional, system, asynchronous, interrupt, transaction abort, trace begin, trace end, in transaction, VM-entry, and VM-exit respectively.

perf script also supports higher level ways to dump instruction traces:

perf script --insn-trace --xed

Dump all instructions. This requires installing the xed tool (see XED below) Dumping all instructions in a long trace can be fairly slow. It is usually better to start with higher level decoding, like

perf script --call-trace

perf script --call-ret-trace

and then select a time range of interest. The time range can then be examined in detail with

perf script --time starttime,stoptime --insn-trace --xed

While examining the trace it's also useful to filter on specific CPUs using the -C option

perf script --time starttime,stoptime --insn-trace --xed -C 1

Dump all instructions in time range on CPU 1.

Another interesting field that is not printed by default is ipc which can be displayed as follows:

perf script --itrace=be -F+ipc

There are two ways that instructions-per-cycle (IPC) can be calculated depending on the recording.

If the cyc config term (see config terms section below) was used, then IPC is calculated using the cycle count from CYC packets, otherwise MTC packets are used - refer to the mtc config term. When MTC is used, however, the values are less accurate because the timing is less accurate.

Because Intel PT does not update the cycle count on every branch or instruction, the values will often be zero. When there are values, they will be the number of instructions and number of cycles since the last update, and thus represent the average IPC since the last IPC for that event type. Note IPC for "branches" events is calculated separately from IPC for "instructions" events.

Also note that the IPC instruction count may or may not include the current instruction. If the cycle count is associated with an asynchronous branch (e.g. page fault or interrupt), then the instruction count does not include the current instruction, otherwise it does. That is consistent with whether or not that instruction has retired when the cycle count is updated.

Another note, in the case of "branches" events, non-taken branches are not presently sampled, so IPC values for them do not appear e.g. a CYC packet with a TNT packet that starts with a non-taken branch. To see every possible IPC value, "instructions" events can be used e.g. --itrace=i0ns

While it is possible to create scripts to analyze the data, an alternative approach is available to export the data to a sqlite or postgresql database. Refer to script export-to-sqlite.py or export-to-postgresql.py for more details, and to script exported-sql-viewer.py for an example of using the database.

There is also script intel-pt-events.py which provides an example of how to unpack the raw data for power events and PTWRITE. The script also displays branches, and supports 2 additional modes selected by option:

--insn-trace - instruction trace --src-trace - source trace

As mentioned above, it is easy to capture too much data. One way to limit the data captured is to use snapshot mode which is explained further below. Refer to new snapshot option and Intel PT modes of operation further below.

Another problem that will be experienced is decoder errors. They can be caused by inability to access the executed image, self-modified or JIT-ed code, or the inability to match side-band information (such as context switches and mmaps) which results in the decoder not knowing what code was executed.

There is also the problem of perf not being able to copy the data fast enough, resulting in data lost because the buffer was full. See Buffer handling below for more details.

Исходный текст на man7.org

perf-intel-pt ( 1 )

QUICKSTART

`perf-intel-pt` ( 1 )