It is important to start small. That is because it is easy to
capture vastly more data than can possibly be processed.
The simplest thing to do with Intel PT is userspace profiling of
small programs. Data is captured with perf record e.g. to trace
ls userspace-only:
perf record -e intel_pt//u ls
And profiled with perf report e.g.
perf report
To also trace kernel space presents a problem, namely kernel
self-modifying code. A fairly good kernel image is available in
/proc/kcore but to get an accurate image a copy of /proc/kcore
needs to be made under the same conditions as the data capture.
perf record can make a copy of /proc/kcore if the option --kcore
is used, but access to /proc/kcore is restricted e.g.
sudo perf record -o pt_ls --kcore -e intel_pt// -- ls
which will create a directory named pt_ls and put the perf.data
file (named simply data) and copies of /proc/kcore,
/proc/kallsyms and /proc/modules into it. The other tools
understand the directory format, so to use perf report becomes:
sudo perf report -i pt_ls
Because samples are synthesized after-the-fact, the sampling
period can be selected for reporting. e.g. sample every
microsecond
sudo perf report pt_ls --itrace=i1usge
See the sections below for more information about the --itrace
option.
Beware the smaller the period, the more samples that are
produced, and the longer it takes to process them.
Also note that the coarseness of Intel PT timing information will
start to distort the statistical value of the sampling as the
sampling period becomes smaller.
To represent software control flow, "branches" samples are
produced. By default a branch sample is synthesized for every
single branch. To get an idea what data is available you can use
the perf script tool with all itrace sampling options, which will
list all the samples.
perf record -e intel_pt//u ls
perf script --itrace=ibxwpe
An interesting field that is not printed by default is flags
which can be displayed as follows:
perf script --itrace=ibxwpe -F+flags
The flags are "bcrosyiABExgh" which stand for branch, call,
return, conditional, system, asynchronous, interrupt, transaction
abort, trace begin, trace end, in transaction, VM-entry, and
VM-exit respectively.
perf script also supports higher level ways to dump instruction
traces:
perf script --insn-trace --xed
Dump all instructions. This requires installing the xed tool (see
XED below) Dumping all instructions in a long trace can be fairly
slow. It is usually better to start with higher level decoding,
like
perf script --call-trace
or
perf script --call-ret-trace
and then select a time range of interest. The time range can then
be examined in detail with
perf script --time starttime,stoptime --insn-trace --xed
While examining the trace it's also useful to filter on specific
CPUs using the -C option
perf script --time starttime,stoptime --insn-trace --xed -C 1
Dump all instructions in time range on CPU 1.
Another interesting field that is not printed by default is ipc
which can be displayed as follows:
perf script --itrace=be -F+ipc
There are two ways that instructions-per-cycle (IPC) can be
calculated depending on the recording.
If the cyc config term (see config terms section below) was used,
then IPC is calculated using the cycle count from CYC packets,
otherwise MTC packets are used - refer to the mtc config term.
When MTC is used, however, the values are less accurate because
the timing is less accurate.
Because Intel PT does not update the cycle count on every branch
or instruction, the values will often be zero. When there are
values, they will be the number of instructions and number of
cycles since the last update, and thus represent the average IPC
since the last IPC for that event type. Note IPC for "branches"
events is calculated separately from IPC for "instructions"
events.
Also note that the IPC instruction count may or may not include
the current instruction. If the cycle count is associated with an
asynchronous branch (e.g. page fault or interrupt), then the
instruction count does not include the current instruction,
otherwise it does. That is consistent with whether or not that
instruction has retired when the cycle count is updated.
Another note, in the case of "branches" events, non-taken
branches are not presently sampled, so IPC values for them do not
appear e.g. a CYC packet with a TNT packet that starts with a
non-taken branch. To see every possible IPC value, "instructions"
events can be used e.g. --itrace=i0ns
While it is possible to create scripts to analyze the data, an
alternative approach is available to export the data to a sqlite
or postgresql database. Refer to script export-to-sqlite.py or
export-to-postgresql.py for more details, and to script
exported-sql-viewer.py for an example of using the database.
There is also script intel-pt-events.py which provides an example
of how to unpack the raw data for power events and PTWRITE. The
script also displays branches, and supports 2 additional modes
selected by option:
--insn-trace - instruction trace
--src-trace - source trace
As mentioned above, it is easy to capture too much data. One way
to limit the data captured is to use snapshot mode which is
explained further below. Refer to new snapshot option and Intel
PT modes of operation further below.
Another problem that will be experienced is decoder errors. They
can be caused by inability to access the executed image,
self-modified or JIT-ed code, or the inability to match side-band
information (such as context switches and mmaps) which results in
the decoder not knowing what code was executed.
There is also the problem of perf not being able to copy the data
fast enough, resulting in data lost because the buffer was full.
See Buffer handling below for more details.