|  | Akaros Profiling | 
|  | =========================== | 
|  |  | 
|  | Contents: | 
|  |  | 
|  | (*) Perf | 
|  | - Setup | 
|  | - Example | 
|  | - More Complicated Examples | 
|  | - Differences From Linux | 
|  |  | 
|  | (*) mpstat | 
|  |  | 
|  |  | 
|  | =========================== | 
|  | PERF | 
|  | =========================== | 
|  | Akaros has limited support for perf_events.  perf is a tool which utilizes CPU | 
|  | performance counters for performance monitoring and troubleshooting. | 
|  |  | 
|  | Akaros has its own version of perf, similar in spirit to Linux's perf, that | 
|  | produces PERFILE2 ABI compliant perf.data files (if not, file a bug!).  The | 
|  | kernel generates traces, under the direction of perf.  You then copy the traces | 
|  | to a Linux host and process using Linux's perf. | 
|  |  | 
|  |  | 
|  | SETUP | 
|  | -------------------- | 
|  | To build Akaros's perf directly: | 
|  |  | 
|  | (linux)$ cd tools/dev-libs/elfutils ; make install; cd - | 
|  | (linux)$ cd tools/dev-util/perf ; make install; cd - | 
|  |  | 
|  | Or to build it along with all apps: | 
|  |  | 
|  | (linux)$ make apps-install | 
|  |  | 
|  | You will also need suitable recent Linux perf for the reporting of the data | 
|  | (something that understands PERFILE2 format).  Unpatched Linux 4.5 perf did the | 
|  | trick.  You'll also want libelf and maybe other libraries on your Linux | 
|  | machine. | 
|  |  | 
|  | First, install libelf according to your distro.  On ubuntu: | 
|  | (linux) $ sudo apt-get install libelf-dev | 
|  |  | 
|  | Then try to just install perf using your Linux distro, and install any needed | 
|  | dependencies.  On ubuntu, you can install linux-tools-common and whatever else | 
|  | it asks for (something particular to your host kernel). | 
|  |  | 
|  | Linux perf changes a lot.  Newer versions are usually nicer.  I recommend | 
|  | building one of them:  Download Linux source, then | 
|  |  | 
|  | (linux) $ cd tools/perf/ | 
|  | (linux) $ make | 
|  |  | 
|  | Then use your new perf binary.  This all is just installing a recent perf - it | 
|  | has little to do with Akaros at this point.  If you run into incompatibilities | 
|  | between our perf.data format and the latest Linux, file a bug. | 
|  |  | 
|  |  | 
|  | BASIC EXAMPLE | 
|  | -------------------- | 
|  | Perf on Akaros supports record, stat, and a few custom options. | 
|  |  | 
|  | You should be able to do the following: | 
|  |  | 
|  | / $ perf record ls | 
|  |  | 
|  | Then scp perf.data to Linux | 
|  |  | 
|  | (linux) $ scp AKAROS_MACHINE:perf.data . | 
|  | (linux) $ perf report --kallsyms=obj/kern/ksyms.map --symfs=kern/kfs/ | 
|  |  | 
|  | Perf will look on your host machine for the kernel symbol table and for | 
|  | binaries.  We need to tell it kallsyms and symfs to override those settings. | 
|  |  | 
|  | It can be a hassle to type out the kallsyms and symfs, so we have a script that | 
|  | will automate that.  Use scripts/perf in any place that you'd normally use | 
|  | perf.  Set your $AKAROS_ROOT (default is ".") and optionally override $PERF_CMD | 
|  | ("default is "perf").  For most people, this will just be: | 
|  |  | 
|  | (linux) $ ./scripts/perf report | 
|  |  | 
|  | The perf.data file is implied, so the above command is equivalent to: | 
|  |  | 
|  | (linux) $ ./scripts/perf report -i perf.data | 
|  |  | 
|  |  | 
|  | MORE COMPLICATED EXAMPLES | 
|  | -------------------- | 
|  | First, try perf --help for usage.  Then check out | 
|  | https://perf.wiki.kernel.org/index.php/Tutorial.  We strive to be mostly | 
|  | compatible with the usage of Linux perf. | 
|  |  | 
|  | perf stat runs a command and reports the count of events during the run of the | 
|  | command.  perf record runs a command and outputs perf.data, which contains | 
|  | backtrace samples from when the event counters overflowed.  For those familiar | 
|  | with other perfmon systems, perf stat is like PAPI and perf record is like | 
|  | Oprofile. | 
|  |  | 
|  | perf record and stat both track a set of events with the -e flag.  -e takes a | 
|  | comma-separated list of events.  Events can be expressed in one of three forms: | 
|  |  | 
|  | - Generic events (called "pre-defined" events on Linux) | 
|  | - Libpfm events | 
|  | - Raw events | 
|  |  | 
|  | Linux's perf only takes Generic and Raw events, so the libpfm4 is an added | 
|  | bonus. | 
|  |  | 
|  | Generic events consist of strings like "cycles" or "cache-misses".  Raw events | 
|  | aresimple strings of the form "rXXX", where the X's are hex nibbles.  The hex | 
|  | codes are passed directly to the PMU.  You can actually have 2-4 Xs on Akaros. | 
|  |  | 
|  | Libpfm events are strings that correspond to events specific to your machine. | 
|  | Libpfm knows about PMU events for a given machine.  It figures out what machine | 
|  | perf is running on and selects events that should be available.  Check out | 
|  | http://perfmon2.sourceforge.net/ for more info. | 
|  |  | 
|  | To see the list of events available, use `perf list [regex]`, supplying an | 
|  | optional search regex.  For example, on a Haswell: | 
|  |  | 
|  | / $ perf list unhalted_reference_cycles | 
|  | #----------------------------- | 
|  | IDX      : 37748738 | 
|  | PMU name : ix86arch (Intel X86 architectural PMU) | 
|  | Name     : UNHALTED_REFERENCE_CYCLES | 
|  | Equiv    : None | 
|  | Flags    : None | 
|  | Desc     : count reference clock cycles while the clock signal on the specific core is running. The reference clock operates at a fixed frequency, irrespective of c | 
|  | ore frequency changes due to performance state transitions | 
|  | Code     : 0x13c | 
|  | Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean) | 
|  | Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean) | 
|  | Modif-02 : 0x02 : PMU : [e] : edge level (may require counter-mask >= 1) (boolean) | 
|  | Modif-03 : 0x03 : PMU : [i] : invert (boolean) | 
|  | Modif-04 : 0x04 : PMU : [c] : counter-mask in range [0-255] (integer) | 
|  | Modif-05 : 0x05 : PMU : [t] : measure any thread (boolean) | 
|  | #----------------------------- | 
|  | IDX      : 322961409 | 
|  | PMU name : hsw_ep (Intel Haswell EP) | 
|  | Name     : UNHALTED_REFERENCE_CYCLES | 
|  | Equiv    : None | 
|  | Flags    : None | 
|  | Desc     : Unhalted reference cycles | 
|  | Code     : 0x300 | 
|  | Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean) | 
|  | Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean) | 
|  | Modif-02 : 0x05 : PMU : [t] : measure any thread (boolean) | 
|  |  | 
|  | There are two different events for UNHALTED_REFERENCE_CYCLES (case | 
|  | insensitive).  libpfm will select the most appropriate one.  You can override | 
|  | this selection by specifying a PMU: | 
|  |  | 
|  | / $ perf stat -e ix86arch::UNHALTED_REFERENCE_CYCLES ls | 
|  |  | 
|  | Here's how to specify multiple events: | 
|  |  | 
|  | / $ perf record -e cycles,instructions ls | 
|  |  | 
|  | Events also take a set of modifiers.  For instance, you can specify running | 
|  | counters only in kernel mode or user mode.  Modifiers are separated by a ':'. | 
|  |  | 
|  | This will track only user cycles (default is user and kernel): | 
|  |  | 
|  | / $ perf record -e cycles:u ls | 
|  |  | 
|  | To use a raw event, you need to know the event number.  You can either look in | 
|  | your favorite copy of the SDM, or you can ask libpfm.  Though if you ask | 
|  | libpfm, you might as well just use its string processing.  For example: | 
|  |  | 
|  | / $ perf list FLUSH | 
|  | #----------------------------- | 
|  | IDX      : 322961462 | 
|  | PMU name : hsw_ep (Intel Haswell EP) | 
|  | Name     : TLB_FLUSH | 
|  | Equiv    : None | 
|  | Flags    : None | 
|  | Desc     : TLB flushes | 
|  | Code     : 0xbd | 
|  | Umask-00 : 0x01 : PMU : [DTLB_THREAD] : None : Count number of DTLB flushes of thread-specific entries | 
|  | Umask-01 : 0x20 : PMU : [STLB_ANY] : None : Count number of any STLB flushes | 
|  | Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean) | 
|  | Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean) | 
|  | Modif-02 : 0x02 : PMU : [e] : edge level (may require counter-mask >= 1) (boolean) | 
|  | Modif-03 : 0x03 : PMU : [i] : invert (boolean) | 
|  | Modif-04 : 0x04 : PMU : [c] : counter-mask in range [0-255] (integer) | 
|  | Modif-05 : 0x05 : PMU : [t] : measure any thread (boolean) | 
|  | Modif-06 : 0x07 : PMU : [intx] : monitor only inside transactional memory region (boolean) | 
|  | Modif-07 : 0x08 : PMU : [intxcp] : do not count occurrences inside aborted transactional memory region (boolean) | 
|  |  | 
|  | The raw code is 0xbd.  So the following are equivalent (but slightly buggy!): | 
|  |  | 
|  | / $ perf stat -e TLB_FLUSH ls | 
|  | / $ perf stat -e rbd ls | 
|  |  | 
|  | If you actually run those, rbd will have zero hits, and TLB_FLUSH will give you | 
|  | the error "Failed to parse event string TLB_FLUSH". | 
|  |  | 
|  | Some events actually rather particular to their Umasks, and TLB_FLUSH is one of | 
|  | them.  TLB_FLUSH wants a Umask.  Umasks are selectors for specific sub-types of | 
|  | events.  In the case of TLB_FLUSH, we can choose between DTLB_THREAD and | 
|  | STLB_ANY.  Umasks are not always required - they just happen to be on my | 
|  | Haswell for TLB_FLUSH.  That being said, we can ask for the event like so: | 
|  |  | 
|  | / $ perf stat -e TLB_FLUSH:STLB_ANY ls | 
|  | / $ perf stat -e r20bd ls | 
|  |  | 
|  | Note that the Umask is placed before the Code.  These 16 bits are passed | 
|  | directly to the PMU, and on Intel the format is "umask:event". | 
|  |  | 
|  | perf record is based on recording samples when event counters overflow.  The | 
|  | number of events required to trigger a sample is referred to as the | 
|  | sample_period.  You can set it with -c, e.g. | 
|  |  | 
|  | / $ perf record -c 10000 ls | 
|  |  | 
|  |  | 
|  | DIFFERENCES FROM LINUX | 
|  | -------------------- | 
|  | For the most part, Akaros perf is similar to Linux.  A few things are | 
|  | different. | 
|  |  | 
|  | The biggest difference is that our perf does not follow processes around.  We | 
|  | count events for cores, not processes.  You can specify certain cores, but not | 
|  | certain processes.  Any options related to tracking specific processes are | 
|  | unsupported. | 
|  |  | 
|  | The -F option (frequency) is loosely supported.  The kernel cannot adjust the | 
|  | sampling count dynamically to meet a certain frequencey.  Instead, we guess | 
|  | that -F is used with cycles, and pick a sample period that will generate | 
|  | samples at the desired frequency if the core is unhalted.  YMMV. | 
|  |  | 
|  | Akaros currently supports only PMU events.  In the future, we may add events | 
|  | like context-switches. | 
|  |  | 
|  |  | 
|  | =========================== | 
|  | mpstat | 
|  | =========================== | 
|  | Akaros has basic support for mpstat.  mpstat gives a high-level glance at where | 
|  | each core is spending its time. | 
|  |  | 
|  | For starters, bind kprof somewhere.  The basic ifconfig script binds it to | 
|  | /prof. | 
|  |  | 
|  | To see the CPU usage, cat mpstat: | 
|  |  | 
|  | / $ cat /prof/mpstat | 
|  | CPU:             irq             kern              user                 idle | 
|  | 0: 1.707136 (  0%), 24.978659 (  0%), 0.162845 (  0%), 13856.233909 ( 99%) | 
|  |  | 
|  | To reset the count: | 
|  |  | 
|  | / $ echo reset > /prof/mpstat | 
|  |  | 
|  | To see the output for a particular command: | 
|  |  | 
|  | / $ echo reset > /prof/mpstat ; COMMAND ; cat /prof/mpstat |