S. L. Fritchie - Actively measuring and profiling Erlang code

Actively measuring and profiling Erlang code

S. L. Fritchie

scott@basho.com

Introducing myself

  • Scott Lystig Fritchie
  • Senior Software Engineer at Basho Japan KK
  • Email: <scott@basho.com>
  • Twitter: @slfritchie
  • UNIX, Erlang and me:
    • UNIX sysadmin & programmer since 1986
    • Erlang programmer since 2000
    • Publications by USENIX and the ACM
    • Speaker at Erlang User Conference and Erlang Factory

Introducing today's presentation

Today's focus: profiling Erlang code

  • You have identified code to profile, based on "USE Method" or other measurement-based methodology.
    • Examine a specific module
    • Examine a specific subsystem/OTP application
    • Not sure where to start: examine an entire system under a specific workload.
  • Guesses are wrong 127% of the time.
  • Be skeptical: doubt your measurements, verify them!
    • Multiple methods
    • Experiment while your test is running!
  • Beard definitely not required!

Today's focus: GOAL for profiling Erlang code

  • Identify the Erlang function(s) that is consuming the most CPU resource.
    • Either wall-clock time or "on scheduler" time
    • Depends on the measurement method

From Brendan Gregg: "Systems Performance" author

empty320x200.png 3T6mju_J.jpeg systemsperf_cover_123.jpeg

  • Most advice about benchmarking is also valid advice for profiling.
  • source: https:

//pbs.twimg.com/profile_images/512451438572613632/3T6mju_J.jpeg, http: //www.brendangregg.com/Images/systemsperf_cover_123.jpeg

Brendan Gregg: "Passive Benchmarking"

  1. Pick a benchmark tool.
  2. Run it with a variety of options.
  3. Make a slide deck of the results.
  4. Hand the slides to management.

Brendan Gregg: "Casual Benchmarking"

  • You benchmark A,
  • but actually measure B,
  • and conclude you've measured C

Brendan Gregg: "Active Benchmarking"

.

"You analyze performance while the benchmark is running – not just after it's done – using other tools."

Why? Confirm that you really understand:

  1. What does the test actually do?
  2. How the system under test actually respond to the test?
  3. What are the true limiters of the system under test?
  4. What are the true limiters of the test itself?

Active Benchmarking -> Active profiling

.

"You analyze performance while the system is running – not just after it's done – using other tools."

Why? Confirm that you really understand:

  1. What does the workload actually do?
  2. How the code under test actually respond to the workload?
  3. What are the true limiters of the code under test?
  4. What are the true limiters of the profiler tool itself?

From Martin Kjellin & Roberto Aloi: Profiling & Debugging Erlang Systems

Kjellin & Aloi: Profiling Mindset

  • Measure, don't guess.
  • Trust your measurements.
  • Measure before optimizing.
  • Know what you're measuring.

Lukas Larsson, Erlang VM Developer @ Erlang Solutions & Ericsson

empty320x200.png 1389811159312570lukas.jpg

"[Profiling is] Inherently difficult, you have to be aware of the entire stack."

sources: https://www.erlang-solutions.com/resources/webinars/understanding-erlang-scheduler (about minute 31), http://www.erlang-factory.com/euc2014/lukas-larsson

Garrett Smith: "Erlang is an operating system for your code"

  • You need to understand the major components of the system:
    • How OS threads are used by Erlang schedulers, async I/O pool, etc.
    • Memory management: allocation, garbage collection, VM allocators, …
    • Types of Erlang processes & what they do.
    • Message passing patterns: send/receive within a node? across nodes? Mailbox sizes
    • Erlang file I/O doesn't work like most other languages'
      • Message passing <-> file port driver!
      • Support for "diskless" mode is still there!
    • Data shared by ETS tables really isn't shared: it's copied on-demand in & out of ETS tables <-> Erlang process heaps
      • Except for "large" Erlang binary terms! ^_^

If you don't know where to start, try The USE Method

  1. Measure these three things:
    • Utilization (% busy/utilized)
    • Saturation rates (items queued due to 100% utilization)
    • Errors
  2. Find largest source of errors, or saturation, or utilization
  3. Re-focus USE method on that thing you found.
  4. Perhaps fix that bottleneck. Go to #1 if not fast enough.
  5. http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/

Kjellin & Aloi: What to look for!

  • OS level
    • memory usage, disk I/O & I/O wait, CPU usage, network usage,…
  • Erlang
    • elapsed wall clock time, message queues, reductions executed, garbage collection events, process scheduler events, …

Survey: Erlang profiling tools available today

  • Not enough time to talk about them all!
  • Use your favorite Web search tool to find more info about them.
  • Afterward, I'll focus on flame graphs

Survey: Inside OTP source & OTP docs:

Survey: Outside of OTP but focused on Erlang

  • Erlang VM internals:
    • gprof, Valgrind, other C/assembly profilers, DTrace/SystemTap
  • Don't forget OS tools!
    • top, htop, sar, iostat, vmstat, strace, sysdig, …

Other measurement techniques

  • timer:tc/1, /2, /3
    • Beware! CLI code is interpreted differently than compiled code!
  • latency_histogram_tracer.erl
  • Adding dynamic tracing probes for DTrace/SystemTap measurements.
    • For example, dyntrace:pn(25,1) and dyntrace:pn(25,2) additions to your code.
    • Measure latencies with a D script between erlang$pid::user_probe-n25 probes with arg=1 vs. arg=2
    • Requires dyntrace support compiled into your VM

latency_histogram_tracer example output

%% For example: create a histogram of call latencies
%% for bitcask:put/3. Watch for 10 seconds. units=usec

> latency_histogram_tracer:start(bitcask, put, 3, 10).
... wait 10 seconds ...
Histogram stats:     
[{min,0},
 {max,48},
 {arithmetic_mean,2.765411819271055},
 {median,3},
 {variance,3.5629207473971585},
 {standard_deviation,1.8875700642352746},
 {percentile,[{50,3},{75,4},{90,5},{95,6},{99,8},{999,14}]},
 {n,51746},
 {histogram,[{1,13436},
             {2,12304},
             {3,10789},
             {4,7397},
             {5,4191},

Variable CPU clock speed

  • You turned it off, didn't you?

Erlang tracing via dbg and erlang BIFs.

  • It's simple. And it's complex. It's both!
  • For each event, send a message to a single "event tracer" process/port.
    • The consumer/destination/sink for trace messages

Erlang tracing: What can be an event producer?

  • Any Erlang process
  • Any Erlang port
  • The scheduler threads

Erlang tracing: What Erlang processes can generate events?

  • Any 1/N/all processes running right now
  • Any processes spawned after tracing starts
  • Any processes linked after tracing starts

Erlang tracing: Types of event messages

  • Calling a function (global or local)
    • filter based on:
      • module name
      • function name
      • # of arguments
      • Erlang-style pattern matching on func's argument list!
  • Sending/receiving a message
  • Process is starting/stopping GC
  • Process is running/stopped by scheduler thread
  • Process has exited

What do trace messages look like?

%% Function call, no timestamp
{trace, Pid, call, MFA}

% Function call, with timestamp w/extra binary() info
{trace_ts, Pid, call, MFA, TS}
{trace_ts, Pid, call, MFA, BIN, TS}

% Garbage collection events & stats, with timestamp
{trace_ts, Pid, gc_start, StatsTerm, TS}
{trace_ts, Pid, gc_end, StatsTerm, TS}

% Process scheduler in & out events, with timestamp
{trace_ts, Pid, in, MFA, TS}
{trace_ts, Pid, out, MFA, TS}

Trace message examples

%% Full arguments of each traced function

    {trace_ts, <0.450.0>, call,
     {erlang,monitor,[<0.49.0>]}, {1426,225,821772}}

%% If 'arity' trace flag is specified, just arity #

    {trace_ts, <0.450.0>, call, {erlang,monitor,1},
     <<"Extra info here.....">>, {1426,225,821772}}
    {trace_ts, <0.450.0>, return_from,
     {erlang,monitor,1}, true, {1426,225,821774}}

Erlang tracing: Strengths

  • Turn on and off anytime
    • On developer's laptop
    • On production system!
  • The API is odd, but the pattern matching is powerful
  • "Event tracer"/consumer can execute any Erlang code
    • Turing complete, very flexible

Erlang tracing: Weaknesses

  • Turn on and off anytime
    • On production system! And forget to turn it off.
  • The API is frequently difficult to remember
    • That's the reason for so many tools to manage the low-level tracing API
  • Only 1 event tracer per VM
    • Easy to create "event tsunami" of millions of event messages
      • Killing entire VMs in production = not fun
    • TCP and file port tracers have lower overhead but not low enough
    • Capacity limit: 1 tracer = 1 CPU core of work available

The fprof profiling tool: Advantages

  • API is mostly easy-to-use
  • Measure wall-clock execution time of {Mod, Func, ArgList}
  • Easy to measure pure function {Mod, Func, ArgList} time!
  • Aware of call stacks:
    • Time executed by function F
    • Time executed by functions called by function F
  • Implemented by Erlang tracing
    • Use it in production, if you wish….

The fprof profiling tool: Disadvantages

  • Report format requires some practice to read correctly
    • erlgrind and eep have better (?) visualization tools
  • Not designed for multi-process measurements
    • Call stack interpretation can be ok or terrible
  • Implemented by Erlang tracing
    • High overhead: 1 message per function call or return
    • Only 1 consumer process/port
    • Can slow down a Riak system by 2-3 orders of magnitude!

Hyperbole: Flame Graphs are Awesome!

The "flame graph" profiling tool

Flame Graph Example

riak1.png

Source: www.snookles.com/scotttmp/eflame2/riak.1.svg

Flame Graph Example

riak1-smaller.png

  • Y axis: call stack depth
  • X axis: proportion of # of samples (or perhaps relative execution time)
    • Left-to-right order is alphabetical
    • Left-to-right order has nothing to do with passage of time

Flame Graph Example

riak1-smaller2.png

  • Y axis: call stack depth
  • X axis: proportion of # of samples (or perhaps relative execution time)
    • Left-to-right order is alphabetical
    • Left-to-right order has nothing to do with passage of time

Flame Graph of Experimental Riak 1

riak.0.png

Flame Graph of Experimental Riak 2

riak.1.png

Flame Graph of Experimental Riak 3

riak.2.png

Flame Graph of Experimental Riak 4

riak.3.png

Flame Graph of Experimental Riak 5

riak.10.png

Making flame graphs with Erlang

  • Method 1: profile the VM itself using standard profiling techniques + Gregg's flame graph toolkit
    • Good view into the VM's internal implementation
    • Not good view into Erlang code behavior
  • Method 2: Vlad Ki's eflame library
  • Method 3: Scott Lystig Fritchie's eflame2 library
    • https://github.com/slfritchie/eflame
    • Uses alternate call stack calculation
    • Can use Erlang tracing (all strengths & weaknesses)
    • Can use patched Erlang VM for much lower overhead cost
    • Easier-to-use API, perhaps, I hope?
    • Not always accurate call stack reporting

Steps to create a flame graph

  1. Start workload on system. (And keep it running!)
  2. Trigger measurement framework (previous slide)
  3. Convert raw measurement format to standard flame graph file format.
  4. Generate flame graph's native SVG format.
  5. Convert SVG to other formats.
    • Not all Web browsers have native SVG support yet {sigh}

Making flamegraphs with eflame2 library: getting ready

  1. git clone https://github.com/slfritchie/eflame.git
  2. cd eflame
  3. make
  4. Copy the ebin/eflame2.beam file to /tmp on whatever machine it is that you wish to use to generate a flame graph. (Call it the System Under Test or SUT.)
  5. Run code:add_pathz("/tmp"). at the Erlang CLI on the SUT.

Making flamegraphs with eflame2 library: collecting traces, make graph

  1. Start the workload on the SUT.
  2. Collect some traces:
    • For example, to gather trace events on all processes and all newly-spawned processes for all global function calls for 10 seconds (10,000 milliseconds).
    • Run eflame2:write_trace( global_calls_plus_new_procs, "/tmp/ef.test.0", all, 10*1000).

Making flamegraphs with eflame2 library: collecting traces, make graph

  1. Convert the binary trace to an ASCII trace:
    • Run eflame2:format_trace("/tmp/ef.test.0", "/tmp/ef.test.0.out").
  2. Convert the ASCII trace to an SVG flame graph image:
    • Run at a BASH/C Shell/login shell:
      • cat /tmp/ef.test.0.out | flamegraph.riak-color.pl > output.svg

eflame2: Call stack calculation, version #1

  • Use little-documented [process_dump] match spec action.
    • e.g., MatchSpec = [{'_',[],[{message,{process_dump}}]}]
  • Create same process summary text that's used for Erlang erl_core.dump files at crash time!
  • On a Riak system, each trace message can be 15 KBytes or more!
  • Parse the Erlang function call info (and throw the rest away).

eflame2: Call stack calculation, version #1 (verbose!)

=proc:<0.450.0>
State: Running
Spawned as: proc_lib:init_p/5
Last scheduled in for: gen:do_call/4
Spawned by: <0.151.0>
Started: Thu Mar 12 15:59:47 2015
Message queue length: 2
Message queue: [{'$gen_event',{riak_vnode_req_v1,1096126227998177188652763624537212264741949407232,{fsm,undefined,<0.16617.8>},{riak_kv_put_req_v1,{<<11 bytes>>,<<4 bytes>>},{r_object,<<11 bytes>>,<<4 bytes>>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<11 bytes>>,53,72,122,99,54,118,108,49,119,78,56,115,86,84,109,117,67,110,85,86,112,88]],[[<<5 bytes>>]],[],[[<<20 bytes>>|{1426,218816,52272}]],[],[]}}},<<4 bytes>>}],[{<<8 bytes>>,{19,63593438016}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined},101334680,63593438016,[]}}},{'$gen_event',{riak_vnode_req_v1,1096126227998177188652763624537212264741949407232,{fsm,undefined,<0.16618.8>},{riak_kv_put_req_v1,{<<11 bytes>>,<<4 bytes>>},{r_object,<<11 bytes>>,<<4 bytes>>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<11 bytes>>,50,80,116,110,68,53,82,119,105,85,65,56,113,77,51,103,85,121,80,54,72,76]],[[<<5 bytes>>]],[],[[<<20 bytes>>|{1426,218816,53566}]],[],[]}}},<<4 bytes>>}],[],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined},54710744,63593438016,[coord]}}}]
Number of heap fragments: 0
Heap fragment data: 0
Link list: []
Dictionary: [{#Ref<0.0.0.1416>,{bc_state,"./data/bitcask/1096126227998177188652763624537212264741949407232",{filestate,read_write,"./data/bitcask/1096126227998177188652763624537212264741949407232/1.bitcask.data",1,<0.1927.7>,<0.1929.7>,3479991403,35986},<<0 bytes>>,[{filestate,read_only,"./data/bitcask/1096126227998177188652763624537212264741949407232/1.bitcask.data",1,<0.4908.7>,undefined,0,0}],2147483648,[{expiry_secs,-1},{vnode_vclocks,true},{included_applications,[]},{allow_strfun,false},{reduce_js_vm_count,6},{anti_entropy_build_limit,{1,3600000}},{storage_backend,riak_kv_bitcask_backend},{hook_js_vm_count,2},{anti_entropy_leveldb_opts,[{write_buffer_size,4194304},{max_open_files,20}]},{listkeys_backpressure,true},{anti_entropy,{on,[]}},{mapred_name,"mapred"},{stats_urlpath,"stats"},{legacy_stats,true},{aae_throttle_limits,[{-1,0},{200,10},{500,50},{750,250},{900,1000},{1100,5000}]},{aae_throttle_sleep_time,0},{js_thread_stack,16},{anti_entropy_data_dir,"./data/anti_entropy"},{anti_entropy_expire,604800000},{warn_siblings,25},{anti_entropy_concurrency,2},{add_paths,[]},{fsm_limit,50000},{anti_entropy_tick,15000},{http_url_encoding,on},{map_js_vm_count,8},{object_format,v1},{mapred_2i_pipe,true},{warn_object_size,5000000},{js_max_vm_mem,8},{read_write,true}],<<0 bytes>>}},{bitcask_efile_port,#Port<0.11417>},{'$initial_call',{riak_core_vnode,init,1}},{bitcask_time_fudge,no_testing},{bitcask_file_mod,bitcask_file},{hashtree_tokens,35},{'$ancestors',[riak_core_vnode_sup,riak_core_sup,<0.148.0>]},{random_seed,{27839,21123,25074}}]
Reductions: 1941240
Stack+heap: 4181
OldHeap: 0
Heap unused: 1619
OldHeap unused: 0
Stack dump:
Program counter: 0x000000001c0c3ca8 (gen:do_call/4 + 576)
CP: 0x000000001c0c3c80 (gen:do_call/4 + 536)

0x0000000035f704b8 Return addr 0x000000001e81a220 (gen_server:call/3 + 128)
y(0)     #Ref<0.0.6.80948>
y(1)     'riak@127.0.0.1'
y(2)     Catch 0x000000001c0c3c90 (gen:do_call/4 + 552)
y(3)     infinity
y(4)     {file_pwrite,35986,[<<4 bytes>>,<<4 bytes>>,<<2 bytes>>,<<4 bytes>>,<<28 bytes>>,<<110 bytes>>]}
y(5)     '$gen_call'
y(6)     <0.1927.7>

[... lots more stuff omitted...]

eflame2: Call stack calculation, version #2

  • Patch the Erlang VM to provide an alternate method of generating a call stack report, without all of the extra/unwanted reporting stuff that the process_dump match spec.
  • Added process_backtrace match spec to write stack trace only
  • Much faster, much smaller trace messages.
  • Use eflame2:write_trace_exp/4 or eflame2:write_trace_exp/6 to use the process_backtrace style trace generation.

Per-function-call tracing is expensive!

  • A Riak system, tracing all processes, can slow down 50x-300x!
  • Time-based sampling, like gprof or DTrace use, have very low overhead and can (given enough time) generate statistically valid profiles.
  • But neither gprof nor DTrace can understand Erlang process stacks
    • View the VM's C internals and not individual Erlang process state
    • If you are a DTrace magician, please contact me!

Patching the VM for Scott's experimental hacks

  1. git clone https://github.com/slfritchie/otp.git
  2. cd otp
  3. git checkout slf-profiling-tracing-hack1-v16
  4. ./otp_build autoconf
  5. ./configure --all-your-favorite-build-flags...
  6. make && make install

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

  • Add new async thread that does nothing but:
    • Sleep periodically
    • Check global vars after waking up
    • If global vars say "yes", set flags in each scheduler thread's Process structure to ask for a tracing sample.

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

void
goofus_timer_loop(useconds_t usecs_to_sleep)
{
    int ix;

    usleep(usecs_to_sleep);
    if (goofus_timer_enabled) {
        for (ix = 0; ix <= erts_no_schedulers; ix++) {
            ERTS_SCHEDULER_IX(ix)->goofus_count = 42;
        }
    }
    goofus_timer_loop(goofus_timer_sleep_msec * 1000);
}

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

  • For each OpCase opcode inside of process_main()
    • Add a check to see if a time-based sample should be dumped now.
    • Added new field in Process structure.
    • If yes, write trace info directly to a file via fwrite(3)
#define GOOFUS_CHECK 
    if (ERTS_GET_SCHEDULER_DATA_FROM_PROC(c_p)     \
                              ->goofus_count != 0) \
        goofus_write_time_sample(c_p)
  • Add GOOFUS_CHECK to lots of places….

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

int goofus_write_time_sample2(Process *c_p)
{
    erts_dsprintf_buf_t *dsbufp = NULL;
    char pidbuf[64];
    Eterm esp;

    if (goofus_fp == NULL || c_p == NULL) { return 0; }

    /* Format the message */
    dsbufp = erts_create_tmp_dsbuf(0);
    dtrace_proc_str(c_p, pidbuf);
    erts_print(ERTS_PRINT_DSBUF, (void *) dsbufp, "Pid %s\n", pidbuf);
    erts_stack_dump_abbreviated(ERTS_PRINT_DSBUF, (void *) dsbufp, c_p);
    erts_print(ERTS_PRINT_DSBUF, (void *) dsbufp, ".\n");
    fwrite(dsbufp->str, 1, dsbufp->str_len, goofus_fp);
    erts_destroy_tmp_dsbuf(dsbufp);
    return 1;
}

VM hacking: weaknesses of the beam_emu.c's opcode technique

  • Add extra (ugly) statements to each opcode's implementation
    • Could hide ugliness with even more macros! ^_^
  • Cannot "see" activity such as GC or port deferred work
  • Cannot "see" async file I/O activity directly
  • Might add bias toward functions immediately after schedule in by scheduler? (Needs more investigation)
  • Very low overhead, even when sampling every few milliseconds
  • Appears to be more accurate (most of the time) than per-function trace events.

eflame2: Call stack calculation, version #3

  1. Clone & compile & install the Erlang/OTP repo described above.
  2. Start a workload on the SUT to be observed.
  3. Run the following at the Erlang shell:

VM hacking: using time-sampling for flame graphs

%% Open a trace file, /tmp/goofus.{OS-pid-number}.out
true = erlang:system_flag(debug_flags, open).

%% Set trace collection for every 7 milliseconds, then sleep for
%% about 1 second to wait for the change to take place.
erlang:system_flag(debug_flags, 7). timer:sleep(1000).

%% Turn off Erlang-trace-status-sensitive option and
%% turn on the asynchronous loop to request new samples
erlang:system_flag(debug_flags, disable_trace).
erlang:system_flag(debug_flags, asynchronous).

%% Let the workload run for some period of time.  Execute
%% the workload here if necessary.  Or wait a while with sleep.
timer:sleep(5*1000).

%% Turn off the tracing
erlang:system_flag(debug_flags, asynchronous).
true = erlang:system_flag(debug_flags, close).

VM hacking: using time-sampling for flame graphs

  1. Convert the trace file to flame-graph-ready input and then to an SVG flame graph.
/path/to/runtime_tools/priv/parse_goofus.pl \
    < /tmp/goofus.PID.out | sort | uniq -c | \
    awk '{ print $2, " ", $1}' | \
    flamegraph.riak-color.pl > /tmp/flame-graph.svg

What does the trace file look like?

Pid <0.34.0>
Program counter: 0x000000001a5b3ba8 (timer:sleep/1 + 64)
CP: 0x0000000000000000 (invalid)

0x1a201eb8 Return addr 0x1a4f43a0 (erl_eval:do_apply/6 + 376)
0x1a201ec8 Return addr 0x1a4b3d28 (shell:exprs/7 + 680)
0x1a201ee0 Return addr 0x1a4b3430 (shell:eval_exprs/7 + 144)
0x1a201f38 Return addr 0x1a4b3028 (shell:eval_loop/3 + 568)
0x1a201f70 Return addr 0x17fc8008 (<terminate process normally>)
.
Pid <0.7238.0>
[... repeat...]

Tutorial example, using Riak as a System Under Test

Reprise: Use "active benchmarking" / "active profiling"

  • Don't assume function X is being called: confirm it
  • Don't assume function X is fast: confirm it
    • These measurement techniques have flaws
    • Confirmation from multiple techniques builds confidence