Actively measuring and profiling Erlang code

S. L. Fritchie

scott@basho.com

Introducing myself

Scott Lystig Fritchie
Senior Software Engineer at Basho Japan KK
Email: <scott@basho.com>
Twitter: @slfritchie
UNIX, Erlang and me:
- UNIX sysadmin & programmer since 1986
- Erlang programmer since 2000
- Publications by USENIX and the ACM
- Speaker at Erlang User Conference and Erlang Factory

Introducing today's presentation

Today's presentation: http://twitter.com/slfritchie
- Long term: http://www.snookles.com/erlang/ef2015/
Today's flame graphs: https://github.com/slfritchie/eflame
Today's Erlang hacks: https://github.com/slfritchie/otp
Flame graph home: https://github.com/brendangregg/FlameGraph
- General introduction: http://www.brendangregg.com/flamegraphs.html

Today's focus: profiling Erlang code

You have identified code to profile, based on "USE Method" or other measurement-based methodology.
- Examine a specific module
- Examine a specific subsystem/OTP application
- Not sure where to start: examine an entire system under a specific workload.
Guesses are wrong 127% of the time.
Be skeptical: doubt your measurements, verify them!
- Multiple methods
- Experiment while your test is running!
Beard definitely not required!

Today's focus: GOAL for profiling Erlang code

Identify the Erlang function(s) that is consuming the most CPU resource.
- Either wall-clock time or "on scheduler" time
- Depends on the measurement method

From Brendan Gregg: "Systems Performance" author

Most advice about benchmarking is also valid advice for profiling.
source: https:

//pbs.twimg.com/profile_images/512451438572613632/3T6mju_J.jpeg, http: //www.brendangregg.com/Images/systemsperf_cover_123.jpeg

Brendan Gregg: "Passive Benchmarking"

Pick a benchmark tool.
Run it with a variety of options.
Make a slide deck of the results.
Hand the slides to management.

Brendan Gregg: "Casual Benchmarking"

You benchmark A,
but actually measure B,
and conclude you've measured C

Brendan Gregg: "Active Benchmarking"

"You analyze performance while the benchmark is running – not just after it's done – using other tools."

Why? Confirm that you really understand:

What does the test actually do?
How the system under test actually respond to the test?
What are the true limiters of the system under test?
What are the true limiters of the test itself?

Active Benchmarking -> Active profiling

"You analyze performance while the system is running – not just after it's done – using other tools."

Why? Confirm that you really understand:

What does the workload actually do?
How the code under test actually respond to the workload?
What are the true limiters of the code under test?
What are the true limiters of the profiler tool itself?

From Martin Kjellin & Roberto Aloi: Profiling & Debugging Erlang Systems

sources: https://vimeo.com/99011731, http://www.erlang-factory.com/euc2014/roberto-aloi, http://www.erlang-factory.com/euc2014/martin-kjellin

Kjellin & Aloi: Profiling Mindset

Measure, don't guess.
Trust your measurements.
Measure before optimizing.
Know what you're measuring.

Lukas Larsson, Erlang VM Developer @ Erlang Solutions & Ericsson

"[Profiling is] Inherently difficult, you have to be aware of the entire stack."

sources: https://www.erlang-solutions.com/resources/webinars/understanding-erlang-scheduler (about minute 31), http://www.erlang-factory.com/euc2014/lukas-larsson

Garrett Smith: "Erlang is an operating system for your code"

You need to understand the major components of the system:
- How OS threads are used by Erlang schedulers, async I/O pool, etc.
- Memory management: allocation, garbage collection, VM allocators, …
- Types of Erlang processes & what they do.
- Message passing patterns: send/receive within a node? across nodes? Mailbox sizes
- Erlang file I/O doesn't work like most other languages'
  - Message passing <-> file port driver!
  - Support for "diskless" mode is still there!
- Data shared by ETS tables really isn't shared: it's copied on-demand in & out of ETS tables <-> Erlang process heaps
  - Except for "large" Erlang binary terms! ^_^

If you don't know where to start, try The USE Method

Measure these three things:
- Utilization (% busy/utilized)
- Saturation rates (items queued due to 100% utilization)
- Errors
Find largest source of errors, or saturation, or utilization
Re-focus USE method on that thing you found.
Perhaps fix that bottleneck. Go to #1 if not fast enough.
http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/

Kjellin & Aloi: What to look for!

OS level
- memory usage, disk I/O & I/O wait, CPU usage, network usage,…
Erlang
- elapsed wall clock time, message queues, reductions executed, garbage collection events, process scheduler events, …

Survey: Erlang profiling tools available today

Not enough time to talk about them all!
Use your favorite Web search tool to find more info about them.
Afterward, I'll focus on flame graphs

Survey: Inside OTP source & OTP docs:

cprof and epof and fprof
etop and pman and observer
ttb
percept and percept2 (http://www.erlang-factory.com/upload/presentations/808/Percept2.pdf)
instrument
lcnt
"Efficiency Guide User's Guide" in OTP docs

Survey: Outside of OTP but focused on Erlang

https://github.com/massemanet/eper package: redbug and dtop
https://github.com/ferd/recon package and his "Stuff Goes Bad: Erlang in Anger" ebook
riak_sysmon & cluster_info @ https://github.com/basho/
https://github.com/virtan/eep + kcachegrind
https://github.com/isacssouza/erlgrind

Erlang VM internals:
- gprof, Valgrind, other C/assembly profilers, DTrace/SystemTap
Don't forget OS tools!
- top, htop, sar, iostat, vmstat, strace, sysdig, …

Other measurement techniques

timer:tc/1, /2, /3
- Beware! CLI code is interpreted differently than compiled code!
latency_histogram_tracer.erl
- Create detailed histogram stats on latencies of Mod:Fun/Arity calls.
- https://gist.github.com/slfritchie/159a8ce1f49fc03c77c6
- Requires folsom package for histograms
Adding dynamic tracing probes for DTrace/SystemTap measurements.
- For example, dyntrace:pn(25,1) and dyntrace:pn(25,2) additions to your code.
- Measure latencies with a D script between erlang$pid::user_probe-n25 probes with arg=1 vs. arg=2
- Requires dyntrace support compiled into your VM

`latency_histogram_tracer` example output

%% For example: create a histogram of call latencies
%% for bitcask:put/3. Watch for 10 seconds. units=usec

> latency_histogram_tracer:start(bitcask, put, 3, 10).
... wait 10 seconds ...
Histogram stats:     
[{min,0},
 {max,48},
 {arithmetic_mean,2.765411819271055},
 {median,3},
 {variance,3.5629207473971585},
 {standard_deviation,1.8875700642352746},
 {percentile,[{50,3},{75,4},{90,5},{95,6},{99,8},{999,14}]},
 {n,51746},
 {histogram,[{1,13436},
             {2,12304},
             {3,10789},
             {4,7397},
             {5,4191},

Variable CPU clock speed

You turned it off, didn't you?

Erlang tracing via `dbg` and `erlang` BIFs.

It's simple. And it's complex. It's both!
For each event, send a message to a single "event tracer" process/port.
- The consumer/destination/sink for trace messages

Erlang tracing: What can be an event producer?

Any Erlang process
Any Erlang port
The scheduler threads

Erlang tracing: What Erlang processes can generate events?

Any 1/N/all processes running right now
Any processes spawned after tracing starts
Any processes linked after tracing starts

Erlang tracing: Types of event messages

Calling a function (global or local)
- filter based on:
  - module name
  - function name
  - # of arguments
  - Erlang-style pattern matching on func's argument list!
Sending/receiving a message
Process is starting/stopping GC
Process is running/stopped by scheduler thread
Process has exited

What do trace messages look like?

%% Function call, no timestamp
{trace, Pid, call, MFA}

% Function call, with timestamp w/extra binary() info
{trace_ts, Pid, call, MFA, TS}
{trace_ts, Pid, call, MFA, BIN, TS}

% Garbage collection events & stats, with timestamp
{trace_ts, Pid, gc_start, StatsTerm, TS}
{trace_ts, Pid, gc_end, StatsTerm, TS}

% Process scheduler in & out events, with timestamp
{trace_ts, Pid, in, MFA, TS}
{trace_ts, Pid, out, MFA, TS}

Trace message examples

%% Full arguments of each traced function

    {trace_ts, <0.450.0>, call,
     {erlang,monitor,[<0.49.0>]}, {1426,225,821772}}

%% If 'arity' trace flag is specified, just arity #

    {trace_ts, <0.450.0>, call, {erlang,monitor,1},
     <<"Extra info here.....">>, {1426,225,821772}}
    {trace_ts, <0.450.0>, return_from,
     {erlang,monitor,1}, true, {1426,225,821774}}

Erlang tracing: Strengths

Turn on and off anytime
- On developer's laptop
- On production system!
The API is odd, but the pattern matching is powerful
"Event tracer"/consumer can execute any Erlang code
- Turing complete, very flexible

Erlang tracing: Weaknesses

Turn on and off anytime
- On production system! And forget to turn it off.
The API is frequently difficult to remember
- That's the reason for so many tools to manage the low-level tracing API
Only 1 event tracer per VM
- Easy to create "event tsunami" of millions of event messages
  - Killing entire VMs in production = not fun
- TCP and file port tracers have lower overhead but not low enough
- Capacity limit: 1 tracer = 1 CPU core of work available

The fprof profiling tool: Advantages

API is mostly easy-to-use
Measure wall-clock execution time of {Mod, Func, ArgList}
Easy to measure pure function {Mod, Func, ArgList} time!
Aware of call stacks:
- Time executed by function F
- Time executed by functions called by function F
Implemented by Erlang tracing
- Use it in production, if you wish….

The fprof profiling tool: Disadvantages

Report format requires some practice to read correctly
- erlgrind and eep have better (?) visualization tools
Not designed for multi-process measurements
- Call stack interpretation can be ok or terrible
Implemented by Erlang tracing
- High overhead: 1 message per function call or return
- Only 1 consumer process/port
- Can slow down a Riak system by 2-3 orders of magnitude!

Hyperbole: Flame Graphs are Awesome!

Source: https://twitter.com/mranney/status/578056023841230848

The "flame graph" profiling tool

Invented by Sun Microsystems alumni as an alternative way of visualizing profiling data with call stacks.
Originally based DTrace + Solaris
Very quickly ported to many other platforms
Links:
- http://www.brendangregg.com/flamegraphs.html
- https://github.com/brendangregg/FlameGraph

Flame Graph Example

Source: www.snookles.com/scotttmp/eflame2/riak.1.svg

Flame Graph Example

Y axis: call stack depth
X axis: proportion of # of samples (or perhaps relative execution time)
- Left-to-right order is alphabetical
- Left-to-right order has nothing to do with passage of time

Flame Graph Example

Y axis: call stack depth
X axis: proportion of # of samples (or perhaps relative execution time)
- Left-to-right order is alphabetical
- Left-to-right order has nothing to do with passage of time

Flame Graph of Experimental Riak 1

riak.0.png

Flame Graph of Experimental Riak 2

riak.1.png

Flame Graph of Experimental Riak 3

riak.2.png

Flame Graph of Experimental Riak 4

riak.3.png

Flame Graph of Experimental Riak 5

riak.10.png

Making flame graphs with Erlang

Method 1: profile the VM itself using standard profiling techniques + Gregg's flame graph toolkit
- Good view into the VM's internal implementation
- Not good view into Erlang code behavior
Method 2: Vlad Ki's eflame library
- https://github.com/proger/eflame
- Uses Erlang tracing (all strengths & weaknesses)
- Not always accurate call stack reporting
Method 3: Scott Lystig Fritchie's eflame2 library
- https://github.com/slfritchie/eflame
- Uses alternate call stack calculation
- Can use Erlang tracing (all strengths & weaknesses)
- Can use patched Erlang VM for much lower overhead cost
- Easier-to-use API, perhaps, I hope?
- Not always accurate call stack reporting

Steps to create a flame graph

Start workload on system. (And keep it running!)
Trigger measurement framework (previous slide)
Convert raw measurement format to standard flame graph file format.
Generate flame graph's native SVG format.
Convert SVG to other formats.
- Not all Web browsers have native SVG support yet {sigh}

Making flamegraphs with `eflame2` library: getting ready

git clone https://github.com/slfritchie/eflame.git
cd eflame
make
Copy the ebin/eflame2.beam file to /tmp on whatever machine it is that you wish to use to generate a flame graph. (Call it the System Under Test or SUT.)
Run code:add_pathz("/tmp"). at the Erlang CLI on the SUT.

Making flamegraphs with `eflame2` library: collecting traces, make graph

Start the workload on the SUT.
Collect some traces:
- For example, to gather trace events on all processes and all newly-spawned processes for all global function calls for 10 seconds (10,000 milliseconds).
- Run eflame2:write_trace( global_calls_plus_new_procs, "/tmp/ef.test.0", all, 10*1000).

Making flamegraphs with `eflame2` library: collecting traces, make graph

Convert the binary trace to an ASCII trace:
- Run eflame2:format_trace("/tmp/ef.test.0", "/tmp/ef.test.0.out").
Convert the ASCII trace to an SVG flame graph image:
- Run at a BASH/C Shell/login shell:
  - cat /tmp/ef.test.0.out | flamegraph.riak-color.pl > output.svg

`eflame2`: Call stack calculation, version #1

Use little-documented [process_dump] match spec action.
- e.g., MatchSpec = [{'_',[],[{message,{process_dump}}]}]
Create same process summary text that's used for Erlang erl_core.dump files at crash time!
On a Riak system, each trace message can be 15 KBytes or more!
Parse the Erlang function call info (and throw the rest away).

`eflame2`: Call stack calculation, version #1 (verbose!)

=proc:<0.450.0>
State: Running
Spawned as: proc_lib:init_p/5
Last scheduled in for: gen:do_call/4
Spawned by: <0.151.0>
Started: Thu Mar 12 15:59:47 2015
Message queue length: 2
Message queue: [{'$gen_event',{riak_vnode_req_v1,1096126227998177188652763624537212264741949407232,{fsm,undefined,<0.16617.8>},{riak_kv_put_req_v1,{<<11 bytes>>,<<4 bytes>>},{r_object,<<11 bytes>>,<<4 bytes>>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<11 bytes>>,53,72,122,99,54,118,108,49,119,78,56,115,86,84,109,117,67,110,85,86,112,88]],[[<<5 bytes>>]],[],[[<<20 bytes>>|{1426,218816,52272}]],[],[]}}},<<4 bytes>>}],[{<<8 bytes>>,{19,63593438016}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined},101334680,63593438016,[]}}},{'$gen_event',{riak_vnode_req_v1,1096126227998177188652763624537212264741949407232,{fsm,undefined,<0.16618.8>},{riak_kv_put_req_v1,{<<11 bytes>>,<<4 bytes>>},{r_object,<<11 bytes>>,<<4 bytes>>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<11 bytes>>,50,80,116,110,68,53,82,119,105,85,65,56,113,77,51,103,85,121,80,54,72,76]],[[<<5 bytes>>]],[],[[<<20 bytes>>|{1426,218816,53566}]],[],[]}}},<<4 bytes>>}],[],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined},54710744,63593438016,[coord]}}}]
Number of heap fragments: 0
Heap fragment data: 0
Link list: []
Dictionary: [{#Ref<0.0.0.1416>,{bc_state,"./data/bitcask/1096126227998177188652763624537212264741949407232",{filestate,read_write,"./data/bitcask/1096126227998177188652763624537212264741949407232/1.bitcask.data",1,<0.1927.7>,<0.1929.7>,3479991403,35986},<<0 bytes>>,[{filestate,read_only,"./data/bitcask/1096126227998177188652763624537212264741949407232/1.bitcask.data",1,<0.4908.7>,undefined,0,0}],2147483648,[{expiry_secs,-1},{vnode_vclocks,true},{included_applications,[]},{allow_strfun,false},{reduce_js_vm_count,6},{anti_entropy_build_limit,{1,3600000}},{storage_backend,riak_kv_bitcask_backend},{hook_js_vm_count,2},{anti_entropy_leveldb_opts,[{write_buffer_size,4194304},{max_open_files,20}]},{listkeys_backpressure,true},{anti_entropy,{on,[]}},{mapred_name,"mapred"},{stats_urlpath,"stats"},{legacy_stats,true},{aae_throttle_limits,[{-1,0},{200,10},{500,50},{750,250},{900,1000},{1100,5000}]},{aae_throttle_sleep_time,0},{js_thread_stack,16},{anti_entropy_data_dir,"./data/anti_entropy"},{anti_entropy_expire,604800000},{warn_siblings,25},{anti_entropy_concurrency,2},{add_paths,[]},{fsm_limit,50000},{anti_entropy_tick,15000},{http_url_encoding,on},{map_js_vm_count,8},{object_format,v1},{mapred_2i_pipe,true},{warn_object_size,5000000},{js_max_vm_mem,8},{read_write,true}],<<0 bytes>>}},{bitcask_efile_port,#Port<0.11417>},{'$initial_call',{riak_core_vnode,init,1}},{bitcask_time_fudge,no_testing},{bitcask_file_mod,bitcask_file},{hashtree_tokens,35},{'$ancestors',[riak_core_vnode_sup,riak_core_sup,<0.148.0>]},{random_seed,{27839,21123,25074}}]
Reductions: 1941240
Stack+heap: 4181
OldHeap: 0
Heap unused: 1619
OldHeap unused: 0
Stack dump:
Program counter: 0x000000001c0c3ca8 (gen:do_call/4 + 576)
CP: 0x000000001c0c3c80 (gen:do_call/4 + 536)

0x0000000035f704b8 Return addr 0x000000001e81a220 (gen_server:call/3 + 128)
y(0)     #Ref<0.0.6.80948>
y(1)     'riak@127.0.0.1'
y(2)     Catch 0x000000001c0c3c90 (gen:do_call/4 + 552)
y(3)     infinity
y(4)     {file_pwrite,35986,[<<4 bytes>>,<<4 bytes>>,<<2 bytes>>,<<4 bytes>>,<<28 bytes>>,<<110 bytes>>]}
y(5)     '$gen_call'
y(6)     <0.1927.7>

[... lots more stuff omitted...]

`eflame2`: Call stack calculation, version #2

Patch the Erlang VM to provide an alternate method of generating a call stack report, without all of the extra/unwanted reporting stuff that the process_dump match spec.
Added process_backtrace match spec to write stack trace only
Much faster, much smaller trace messages.
Use eflame2:write_trace_exp/4 or eflame2:write_trace_exp/6 to use the process_backtrace style trace generation.

Per-function-call tracing is expensive!

A Riak system, tracing all processes, can slow down 50x-300x!
Time-based sampling, like gprof or DTrace use, have very low overhead and can (given enough time) generate statistically valid profiles.
But neither gprof nor DTrace can understand Erlang process stacks
- View the VM's C internals and not individual Erlang process state
- If you are a DTrace magician, please contact me!

Patching the VM for Scott's experimental hacks

git clone https://github.com/slfritchie/otp.git
cd otp
git checkout slf-profiling-tracing-hack1-v16
./otp_build autoconf
./configure --all-your-favorite-build-flags...
make && make install

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

Add new async thread that does nothing but:
- Sleep periodically
- Check global vars after waking up
- If global vars say "yes", set flags in each scheduler thread's Process structure to ask for a tracing sample.

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

void
goofus_timer_loop(useconds_t usecs_to_sleep)
{
    int ix;

    usleep(usecs_to_sleep);
    if (goofus_timer_enabled) {
        for (ix = 0; ix <= erts_no_schedulers; ix++) {
            ERTS_SCHEDULER_IX(ix)->goofus_count = 42;
        }
    }
    goofus_timer_loop(goofus_timer_sleep_msec * 1000);
}

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

For each OpCase opcode inside of process_main()
- Add a check to see if a time-based sample should be dumped now.
- Added new field in Process structure.
- If yes, write trace info directly to a file via fwrite(3)

#define GOOFUS_CHECK 
    if (ERTS_GET_SCHEDULER_DATA_FROM_PROC(c_p)     \
                              ->goofus_count != 0) \
        goofus_write_time_sample(c_p)

Add GOOFUS_CHECK to lots of places….

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

int goofus_write_time_sample2(Process *c_p)
{
    erts_dsprintf_buf_t *dsbufp = NULL;
    char pidbuf[64];
    Eterm esp;

    if (goofus_fp == NULL || c_p == NULL) { return 0; }

    /* Format the message */
    dsbufp = erts_create_tmp_dsbuf(0);
    dtrace_proc_str(c_p, pidbuf);
    erts_print(ERTS_PRINT_DSBUF, (void *) dsbufp, "Pid %s\n", pidbuf);
    erts_stack_dump_abbreviated(ERTS_PRINT_DSBUF, (void *) dsbufp, c_p);
    erts_print(ERTS_PRINT_DSBUF, (void *) dsbufp, ".\n");
    fwrite(dsbufp->str, 1, dsbufp->str_len, goofus_fp);
    erts_destroy_tmp_dsbuf(dsbufp);
    return 1;
}

VM hacking: weaknesses of the beam_emu.c's opcode technique

Add extra (ugly) statements to each opcode's implementation
- Could hide ugliness with even more macros! ^_^
Cannot "see" activity such as GC or port deferred work
Cannot "see" async file I/O activity directly
Might add bias toward functions immediately after schedule in by scheduler? (Needs more investigation)
Very low overhead, even when sampling every few milliseconds
Appears to be more accurate (most of the time) than per-function trace events.

`eflame2`: Call stack calculation, version #3

Clone & compile & install the Erlang/OTP repo described above.
Start a workload on the SUT to be observed.
Run the following at the Erlang shell:

VM hacking: using time-sampling for flame graphs

%% Open a trace file, /tmp/goofus.{OS-pid-number}.out
true = erlang:system_flag(debug_flags, open).

%% Set trace collection for every 7 milliseconds, then sleep for
%% about 1 second to wait for the change to take place.
erlang:system_flag(debug_flags, 7). timer:sleep(1000).

%% Turn off Erlang-trace-status-sensitive option and
%% turn on the asynchronous loop to request new samples
erlang:system_flag(debug_flags, disable_trace).
erlang:system_flag(debug_flags, asynchronous).

%% Let the workload run for some period of time.  Execute
%% the workload here if necessary.  Or wait a while with sleep.
timer:sleep(5*1000).

%% Turn off the tracing
erlang:system_flag(debug_flags, asynchronous).
true = erlang:system_flag(debug_flags, close).

VM hacking: using time-sampling for flame graphs

Convert the trace file to flame-graph-ready input and then to an SVG flame graph.

/path/to/runtime_tools/priv/parse_goofus.pl \
    < /tmp/goofus.PID.out | sort | uniq -c | \
    awk '{ print $2, " ", $1}' | \
    flamegraph.riak-color.pl > /tmp/flame-graph.svg

What does the trace file look like?

Pid <0.34.0>
Program counter: 0x000000001a5b3ba8 (timer:sleep/1 + 64)
CP: 0x0000000000000000 (invalid)

0x1a201eb8 Return addr 0x1a4f43a0 (erl_eval:do_apply/6 + 376)
0x1a201ec8 Return addr 0x1a4b3d28 (shell:exprs/7 + 680)
0x1a201ee0 Return addr 0x1a4b3430 (shell:eval_exprs/7 + 144)
0x1a201f38 Return addr 0x1a4b3028 (shell:eval_loop/3 + 568)
0x1a201f70 Return addr 0x17fc8008 (<terminate process normally>)
.
Pid <0.7238.0>
[... repeat...]

Tutorial example, using Riak as a System Under Test

See https://github.com/slfritchie/eflame/blob/feature/alt-stack-method/README-Riak-Example.md for text and tutorial ideas here.

Reprise: Use "active benchmarking" / "active profiling"

Don't assume function X is being called: confirm it
Don't assume function X is fast: confirm it
- These measurement techniques have flaws
- Confirmation from multiple techniques builds confidence

S. L. Fritchie - Actively measuring and profiling Erlang code

Actively measuring and profiling Erlang code

S. L. Fritchie

scott@basho.com

Introducing myself

Introducing today's presentation

Today's focus: profiling Erlang code

Today's focus: GOAL for profiling Erlang code

From Brendan Gregg: "Systems Performance" author

Brendan Gregg: "Passive Benchmarking"

Brendan Gregg: "Casual Benchmarking"

Brendan Gregg: "Active Benchmarking"

Active Benchmarking -> Active profiling

From Martin Kjellin & Roberto Aloi: Profiling & Debugging Erlang Systems

Kjellin & Aloi: Profiling Mindset

Lukas Larsson, Erlang VM Developer @ Erlang Solutions & Ericsson

Garrett Smith: "Erlang is an operating system for your code"

If you don't know where to start, try The USE Method

Kjellin & Aloi: What to look for!

Survey: Erlang profiling tools available today

Survey: Inside OTP source & OTP docs:

Survey: Outside of OTP but focused on Erlang

Other measurement techniques

latency_histogram_tracer example output

Variable CPU clock speed

Erlang tracing via dbg and erlang BIFs.

Erlang tracing: What can be an event producer?

Erlang tracing: What Erlang processes can generate events?

Erlang tracing: Types of event messages

What do trace messages look like?

Trace message examples

Erlang tracing: Strengths

Erlang tracing: Weaknesses

The fprof profiling tool: Advantages

The fprof profiling tool: Disadvantages

Hyperbole: Flame Graphs are Awesome!

The "flame graph" profiling tool

Flame Graph Example

Flame Graph Example

Flame Graph Example

Flame Graph of Experimental Riak 1

Flame Graph of Experimental Riak 2

Flame Graph of Experimental Riak 3

Flame Graph of Experimental Riak 4

Flame Graph of Experimental Riak 5

Making flame graphs with Erlang

Steps to create a flame graph

Making flamegraphs with eflame2 library: getting ready

Making flamegraphs with eflame2 library: collecting traces, make graph

Making flamegraphs with eflame2 library: collecting traces, make graph

eflame2: Call stack calculation, version #1

eflame2: Call stack calculation, version #1 (verbose!)

eflame2: Call stack calculation, version #2

Per-function-call tracing is expensive!

Patching the VM for Scott's experimental hacks

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

VM hacking: adding extra code to beam_emu.c's opcode interpreter loop

VM hacking: weaknesses of the beam_emu.c's opcode technique

eflame2: Call stack calculation, version #3

VM hacking: using time-sampling for flame graphs

VM hacking: using time-sampling for flame graphs

What does the trace file look like?

Tutorial example, using Riak as a System Under Test

Reprise: Use "active benchmarking" / "active profiling"

`latency_histogram_tracer` example output

Erlang tracing via `dbg` and `erlang` BIFs.

Making flamegraphs with `eflame2` library: getting ready

Making flamegraphs with `eflame2` library: collecting traces, make graph

Making flamegraphs with `eflame2` library: collecting traces, make graph

`eflame2`: Call stack calculation, version #1

`eflame2`: Call stack calculation, version #1 (verbose!)

`eflame2`: Call stack calculation, version #2

`eflame2`: Call stack calculation, version #3