Introducing myself
- Scott Lystig Fritchie
- Senior Software Engineer at Basho Japan KK
- Email: <scott@basho.com>
- Twitter:
@slfritchie
- UNIX, Erlang and me:
- UNIX sysadmin & programmer since 1986
- Erlang programmer since 2000
- Publications by USENIX and the ACM
- Speaker at Erlang User Conference and Erlang Factory
Introducing today's presentation
- Today's presentation: http://twitter.com/slfritchie
- Long term: http://www.snookles.com/erlang/ef2015/
- Today's flame graphs: https://github.com/slfritchie/eflame
- Today's Erlang hacks: https://github.com/slfritchie/otp
- Flame graph home: https://github.com/brendangregg/FlameGraph
- General introduction: http://www.brendangregg.com/flamegraphs.html
Today's focus: profiling Erlang code
- You have identified code to profile, based on "USE Method" or
other measurement-based methodology.
- Examine a specific module
- Examine a specific subsystem/OTP application
- Not sure where to start: examine an entire system under a specific workload.
- Guesses are wrong 127% of the time.
- Be skeptical: doubt your measurements, verify them!
- Multiple methods
- Experiment while your test is running!
- Beard definitely not required!
Today's focus: GOAL for profiling Erlang code
- Identify the Erlang function(s) that is consuming the most CPU resource.
- Either wall-clock time or "on scheduler" time
- Depends on the measurement method
From Brendan Gregg: "Systems Performance" author
- Most advice about benchmarking is also valid advice for profiling.
- source: https:
//pbs.twimg.com/profile_images/512451438572613632/3T6mju_J.jpeg, http: //www.brendangregg.com/Images/systemsperf_cover_123.jpeg
Brendan Gregg: "Passive Benchmarking"
- Pick a benchmark tool.
- Run it with a variety of options.
- Make a slide deck of the results.
- Hand the slides to management.
Brendan Gregg: "Casual Benchmarking"
- You benchmark A,
- but actually measure B,
- and conclude you've measured C
Brendan Gregg: "Active Benchmarking"
.
"You analyze performance while the benchmark is running – not just after it's done – using other tools."
Why? Confirm that you really understand:
- What does the test actually do?
- How the system under test actually respond to the test?
- What are the true limiters of the system under test?
- What are the true limiters of the test itself?
Active Benchmarking -> Active profiling
.
"You analyze performance while the system is running – not just after it's done – using other tools."
Why? Confirm that you really understand:
- What does the workload actually do?
- How the code under test actually respond to the workload?
- What are the true limiters of the code under test?
- What are the true limiters of the profiler tool itself?
From Martin Kjellin & Roberto Aloi: Profiling & Debugging Erlang Systems
Kjellin & Aloi: Profiling Mindset
- Measure, don't guess.
- Trust your measurements.
- Measure before optimizing.
- Know what you're measuring.
Lukas Larsson, Erlang VM Developer @ Erlang Solutions & Ericsson
"[Profiling is] Inherently difficult, you have to be aware of the entire stack."
sources: https://www.erlang-solutions.com/resources/webinars/understanding-erlang-scheduler (about minute 31), http://www.erlang-factory.com/euc2014/lukas-larsson
Garrett Smith: "Erlang is an operating system for your code"
- You need to understand the major components of the system:
- How OS threads are used by Erlang schedulers, async I/O pool, etc.
- Memory management: allocation, garbage collection, VM allocators, …
- Types of Erlang processes & what they do.
- Message passing patterns: send/receive within a node? across nodes? Mailbox sizes
- Erlang file I/O doesn't work like most other languages'
- Message passing <-> file port driver!
- Support for "diskless" mode is still there!
- Data shared by ETS tables really isn't shared: it's copied
on-demand in & out of ETS tables <-> Erlang process heaps
- Except for "large" Erlang binary terms! ^_^
If you don't know where to start, try The USE Method
- Measure these three things:
- Utilization (% busy/utilized)
- Saturation rates (items queued due to 100% utilization)
- Errors
- Find largest source of errors, or saturation, or utilization
- Re-focus USE method on that thing you found.
- Perhaps fix that bottleneck. Go to #1 if not fast enough.
- http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/
Kjellin & Aloi: What to look for!
- OS level
- memory usage, disk I/O & I/O wait, CPU usage, network usage,…
- Erlang
- elapsed wall clock time, message queues, reductions executed, garbage collection events, process scheduler events, …
Survey: Erlang profiling tools available today
- Not enough time to talk about them all!
- Use your favorite Web search tool to find more info about them.
- Afterward, I'll focus on flame graphs
Survey: Inside OTP source & OTP docs:
cprof
andepof
andfprof
etop
andpman
andobserver
ttb
percept
andpercept2
(http://www.erlang-factory.com/upload/presentations/808/Percept2.pdf)instrument
lcnt
- "Efficiency Guide User's Guide" in OTP docs
Survey: Outside of OTP but focused on Erlang
- https://github.com/massemanet/eper package: redbug and dtop
- https://github.com/ferd/recon package and his "Stuff Goes Bad: Erlang in Anger" ebook
riak_sysmon
&cluster_info
@ https://github.com/basho/- https://github.com/virtan/eep +
kcachegrind
- https://github.com/isacssouza/erlgrind
- Erlang VM internals:
- gprof, Valgrind, other C/assembly profilers, DTrace/SystemTap
- Don't forget OS tools!
top
,htop
,sar
,iostat
,vmstat
,strace
,sysdig
, …
Other measurement techniques
- timer:tc/1, /2, /3
- Beware! CLI code is interpreted differently than compiled code!
latency_histogram_tracer.erl
- Create detailed histogram stats on latencies of Mod:Fun/Arity calls.
- https://gist.github.com/slfritchie/159a8ce1f49fc03c77c6
- Requires
folsom
package for histograms
- Adding dynamic tracing probes for DTrace/SystemTap measurements.
- For example,
dyntrace:pn(25,1)
anddyntrace:pn(25,2)
additions to your code. - Measure latencies with a D script between
erlang$pid::user_probe-n25
probes with arg=1 vs. arg=2 - Requires
dyntrace
support compiled into your VM
- For example,
latency_histogram_tracer
example output
%% For example: create a histogram of call latencies %% for bitcask:put/3. Watch for 10 seconds. units=usec > latency_histogram_tracer:start(bitcask, put, 3, 10). ... wait 10 seconds ... Histogram stats: [{min,0}, {max,48}, {arithmetic_mean,2.765411819271055}, {median,3}, {variance,3.5629207473971585}, {standard_deviation,1.8875700642352746}, {percentile,[{50,3},{75,4},{90,5},{95,6},{99,8},{999,14}]}, {n,51746}, {histogram,[{1,13436}, {2,12304}, {3,10789}, {4,7397}, {5,4191},
Variable CPU clock speed
- You turned it off, didn't you?
Erlang tracing via dbg
and erlang
BIFs.
- It's simple. And it's complex. It's both!
- For each event, send a message to a single "event tracer"
process/port.
- The consumer/destination/sink for trace messages
Erlang tracing: What can be an event producer?
- Any Erlang process
- Any Erlang port
- The scheduler threads
Erlang tracing: What Erlang processes can generate events?
- Any 1/N/all processes running right now
- Any processes spawned after tracing starts
- Any processes linked after tracing starts
Erlang tracing: Types of event messages
- Calling a function (global or local)
- filter based on:
- module name
- function name
- # of arguments
- Erlang-style pattern matching on func's argument list!
- filter based on:
- Sending/receiving a message
- Process is starting/stopping GC
- Process is running/stopped by scheduler thread
- Process has exited
What do trace messages look like?
%% Function call, no timestamp {trace, Pid, call, MFA} % Function call, with timestamp w/extra binary() info {trace_ts, Pid, call, MFA, TS} {trace_ts, Pid, call, MFA, BIN, TS} % Garbage collection events & stats, with timestamp {trace_ts, Pid, gc_start, StatsTerm, TS} {trace_ts, Pid, gc_end, StatsTerm, TS} % Process scheduler in & out events, with timestamp {trace_ts, Pid, in, MFA, TS} {trace_ts, Pid, out, MFA, TS}
Trace message examples
%% Full arguments of each traced function {trace_ts, <0.450.0>, call, {erlang,monitor,[<0.49.0>]}, {1426,225,821772}} %% If 'arity' trace flag is specified, just arity # {trace_ts, <0.450.0>, call, {erlang,monitor,1}, <<"Extra info here.....">>, {1426,225,821772}} {trace_ts, <0.450.0>, return_from, {erlang,monitor,1}, true, {1426,225,821774}}
Erlang tracing: Strengths
- Turn on and off anytime
- On developer's laptop
- On production system!
- The API is odd, but the pattern matching is powerful
- "Event tracer"/consumer can execute any Erlang code
- Turing complete, very flexible
Erlang tracing: Weaknesses
- Turn on and off anytime
- On production system! And forget to turn it off.
- The API is frequently difficult to remember
- That's the reason for so many tools to manage the low-level tracing API
- Only 1 event tracer per VM
- Easy to create "event tsunami" of millions of event messages
- Killing entire VMs in production = not fun
- TCP and file port tracers have lower overhead but not low enough
- Capacity limit: 1 tracer = 1 CPU core of work available
- Easy to create "event tsunami" of millions of event messages
The fprof profiling tool: Advantages
- API is mostly easy-to-use
- Measure wall-clock execution time of {Mod, Func, ArgList}
- Easy to measure pure function {Mod, Func, ArgList} time!
- Aware of call stacks:
- Time executed by function F
- Time executed by functions called by function F
- Implemented by Erlang tracing
- Use it in production, if you wish….
The fprof profiling tool: Disadvantages
- Report format requires some practice to read correctly
erlgrind
andeep
have better (?) visualization tools
- Not designed for multi-process measurements
- Call stack interpretation can be ok or terrible
- Implemented by Erlang tracing
- High overhead: 1 message per function call or return
- Only 1 consumer process/port
- Can slow down a Riak system by 2-3 orders of magnitude!
Hyperbole: Flame Graphs are Awesome!
The "flame graph" profiling tool
- Invented by Sun Microsystems alumni as an alternative way of visualizing profiling data with call stacks.
- Originally based DTrace + Solaris
- Very quickly ported to many other platforms
- Links:
Flame Graph Example
Source: www.snookles.com/scotttmp/eflame2/riak.1.svg
Flame Graph Example
- Y axis: call stack depth
- X axis: proportion of # of samples (or perhaps relative execution time)
- Left-to-right order is alphabetical
- Left-to-right order has nothing to do with passage of time
Flame Graph Example
- Y axis: call stack depth
- X axis: proportion of # of samples (or perhaps relative execution time)
- Left-to-right order is alphabetical
- Left-to-right order has nothing to do with passage of time
Flame Graph of Experimental Riak 1
Flame Graph of Experimental Riak 2
Flame Graph of Experimental Riak 3
Flame Graph of Experimental Riak 4
Flame Graph of Experimental Riak 5
Making flame graphs with Erlang
- Method 1: profile the VM itself using standard profiling
techniques + Gregg's flame graph toolkit
- Good view into the VM's internal implementation
- Not good view into Erlang code behavior
- Method 2: Vlad Ki's
eflame
library- https://github.com/proger/eflame
- Uses Erlang tracing (all strengths & weaknesses)
- Not always accurate call stack reporting
- Method 3: Scott Lystig Fritchie's
eflame2
library- https://github.com/slfritchie/eflame
- Uses alternate call stack calculation
- Can use Erlang tracing (all strengths & weaknesses)
- Can use patched Erlang VM for much lower overhead cost
- Easier-to-use API, perhaps, I hope?
- Not always accurate call stack reporting
Steps to create a flame graph
- Start workload on system. (And keep it running!)
- Trigger measurement framework (previous slide)
- Convert raw measurement format to standard flame graph file format.
- Generate flame graph's native SVG format.
- Convert SVG to other formats.
- Not all Web browsers have native SVG support yet {sigh}
Making flamegraphs with eflame2
library: getting ready
git clone https://github.com/slfritchie/eflame.git
cd eflame
make
- Copy the
ebin/eflame2.beam
file to/tmp
on whatever machine it is that you wish to use to generate a flame graph. (Call it the System Under Test or SUT.) - Run
code:add_pathz("/tmp").
at the Erlang CLI on the SUT.
Making flamegraphs with eflame2
library: collecting traces, make graph
- Start the workload on the SUT.
- Collect some traces:
- For example, to gather trace events on all processes and all newly-spawned processes for all global function calls for 10 seconds (10,000 milliseconds).
- Run
eflame2:write_trace( global_calls_plus_new_procs, "/tmp/ef.test.0", all, 10*1000).
Making flamegraphs with eflame2
library: collecting traces, make graph
- Convert the binary trace to an ASCII trace:
- Run
eflame2:format_trace("/tmp/ef.test.0", "/tmp/ef.test.0.out").
- Run
- Convert the ASCII trace to an SVG flame graph image:
- Run at a BASH/C Shell/login shell:
cat /tmp/ef.test.0.out | flamegraph.riak-color.pl > output.svg
- Run at a BASH/C Shell/login shell:
eflame2
: Call stack calculation, version #1
- Use little-documented [process_dump] match spec action.
- e.g.,
MatchSpec = [{'_',[],[{message,{process_dump}}]}]
- e.g.,
- Create same process summary text that's used for Erlang
erl_core.dump
files at crash time! - On a Riak system, each trace message can be 15 KBytes or more!
- Parse the Erlang function call info (and throw the rest away).
eflame2
: Call stack calculation, version #1 (verbose!)
=proc:<0.450.0> State: Running Spawned as: proc_lib:init_p/5 Last scheduled in for: gen:do_call/4 Spawned by: <0.151.0> Started: Thu Mar 12 15:59:47 2015 Message queue length: 2 Message queue: [{'$gen_event',{riak_vnode_req_v1,1096126227998177188652763624537212264741949407232,{fsm,undefined,<0.16617.8>},{riak_kv_put_req_v1,{<<11 bytes>>,<<4 bytes>>},{r_object,<<11 bytes>>,<<4 bytes>>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<11 bytes>>,53,72,122,99,54,118,108,49,119,78,56,115,86,84,109,117,67,110,85,86,112,88]],[[<<5 bytes>>]],[],[[<<20 bytes>>|{1426,218816,52272}]],[],[]}}},<<4 bytes>>}],[{<<8 bytes>>,{19,63593438016}}],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined},101334680,63593438016,[]}}},{'$gen_event',{riak_vnode_req_v1,1096126227998177188652763624537212264741949407232,{fsm,undefined,<0.16618.8>},{riak_kv_put_req_v1,{<<11 bytes>>,<<4 bytes>>},{r_object,<<11 bytes>>,<<4 bytes>>,[{r_content,{dict,3,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<11 bytes>>,50,80,116,110,68,53,82,119,105,85,65,56,113,77,51,103,85,121,80,54,72,76]],[[<<5 bytes>>]],[],[[<<20 bytes>>|{1426,218816,53566}]],[],[]}}},<<4 bytes>>}],[],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean|true]],[]}}},undefined},54710744,63593438016,[coord]}}}] Number of heap fragments: 0 Heap fragment data: 0 Link list: [] Dictionary: [{#Ref<0.0.0.1416>,{bc_state,"./data/bitcask/1096126227998177188652763624537212264741949407232",{filestate,read_write,"./data/bitcask/1096126227998177188652763624537212264741949407232/1.bitcask.data",1,<0.1927.7>,<0.1929.7>,3479991403,35986},<<0 bytes>>,[{filestate,read_only,"./data/bitcask/1096126227998177188652763624537212264741949407232/1.bitcask.data",1,<0.4908.7>,undefined,0,0}],2147483648,[{expiry_secs,-1},{vnode_vclocks,true},{included_applications,[]},{allow_strfun,false},{reduce_js_vm_count,6},{anti_entropy_build_limit,{1,3600000}},{storage_backend,riak_kv_bitcask_backend},{hook_js_vm_count,2},{anti_entropy_leveldb_opts,[{write_buffer_size,4194304},{max_open_files,20}]},{listkeys_backpressure,true},{anti_entropy,{on,[]}},{mapred_name,"mapred"},{stats_urlpath,"stats"},{legacy_stats,true},{aae_throttle_limits,[{-1,0},{200,10},{500,50},{750,250},{900,1000},{1100,5000}]},{aae_throttle_sleep_time,0},{js_thread_stack,16},{anti_entropy_data_dir,"./data/anti_entropy"},{anti_entropy_expire,604800000},{warn_siblings,25},{anti_entropy_concurrency,2},{add_paths,[]},{fsm_limit,50000},{anti_entropy_tick,15000},{http_url_encoding,on},{map_js_vm_count,8},{object_format,v1},{mapred_2i_pipe,true},{warn_object_size,5000000},{js_max_vm_mem,8},{read_write,true}],<<0 bytes>>}},{bitcask_efile_port,#Port<0.11417>},{'$initial_call',{riak_core_vnode,init,1}},{bitcask_time_fudge,no_testing},{bitcask_file_mod,bitcask_file},{hashtree_tokens,35},{'$ancestors',[riak_core_vnode_sup,riak_core_sup,<0.148.0>]},{random_seed,{27839,21123,25074}}] Reductions: 1941240 Stack+heap: 4181 OldHeap: 0 Heap unused: 1619 OldHeap unused: 0 Stack dump: Program counter: 0x000000001c0c3ca8 (gen:do_call/4 + 576) CP: 0x000000001c0c3c80 (gen:do_call/4 + 536) 0x0000000035f704b8 Return addr 0x000000001e81a220 (gen_server:call/3 + 128) y(0) #Ref<0.0.6.80948> y(1) 'riak@127.0.0.1' y(2) Catch 0x000000001c0c3c90 (gen:do_call/4 + 552) y(3) infinity y(4) {file_pwrite,35986,[<<4 bytes>>,<<4 bytes>>,<<2 bytes>>,<<4 bytes>>,<<28 bytes>>,<<110 bytes>>]} y(5) '$gen_call' y(6) <0.1927.7> [... lots more stuff omitted...]
eflame2
: Call stack calculation, version #2
- Patch the Erlang VM to provide an alternate method of generating a
call stack report, without all of the extra/unwanted reporting stuff
that the
process_dump
match spec. - Added
process_backtrace
match spec to write stack trace only - Much faster, much smaller trace messages.
- Use
eflame2:write_trace_exp/4
oreflame2:write_trace_exp/6
to use theprocess_backtrace
style trace generation.
Per-function-call tracing is expensive!
- A Riak system, tracing all processes, can slow down 50x-300x!
- Time-based sampling, like gprof or DTrace use, have very low overhead and can (given enough time) generate statistically valid profiles.
- But neither gprof nor DTrace can understand Erlang process stacks
- View the VM's C internals and not individual Erlang process state
- If you are a DTrace magician, please contact me!
Patching the VM for Scott's experimental hacks
git clone https://github.com/slfritchie/otp.git
cd otp
git checkout slf-profiling-tracing-hack1-v16
./otp_build autoconf
./configure --all-your-favorite-build-flags...
make && make install
VM hacking: adding extra code to beam_emu.c's opcode interpreter loop
- Add new async thread that does nothing but:
- Sleep periodically
- Check global vars after waking up
- If global vars say "yes", set flags in each scheduler thread's
Process
structure to ask for a tracing sample.
VM hacking: adding extra code to beam_emu.c's opcode interpreter loop
void goofus_timer_loop(useconds_t usecs_to_sleep) { int ix; usleep(usecs_to_sleep); if (goofus_timer_enabled) { for (ix = 0; ix <= erts_no_schedulers; ix++) { ERTS_SCHEDULER_IX(ix)->goofus_count = 42; } } goofus_timer_loop(goofus_timer_sleep_msec * 1000); }
VM hacking: adding extra code to beam_emu.c's opcode interpreter loop
- For each
OpCase
opcode inside ofprocess_main()
- Add a check to see if a time-based sample should be dumped now.
- Added new field in
Process
structure. - If yes, write trace info directly to a file via
fwrite(3)
#define GOOFUS_CHECK if (ERTS_GET_SCHEDULER_DATA_FROM_PROC(c_p) \ ->goofus_count != 0) \ goofus_write_time_sample(c_p)
- Add GOOFUS_CHECK to lots of places….
VM hacking: adding extra code to beam_emu.c's opcode interpreter loop
int goofus_write_time_sample2(Process *c_p) { erts_dsprintf_buf_t *dsbufp = NULL; char pidbuf[64]; Eterm esp; if (goofus_fp == NULL || c_p == NULL) { return 0; } /* Format the message */ dsbufp = erts_create_tmp_dsbuf(0); dtrace_proc_str(c_p, pidbuf); erts_print(ERTS_PRINT_DSBUF, (void *) dsbufp, "Pid %s\n", pidbuf); erts_stack_dump_abbreviated(ERTS_PRINT_DSBUF, (void *) dsbufp, c_p); erts_print(ERTS_PRINT_DSBUF, (void *) dsbufp, ".\n"); fwrite(dsbufp->str, 1, dsbufp->str_len, goofus_fp); erts_destroy_tmp_dsbuf(dsbufp); return 1; }
VM hacking: weaknesses of the beam_emu.c's opcode technique
- Add extra (ugly) statements to each opcode's implementation
- Could hide ugliness with even more macros!
^_^
- Could hide ugliness with even more macros!
- Cannot "see" activity such as GC or port deferred work
- Cannot "see" async file I/O activity directly
- Might add bias toward functions immediately after schedule in by scheduler? (Needs more investigation)
- Very low overhead, even when sampling every few milliseconds
- Appears to be more accurate (most of the time) than per-function trace events.
eflame2
: Call stack calculation, version #3
- Clone & compile & install the Erlang/OTP repo described above.
- Start a workload on the SUT to be observed.
- Run the following at the Erlang shell:
VM hacking: using time-sampling for flame graphs
%% Open a trace file, /tmp/goofus.{OS-pid-number}.out true = erlang:system_flag(debug_flags, open). %% Set trace collection for every 7 milliseconds, then sleep for %% about 1 second to wait for the change to take place. erlang:system_flag(debug_flags, 7). timer:sleep(1000). %% Turn off Erlang-trace-status-sensitive option and %% turn on the asynchronous loop to request new samples erlang:system_flag(debug_flags, disable_trace). erlang:system_flag(debug_flags, asynchronous). %% Let the workload run for some period of time. Execute %% the workload here if necessary. Or wait a while with sleep. timer:sleep(5*1000). %% Turn off the tracing erlang:system_flag(debug_flags, asynchronous). true = erlang:system_flag(debug_flags, close).
VM hacking: using time-sampling for flame graphs
- Convert the trace file to flame-graph-ready input and then to an SVG flame graph.
/path/to/runtime_tools/priv/parse_goofus.pl \ < /tmp/goofus.PID.out | sort | uniq -c | \ awk '{ print $2, " ", $1}' | \ flamegraph.riak-color.pl > /tmp/flame-graph.svg
What does the trace file look like?
Pid <0.34.0> Program counter: 0x000000001a5b3ba8 (timer:sleep/1 + 64) CP: 0x0000000000000000 (invalid) 0x1a201eb8 Return addr 0x1a4f43a0 (erl_eval:do_apply/6 + 376) 0x1a201ec8 Return addr 0x1a4b3d28 (shell:exprs/7 + 680) 0x1a201ee0 Return addr 0x1a4b3430 (shell:eval_exprs/7 + 144) 0x1a201f38 Return addr 0x1a4b3028 (shell:eval_loop/3 + 568) 0x1a201f70 Return addr 0x17fc8008 (<terminate process normally>) . Pid <0.7238.0> [... repeat...]
Tutorial example, using Riak as a System Under Test
See https://github.com/slfritchie/eflame/blob/feature/alt-stack-method/README-Riak-Example.md for text and tutorial ideas here.
Reprise: Use "active benchmarking" / "active profiling"
- Don't assume function X is being called: confirm it
- Don't assume function X is fast: confirm it
- These measurement techniques have flaws
- Confirmation from multiple techniques builds confidence