<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Humans and Hippos in Harmony</title>
	<atom:link href="http://www.snookles.com/slf-blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.snookles.com/slf-blog</link>
	<description>Scott Lystig Fritchie&#039;s whimsy about computers, film, music, and whatever else his crazy neurons come up with that cannot fit into a Twitter tweet (@slfritchie).</description>
	<lastBuildDate>Mon, 23 Jan 2012 01:41:07 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>TCP incast: What is it? How can it affect Erlang applications?</title>
		<link>http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/</link>
		<comments>http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/#comments</comments>
		<pubDate>Thu, 05 Jan 2012 07:34:11 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Riak]]></category>
		<category><![CDATA[TCP]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=220</guid>
		<description><![CDATA[So, what&#8217;s the deal with the &#8220;TCP incast&#8221; pattern?  Never heard of it?  Join the club. I&#8217;ve been wearing a developer hat for too long and not wearing my sysadmin and network manager hats.  And the publications about Ethernet microbursts &#8230; <a href="http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>So, what&#8217;s the deal with the &#8220;TCP incast&#8221; pattern?  Never heard of it?  Join the club.</p>
<p>I&#8217;ve been wearing a developer hat for too long and not wearing my sysadmin and network manager hats.  And the publications about Ethernet microbursts and the TCP incast pattern have been hiding in conference proceedings &amp; journals that I don&#8217;t follow.  (Note to self: change reading habits.)</p>
<p>If you&#8217;d rather read a paper about the problem, go read one or more of these:</p>
<ul>
<li><a href="http://www.cs.cmu.edu/~vrv/papers/PDSI07-Incast.pdf">http://www.cs.cmu.edu/~vrv/papers/PDSI07-Incast.pdf</a></li>
<li><a href="http://www.eecs.berkeley.edu/~ychen2/professional/TCPIncastWREN2009.pdf">http://www.eecs.berkeley.edu/~ychen2/professional/TCPIncastWREN2009.pdf</a></li>
</ul>
<p>TL;DR: you can&#8217;t pour two buckets of manure into one bucket.  (Credit: my grandfather.)  I&#8217;m sure that&#8217;s very helpful &#8230; now keep on reading.</p>
<h3>Assumptions</h3>
<p>If you have a recipe using typical, modern, commodity computing hardware like this:</p>
<ul>
<li>Computers of fast CPU cores with lots of memory.</li>
<li>Very efficient network interface cards in those computers, and a reasonably modern OS to take advantage of them.</li>
<li>Gigabit Ethernet (or 10 Gbit/sec) Ethernet connecting those computers together.  One switch or a tree/fabric/whatever of switches, it doesn&#8217;t matter too much which, though multiple switches might be a bit easier.</li>
<li>An application that has lots of many-to-one or many-to-many communication patterns.  The <a href="http://basho.com/products/riak-overview/">Riak</a> database by <a href="http://basho.com/">Basho Technologies</a> happens to do a large amount of many-to-one operations, so I&#8217;ll use that as my example below.  The same principle applies to Hadoop and many other data-intensive distributed applications.</li>
</ul>
<h3>The Easy Bandwidth Problem</h3>
<p>Say you&#8217;ve got five machines using &#8220;scp&#8221; to copy data to a single destination.  Each source machine is capable of outputting 900+ Mbit/sec .  If your five machines each send 900+ Mbit/sec of traffic to a single recipient, your 1Gbit/sec Ethernet switch will soon have no choice but to drop packets.</p>
<p>But the &#8220;5 machines sending lots bulk data to 1 machine&#8221; is easy to understand: all machines cannot simultanously send 900+ Mbit/sec because the receiver has only 1,000 Mbit/sec of (theoretical) bandwidth.  The switch will drop some packets, then each TCP connection will react and slow down.  Eventually, each sender will reach a steady-state of sending roughly 200 Mbit/sec.  So, 5 * 200Mbit/sec = 1,000 Mbit/sec.  Easy, right?</p>
<h3>The More Difficult Bandwidth Problem</h3>
<p>What if you have your Riak cluster of 20 machines, and each machine is using (on average) 250 Mbit/sec of both transmit &amp; receive bandwidth on their 1 Gbit/sec Ethernet interfaces?  You&#8217;re only using 25% of the theoretical peak rate on each machine&#8217;s interface.  You&#8217;ve got <em>plenty of room</em> to grow, right?  No, not really.  For clusters of data-intensive applications like Riak or <a href="http://hadoop.apache.org/">Hadoop</a>, that 25% rate may be halfway to your practical bandwidth limit.  Why?</p>
<p>Because you&#8217;re likely to have &#8220;microbursts&#8221; of traffic sent to a single cluster member.  The switch can&#8217;t buffer all of the frames in the microburst, so it drops some.  Then the TCP mechanisms intervene to slow things down.  If your cluster has average utilization of 25-35% and uses Ethernet switches with small buffers, you may already be dropping 40-100 frames per second per machine.  The frame drop rate will rise very sharply as utilization increases.  At 45-55% utilization, you&#8217;ll start seeing 100-500 dropped frames per second per machine.  Will TCP be able to run full-speed with frame drop rates that high?  No.  Instead, you&#8217;ll be stuck with a cluster that can barely run above 50-60% average utilization.</p>
<p>It sounds a bit crazy, but it&#8217;s a real phenomenon.  And it&#8217;s finally something that has happened to me.  It took a long time to realize what was really happening &#8230; because I never considered a network that was 45-55% utilized was really near its peak usable bandwidth.</p>
<h3>An Illustration Using Riak</h3>
<p>The Riak database manages its data replication by storing several copies of any single piece of data on multiple machines.  Typically, the number of copies is three.  When your app tries to fetch a key from Riak, Riak will create a new process to coordinate the operation.  That new process will send &#8216;get&#8217; requests to three different nodes and await their replies.  Each server that receives a &#8216;get&#8217; request will send a copy of the key, its value, and its metadata dictionary back to the coordinator.</p>
<p>So, what if both of the following were true?</p>
<ol>
<li>All three &#8216;get&#8217; results are sent to the coordinator at exactly the same time.  (Here, &#8220;exactly&#8221; means something like &#8220;within the same millisecond or perhaps even the same microsecond.)</li>
<li>All three &#8216;get&#8217; results are big, for example, 100 kilobytes each.</li>
</ol>
<p>Then you have three different computers trying to send their 100 kilobytes of data to a single machine at the exact same time.  The resulting microburst of data creates a problem for the Ethernet switch(es) that these four computers are using.  (For the sake of simplicity, assume that all four computers are using the same Ethernet switch.)</p>
<p>If all three &#8216;get&#8217; results arrive at the same instant, then the switch must do one of two things:</p>
<ol>
<li>Buffer all of the 300KB of packets.</li>
<li>Drop some packets.</li>
</ol>
<p>If you&#8217;re using a typical, low-cost commodity Ethernet switch such as a Cisco Catalyst 3750, you don&#8217;t have large amounts of buffer space (compared to other switches on the market).  See a <a href="http://www.advizex.com/assets/1/7/Tolly211127HP3800SeriesTCOVsCisco3750XandJuniperEX4200.pdf">table from a Tolly Enterprises, LLC report number 211127</a>:</p>
<p><a href="http://www.snookles.com/slf-blog/wp-content/uploads/2012/01/tolly-group-211127-excerpt.png"><img class="alignnone size-full wp-image-233" title="tolly-group-211127-excerpt" src="http://www.snookles.com/slf-blog/wp-content/uploads/2012/01/tolly-group-211127-excerpt.png" alt="" width="765" height="633" /></a></p>
<p>But if your Ethernet switch has a lot of buffer space (like the HP or Juniper switches in the Tolly figure), you do not make it impossible to overrun your buffer space: large buffers only make buffer overruns less likely.  Remember, these are fast machines in this cluster.  And your switch may only have buffer space for a few dozen or a few hundred frames, depending on the frame size.</p>
<p>So, now imagine that the Riak cluster is much busier.  Each cluster member is taking in many thousands of queries per second, thus starting several thousand coordinators processes per second.  Each coordinator is getting (typically) 3 replies.  Each reply is big enough to use many Ethernet frames.  And it&#8217;s quite likely that enough frames will arrive from all over the cluster to overrun the Ethernet switch&#8217;s buffers for the coordinator machine&#8217;s Ethernet port.  Boom, &#8220;TCP incast&#8221; bites you, hard.</p>
<h3>Riak in Production at Voxer: a case study</h3>
<p>I&#8217;ve seen a <a href="http://www.voxer.com">customer&#8217;s cluster (Voxer)</a> do exactly this.  When utilization on a GigE network (fed by Cisco Catalyst 3750 switches) uses more than 50% average utilization (sampled @ 1 second intervals), TCP throughput collapses regularly.  The 1-second utilization rates will see-saw between 900 Mbit/sec down to 200 Mbit/sec.  To make matters worse, the 200 Mbit/sec rate will happen much more frequently than the 900 Mbit/sec rate.</p>
<p>To show that we&#8217;re seeing packet drops, we use a couple of methods.  First, using &#8220;tcpdump&#8221; and &#8220;tcptrace&#8221;.</p>
<pre>root# sh -c 'FILE=/tmp/slf.capt`date +%s` ; echo Putting tcpdump output in $FILE ; time tcpdump -i eth0 -c 200000 -w $FILE ; tcptrace -l $FILE | grep "rexmt data pkts" | head -5 ; echo Pseudo-count of TCP flows with retransmitted packets: ; tcptrace -l $FILE | grep "rexmt data pkts" | grep -v "rexmt data pkts: 0 rexmt data pkts: 0" | wc -l ; echo Total retransmitted packets and bytes: ; tcptrace -l $FILE | awk " /rexmt data pkts/ { p += \$4 + \$8 } /rexmt data bytes/ { b += \$4 + \$8 } END { print p, b }" '
Putting tcpdump output in /tmp/slf.capt1325142154
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
200000 packets captured
200011 packets received by filter
0 packets dropped by kernel
0.22user 0.43system 0:07.50elapsed 8%CPU (0avgtext+0avgdata 17648maxresident)k
136inputs+36960outputs (3major+738minor)pagefaults 0swaps
 rexmt data pkts: 0 rexmt data pkts: 0
 rexmt data pkts: 0 rexmt data pkts: 0
 rexmt data pkts: 0 rexmt data pkts: 0
 rexmt data pkts: 0 rexmt data pkts: 0
 rexmt data pkts: 0 rexmt data pkts: 0
Pseudo-count of TCP flows with retransmitted packets:
99
Total retransmitted packets and bytes:
837 5608410
root# for i in `seq 1 30`; do sh -c 'echo -n `date` " " ; ifconfig eth0 ; sleep 1 ; ifconfig eth0' | grep bytes | sed -e 's/.. bytes://g' | awk 'NR == 1 { rx = $1; tx = $4} NR == 2 { printf "rx %.1f Mbit/sec tx %.1f Mbit/sec ", ($1 - rx) * 8 / (1024*1024) / 1, ($4 - tx) * 8 / (1024*1024) / 1 }' ; date ; donerx 226.3 Mbit/sec tx 293.7 Mbit/sec Wed Dec 28 23:03:17 PST 2011rx 242.8 Mbit/sec tx 265.3 Mbit/sec Wed Dec 28 23:03:18 PST 2011
rx 260.4 Mbit/sec tx 302.1 Mbit/sec Wed Dec 28 23:03:25 PST 2011
rx 226.4 Mbit/sec tx 241.4 Mbit/sec Wed Dec 28 23:03:26 PST 2011
rx 238.1 Mbit/sec tx 277.9 Mbit/sec Wed Dec 28 23:03:27 PST 2011
rx 270.5 Mbit/sec tx 315.9 Mbit/sec Wed Dec 28 23:03:28 PST 2011
rx 257.4 Mbit/sec tx 287.5 Mbit/sec Wed Dec 28 23:03:29 PST 2011
^C</pre>
<p>That&#8217;s 837 dropped packets in 7.5 seconds of sampling, at an average 1-second throughput of 240-315 Mbit/sec.</p>
<p>Looking at this in a more Linux-specific manner, and a bit more exact measurement of transmit bandwidth during the time period that we&#8217;re measuring transmissions:</p>
<pre>root# netstat -s | egrep 'segments retransmited|OutOctets|requests sent out' ; sleep 1; netstat -s | egrep 'segments retransmited|OutOctets|requests sent out'
    1542387771 requests sent out
    521663296 segments retransmited
    OutOctets: 1222021825
    1542398375 requests sent out
    521663388 segments retransmited
    OutOctets: 1248437047</pre>
<p>That&#8217;s 10604 packets sent, 92 packets retransmitted, for 26415222 octets sent (201 Mbit/sec).  The packet retransmission rate is 0.9.  (Yes, all other machines are using roughly the same 1-second bandwidth.)  At 201 Mbit/sec, we&#8217;re at 20% of allegedly full bandwidth.  If we double the average utilization of all machines to 400Mbit/sec, the packet rate moves up to the 4-6% range.  And if all machines go up to 500Mbit/sec, things get ugly.</p>
<pre>rx 364.4 Mbit/sec tx 916 Mbit/sec Wed Dec 21 15:43:31 PST 2011
rx 339.1 Mbit/sec tx 951 Mbit/sec Wed Dec 21 15:43:32 PST 2011
rx 361.6 Mbit/sec tx 952 Mbit/sec Wed Dec 21 15:43:33 PST 2011
rx 475.4 Mbit/sec tx 491 Mbit/sec Wed Dec 21 15:43:34 PST 2011
rx 529.2 Mbit/sec tx 415 Mbit/sec Wed Dec 21 15:43:36 PST 2011
rx 472.9 Mbit/sec tx 267 Mbit/sec Wed Dec 21 15:43:37 PST 2011
rx 505.3 Mbit/sec tx 269 Mbit/sec Wed Dec 21 15:43:38 PST 2011
rx 393.5 Mbit/sec tx 191 Mbit/sec Wed Dec 21 15:43:39 PST 2011
rx 175.1 Mbit/sec tx 8 Mbit/sec Wed Dec 21 15:43:40 PST 2011
rx 436.8 Mbit/sec tx 246 Mbit/sec Wed Dec 21 15:43:41 PST 2011
rx 487.6 Mbit/sec tx 246 Mbit/sec Wed Dec 21 15:43:42 PST 2011
rx 524.2 Mbit/sec tx 194 Mbit/sec Wed Dec 21 15:43:43 PST 2011
rx 441.9 Mbit/sec tx 699 Mbit/sec Wed Dec 21 15:43:44 PST 2011
rx 382.8 Mbit/sec tx 952 Mbit/sec Wed Dec 21 15:43:45 PST 2011
rx 331.5 Mbit/sec tx 951 Mbit/sec Wed Dec 21 15:43:46 PST 2011
rx 391.3 Mbit/sec tx 949 Mbit/sec Wed Dec 21 15:43:47 PST 2011
rx 538.6 Mbit/sec tx 396 Mbit/sec Wed Dec 21 15:43:48 PST 2011</pre>
<p>Bummer.  During about 20 seconds, our transmit bandwidth ranges from a high of 952 Mbit/sec all the way down to 8 Mbit/sec.  EIGHT!  For an entire second!  And it took four more seconds before the transmit rate rises above 500Mbit/sec.</p>
<p>So, one last question about the whole microburst and Ethernet switch buffering and &#8220;TCP incast&#8221; problem is &#8230; are the microbursts really happening?  Are the switch ports really being pushed to full line rate some of the time?</p>
<p>As far as I can tell, the answer is &#8220;yes&#8221;.  Using an Erlang program (Escript, actually), I have good timer resolution down to about 4 milliseconds, perhaps even 2 milliseconds.  To get polling accurate below that, I&#8217;d have to write a demo program in another language.  (Or write the Erlang program so that avoids using timers and instead uses busy-wait loops: Erlang&#8217;s time-of-day clock has microsecond resulution.)</p>
<p>Here&#8217;s a <a href="http://www.snookles.com/scotttmp/poll-value-mbit.escript">link to the escript that I used</a>, and here&#8217;s some output, taken from an off-peak period last night. The first argument is the number of milliseconds between polling periods, and the second is the path to Linux&#8217;s <tt>/sys</tt> file system file for incoming network octets. (The {22,x,y} stuff is a timestamp, e.g. 22:29:21 Pacific time.  The &#8220;ratio&#8221; figure divides the maximum bandwidth observed during that second by the average bandwidth for the second.)</p>
<pre>root# ./poll-value-mbit.escript 500 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 184.7 Mbit/s Avg 170.3 Mbit/s Ratio 1.1 @ {22,29,21}
Max 206.6 Mbit/s Avg 185.7 Mbit/s Ratio 1.1 @ {22,29,22}
Max 230.1 Mbit/s Avg 206.0 Mbit/s Ratio 1.1 @ {22,29,23}
Max 192.1 Mbit/s Avg 185.4 Mbit/s Ratio 1.0 @ {22,29,24}
Max 183.7 Mbit/s Avg 167.4 Mbit/s Ratio 1.1 @ {22,29,25}
Max 212.5 Mbit/s Avg 191.8 Mbit/s Ratio 1.1 @ {22,29,26}
^C
root# ./poll-value-mbit.escript 250 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 174.0 Mbit/s Avg 157.8 Mbit/s Ratio 1.1 @ {22,29,32}
Max 225.1 Mbit/s Avg 180.2 Mbit/s Ratio 1.2 @ {22,29,33}
Max 210.7 Mbit/s Avg 182.6 Mbit/s Ratio 1.2 @ {22,29,34}
Max 188.3 Mbit/s Avg 172.3 Mbit/s Ratio 1.1 @ {22,29,35}
Max 193.5 Mbit/s Avg 177.5 Mbit/s Ratio 1.1 @ {22,29,36}
^C
root# ./poll-value-mbit.escript 125 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 256.2 Mbit/s Avg 200.5 Mbit/s Ratio 1.3 @ {22,29,44}
Max 245.3 Mbit/s Avg 209.4 Mbit/s Ratio 1.2 @ {22,29,45}
Max 271.8 Mbit/s Avg 212.4 Mbit/s Ratio 1.3 @ {22,29,46}
Max 214.6 Mbit/s Avg 189.4 Mbit/s Ratio 1.1 @ {22,29,47}
Max 261.8 Mbit/s Avg 199.5 Mbit/s Ratio 1.3 @ {22,29,48}
^C
root# ./poll-value-mbit.escript 64 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 458.6 Mbit/s Avg 311.9 Mbit/s Ratio 1.5 @ {22,30,0}
Max 389.1 Mbit/s Avg 236.0 Mbit/s Ratio 1.6 @ {22,30,1}
Max 267.2 Mbit/s Avg 162.8 Mbit/s Ratio 1.6 @ {22,30,2}
Max 276.8 Mbit/s Avg 167.6 Mbit/s Ratio 1.7 @ {22,30,3}
Max 229.3 Mbit/s Avg 172.3 Mbit/s Ratio 1.3 @ {22,30,4}
Max 346.6 Mbit/s Avg 193.1 Mbit/s Ratio 1.8 @ {22,30,5}
^C
root# ./poll-value-mbit.escript 32 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 372.7 Mbit/s Avg 204.2 Mbit/s Ratio 1.8 @ {22,30,9}
Max 305.6 Mbit/s Avg 166.9 Mbit/s Ratio 1.8 @ {22,30,10}
Max 356.0 Mbit/s Avg 192.0 Mbit/s Ratio 1.9 @ {22,30,11}
Max 410.0 Mbit/s Avg 174.4 Mbit/s Ratio 2.4 @ {22,30,12}
Max 349.7 Mbit/s Avg 187.7 Mbit/s Ratio 1.9 @ {22,30,13}
^C
root# ./poll-value-mbit.escript 16 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 441.1 Mbit/s Avg 171.8 Mbit/s Ratio 2.6 @ {22,30,19}
Max 451.1 Mbit/s Avg 179.5 Mbit/s Ratio 2.5 @ {22,30,20}
Max 352.6 Mbit/s Avg 166.3 Mbit/s Ratio 2.1 @ {22,30,21}
Max 424.7 Mbit/s Avg 177.1 Mbit/s Ratio 2.4 @ {22,30,22}
^C
root# ./poll-value-mbit.escript 8 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 598.1 Mbit/s Avg 199.4 Mbit/s Ratio 3.0 @ {22,30,30}
Max 724.1 Mbit/s Avg 205.9 Mbit/s Ratio 3.5 @ {22,30,31}
Max 637.2 Mbit/s Avg 158.8 Mbit/s Ratio 4.0 @ {22,30,32}
Max 727.2 Mbit/s Avg 187.3 Mbit/s Ratio 3.9 @ {22,30,33}
Max 832.3 Mbit/s Avg 221.8 Mbit/s Ratio 3.8 @ {22,30,34}
Max 436.6 Mbit/s Avg 162.6 Mbit/s Ratio 2.7 @ {22,30,35}
^C
root# ./poll-value-mbit.escript 4 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 826.5 Mbit/s Avg 200.2 Mbit/s Ratio 4.1 @ {22,30,41}
Max 858.9 Mbit/s Avg 166.1 Mbit/s Ratio 5.2 @ {22,30,42}
Max 746.8 Mbit/s Avg 176.0 Mbit/s Ratio 4.2 @ {22,30,43}
Max 838.7 Mbit/s Avg 190.0 Mbit/s Ratio 4.4 @ {22,30,44}
Max 674.0 Mbit/s Avg 173.5 Mbit/s Ratio 3.9 @ {22,30,45}
^C
root# ./poll-value-mbit.escript 2 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 950.2 Mbit/s Avg 188.9 Mbit/s Ratio 5.0 @ {22,30,51}
Max 960.0 Mbit/s Avg 197.2 Mbit/s Ratio 4.9 @ {22,30,52}
Max 992.8 Mbit/s Avg 236.1 Mbit/s Ratio 4.2 @ {22,30,53}
Max 997.5 Mbit/s Avg 191.7 Mbit/s Ratio 5.2 @ {22,30,54}
Max 962.5 Mbit/s Avg 172.5 Mbit/s Ratio 5.6 @ {22,30,55}
Max 957.5 Mbit/s Avg 187.6 Mbit/s Ratio 5.1 @ {22,30,56}
^C
root# ./poll-value-mbit.escript 1 /sys/devices/pci0000:00/0000:00:01.0/0000:0a:00.0/net/eth0/statistics/rx_bytes
Max 1645.6 Mbit/s Avg 182.2 Mbit/s Ratio 9.0 @ {22,31,2}
Max 1610.0 Mbit/s Avg 212.0 Mbit/s Ratio 7.6 @ {22,31,3}
Max 1623.1 Mbit/s Avg 206.3 Mbit/s Ratio 7.9 @ {22,31,4}
Max 1522.3 Mbit/s Avg 173.5 Mbit/s Ratio 8.8 @ {22,31,5}
Max 1607.7 Mbit/s Avg 199.0 Mbit/s Ratio 8.1 @ {22,31,6}</pre>
<p>It&#8217;s clear that as the polling period gets smaller, the ratio between maximum observed incoming rate and average rate gets larger.  It&#8217;s also clear that we don&#8217;t have good enough timer resolution for 1 millisecond resolution.  2 millisecond resolution might be iffy, <span style="color: #000000;"><strong>however</strong>, a separate experiment to test Erlang timer resolution at 2 milliseconds shows that the accuracy is within roughly 10%.</span></p>
<h3>What happens to Erlang programs when TCP incast strikes?</h3>
<p>When a few packets are dropped, TCP does a good job of figuring out which packet(s) needs retransmission.  The TCP stack in Linux 2.6.32 seems to do well enough.  At low packet loss rates, throughput isn&#8217;t harmed much, and latency penalties are minimal.</p>
<p>At higher packet loss rates, however, you can hit &#8220;<a href="http://en.wikipedia.org/wiki/Slow-start">TCP slow start</a>&#8220;.  In the example above, where throughput fluctuates from 952 Mbit/sec down to 8 Mbit/sec, that&#8217;s what I believe happened.  (I don&#8217;t have proof, alas: I didn&#8217;t have a packet capture running at that time.)</p>
<p>Between any two Erlang nodes <strong>A</strong> and <strong>B</strong>, there is a single TCP connection that carries all Erlang messages between <strong>A</strong> and <strong>B</strong>.  If there is severe congestion between <strong>A</strong> and <strong>B</strong>, and TCP slow start happens on the A-to-B TCP connection, messaging between <strong>A</strong> and node <strong>C</strong> will not be impacted.</p>
<p>However, there&#8217;s a very strong backpressure/feedback mechanism built in to Erlang that is related to output to &#8220;ports&#8221;.  An Erlang port is a gateway to the outside world, e.g. file descriptors to TCP sockets and to local file systems.  Each port has a buffer associated with it.  If an Erlang process writes data to the port until the buffer is full, then that process will be descheduled: the process cannot run until the buffer is no longer full.  If Erlang system monitoring for busy ports is enabled, a <tt>busy_port</tt> message will be sent to the system monitor process.</p>
<p>If an Erlang process <strong>P</strong> on node <strong>A</strong> attempts to send a message to process <strong>R</strong> on node <strong>B</strong>, and if the A-to-B port&#8217;s buffer is full, then process <strong>P</strong> will be descheduled. If Erlang system monitoring for busy network distribution ports is enabled, a <tt>busy_dist_port</tt> message will be sent to the system monitor process.</p>
<p>In an application like Riak, however, the process scheduling problem caused by a <tt>busy_dist_port</tt> event is very costly.</p>
<ul>
<li>Assume process <strong>P</strong> on node <strong>A</strong> is a Riak KV vnode process.  This process is, for discussion purposes, an Erlang/OTP <tt>gen_server</tt> process.  Like all <tt>gen_server</tt> processes, it handles all requests serially.</li>
<li>Assume process <span style="color: #000000;"><strong>R</strong></span> (the soon-to-be-message-receiver in this example) on node <span style="color: #000000;"><strong>B</strong></span> is a Riak client &#8216;get&#8217; operation FSM (finite state machine) process.</li>
<li>Assume that the TCP connection between <strong>A</strong> and <span style="color: #000000;"><strong>B</strong> is congested.  In fact, it&#8217;s so congested that the buffer for the port for A-to-B messaging is completely full.</span></li>
<li><span style="color: #000000;">When process <strong>P</strong> gets <strong>R</strong>&#8216;s &#8216;get&#8217; request, it does its normal computation: pass the request to the <tt>riak_kv</tt> vnode handler, then down to (for example) the Bitcask local storage manager.  Eventually, a reply is calculated and ready to send back to the client, <strong>R</strong>.</span></li>
<li><span style="color: #000000;">Our server process <strong>P</strong> eventually uses the Erlang <tt>!</tt> operator to send the reply to <strong>R</strong>.</span></li>
<li><span style="color: #000000;">The A-to-B network distribution port buffer is 100% full.  The VM sends a <tt>busy_dist_port</tt> message to the system monitor process (which will write the event to the Riak application log file), and process <strong>P</strong> is descheduled.</span></li>
</ul>
<p>Given this set of events, then our Riak vnode process <strong>P</strong> cannot answer any more queries until the A-to-B network link becomes uncongested.   If the cluster has 12 machines in it, <em> P cannot process any queries from the other 10 machines in the cluster</em> &#8230; even if the Ethernet ports to those other 10 machines are 0% utilized.  Service to those other 10 machines will be delayed for as long as it takes for the A-to-B TCP connection&#8217;s data to start flowing again.  If TCP slow start is triggered, the Linux TCP stack&#8217;s default timeout before starting TCP slow start is 200 milliseconds.  And it can take another fraction of a second longer (at least) before full bandwidth can be utilized &#8230; which also assumes that there are no more packets dropped while TCP climbs out of slow start mode &#8230; which we now know is a faulty assumption.</p>
<p>(Careful readers will probably draw a connection from the process descheduling problem to a more general <a href="http://en.wikipedia.org/wiki/Head-of-line_blocking">head-of-line blocking</a> problem.)</p>
<p>This causes Riak really big headaches.  Suddenly Riak query latencies become extremely unpredictable.  Overall memory usage can rise dramatically, 25% or more, for short periods of time and then fall back to normal as queues eventually drain.</p>
<h3>What&#8217;s the remedy?</h3>
<p>We don&#8217;t have a good remedy for this yet.  Options include:</p>
<ul>
<li>Using the <a href="http://research.microsoft.com/pubs/121386/dctcp-public.pdf">DCTCP</a> protocol isn&#8217;t an option for most <a href="http://www.basho.com/">Basho</a> customers.</li>
<li>Enabling support for <a href="http://en.wikipedia.org/wiki/Ethernet_flow_control">Ethernet &#8220;pause frames&#8221;</a> assumes that your switches support them correctly &#8212; transmission between switches is apparently not well supported.</li>
<li>Changing the 200ms TCP &#8220;RTO&#8221; timer (the retransmission timeout timer) may have a beneficial effect, but it&#8217;s difficult to measure because the Linux 2.6.32 kernel&#8217;s implemention for changing the RTO timer is buggy.</li>
<li>Changing other Linux TCP and NIC configuration knobs (e.g. TX queue, firmware ring size &amp; interrupt rates, MTU 9000) have negligible effect.</li>
<li>Placing limits on all cluster members outgoing bandwidth to 80% of line rate (which might reduce the ability of any single cluster member to cause TCP incast packet drops on any single receiver) appears difficult. Documentation for the Linux <tt>tc</tt> utility and the &#8220;token bucket filter&#8221; mechanism suggests that TBF might be 100% accurate up to 10 Mbits/sec.  Um, I need something that can handle a couple orders of magnitude more traffic than that.  And I&#8217;d like to be able to configure it without using units like packets/jiffy.</li>
</ul>
<p>We also have <a href="https://issues.basho.com/show_bug.cgi?id=1309">Bug 1309</a> open, to work around the worst of the head-of-line blocking problem.</p>
<h3>Postscript</h3>
<p>Many thanks to Matt Ranney at Voxer and his great staff for assisting in troubleshooting this networking problem.  The amount of time that he spent on the phone with various data center support staff is &#8230; I don&#8217;t want to try to sum it all up.  When we finally stumbled across the theory of the TCP incast pattern, my initial reaction was (paraphrasing), &#8220;That&#8217;s bullshit.&#8221;  But, indeed, I was wrong.  All subsequent research points to TCP incast as our main problem.</p>
<h3>Update: 2012-Jan-22</h3>
<p>There&#8217;s also a good collection of papers about the TCP incast pattern at the <a href="http://www.pdl.cmu.edu/Incast/">CMU PDL Projects</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
		<item>
		<title>RSS clients, please, stop pinging every 60 seconds</title>
		<link>http://www.snookles.com/slf-blog/2011/12/01/rss-clients-please-stop-pinging-every-60-seconds/</link>
		<comments>http://www.snookles.com/slf-blog/2011/12/01/rss-clients-please-stop-pinging-every-60-seconds/#comments</comments>
		<pubDate>Fri, 02 Dec 2011 02:25:06 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[None-of-the-above]]></category>
		<category><![CDATA[FreeBSD]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=216</guid>
		<description><![CDATA[Hi, everyone. I know that y&#8217;all is simply craving to hear everything that I have to say, as quickly as I say it.  That&#8217;s why you use an RSS client, to keep track of my blog.  You keep tabs on &#8230; <a href="http://www.snookles.com/slf-blog/2011/12/01/rss-clients-please-stop-pinging-every-60-seconds/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Hi, everyone. I know that y&#8217;all is simply craving to hear everything that I have to say, as quickly as I say it.  That&#8217;s why you use an RSS client, to keep track of my blog.  You keep tabs on me, because I&#8217;m worth keeping tabs on.  I&#8217;m <em>so</em> worth it that your RSS client is printing my blog&#8217;s RSS feed every 60 seconds.</p>
<p>I might say something. At any time.</p>
<p>From this day forward, I&#8217;ll separate my blog postings by several hours, if not days or months.  You won&#8217;t miss anything, really, if you polled less frequently.  Besides, you&#8217;re probably a Twitter user.  Follow me on Twitter, <a href="http://twitter.com/slfritchie">@slfritchie</a>, to get your fix.</p>
<p>My firewall might say something.  To you, 50.16.238.123.  At any time.</p>
<p>Except that you&#8217;re likely running on an EC2 instance.  Fickle, mercurial.  Poised to multiply a thousandfold like tribbles, overruning my modest ISP service, flooding its virtual storage bins with cute replicas of their fuzzy terror.</p>
<p>The firewall &#8220;man&#8221; page says nothing about filtering TDP, the Tribble DoS Protocol.  Drat.  I need an upgrade.  At an unknown time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2011/12/01/rss-clients-please-stop-pinging-every-60-seconds/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>DTrace, FreeBSD 9.0, and Erlang</title>
		<link>http://www.snookles.com/slf-blog/2011/11/26/dtrace-freebsd-9-0-and-erlang/</link>
		<comments>http://www.snookles.com/slf-blog/2011/11/26/dtrace-freebsd-9-0-and-erlang/#comments</comments>
		<pubDate>Sat, 26 Nov 2011 21:13:45 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[Erlang]]></category>
		<category><![CDATA[DTrace]]></category>
		<category><![CDATA[FreeBSD]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=177</guid>
		<description><![CDATA[While working on the DTrace probes for the Erlang R15 release (which I&#8217;ve blogged about earlier this month), I discovered that the build recipe for FreeBSD 9.0 was broken.  Oh no, my favorite OS doesn&#8217;t work, sound the alarm! FreeBSD &#8230; <a href="http://www.snookles.com/slf-blog/2011/11/26/dtrace-freebsd-9-0-and-erlang/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>While working on the DTrace probes for the Erlang R15 release (which I&#8217;ve <a href="http://www.snookles.com/slf-blog/2011/11/19/dtrace-and-erlang-a-status-report/">blogged about earlier this month</a>), I discovered that the build recipe for FreeBSD 9.0 was broken.  Oh no, my favorite OS doesn&#8217;t work, sound the alarm!</p>
<p>FreeBSD has supported kernel-space DTrace probes for a while.  However, support for <a href="http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_USDT">user-space probes, USDT</a>, has not arrived until FreeBSD 9.0RC1.  &#8221;RC1&#8243; means &#8220;Release Candidate #1&#8243;.  As of this writing, RC2 is the newest release.  The official release of FreeBSD 9.0 (to be called FreeBSD 9.0-RELEASE) will not be ready for at least another month or perhaps longer; see the <a href="http://wiki.freebsd.org/Releng/9.0TODO">9.0 wiki release schedule</a>  for more details.</p>
<p>As I mentioned in the introduction to this article, USDT probes are a bit broken in FreeBSD 9.0RC2.  I&#8217;ve added two different kinds of probes to the Erlang virtual machine:</p>
<ol>
<li>Probes that are defined as part of the core virtual machine.  These probes fire in response to events within the virtual machine, e.g. spawning a new process or a garbage collection event.  These probes are defined in <a href="https://github.com/slfritchie/otp/blob/dtrace-review3/erts/emulator/beam/erlang_dtrace.d">erlang_dtrace.d</a>.</li>
<li>A probe that can be triggered directly by Erlang code, e.g., <code>dtrace:p(42, "Hello, world!").</code> This probe is defined in <a href="https://github.com/slfritchie/otp/blob/dtrace-review3/lib/dtrace/c_src/dtrace_user.d">dtrace_user.d</a> and by the Erlang module <a href="https://github.com/slfritchie/otp/blob/dtrace-review3/lib/dtrace/src/dtrace.erl">dtrace.erl</a>.</li>
</ol>
<p>The probe that can be fired directly from Erlang code must be compiled using position-independent code (PIC) and assembled into a shared library.  The Erlang function <code>dtrace:init()</code> will load the shared library into the virtual machine and initialize the Erlang&lt;-&gt;C glue necessary for an Erlang function to call a C function.</p>
<p>Both FreeBSD 9.0RC1 and FreeBSD 9.0RC2 cannot compile and link to PIC code.  So all of the probes in <code>erlang_dtrace.d</code> will work correctly (because they&#8217;re not part of a shared library), but the probe in <code>dtrace_user.d</code> will not work correctly.</p>
<p>There is a problem report (PR) for FreeBSD that discusses this PIC problem: <a href="http://www.FreeBSD.org/cgi/query-pr.cgi?pr=159046">kern/159046: [dtrace] [patch] dtrace library is linked with a wrong flags on -CURRENT.</a>  I emailed an update to that ticket this afternoon, asking that Alex Samorukov&#8217;s fix be folded into the 9.0-RELEASE of FreeBSD.  I have no idea if it will happen in time or not &#8230; but I have my fingers crossed.</p>
<p>In the meantime, I have a fix for those who are using one of the FreeBSD release candidates where the PIC bug is still present.  You have a couple of options:</p>
<ol>
<li>If you&#8217;re familiar with building &amp; installing FreeBSD kernels and FreeBSD user-space programs, then following the advice in the FreeBSD problem report (linked above) will work just fine.</li>
<li>Download PIC-compiled copies of the system libraries that are required for PIC-compiled trace code and install them manually.</li>
</ol>
<div>If you choose option #2, then I have a step-by-step recipe for you:</div>
<div>
<ul>
<li>Download the 5.5 KByte file <a href="http://www.snookles.com/scott/misc/freebsd-9.0rcX.dtrace-pic-libs.tar.gz">http://www.snookles.com/scott/misc/freebsd-9.0rcX.dtrace-pic-libs.tar.gz</a> to your FreeBSD 9.0RCx&#8217;s <code>/tmp</code> directory.</li>
<li>Run this command: <code>gunzip -c /tmp/freebsd-9.0rcX.dtrace-pic-libs.tar.gz | (cd / ; tar xvfp -)</code></li>
</ul>
<p>Good luck, and have fun with DTrace.</p>
<h3>UPDATE 2011-Nov-29</h3>
<p>It looks like the FreeBSD problem report has now been resolved, and the patch will be included in the 9.0-RELEASE distribution when it is released in the next month or three.  Very good news!</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2011/11/26/dtrace-freebsd-9-0-and-erlang/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SystemTap and Erlang: a tutorial</title>
		<link>http://www.snookles.com/slf-blog/2011/11/19/systemtap-and-erlang-a-tutorial/</link>
		<comments>http://www.snookles.com/slf-blog/2011/11/19/systemtap-and-erlang-a-tutorial/#comments</comments>
		<pubDate>Sat, 19 Nov 2011 07:45:59 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[Erlang]]></category>
		<category><![CDATA[DTrace]]></category>
		<category><![CDATA[SystemTap]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=166</guid>
		<description><![CDATA[As mentioned in my previous posting, DTrace and Erlang: a status report, it&#8217;s also possible to use Linux&#8217;s SystemTap to watch the inner workings of the Erlang virtual machine.  This is possible via a DTrace compatibility layer built in to SystemTap. In theory, any Linux &#8230; <a href="http://www.snookles.com/slf-blog/2011/11/19/systemtap-and-erlang-a-tutorial/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>As mentioned in my previous posting, <a href="http://www.snookles.com/slf-blog/2011/11/19/dtrace-and-erlang-a-status-report">DTrace and Erlang: a status report</a>, it&#8217;s also possible to use Linux&#8217;s <a href="http://sourceware.org/systemtap/">SystemTap</a> to watch the inner workings of the Erlang virtual machine.  This is possible via a <a href="http://dtrace.org/blogs/">DTrace</a> compatibility layer built in to SystemTap.</p>
<p>In theory, any Linux system that supports user-space SystemTap should work just fine.</p>
<p>In practice, I highly recommend that you use a CentOS 5 machine.  Why?</p>
<ul>
<li>CentOS 5 is what I use for my Linux + SystemTap testing</li>
<li>CentOS 6 ought to work, but I&#8217;ve noticed strange problems when trying to run some SystemTap scripts &#8230; and have seen incorrect/weird behavior from other scripts.</li>
<li>Fedora Core 16 ought to work as well as CentOS 5, but I haven&#8217;t tried it yet.</li>
</ul>
<div>I owe many, many thanks to Andreas Schultz at <a href="http://www.travelping.com/">Travelping GmbH</a> for doing almost all the research and actual work to get the SystemTap stuff working.</div>
<h3>First, Read the README File</h3>
<div>As with any piece of open source software, it&#8217;s a good idea to read the README file.  In Erlang&#8217;s case, there are several README files.  I recommend that you quickly skim the following two files (though careful reading would be even better):</div>
<div>
<ul>
<li><a href="https://github.com/slfritchie/otp/blob/dtrace-review3/README.dtrace.md">README.dtrace.md</a>, the introduction to DTrace and Erlang</li>
<li><a href="https://github.com/slfritchie/otp/blob/dtrace-experiment+michal2/README.systemtap.md">README.systemtap.md</a>, the introduction to SystemTap and Erlang</li>
</ul>
<div>
<p>Please note that the newest, most up-to-date copies of these files can be found in the Erlang/OTP distribution, using the instructions below in the section &#8220;<a href="#build">Build Erlang/OTP With SystemTap Enabled</a>&#8220;.</p>
</div>
</div>
<h3>Installing SystemTap on a CentOS 5 Machine</h3>
<p>I started with a a CentOS 5.6 32-bit machine, freshly installed via the installer CD. Using a 64-bit machine is fine.  My 32-bit machine is actually a virtual machine, running inside <a href="http://www.vmware.com/products/fusion/overview.html">VMware Fusion</a>.</p>
<p>First, edit the <tt>/etc/yum.repos.d/CentOS-Debuginfo.repo</tt> file to change the &#8220;<tt>enabled=0</tt>&#8221; line to be &#8220;<tt>enabled=1</tt>&#8221; instead.</p>
<p>Next, run the following &#8220;yum&#8221; commands:</p>
<pre><code> yum -y install kernel kernel-devel kernel-headers yum -y install kernel-debuginfo kernel-debuginfo-common systemtap systemtap-sdt-devel systemtap-runtime systemtap-testsuite yum -y install make gcc gcc-c++ kernel-devel m4 ncurses-devel openssl-</code>devel</pre>
<pre><span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px; white-space: normal;">Those three commands take care of the following:</span></pre>
<ul>
<li>Upgrade your kernel</li>
<li>Install the kernel &amp; user packages required for SystemTap</li>
<li>Install the minimum packages needed to build the Erlang/OTP package.</li>
</ul>
<div>We also need to install Git.  This recipe comes from <span class="Apple-style-span" style="font-family: Consolas, Monaco, monospace; font-size: 12px; line-height: 18px; white-space: pre;"><a href="http://www.webtatic.com/packages/git17/">http://www.webtatic.com/packages/git17/</a></span></div>
<div>
<pre>rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm
yum install --enablerepo=webtatic git</pre>
</div>
<p>When that&#8217;s done, reboot your machine.  The default kernel in your GRUB boot loader config should be the latest stable kernel for CentOS 5.</p>
<p>Now you&#8217;re ready for the next step &#8230; skip down to the section &#8220;<a href="#build">Build Erlang/OTP With SystemTap Enabled</a>&#8220;.</p>
<h3>Installing SystemTap on a CentOS 6 Machine</h3>
<p>Setting up CentOS 6 is slightly easier than CentOS 5.</p>
<p>So, I&#8217;ve had a more difficult time using SystemTap on CentOS 6.  There seem to be two (or three?) main variations of the SystemTap kernel &amp; user-space packaging.  Whichever one that CentOS 6 uses doesn&#8217;t work well.  Or, doesn&#8217;t work very well for me.  If you have a different experience, please contact me!</p>
<p>OK, here&#8217;s the packages to install via &#8220;yum&#8221;:</p>
<pre>yum install kernel kernel-devel kernel-headers
yum install systemtap systemtap-sdt-devel systemtap-runtime systemtap-testsuite
yum install make gcc gcc-c++ kernel-devel m4 ncurses-devel openssl-devel
yum install git</pre>
<p>Reboot your machine, then you&#8217;re ready for the next phase.</p>
<h3>Build Erlang/OTP With SystemTap Enabled</h3>
<p>First, please make certain that the kernel that&#8217;s currently running supports user-space SystemTap</p>
<p>See the <a href="http://www.snookles.com/slf-blog/2011/11/19/dtrace-and-erlang-a-status-report/#howtotry">&#8220;How Do I Try It?&#8221;</a> section of the status report for instructions on how to compile the Erlang/OTP distribution.  Both the R14B04 and the still-under-development R15A releases are supported.</p>
<h3>Checking to See If User-Space SystemTap Works</h3>
<p>Please run the following command to make certain that user-space SystemTap is installed correctly on your system.</p>
<pre>    grep CONFIG_UTRACE /boot/config-`uname -r`</pre>
<p>The output should be:</p>
<pre>    CONFIG_UTRACE=y</pre>
<p>If not, you&#8217;re using the wrong kernel.</p>
<p>Next, check to see if your Erlang build has been compiled successfully with SystemTap via the DTrace compatibility mode:</p>
<pre>    PATH=/path/to/beam:$PATH stap -L 'process("beam").mark("*")'</pre>
<p>You should see output that looks like this:</p>
<pre>process("beam").mark("aio_pool__add")
process("beam").mark("aio_pool__get")
process("beam").mark("bif__entry")
process("beam").mark("bif__return")
process("beam").mark("copy__object")
process("beam").mark("copy__struct")
[...]</pre>
<p>Congratulations!</p>
<h3>Running Your First SystemTap Script on CentOS 5</h3>
<p>You will need two windows/login sessions:</p>
<ol>
<li>One window/login session to run an Erlang shell</li>
<li>One window/login session as the superuser (i.e., &#8220;root&#8221;) to run the SystemTap scripts.</li>
</ol>
<div>Using the first window, start an Erlang shell with SMP support disabled, e.g. <tt>/path/to/your/erlang/installation/bin/erl -smp disabled</tt></div>
<p>On my machine, I have close to a dozen different versions of Erlang installed.  To keep my SystemTap/DTrace builds separate from everything else, I used &#8220;<tt>./configure --prefix=/usr/local/erlang/R15A.dtrace</tt>&#8220;.  Therefore, my startup command is &#8220;<tt>/usr/local/erlang/R15A.dtrace/bin/erl -smp disabled</tt>&#8220;.</p>
<p>Your paths are probably different.  For the rest of this little tutorial, I assume that you know what the top-level path is.</p>
<p>Using your second window, use &#8220;<tt>ps axw | grep beam</tt>&#8221; to find the full path to the BEAM virtual machine.  This path is <strong>NOT THE SAME</strong> as the path used to start Erlang via the &#8220;<tt>erl</tt>&#8221; command.  On my system, the result is:</p>
<pre>[root@localhost ~]# ps axw | grep beam
 9606 pts/2    S+     0:00 /usr/local/erlang/R15A.dtrace-review3/lib/erlang/erts-5.9/bin/beam -- -root /usr/local/erlang/R15A.dtrace-review3/lib/erlang -progname erl -- -home /home/fritchie -- -smp disable</pre>
<p>The directory that contains the &#8220;<tt>beam</tt>&#8221; executable is what&#8217;s necessary for the &#8220;env PATH&#8221; commands that we will use later.  In my case, that directory is &#8220;<tt>/usr/local/erlang/R15A.dtrace-review3/lib/erlang/erts-5.9/bin</tt>&#8220;. Your directory name is probably different.</p>
<p>In the second window/login session, start a SystemTap script from the examples directory:<span class="Apple-style-span" style="font-family: Consolas, Monaco, monospace; font-size: 12px; line-height: 18px; white-space: pre;"> </span></p>
<pre>cd /usr/local/erlang/R15A.dtrace/lib/erlang/lib/dtrace-0.8/example
env PATH=/usr/local/erlang/R15A.dtrace-review3/lib/erlang/erts-5.9/bin:$PATH stap ./garbage-collection.systemtap</pre>
<pre><span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px; white-space: normal;">Then, back in the first window, start typing commands at the Erlang shell prompt.  For example:</span><span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px; white-space: normal;"> </span></pre>
<pre>Erlang R15A (erts-5.9) [source] [rq:1] [async-threads:0] [hipe] [kernel-poll:false] [dtrace]

Eshell V5.9  (abort with ^G)
1&gt; "this is a test of the SystemTap system".</pre>
<p>Over in the second window, you should see output that looks like this:</p>
<pre>GC minor start pid &lt;0.24.0&gt; need 3 words
GC minor end pid &lt;0.24.0&gt; reclaimed 3 words
GC minor start pid &lt;0.22.0&gt; need 14 words
GC minor end pid &lt;0.22.0&gt; reclaimed 14 words
GC minor start pid &lt;0.24.0&gt; need 8 words
GC minor end pid &lt;0.24.0&gt; reclaimed 8 words</pre>
<p>Congratulations, go have fun exploring Erlang and SystemTap!</p>
<h3> WARNING: Erlang SMP Support Uses a Different Executable Name!</h3>
<p>All of the SystemTap scripts in the &#8220;examples&#8221; directory contain the name of the Erlang virtual machine executable, &#8220;<tt>beam</tt>&#8220;.  We started Erlang using the &#8220;<tt>-smp disabled</tt>&#8221; flag because we wanted to make certain that we were using &#8220;<tt>beam</tt>&#8220;, the executable that does <strong>NOT</strong> support SMP, i.e., symmetric multi-processing, i.e., multi-CPU and multi-core systems.</p>
<p>The executable that supports SMP is called &#8220;<tt>beam.smp</tt>&#8220;.  If you use the SMP version of Erlang, then you must also change the name of the executable in your SystemTap scripts!</p>
<h3>Running Your First SystemTap Script on CentOS 6</h3>
<p>So, all of the directions given in the CentOS 5 section are exactly the same for CentOS 6 also.</p>
<p>Why is there a separate CentOS 6 section?  Because things don&#8217;t work correctly for me, and I&#8217;m hoping that someone else will notice and then email me with a fix.  (Hint, hint, &#8230;. :-)</p>
<p>Here&#8217;s what I see when I run the same garbage collection script on my CentOS 6 machine:</p>
<pre>[root@localhost ~]# env PATH=/usr/local/erlang/R15A.dtrace-review3/lib/erlang/erts-5.9/bin:$PATH stap ./garbage-collection.systemtap
semantic error: no match while resolving probe point process("beam").mark("gc_major-start")
semantic error: no match while resolving probe point process("beam").mark("gc_minor-start")
semantic error: no match while resolving probe point process("beam").mark("gc_major-end")
Pass 2: analysis failed.  Try again with another '--vp 01' option.</pre>
<p>Actually, I know how to fix that one.  The problem comes from the DTrace convention of using underscore characters and hyphen characters in probe names.  With DTrace, if a probe definition file uses a name has two underscores, e.g., &#8220;gc_major__start&#8221;, then the DTrace probe name is &#8220;gc_major-start&#8221;.</p>
<p>The version of SystemTap that CentOS 5 uses seems to be aware of the double-underscore-becomes-a-hyphen convention.  The CentOS 6 version of SystemTap does not.</p>
<p>So, we fix our file with a bit of editing, so it now looks like this:</p>
<pre>[... comments at top of file omitted ...]
probe process("beam").mark("gc_major__start")
{
    printf("GC major start pid %s need %d words\n", user_string($arg1), $arg2);
}

probe process("beam").mark("gc_minor__start")
{
    printf("GC minor start pid %s need %d words\n", user_string($arg1), $arg2);
}

probe process("beam").mark("gc_major__end")
{
    printf("GC major end pid %s reclaimed %d words\n", user_string($arg1), $arg2);
}

probe process("beam").mark("gc_minor__start")
{
    printf("GC minor end pid %s reclaimed %d words\n", user_string($arg1), $arg2);
}</pre>
<p>Now we try to run again.  This time we see garbage instead of the Erlang PID strings that we ought to see:</p>
<pre>[root@localhost ~]# env PATH=/usr/local/erlang/R15A.dtrace-review3/lib/erlang/erts-5.9/bin:$PATH stap /tmp/garbage-collection.systemtap
GC minor start pid ???? need 8 words
GC minor end pid ???? reclaimed 8 words
GC minor start pid ???? need 14 words
GC minor end pid ???? reclaimed 14 words
GC minor start pid need 16 words
GC minor end pid reclaimed 16 words
^C</pre>
<p>Other SystemTap scripts in the examples director have bigger problems, such as the following (after changing the probe name hyphens to double-underscores):</p>
<pre>[root@localhost ~]# env PATH=/usr/local/erlang/R15A.dtrace-review3/lib/erlang/erts-5.9/bin:$PATH stap /tmp/messages.systemtap
semantic error: failed to retrieve location attribute for local 'arg4' (dieoffset: 0x193897): identifier '$arg4' at /tmp/messages.systemtap:79:9
        source:     if ($arg4 == 0 &amp;&amp; $arg5 == 0 &amp;&amp; $arg6 == 0) {
                        ^
Pass 2: analysis failed.  Try again with another '--vp 01' option.

[root@localhost ~]# env PATH=/usr/local/erlang/R15A.dtrace-review3/lib/erlang/erts-5.9/bin:$PATH stap -L 'process("beam").mark("*")' | grep message
process("beam").mark("message__queued") $arg1:char* $arg3:int $arg4:Sint $arg5:Sint $arg6:Sint
process("beam").mark("message__receive")
process("beam").mark("message__send") $arg1:char* $arg2:char* $arg4:Sint $arg5:Sint $arg6:Sint
process("beam").mark("message__send__remote") $arg1:char* $arg2:char* $arg3:char* $arg5:Sint $arg6:Sint $arg7:Sint</pre>
<p>So, the probe is there, it&#8217;s named correctly, and it&#8217;s got the proper number of arguments but &#8230; SystemTap doesn&#8217;t like it.  Perhaps there&#8217;s a different version of SystemTap syntax also?  Hrm, I don&#8217;t know the answer, but I&#8217;ll try to find out in the next week&#8230;.<br />
<a name="update1"></a></p>
<h3>Conclusion</h3>
<p>I hope you&#8217;ve enjoyed this little tutorial.  Now go have some fun and explore Erlang and SystemTap!</p>
<h3> UPDATE 1, Saturday Nov 19, 2011:</h3>
<p>It&#8217;s quite possible that despite the number 6 being greater than the number 5, CentOS 6 is using a version of SystemTap 1.2, and CentOS 5 is using a version of SystemTap 1.3.  If that&#8217;s true, then it&#8217;s probably better for CentOS 6 users to roll their own SystemTap support.  Many thanks to Andreas Schultz for pointing this out.</p>
<p>Another tip from Andreas: Newer versions of SystemTap are available using the <a href="http://wiki.centos.org/AdditionalResources/Repositories/CR">CentOS continuous release (CR) repository</a>.  Beware that packages there may not be 100% stable.<br />
<a name="update2"></a></p>
<h3>UPDATE 2, Tuesday Nov 29, 2011:</h3>
<p>There appears to be an additional problem, this time with CentOS 5 and its SystemTap version 1.3: probes defined by shared libraries do not work correctly.  To demonstrate, run this little demo in two different windows/login sessions.</p>
<p>In window number #1, run this command using a DTrace-enabled Erlang build. Depending on where your Erlang package was installed, you <strong>must</strong> change the directory paths to match your environment.</p>
<pre>/usr/local/erlang/R15A.dtrace-review-20111127/bin/erl -eval '{dtrace:init(), [{dtrace:p(42), io:format("."), timer:sleep(1000)} || _ &lt;- lists:seq(1,2000)]}.'</pre>
<p>This 1-liner will load the &#8220;dtrace.so&#8221; shared library into the Erlang virtual machine and then try to fire a probe within that library once per second.</p>
<p>In window/login session #2, as root/superuser, run the following two commands. Also, note the output from the first command should match what I show here.</p>
<pre># env PATH=/home/local/erlang/R15A.dtrace-review-20111127/lib/erlang/lib/dtrace-0.8/priv/lib:/usr/local/erlang/R15A.dtrace-review-20111127/lib/erlang/erts-5.9.pre/bin:$PATH stap -l 'process("dtrace.so").mark("*")'
process("dtrace.so").mark("user_trace__i4s4")
process("dtrace.so").mark("user_trace__s1")

# env PATH=/home/local/erlang/R15A.dtrace-review-20111127/lib/erlang/lib/dtrace-0.8/priv/lib:/usr/local/erlang/R15A.dtrace-review-20111127/lib/erlang/erts-5.9.pre/bin:$PATH stap -e 'probe process("dtrace.so").mark("*") {printf("Shared lib probe fired");}'</pre>
<p>If the probe is working correctly, then you should see the &#8220;Shared lib probe fired&#8221; message printed once per second. Unfortunately, on my CentOS 5 machine, the probe does nothing.  :-(</p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2011/11/19/systemtap-and-erlang-a-tutorial/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>DTrace and Erlang: a status report</title>
		<link>http://www.snookles.com/slf-blog/2011/11/19/dtrace-and-erlang-a-status-report/</link>
		<comments>http://www.snookles.com/slf-blog/2011/11/19/dtrace-and-erlang-a-status-report/#comments</comments>
		<pubDate>Sat, 19 Nov 2011 05:39:14 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[Erlang]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=159</guid>
		<description><![CDATA[I thought it&#8217;d be a good idea to write a status report about my work with Erlang, DTrace, and some SystemTap. Quick, I Want an Overview and Some Slides&#8230;. I gave a presentation at the Erlang User Conference  in Stockholm a couple &#8230; <a href="http://www.snookles.com/slf-blog/2011/11/19/dtrace-and-erlang-a-status-report/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I thought it&#8217;d be a good idea to write a status report about my work with Erlang, <a href="http://dtrace.org/blogs/">DTrace</a>, and some <a href="http://sourceware.org/systemtap/">SystemTap</a>.</p>
<h3>Quick, I Want an Overview and Some Slides&#8230;.</h3>
<p>I gave a presentation at the <a title="EUC Stockholm 2011" href="http://www.erlang-factory.com/conference/ErlangUserConference2011">Erlang User Conference</a>  in Stockholm a couple weeks ago.  The topic was about adding DTrace probes to the Erlang virtual machine.  The <a title="Scott's presentations at the 2011 EUC" href="http://www.erlang-factory.com/conference/ErlangUserConference2011/speakers/ScottLystigFritchie">presentation slides (and soon video, I hope)</a> are available online at the Erlang Factory website.  Click on the easel icon to fetch a PDF copy of the slides.  When the video is available, it will probably appear as an embedded video player/applet/thingie.</p>
<p>I&#8217;ve given a few other talks at EUC, Erlang Factory, and Erlang Workshop venues.  I&#8217;ve never had as many people tell me afterward that they both enjoyed the talk and are looking forward to using DTrace (or SystemTap) with Erlang programs.</p>
<h3>Getting DTrace into the Erlang/OTP R15 Release</h3>
<p><a title="erlang-patches mailing list posting #1" href="http://erlang.org/pipermail/erlang-patches/2011-November/002465.html">I submitted a set of four patches</a> to the erlang-questions mailing list on Wednesday night, November 16, 2011.  <a title="erlang-patches mailing list posting #2" href="http://erlang.org/pipermail/erlang-patches/2011-November/002469.html">The patches were accepted</a> by the OTP Team and are now sitting in the <a title="Erlang/OTP source repo @ GitHub, &quot;pu&quot; branch" href="https://github.com/erlang/otp/tree/pu">&#8220;pu&#8221; branch of the Erlang/OTP GitHub source repository</a>.  The name &#8220;pu&#8221; stands for &#8220;proposed updates&#8221;.  The &#8220;pu&#8221; branch is used for all externally-submitted patches which have passed the initial round of review.  Everything in the &#8220;pu&#8221; branch is tested regularly by Ericsson on all supported Erlang/OTP platforms (e.g. Solaris, Windows, Linux, &#8230;).</p>
<p>Now we wait and see what Ericsson&#8217;s tests show.  The DTrace patches were submitted well ahead of the code freeze date for the Erlang/OTP R15B01 release, which is tentatively scheduled for December 14, 2011.  If all goes according to plan, the patches will be in the official Erlang/OTP R15B01 release next month!</p>
<h3>What Is Traceable?</h3>
<p>Take a look at the &#8220;examples&#8221; directory in the dtrace application for some simple tracing scripts.  To see the actual definition of the DTrace probes and their arguments, please see the  <a href="https://github.com/slfritchie/otp/blob/dtrace-review3/erts/emulator/beam/erlang_dtrace.d">erlang_dtrace.d definition file</a>.  Please note that this link is a static one and may not always show the absolutely newest, freshest, most current version of the file.</p>
<p>Here&#8217;s a short summary of what&#8217;s traceable today using DTrace:</p>
<ul>
<li>Processes: spawn, exit, hibernate, scheduled, &#8230;</li>
<li>Messages: send, queued, received, exit signals</li>
<li>Memory: GC minor &amp; major, proc heap grow &amp; shrink</li>
<li>Data copy: within heap, across heaps</li>
<li>Function calls: function &amp; BIF &amp; NIF, entry &amp; return</li>
<li>Network distribution: monitor, port busy, output events</li>
<li>Ports: open, command, control, busy/not busy</li>
<li>Drivers: callback API 100% instrumented</li>
<li><tt>efile_drv.c</tt> file I/O driver: 100% instrumented</li>
</ul>
<p>As of today, there are 60 probes available.</p>
<h3>DTrace &amp; Erlang: Supported Platforms</h3>
<p>I&#8217;ve been testing the following platforms regularly:</p>
<ul>
<li>OS X 10.6.x, a.k.a. Snow Leopard.  I don&#8217;t know of any reason why it shouldn&#8217;t work with 10.7.x, a.k.a. Lion.</li>
<li>Solaris 10.  I&#8217;ve run into difficulties with long-running D scripts on my Solaris 10 machine, but I&#8217;ll write more about that in a later blog posting.</li>
<li>Linux, specifically CentOS 5 and CentOS 6 using <a href="http://sourceware.org/systemtap/">SystemTap</a>&#8216;s DTrace compatibility API.  I&#8217;ll include some recipes for setting up CentOS boxes to be capable of running user-space SystemTap in another blog posting.  Things should also work when using a Fedora Core 16 system, but I haven&#8217;t tested that yet.</li>
</ul>
<p>I&#8217;m hoping to be able to get FreeBSD 9 supported also.  At the time of this writing, <a href="http://www.freebsd.org/releases/9.0R/schedule.html">FreeBSD 9.0RC1 is now available for testing</a>.  Andrew Thompson @ Basho has started the FreeBSD 9 work and passed it on to me.  I&#8217;ll try to get it hammered into working condition over this weekend.<br />
<a name="howtotry"></a></p>
<h3>How do I try it?</h3>
<p>First, clone a copy of my source repository at GitHub:</p>
<pre>git clone git://github.com/slfritchie/otp.git</pre>
<p>Then you need to choose the branch that you wish to build:</p>
<ul>
<li>Run the following if you want to compile the R14B04 version of Erlang/OTP: &#8220;<tt>cd otp ; git checkout dtrace-r14b04</tt>&#8220;.</li>
<li>Run the following if you want to compile the R15A version of Erlang/OTP: &#8220;<tt>cd otp ; git checkout dtrace-r15</tt>&#8220;.  NOTE that this branch is not fully synchronized with Ericsson&#8217;s R15A development branch.</li>
<li>Run the following if you want to compile the patches as I submitted them to Ericsson for inclusion in Ericsson&#8217;s source repo: &#8220;<tt>cd otp ; git checkout dtrace-review3</tt>&#8220;.</li>
</ul>
<p>I also have a branch called &#8220;<tt>dtrace-experiment+michal2</tt>&#8221; that contains the primary copy of all of the changes that I later merge into the three branches listed above.</p>
<p>Here are the instructions for building the Erlang/OTP package.  I assume that you already have the toolchain dependencies (e.g. GCC compiler) and libraries (e.g. OpenSSL) already installed.</p>
<p><code><br />
./otp_build autoconf<br />
./configure --enable-dtrace {put your other config options here}<br />
make<br />
make install<br />
</code></p>
<p>If you&#8217;re using a Linux box with SystemTap installed, you will still use the configure flag &#8220;<tt>--enable-dtrace</tt>&#8220;, despite the fact that you&#8217;re going to be using SystemTap.</p>
<p>After you&#8217;ve installed everything, then you need two windows:</p>
<ol>
<li>One window/login session to run an Erlang shell</li>
<li>One window/login session as the superuser (i.e., &#8220;root&#8221;) to run the D scripts.  If you&#8217;re using SystemTap, please wait until my next blog posting, which will be specifically about Erlang and SystemTap.</li>
</ol>
<p>Using the first window, start an Erlang shell, e.g. <tt>/path/to/your/erlang/installation/bin/erl</tt></p>
<p>Using your second window, start a D script from the examples directory:</p>
<p><code><br />
cd /path/to/your/erlang/installation/lib/erlang/lib/dtrace-0.8/examples<br />
dtrace -qs ./function-calls.d<br />
</code></p>
<p>Then, back in the first window, start typing commands at the Erlang shell prompt.  Watch the output in the second window.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2011/11/19/dtrace-and-erlang-a-status-report/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Been a while since I&#8217;ve written</title>
		<link>http://www.snookles.com/slf-blog/2011/11/19/been-a-while-since-ive-writte/</link>
		<comments>http://www.snookles.com/slf-blog/2011/11/19/been-a-while-since-ive-writte/#comments</comments>
		<pubDate>Sat, 19 Nov 2011 10:34:40 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[None-of-the-above]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=153</guid>
		<description><![CDATA[Life has been busy.  Very good, but also busy. Professionally, things at Basho are very good.  It&#8217;s a great bunch of smart, ego-firmly-in-check folks.  Basho sent me to the Erlang User Conference  in Stockholm a couple weeks ago to present my &#8230; <a href="http://www.snookles.com/slf-blog/2011/11/19/been-a-while-since-ive-writte/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Life has been busy.  Very good, but also busy.</p>
<p>Professionally, things at <a title="Basho Technologies" href="http://www.basho.com/">Basho</a> are very good.  It&#8217;s a great bunch of smart, ego-firmly-in-check folks.  Basho sent me to the <a title="EUC Stockholm 2011" href="http://www.erlang-factory.com/conference/ErlangUserConference2011">Erlang User Conference</a>  in Stockholm a couple weeks ago to present my work on adding <a href="http://dtrace.org/blogs/">DTrace</a> probes to the Erlang virtual machine.  Flying from Stockholm to San Francisco was a very long Sunday, but I managed to survive.  The meetings in San Francisco with my fellow Bashonians were great, as was a lunch with the old crowd at <a href="http://www.geminimobile.com/">Gemini Mobile</a> in Foster City.</p>
<p>I&#8217;m fighting a bad head cold at the moment.  That has made work on DTrace &amp; Erlang stuff difficult this week.  (More on DTrace in a later posting.)</p>
<p>We&#8217;re quite looking forward to the Thanksgiving break next week, then travel to Salt Lake City the following weekend to visit our new niece Autumn as well as budding musician Thai and proud parents Victor and Carla.  And perhaps someday we&#8217;ll make plans for Christmas&#8230;.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2011/11/19/been-a-while-since-ive-writte/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Blog Roused from Slumber: the MSPIFF 2011</title>
		<link>http://www.snookles.com/slf-blog/2011/04/04/blog-roused-from-slumber-the-mspiff-2011/</link>
		<comments>http://www.snookles.com/slf-blog/2011/04/04/blog-roused-from-slumber-the-mspiff-2011/#comments</comments>
		<pubDate>Mon, 04 Apr 2011 01:28:22 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[Film]]></category>
		<category><![CDATA[MSPIFF]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=146</guid>
		<description><![CDATA[Yep, I haven&#8217;t written for a while.  One excuse is that I&#8217;ve started using Twitter.  After being a Twitter skeptic for quite a while, but I guess I&#8217;ve decided that Twitter&#8217;s 140 character limit is indeed enough for some things. &#8230; <a href="http://www.snookles.com/slf-blog/2011/04/04/blog-roused-from-slumber-the-mspiff-2011/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Yep, I haven&#8217;t written for a while.  One excuse is that I&#8217;ve started using Twitter.  After being a Twitter skeptic for quite a while, but I guess I&#8217;ve decided that Twitter&#8217;s 140 character limit is indeed enough for some things.  For those three loyal readers of this blog, here is the link to my <a title="Scott Lystig Fritchie's twitter messages of near-infinite folly" href="https://twitter.com/slfritchie">Twitter messages of near-infinite wisdom</a>.</p>
<p>Meanwhile, it&#8217;s about time for the <a href="http://www.mspfilmfest.org/2011/">2011 Minneapolis/St. Paul International Film Festival</a> to start.  I bought my gold pass today and had a chance to visit with a couple of the organizers for a couple of minutes.  Almost all films have been chosen and scheduled, but Web site is still undergoing a lot of development. One missing features is easy browsing films by date &amp; time.</p>
<p>The printed program will likely be done in a week or so.</p>
<p>I haven&#8217;t looked at the whole program yet, but (strangely enough) the title <a href="http://www.imdb.com/title/tt1595379/">The Lutefisk Wars (IMDB)</a> managed to jump out at me&#8230;.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2011/04/04/blog-roused-from-slumber-the-mspiff-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hibari mentioned in InfoWorld article</title>
		<link>http://www.snookles.com/slf-blog/2010/10/26/hibari-mentioned-in-infoworld-article/</link>
		<comments>http://www.snookles.com/slf-blog/2010/10/26/hibari-mentioned-in-infoworld-article/#comments</comments>
		<pubDate>Tue, 26 Oct 2010 22:32:58 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Hibari]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=141</guid>
		<description><![CDATA[This article on Oct 25th mentions Hibari in the &#8220;Erlang&#8221; section. As a Basho guy nowadays, I&#8217;m very slightly sad that the author didn&#8217;t also mention Riak, but as a Hibari fan, I&#8217;m quite happy. Seven Programming Languages on the Rise: From &#8230; <a href="http://www.snookles.com/slf-blog/2010/10/26/hibari-mentioned-in-infoworld-article/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>This article on Oct 25th mentions Hibari in the &#8220;Erlang&#8221; section.  As a <a href="http://www.basho.com/">Basho</a> guy nowadays, I&#8217;m very slightly sad that the author didn&#8217;t also mention <a href="http://wiki.basho.com/">Riak</a>, but as a <a href="http://hibari.sourceforge.net/">Hibari</a> fan, I&#8217;m quite happy.</p>
<p style="padding-left: 30px;">Seven Programming Languages on the Rise: From Ruby to Erlang, once niche programming language are gaining converts in today’s enterprise. <a href="http://infoworld.com/d/developer-world/7-programming-languages-the-rise-620">http://infoworld.com/d/developer-world/7-programming-languages-the-rise-620</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2010/10/26/hibari-mentioned-in-infoworld-article/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Setting Up Hibari from Scratch in a CentOS 5 Virtual Machine</title>
		<link>http://www.snookles.com/slf-blog/2010/10/10/setting-up-hibari-from-scratch-in-a-centos-5-virtual-machine/</link>
		<comments>http://www.snookles.com/slf-blog/2010/10/10/setting-up-hibari-from-scratch-in-a-centos-5-virtual-machine/#comments</comments>
		<pubDate>Sun, 10 Oct 2010 05:33:59 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Hibari]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=138</guid>
		<description><![CDATA[The following is based on a message that I&#8217;d sent a few weeks ago to the hibari-questions-en mailing list over at SourceForge. Since then, I&#8217;ve streamlined things slightly and added instructions on how to set up a CentOS 5 (inside &#8230; <a href="http://www.snookles.com/slf-blog/2010/10/10/setting-up-hibari-from-scratch-in-a-centos-5-virtual-machine/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>The following is based on a message that I&#8217;d sent a few weeks ago to the <a href="https://lists.sourceforge.net/lists/listinfo/hibari-patches-en">hibari-questions-en</a> mailing list over at <a href="http://sf.net/">SourceForge</a>.  Since then, I&#8217;ve streamlined things slightly and added instructions on how to set up a <a href="http://www.centos.org/">CentOS</a> 5 (inside a VMware virtual machine, though using &#8220;real&#8221; hardware is certainly OK also) to act as the host for a Hibari server node.</p>
<h2>Installing CentOS 5 From Scratch</h2>
<p>I was installing on virtual hardware, using VMware Fusion for MacOS.  The virtual machine&#8217;s configuration summary: 256MB RAM, 1 CPU, and a 20GB expanding disk.  A Hibari server can consume far more than 256MB of RAM if you&#8217;re storing a lot of keys, but for experimentation &amp; development work, 256MB is fine.  Besides, it&#8217;s easy to change the amount of RAM when using virtual hardware.</p>
<p>Booting the CentOS 5 server DVD, I chose the graphical installation option.</p>
<p>Partitioning via LVM:</p>
<ul>
<li>LogVol00, /, 19584 MB</li>
<li>LogVol01, swap, 768</li>
</ul>
<p>Clock <em>does not</em> use UTC.  Most people I know don&#8217;t use UTC like that, so why does Red Hat (and therefore CentOS) make it the default?</p>
<p>Software package choices:</p>
<ul>
<li> un-select: Desktop &#8211; Gnome</li>
<li>select: Server</li>
<li>Customize later</li>
</ul>
<p>After first reboot, &#8220;Setup Agent&#8221; runs automatically:</p>
<ul>
<li> Authentication: change nothing</li>
<li>Firewall config: Security level disabled, SELinux disabled</li>
<li>Network config: DNS config: set appropriately to your environment</li>
<li>Services config: turn off: avahi-daemon, bluetooth, cups, isdn, nfslock, pcscd, portmap, rpcgssd, rpcidmapd, sendmail, yum-updatesd.
<ul>
<li>There are a few other services that could be turned off, but those are mostly small quibbles not worth worrying about.</li>
<li>The SSH service is on by default, hooray.</li>
<li>The default sshd config allows &#8220;root&#8221; to log in, which I frequently use.  Your opinion may not agree.</li>
</ul>
</li>
</ul>
<p>Once booted fully to a login prompt, log in, install VMware tools, use defaults for everything, and then reboot.</p>
<h2>Building Erlang</h2>
<p>The <a href="http://erlangexamples.com/2010/04/30/building-erlang-b13r04-in-centos-5-4/">blog article at ErlangExamples.com </a>has the longer explanation.  A really short summary of how to install the prerequisite software packages is:</p>
<pre style="padding-left: 30px;"><code>sudo yum -y install make gcc gcc-c++ kernel-devel m4 ncurses-devel openssl-devel</code></pre>
<p>I put Erlang in a non-standard place because it&#8217;s easier to deal with multiple releases of the Erlang runtime on the same machine.  By default (i.e., without using the &#8220;<code>--prefix</code>&#8221; argument to &#8220;<code>./configure</code>&#8220;), the main Erlang commands &#8220;<code>erl</code>&#8221; and &#8220;<code>erlc</code>&#8221; are installed in <code>/usr/local/bin</code>.  With the commands below, those commands will be available instead at <code>/usr/local/erlang/R13B04/bin</code>.</p>
<pre>  tar zxvf otp_src_R13B04.tar.gz
  cd otp_src_R13B04
  ./configure --prefix=/usr/local/erlang/R13B04
  make
  make install
</pre>
<h2>Building Hibari</h2>
<p>As root:</p>
<pre>  yum install libtool ncompress
</pre>
<p>Bummer, &#8220;yum install git&#8221; doesn&#8217;t know about Git.  Have to build it from scratch.</p>
<pre>  mkdir -p /usr/local/src/git
  cd /usr/local/src/git
  wget http://www.kernel.org/pub/software/scm/git/git-1.7.2.tar.bz2
  tar jxvf git-1.7.2.tar.bz2
  cd git-1.7.2
  ./configure
  make
  make install
</pre>
<p>As non-root (use only one of the 1st or 2nd line, depending on your login shell).</p>
<pre>  set path = ($path /usr/local/erlang/R13B04/bin) (csh/tcsh style)
  export PATH=${PATH}:/usr/local/erlang/R13B04/bin (sh/bash style)
  mkdir -p ~/work/hibari
  cd ~/work/hibari
  git clone git://hibari.git.sourceforge.net/gitroot/hibari/bom .
  env BOM_GIT=git://hibari.git.sourceforge.net/gitroot/hibari/ \
  ./bom.sh co src/top/hibari/GDSS
  ./bom.sh make
  make
</pre>
<p>As root:</p>
<pre>  cd /path/to/hibari/pkg
  ./installer-gdss.sh -o silent</pre>
<p>And because I install Erlang in a slightly weird place, I needed a   symlink to point to the place where <code>/etc/init.d/gdss</code> expects to find it.   (As mentioned above, I install it at <code>/usr/local/erlang/R13B04</code>.)  As root:</p>
<pre>  ln -s /usr/local/erlang /usr/local/hibari/ert
</pre>
<p>Now to start Hibari and then configure the local node to be a standalone/1-node system.  Again as root:</p>
<pre>  /etc/init.d/gdss start
  /etc/init.d/gdss provision-standalone
</pre>
<p>That&#8217;s it, we&#8217;re all done.  You can point your Web browser at <code>http://localhost:23080/</code> (or substitute the Hibari machine&#8217;s DNS hostname or IP address, if that machine isn&#8217;t the localhost) to verify that the server is indeed running.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2010/10/10/setting-up-hibari-from-scratch-in-a-centos-5-virtual-machine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>basho_bench for Hibari</title>
		<link>http://www.snookles.com/slf-blog/2010/09/16/basho_bench-for-hibari/</link>
		<comments>http://www.snookles.com/slf-blog/2010/09/16/basho_bench-for-hibari/#comments</comments>
		<pubDate>Thu, 16 Sep 2010 06:57:43 +0000</pubDate>
		<dc:creator>slfritchie</dc:creator>
				<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Hibari]]></category>

		<guid isPermaLink="false">http://www.snookles.com/slf-blog/?p=128</guid>
		<description><![CDATA[Howdy, all. I&#8217;ve written a first draft of a basho_bench driver for Hibari. Instructions on how to use it follow. NOTE: You&#8217;ll need Git and Mercurial in order to check out the basho_bench source. You&#8217;ll also need the (very cool) &#8230; <a href="http://www.snookles.com/slf-blog/2010/09/16/basho_bench-for-hibari/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Howdy, all.  I&#8217;ve written a first draft of a <a href="http://bitbucket.org/basho/basho_bench">basho_bench</a> driver for <a href="http://hibari.sourceforge.net/">Hibari</a>.  Instructions on how to use it follow.</p>
<p>NOTE: You&#8217;ll need <a href="http://git-scm.com/">Git</a> and <a href="http://mercurial.selenic.com/">Mercurial</a> in order to check out the basho_bench source.</p>
<p>You&#8217;ll also need the (very cool) <a href="http://www.r-project.org/">statistics package R</a> installed in order to create basho_bench&#8217;s graphs.  For CentOS 5 users, I don&#8217;t have any good advice, sorry.  For Mac OS, once I&#8217;d installed <a href="http://www.macports.org/">MacPorts</a>, I think it was just a matter of &#8220;port install R&#8221;.</p>
<p>Now make a copy of the example config file:<br />
<code><br />
cp examples/hibari.config ./my-hibari.config<br />
</code><br />
&#8230; and edit it to reflect your local environment.  Sorry, I don&#8217;t have any extensive documentation on it right now.  There is basho_bench- specific documentation in the &#8220;docs&#8221; subdirectory.  All of the Hibari- specific stuff is (hopefully clear?) in the comments of the example config file.</p>
<p>The &#8216;code_paths&#8217; section of the config will require pointers to a bunch of code directories in the Hibari source.  *ALSO*, there is one patch that you need to install in the &#8230;/hibari/src/erl-apps/gmt_util__HEAD directory.  Download the patch file and put it in /tmp.  The URL is: <a href="http://www.snookles.com/scott/hibari/gmt_config.erl.patch">http://www.snookles.com/scott/hibari/gmt_config.erl.patch</a></p>
<p>Then apply it and rebuild using:<br />
<code><br />
cd /your/path/to/hibari/src/erl-apps/gmt-util__HEAD<br />
patch -p1 &lt; /tmp/gmt_config.erl.patch<br />
rm .bom*<br />
cd ../../../<br />
make<br />
</code></p>
<p>OK, now back to editing &#8220;./my-hibari.config&#8221;.  Hopefully the comments are clear enough for you to figure out what to do.  As a first experiment, I suggest that you ignore the &#8220;Erlang native client&#8221; and instead use the EBF client.  Most of the entries in examples/hibari.config are good enough.</p>
<p>I suggest changing &#8216;duration&#8217; to 1 minute and &#8216;concurrent&#8217; to something small&#8217;ish like 10 (until you&#8217;re familiar with the tool).  Change the code_paths to reflect where you&#8217;ve got the Hibari source.  Then change the &#8216;hibari_servers&#8217; list to point to your server(s).  For a single node system, make only one entry in this list.  :-)</p>
<p>Now, time to run the tool.<br />
<code><br />
./basho_bench ./my-hibari.config<br />
</code></p>
<p>The output data will go into a local subdirectory called &#8220;tests&#8221;.  When the run is finished, you&#8217;ll see something like:<br />
<code><br />
=INFO REPORT==== 16-Sep-2010::01:21:31 ===<br />
application: basho_bench<br />
exited: stopped<br />
type: permanent<br />
Test completed after 1 mins.<br />
</code></p>
<p>The command &#8220;make results&#8221; will create a graph.  The graph is a PNG file that is located in the same directory as the test results.  Note that there is a symbolic link in the &#8220;tests&#8221; subdirectory that points to the most recent test run results.  For example, &#8220;tests/20100916_012028&#8243;. That file can be viewed by most bitmap graphic image viewing utilities; for MacOS, I usually use:<br />
<code><br />
open ./tests/current/summary.png<br />
</code></p>
<p>I&#8217;ll include a sample config and graph.  The config is for a 10 minute run with 100 load generator threads.  This isn&#8217;t a very good test environment: the server is a small RAM, single disk machine.  The basho_bench machine is a MacBook Pro that uses a WiFi network to talk to the server.  This is not the kind of environment that you want to try to optimize, so don&#8217;t bother trying.  This is just meant to be a quickie example, to be emailed before I fall asleep.</p>
<p>Happy hacking!</p>
<p>-Scott</p>
<p>(Click on the image below to see a full-size, hopefully-undistorted image).</p>
<p><a href="http://www.snookles.com/slf-blog/wp-content/uploads/2010/09/summary.png"><img class="alignnone size-full wp-image-129" title="summary" src="http://www.snookles.com/slf-blog/wp-content/uploads/2010/09/summary.png" alt="" width="1024" height="768" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.snookles.com/slf-blog/2010/09/16/basho_bench-for-hibari/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
