[vox-tech] tcp tuning with wireshark?

Nick Schmalenberger nick at schmalenberger.us
Sat Mar 1 03:16:00 PST 2014


On Fri, Feb 28, 2014 at 08:27:13PM -0800, Bill Broadley wrote:
> On 02/26/2014 11:37 PM, Nick Schmalenberger wrote:
> > I need to increase throughput for long lived tcp connections over
> > ipsec over wan between Amazon in Ireland and a Level3 gigabit
> > link in Ashburn Virginia (currently running at about 20Mbps).
> 
> Do you mean that the link is 20mbit?  Or that the bandwidth you are
> achieving over it is 20 mbit?
The latter.
> 
> What is the largest MTU supported across that link?
> 
The interface mtu on the Ireland side is 9001, but it is really
rather less over the path. Tracepath, and ping with pmtu
discovery both seem to think it is 1500 after the first hop, but
it seems to get fragmented after that anyway, I have seen with
tcpdump or tshark. The actual mtu seems to be 1452. I assume this
is because the Amazon ipsec router is just disregarding the don't
fragment for PMTU discovery.

> > I've read various articles saying to enlarge the buffers and make
> > various other kernel tweaks. Some say to base it on the bandwidth
> > delay product, some say just on the link speed, and some say
> > don't bother linux does all that automatically now. Alot of it
> > seems random.
> 
> Heh, well there's quite a few variables to consider.  But in general
> bandwidth is much harder to utilize over a high latency link.  So BDP is
> definitely relevant.
> 
> What exactly are you trying to send/receive over this high latency link?
> 
The protocol is Kafka, sending video playing info back from
Europe for aggregation in Ashburn.
> > However, with wireshark I see that the "bytes in flight"
> > measurement which counts unacknowledged bytes from the source
> > never gets close to the window size sent by the destination. Does
> > this suggest anything in particular to tweak? I got some books on
> 
> One cheat/hack is to just do more TCP connections.  Generally the
> throughput will increase over high latency links with more TCP
> connections.  Up to a point of course.
> 
Yeah, we have had success with this, compression by Kafka also
helps, but I'm hoping to increase throughput at a lower level
anyway if its possible.

> 
> 
> > wireshark, which were quite helpful in how to use the graphs and
> > filters, but on tcp performance they mostly just talked about the
> > effect of packet loss. I'm not certain, but I don't think packet
> > loss is the main thing holding back my performance, because there
> > is some which causes a brief dip in the window size and then it
> > recovers. Throughput stays pretty flat.
> > 
> > It would be really amazing if there was a flowchart on doing this
> > for linux that could be informed by wireshark io graphs and other
> > graphs. Has anybody ever seen such a chart? If this approach is
> > succesful for me, and I can understand how to do it in several
> > scenarios, I think I will even like to make such a flow chart if
> > it doesn't already exist. Thanks for any tips and disabusement of
> > my misunderstandings about tcp in linux :) 
> 
> TCP defaults are definitely suboptimal for transatlantic links.  As our
> many assumptions that applications have.  Much depends on what you are
> trying to do.  I've seen various appliance like widgets that will proxy
> a given protocol for a high latency link so that servers/clients with
> poor assumptions don't take quite as much of a hit.
> 
> I pretty good overview of the related issues is:
>   http://www.psc.edu/index.php/networking/641-tcp-tune
> 
> What I'd do first is attempt to fix things manually by tinkering with
> the mentioned values.  It wouldn't be particularly hard to write a
> traffic generator that played with bitrate and number of simultaneous
> connections to analyze available performance.  Said tool could even
> explore the reasonable ranges for the various knobs you can tinker with
> by writing to /sys and /proc.  Personally I'd be more likely to parse
> the tcpdump logs myself so I could use an arbitrary number of filters,
> statistics, and post processing.   Shouldn't be too hard to graph the
> number of unack'd packets over time for instance.
>
The unack'd packets is exactly what I've been focusing on with
wireshark, and I think wireshark is actually the ideal tool to do
the graphing. It has a metric called "bytes in flight" which
counts unack'd bytes, and I graphed it like this in comparison to
window size: http://postimg.org/image/az0hspgkp/ and the spikes
clearly match up with retransmissions which makes sense because
retransmissions should continue until those segments are acked.
I'm also assuming the real packet loss shown is negligible.

What I don't get though, is why bytes in flight never seems to
get close to the window size (except for this 1 time:
http://postimg.org/image/8b3ia1qhh/ ). Also, why would the window
size seem to sustain at higher level after these dips recover? My
expectation was bytes in flight would have a sawtooth pattern up
to the window size. Its also possible that the window size is
being reached and I just didn't setup the graphs to show that
because of smoothing or aggregation or something, although I hope
by selecting MAX bytes in flight and MIN window size to represent
each interval it would help with that.

I've heard of some of the appliances you mention, and I think its
probably not a coincidence that Riverbed is the primary sponsor
of Wireshark! My boss has mentioned them as an option here, and I
think they are even supported in the Amazon environment. But,
having a wireshark-informed approach to tuning Linux would be so
cool...

What I had in mind with a flow chart, was some things like that
the max value of net.ipv4.tcp_wmem does not override
net.core.wmem_max (according to
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt),
and so there could be a logic of how the various values relate as
bottlenecks. 

Then ideally various patterns in the wireshark graphs would
relate to tunables in linux in a way that fits into the
flowchart. For example, could a too small tx buffer explain
"bytes in flight" never reaching the window size sent from the
destination? Do the spikes in bytes in flight during
retransmission demonstrate the tx buffer is actually plenty
large, and the source application just can't sustain higher
output? Does not reaching the window size at least mean I should
focus on tuning the source host?

The way I'm generating the traffic, was I created a 100M file
from /dev/urandom, and I copied itself 5 times to make a 500M
file (sometimes I just did 100M also) Then, I ran
nc -kl 10.0.4.61 8080 > /dev/null on the destination, and
curl -T /tmp/500Mrandomfile http://10.0.4.61:8080 > /dev/null on
the source.

On the source /tmp is a tmpfs so should be quite high
performance, and between hosts on the lan it goes right up near a
gigabit/s, so I think this should be a pretty good http-like
benchmark. Thanks for all your suggestions!
Nick



More information about the vox-tech mailing list