[vox-tech] Very slow TCP transfer over loopback

Mike Simons vox-tech@lists.lugod.org
Mon, 1 Mar 2004 12:41:37 -0500


On Sun, Feb 29, 2004 at 11:43:31PM -0800, Matt Roper wrote:
> On Sun, Feb 29, 2004 at 08:56:42PM -0500, Mike Simons wrote:
> > pps: if you move the RCVBUF setting to change the "accept_sock"
> >      instead of the "file", then the problem goes away regardless
> >      of what size.
>=20
> Hmm, I didn't think it was possible to change the RCVBUF and SNDBUF
> settings after you had already accepted the connection.  I just checked
> the tcp(7) manpage, and it contains the following:
>=20
>     "On individual connections, the socket buffer size must be set prior
>     to the listen() or connect() calls in order to  have  it take
>     effect."
[...]
> So I think you want to move the RCVBUF setting up to where you set the
> REUSEADDR option.

Yes, thanks for pointing that out, older man pages do not have that
phrase.  I agree that it should be done once before accept instead of
after every single accept.

  Unfortunately the new man page is wrong.  Changing the buffer=20
size with SO_RCVBUF or SO_SNDBUF does have "an effect" after a socket=20
is accepted.

> Unfortunately I'm not sure why this would cause the
> delays you're experiencing; that sounds almost like something to do with
> the Nagle Algorithm (although that's a sending issue, not a receiving
> issue). =20

  Err nope, Nagle should not be causing this sort of behavior.
Nagle basicly says ... if you can't send a full TCP frame and you have=20
any outstanding sends that have not yet been ACK'd, then wait until=20
the ACK arrives.  Both of those conditions fail in this case, there
are many full frames worth of data available to send... and there is=20
no outstanding data to ack.

=3D=3D=3D
RFC1122                  TRANSPORT LAYER -- TCP             October 1989
Internet Engineering Task Force                                [Page 98]

            DISCUSSION:
                 The Nagle algorithm is generally as follows:

                      If there is unacknowledged data (i.e., SND.NXT >
                      SND.UNA), then the sending TCP buffers all user
                      data (regardless of the PSH bit), until the
                      outstanding data has been acknowledged or until
                      the TCP can send a full-sized segment (Eff.snd.MSS
                      bytes; see Section 4.2.2.6).
=3D=3D=3D

Right around the paste above in rfc1122 they describe a "senders Silly
Window Syndrome avoidance algorithm"... this looks like it could be the
reason.

The Max(SND.WND) was 16k... right at the beginning of the connection.
The D is big 64k or more... (as seen in netstat) lots to send.
The U is small 1.5k (in the trace provided), but depends on what=20
      RCVBUF was reduced to.

I don't know what Fs is in linux, but it appears to be 1/3rd=20
  (based on observation).
I don't know what Timeout is, but it appears to be .2 seconds.

=3D=3D=3D
         4.2.3.4  When to Send Data

            A TCP MUST include a SWS avoidance algorithm in the sender.
[...]
            IMPLEMENTATION:
                 The sender's SWS avoidance algorithm is more difficult
                 than the receivers's, because the sender does not know
                 (directly) the receiver's total buffer space RCV.BUFF.
                 An approach which has been found to work well is for
                 the sender to calculate Max(SND.WND), the maximum send
                 window it has seen so far on the connection, and to use
                 this value as an estimate of RCV.BUFF.  Unfortunately,
                 this can only be an estimate; the receiver may at any
                 time reduce the size of RCV.BUFF.  To avoid a resulting
                 deadlock, it is necessary to have a timeout to force
                 transmission of data, overriding the SWS avoidance
                 algorithm.  In practice, this timeout should seldom
                 occur.
[...]
                 Send data [only if]:
[...]
                 (3)  or if at least a fraction Fs of the maximum window
                      can be sent, i.e., if:

                          [SND.NXT =3D SND.UNA and]

                                  min(D.U) >=3D Fs * Max(SND.WND);


                 (4)  or if data is PUSHed and the override timeout
                      occurs.

                 Here Fs is a fraction whose recommended value is 1/2.
                 The override timeout should be in the range 0.1 - 1.0
                 seconds.  It may be convenient to combine this timer
                 with the timer used to probe zero windows (Section
                 4.2.2.17).

                 Finally, note that the SWS avoidance algorithm just
                 specified is to be used instead of the sender-side
                 algorithm contained in [TCP:5].
=3D=3D=3D

It appears to be a combination of the Really Big loopback MTU (16k),
The application lowering RCVBUF after accept instead of before,
and this senders side SWS which are leading to the problem.