Continuum® Corner — Noah Davids, Technical Paper  

Understanding STCP Send and Receive Windows

Noah Davids

Support Engineer

Stratus Technologies

 

TCP is a windowing protocol, that is only so many bytes can be sent at one time. The window is TCP’s flow control mechanism (see the section “Managing the Window” in RFC-793). There are actually three windows — congestion, send, and receive. TCP window size is important when you are sending large volumes of data over a connection and the bandwidth delay of that connection is large, that is a fast network (100 mbps or gigabit Ethernet) with moderate delay (0.5 ms) or a slow network (T1) with large delay (300 ms). This article will describe how TCP windows work, how to calculate the optimal window size, and how to tune STCP to provide those windows.

 

Window #1 – Congestion Window

This window has two purposes. The first is to control the “Slow Start” (RFC-2581) algorithm. Basically, when a connection is first established only one maximum segment size (MSS) sized packet can be sent. Once that is acknowledged, TCP will send two packets. When those are acknowledged, TCP will allow four packets, etc. Eventually, the size of the congestion window will grow to the point where one of the other windows is the limiting factor.

 

The congestion window is also used to reduce the number of packets that can be sent in response to a lost packet. The theory is that a lost packet means network congestion so TCP slows down. This is also discussed in RFC-2581. For the rest of this discussion I will assume that the congestion window is not the limiting window.

 

Window #2 – Send Window

This is the amount of data that the local host can buffer while waiting for the other end of the connection (the remote peer) to acknowledge it. Once this window is full, the sending process will either block in the send function or receive an error return from the send function indicating that (at least some of the) data could not be sent. There are several possible reasons why the local host must buffer data instead of sending it.

 

First, the remote peer may have closed its receive window (more on this in the next paragraph) indicating that it has no more room to accept data. Second, a data packet or acknowledgement packet may have been lost. The sending host must continue to hold in its buffer all data sent until it gets an acknowledgment for the data.  Third, there may be a problem actually sending the data out of the Ethernet card. This could happen if the Ethernet segment is half duplex and extremely busy or has some kind of problem. Finally, the sending process may just be sending data to the TCP stack faster than acknowledgements can be returned from the remote peer. For example, a process that just sends data without needing to wait for any kind of application layer response. FTP doing a put is a good example.

 

Window #3 – Receive Window

This is the amount of data the remote peer advertises that it can receive at the current time. When the local host sends data, it adjusts its copy of the remote peer’s receive window by subtracting from the advertised value the number of bytes that it just sent. When the remote peer acknowledges the data, the local copy of the remote peer’s receive window is adjusted based on the “window” field in the acknowledgement packet. Typically, this value goes back up but if the process reading the data on the remote peer is too slow. Then, the advertised window may slowly go down until it reaches zero, at which point the local host must stop sending data. When a host sends a packet advertising a zero window, it is said to have closed its window. You can see the advertised window of both hosts by looking at the window field in any packet trace. You can identify which host the window is for by looking at the source IP address of the packet. Figure 1 shows a packet_monitor trace.

 

10:46:05.542 Rcvd Ether Dst 00:00:a8:41:3b:6e  Src 00:04:c1:09:78:60 Type 0800

+(IP)

IP   Ver/HL 45, ToS  0, Len  228, ID  eac, Flg/Frg    0, TTL 38,  Prtl  6

          Cksum  870c, Src c0a838b5, Dst a4984d22

TCP from 192.168.56.181.49187 to 164.152.77.34.5500

    seq  1368981463, ack  222929851, window  8192, 512 data bytes, flags Push Ac

+k.

    X/Off 05, Flags 18, Cksum ca9f,  Urg-> 0000

 

10:46:05.720 Xmit Ether Dst 00:04:c1:09:78:60  Src 00:00:a8:41:3b:6e Type 0800

+(IP)

IP   Ver/HL 45, ToS  0, Len   28, ID b7bb, Flg/Frg    0, TTL 3c,  Prtl  6

          Cksum  dbfc, Src a4984d22, Dst c0a838b5

TCP from 164.152.77.34.5500 to 192.168.56.181.49187

    seq   222929851, ack 1368981975, window  8192, 0 data bytes, flags Ack.

    X/Off 05, Flags 10, Cksum caa7,  Urg-> 0000

 

10:46:10.235 Rcvd Ether Dst 00:00:a8:41:3b:6e  Src 00:04:c1:09:78:60 Type 0800

+(IP)

IP   Ver/HL 45, ToS  0, Len  228, ID  eb4, Flg/Frg    0, TTL 38,  Prtl  6

          Cksum  8704, Src c0a838b5, Dst a4984d22

TCP from 192.168.56.181.49187 to 164.152.77.34.5500

    seq  1368981975, ack  222929851, window  8192, 512 data bytes, flags Push Ac

+k.

    X/Off 05, Flags 18, Cksum c89f,  Urg-> 0000

 

Figure 1 – packet monitor showing advertised receive windows

 

Why should you care?

Let’s say that you just installed a new Stratus® ftServer® V Series system with gigabit cards and you are connecting it to a Stratus Continuum® system with K470/U714 gigabit cards. The round trip time from the V Series system to the Continuum and back again is 0.345 ms. Looking at the connection, you see the Continuum server advertise a receive window of 8,192 bytes. It takes six packets (5 at 1460 bytes + 1 at 892 bytes) to send that 8,192 bytes. Adding in the 78 bytes of overhead[1] for each packet gives 8,660 (78 * 6 + 8,192) bytes. At gigabit speeds, this can be sent in 0.06928 ms, but it will take 0.345 ms for the first acknowledgement packet to make its way back to the V Series system. So, out of those 0.345 ms, nothing is sent for 0.275 ms. More time is spent waiting for an acknowledgement then sending data! What happens next depends on the data-sending pattern of the application on the Continuum system. However, let’s assume that an FTP put is being done. The TCP stack on the Continuum system (or any other host, it doesn’t really matter) will send an acknowledgement for every other packet received. So when the V Series system gets an ACK, it will be able to send out two more packets, at which time another ACK will come in. This will work for the next six packets, but there will be another 0.275 ms gap. The following figure shows this.

 

Figure 2 - time line for packets sent and acknowledgments received


In general, the throughput that can be sustained over a connection is limited to

 

sustained throughput = window size / round trip time

 

(In this case, 8192 / 0.345 = 23,744.9275 bytes / ms or 189,959,420 bits / second.) Of course, you are not guaranteed this speed. Other things, such as disk IO or bus speed, may place a lower limit on throughput.  But why place unnecessary limitations on the connection’s throughput.

 

For any given data rate, the bandwidth delay product gives the minimum window size needed to sustain that rate: 

 

            Window size = sustained throughput * round trip time

 

For our .345 ms RTT on a gigabit adapter, using a gigabit for the sustained throughput rate, the window size works out to 43,125 bytes. This of course is slightly larger than absolutely necessary since it does not take into account the 78 byte of packet overhead.

 

I have seen recommendations that the receive window size be anywhere from equal to the bandwidth delay, to 10% larger, to two or three times the bandwidth delay product. The rational for having the receive window size be larger than the bandwidth delay product is to allow continued transmission during random temporary increases in the round trip time. The danger for having the receive window size be too large is that a lot of data may be need to be retransmitted if a packet is lost and the connection does not support selective acknowledgment (SACK) (see RFC-2018). What you should actually set it to will vary, depending on your circumstances. For example, local connections are a lot less likely to drop packets and require retransmissions. The send window should always be as large as the receive window. One thing that you should do is make the receive window size an even multiple of the maximum segment size. This will eliminate the possibility of a final single segment or a segment smaller than the MSS from being sent and having to wait for a delayed ACK because the remote system was waiting for two full sized segments before sending an immediate ACK.

 

As I said above, the TCP stack will send an ACK for every two full sized packets that it receives. If the stack receives only one segment or segments that are not full sized, and it has no data of its own to send back, it will wait for the “delayed ACK” time before sending an acknowledgment. This is to reduce the number of packets on the network that contain nothing but acknowledgments.

 

Monitoring the TCP windows of STCP connections

Currently, STCP sets the send window for a connection to the minimum of 32,768 and the remote peer’s advertised receive window. The suggestion stcp-1976 suggests that the minimum be raised to 65,535. It’s important to note that the STCP send window is not pre-allocated, that is, space is not allocated when the connection is created. It is really just a limit on the number of bytes that will be dynamically allocated.

 

You can check the send window of an STCP connection by dumping the TCB for the connection and matching on the string “snd”.  The variable sndws is the remote peer’s advertised receive window. The variable maxsndws is the maximum receive window ever advertised by the remote peer. The variable sndmax is the maximum number of bytes that can actually be sent - currently the minimum of the remote peer’s advertised receive window and 32,768. What is of interest in this discussion is the maxsndws value. If it is greater than 32,768, then you know that the limiting value is our send window of 32,768. If it is less, then the limited window is the remote peer’s receive window.

 

In the following example, Figure 3a shows the “snd” variables at the start of the connection and Figure 3b shows them after I let the send window fill up. The window filled up because the process that accepted that connection never did a read (this would be the first reason for the send window filling up mentioned above). Notice that in Figure 3b, netstat shows there are 8708 bytes queued even though the send window was only 8192. This is not a bug. It is just the way that netstat calculates the buffered data (there can also be data in the stream between the application and TCP, some of that is included by netstat).

 

netstat -numeric -PCB_addrs

Active connections

PCB       Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)

. . .

84cc0340  tcp        0      0  164.152.77.34:49255 164.152.77.185:5500 ESTABLISH

+ED

. . .

 

as:  match snd; dump_onetcb 84cc0340

. . .

     sndws                         8192

     maxsndws                      8192

     sndmax                        8192

. . .

Figure 3anetstat and send windows at start of the connection

 

netstat -numeric -PCB_addr

Active connections

PCB       Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)

. . .

84cc0340  tcp        0   8708  164.152.77.34:49255 164.152.77.185:5500 ESTABLISH

+ED

. . .

 

as:  match snd; dump_onetcb 84cc0340

. . .

     sndws                         0

     maxsndws                      8192

     sndmax                        8192

. . .

Figure 3bnetstat and filled up send window

 

 

Figure 4 is the start of a connection where the remote peer advertised a 64K receive window:

 

as:  match snd; dump_onetcb 84ce1080

. . .

     sndws                         65535

     maxsndws                      65535

     sndmax                        32768

. . .Figure 4snd values when the remote peer advertises a 64K receive window

 

In VOS release 14.7 and 15.0, all connections will start out with a receive window of 8K. The size of the receive window may be increased automatically by STCP if there is space available. Space in this case means space in a receive window pool. STCP maintains pools for 64K, 32K, and 16K receive windows. By default, the number of connections allowed in each of the above pools is 0, 10, and 10. A connection’s receive window will be increased if it is currently <16K; there is space in one of the larger pools; and it has done an explicit ACK. An explicit ACK is sent under several conditions. The two most easily controlled conditions are when a client opens a connection and when two MSS-sized segments are received without sending data back. This increase may happen at any point in the connections life.

 

Figure 5 shows how to determine a connection’s current receive window size. Figure 5a shows the connection at its start and Figure 5b shows it after I let the receive window fill up. As in Figure 3, I did this by not reading any of the received data. The variable rcvws is the current receive window and maxrcvws is the maximum advertised receive window.

 

netstat -numeric -PCB_addr

Active connections

PCB       Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)

. . .

84ce92c0  tcp        0      0  164.152.77.34:5500 192.168.56.181:49187 ESTABLISH

+ED

. . .

 

as:  match rcv; dump_onetcb 84ce92c0

. . .

     rcvws                         8192

     maxrcvws                      8192

. . .

 

Figure 5anetstat and receive window at start of connection

 

 

netstat -numeric -PCB_addr

Active connections

PCB       Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)

. . .

84ce92c0  tcp     7680      0  164.152.77.34:5500 192.168.56.181:49187 ESTABLISH

+ED

. . .

 

as:  match rcv; dump_onetcb 84ce92c0

. . .

     rcvws                         0

     maxrcvws                      8192

. . .

 

Figure 5bnetstat and filled up receive window

 

 

Tuning the TCP window sizes of STCP connections

The standard way to adjust the window sizes of a connection is for the application to call the setsockopt procedure with the SO_SNDBUF and SO_RCVBUF socket options. In STCP, these options were supported for UDP sockets, but not TCP sockets. Starting with release 16.2.0ag (stcp-1447 and stcp-2387) these options are supported for TCP sockets.

When setting SO_RCVBUF STCP's behavior will depend on the caller's access to the STCP device. If the caller has write access to the >system>acl>stcp_access file STCP will ignore the limits set by the max_64k_windows, max_32k_windows and max_16k_windows STCP parameters. It will set the advertised receive window to 64K if the requested buffer is greater than 64K or to either 16K, 32K or 64K, the smallest value that is larger than the requested buffer. In this case it is possible that the current 64k, 32k or 16k windows counts displayed by the analyze_system request list_stcp_params will exceed the max limits. If the caller does not have write access to the >system>acl>stcp_access file the receive window size will only be increased if the current count does not exceed the max limit. Note that the buffer space is allocated even if the window size is not increased.

STCP automatically moves received bytes into the buffers even if the application stops reading data. The effect of a larger buffer is that more bytes can be received without STCP reducing the advertised receive window. As long as the advertised received window is greater than the bandwidth delay product the size of the window is not as important as the size of the buffer space. If the advertised window is less than the bandwidth delay product then a smaller advertised window will reduce throughput.

If an application cannot use SO_RCVBUF, what can it do? The only thing that it can do is try to improve the chances that it will get into one of the larger receive window pools. First, the receive window of clients will always be enlarged, if possible. This is because an explicit ACK is always sent as the last packet in TCP’s 3-way handshake (SYN, SYN/ACK, ACK) when a connection is established. So, clients have a slightly better chance than severs of getting into a larger receive window pool.

 

If the data flow is toward the server, then the application protocol should be written to force an explicit ACK. This means that the client should send at least 2 * MSS bytes of data, and the server should read this data but not send anything back to the client for at least 200 ms. One of the problems is that the MSS value can vary based on the end points of the connection. Using an MSS value of 1460 will cover all possible connections.

 

None of this will do any good if the larger window pools are already filled. As a system administrator, you can increase the size of the 16K, 32K, and 64K window pools. This will allow more connections to have their receive windows enlarged. The variables that control this are: maxno_16k_windows, maxno_32k_windows, and maxno_64k_windows. Currently, they can be set in analyze_system with the set_longword request.

 

as:  set_longword maxno_32k_windows 14

addr      from      to

BF0D200C  0000000A  00000014

Figure 6a — increasing the 32K window pool from 10 to 20

 

In releases 14.7.0ag, 15.0.0ab and later (containing the fix for av-1352) you will be able to change these window pool sizes with the analyze_system request set_stcp_params. The keywords will be max_64k_windows, max_32k_windows, and max_16k_windows.

 

as:  set_stcp_param max_32k_windows 20

 

Changing maximum 32k windows (max_32k_windows)

 from 10 to 20

as:

Figure 6b — increasing the 32K window pool from 10 to 20 after av-1352 fix

 

The list_stcp_params request will also display the current number of sessions in each pool.

 

as:  list_stcp_params

 

STCP Parameters:

 

. . .

maximum send window size [4096-65535]    (max_send_ws)        32768 bytes

maximum 64k windows [>=0]                (max_64k_windows)    0

current 64k windows                                           0

maximum 32k windows [>=0]                (max_32k_windows)    10

current 32k windows                                           10

maximum 16k windows [>=0]                (max_16k_windows)    10

current 16k windows                                           10

big windows [off/on]                     (big_windows)        off

default recv window size [1024-16383]    (dft_recv_ws)        8192 bytes

Figure 7 — Using list_stcp_params to show window related values after av-1352 fix

 

Note that these defaults of 0, 10, 10 were changed in 15.1.0ah and again in 15.2 (stcp-2107).

 

 

VOS 14.7

VOS 15.1.0ah

VOS 15.2

64K

0

0

0

32K

10

100

dynamic

16K

10

0

0

Table 1 – default number of 16, 32 and 64Kwindows for various VOS releases

 

The value of dynamic in VOS 15.2 is based on the amount of streams memory configured for the system. For example a 4 GIG system with the default streams allocation of 1/8 will have 798 32K windows available.

 

If you are looking around in analyze_system, you may (or may not depending on release) find a maxno_8k_windows set to 2. Since all connections start out with an 8K receive window, the pool is effectively unlimited. The maxno_8k_windows is left over from previous releases when things were different. However, there are some calculations that rely on this value so DON’T CHANGE IT.

 

Besides changing the values of the maxno_* variables, it might be necessary to change the value of big_windows. The variable big_windows is a flag that just indicates that one of the large window pools has an opening. Once all the pools are filled, the flag is set to zero. It is set to one only when a connection using a larger pool is closed. It’s an optimization that prevents unnecessary searching. If you increase the size of a pool after the big_windows flag is set to zero, the system will not use the bigger pool. Therefore, after changing the pool size, check the value of big_windows and if it is zero change it to one.

 

as:  d big_windows

BF0DE1C4  0  00000000                             |....            |

as:  set_longword big_windows 1

addr      from      to

BF0DE1C4  00000000  00000001

Figure 8a — checking and changing the big_windows flag

 

Again, in releases 14.7.0ag, 15.0.0ab and later (containing the fix for av-1352) you will be able to change the value of big_windows with the analyze_system request set_stcp_params.

 

as:  set_stcp_param big_windows on

 

Changing big windows (big_windows)

 from off to on

as:

Figure 8b — checking and changing the big_windows flag to on after av-1352 fix

 

The av-1352 fix will also allow you to change the default receive window size from the current 8K up to 16K-1, the keyword will be dft_recv_ws.

 

as:  set_stcp_param dft_recv_ws 16383

 

Changing default recv window size (dft_recv_ws)

 from 8192 bytes to 16383 bytes

as:

Figure 9 – changing the default receive window size to 16383 bytes

 

There is a danger in increasing the receive_window. If for some reason the applications stop processing incoming data, the receive windows will fill up. If enough of them do so, streams memory can be exhausted. Once streams memory becomes exhausted, STCP will start to behave erratically. This is the reason for the limited number of connections in each pool. You need to weigh the chances of this against the costs of destabilizing the system and the benefits of faster throughput.

 

STCP will set the send window equal to the remote peers advertised receive window, with certain limits. In release 14.7 the default maximum size of the send window is 32678 bytes. You can if you want change this by setting the max_send_ws size using the set_stcp_param request in analyze_system (assuming you are running at least VOS 14.7.0ag)

 

as:  set_stcp_param max_send_ws 65535

 

Changing maximum send window size (max_send_ws)

 from 32768 bytes to 65535 bytes

Figure 10 – changing the default receive window size to 65535 bytes

 

Starting in release 15.1 the default maximum is 65535 bytes.

 

Finally, while you cannot change the big window sizes (16K, 32K and 64K) to be an even multiple of the 1460 MSS value, you can change the MSS value to be a factor of the window sizes. Since all window sizes are even multiples of 1024, changing the MSS to 1024 will result in all the window sizes being even multiples of the MSS. In the last eNewsletter, I talked about tuning the MSS value, and suggested setting it to the maximum of 1460. That is optimum (within limits, see the article) as long as the window size is not a limiting factor. To change the MSS, you need to adjust the default_min_mtu value to 1044. It is 1044, and not 1024, because the value includes the TCP header of 20 bytes while the MSS does not.

 

as:  set_longword default_min_mtu 414

addr      from      to

BF06390C  0000022C  00000414

Figure 11 – changing the default min MTU to adjust the default MSS

 

One thing to note is that this change will only effect connections going to remote networks. Connections to hosts on a local subnet will continue to use an MSS value of 1460. To change the MSS for local connections, you need to change the MTU for the interface. This can be done by setting the MTU argument when the interface is configured to 1064. In this case, the MTU includes, not only the 20-byte TCP header, but the 20-byte IP header.

 

ifconfig #sdlmux1 164.152.77.203 -netmask 255.255.254.0 -mtu 1064 -add

Figure 12 – command to configure an interface with an MTU that will yield a 1024 MSS

 

Another thing you can do (if you have the fix for av-1352) is change the default receive window to an even multiple of 1460. Since you will be able to change the default to something less than 16K, and you want an even multiple of the MSS, the default should be 14,600. This will gain you something only if you do not change the MTU value. You will have to experiment to determine if this will gain you more than reducing the MTU and increasing the number of connections in the big_windows pools.

 

You’re wondering about the congestion window. I know of no TCP stack that gives you any control over this window. I included it just for completeness, in case you have heard of it and wonder how it fits into the rest of the discussion.

 

 

 

Revision History

     Originally published in Vol 5 (Oct 2004) of the Stratus eCustomer/ePartner Newsletter

     Updated 07-01-11 to reflect changes in the available number of receive windows of each size and send window size allocation for newer releases

     Updated 08-05-07 to reflect the stcp-1447 and stcp-2387 fixes

 

Specifications and descriptions are summary in nature and subject to change without notice

 

Stratus, ftServer, and Continuum are registered trademarks of Stratus Technologies Bermuda Ltd.

 

All other trademarks and registered trademarks are property of their respective holders.

 



[1] Where did those 78 extra bytes come from:

Ethernet interframe gap                    12

Ethernet Preamble                            8

Ethernet header                                 14

IP header                                            20

TCP header                                        20

Ethernet trailer                                  4

 

The interframe gap is not a real transmission of bits, bit it takes time just like a real transmission.