Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: TSO / LSO

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TSO / LSO

Subject: Re: TSO / LSO
From: Terry Lambert <email@hidden>
Date: Wed, 3 Jan 2007 18:44:33 -0800

On Jan 3, 2007, at 10:28 AM, Adi Masputra wrote:

On Jan 3, 2007, at 6:44 AM, Andrew Gallatin wrote:

There are 3 problems with the above:

1) The NIC does not advertise any sort of MTU.  The stack just sends
down the biggest frame it can (ip_len is 16 bits, so the max is 64K-1
for IPv4..).  The NIC is responsible for splitting this into as many
packets as required.

Adi covered most of this below, but the real concern I had here was the NIC advertising the largest buffer that it was willing to take at a time. To me, this is the effective MTU to the card; you said 64K, but from my reading on TSO, you might wish to limit this due to on card memory buffer constraints vs. the number of active streams you expect to deal with simultaneously (i.e. host quench based on knowledge of card transmit buffer size).

2) The NIC is implicitly informed of the path MTU by the TCP stack tagging the frame with the TCP MSS (m->m_pkthdr.tso_segsize in FreeBSD terms). The packets a NIC generates are limited to the MSS + the size of the template TCP/IP + link-layer headers.
I'm not disagreeing with these points, and I personally know the benefits of TCP LSO having involved with it in the past. But I think Terry brought some valid points, especially those related to driver capability and the possibility of an interface filter interposed between IP and the driver.

IIRC, in the TCP LSO case the hardware would generate the IP+TCP headers based on the template; e.g. the IP ID field and friends get generated by an entity other than the network stack. This may become an issue with some interface filters, especially those which splice or re- injects the packets especially when it's not LSO-aware and expects the TCP segment to be no larger than the SMSS.

Right; any filter that operates on a stack-based buffer is going to find itself limited to the kernel stack size for its data, or it'll overflow (and panic the system, when it hits the guard page). 64K is large enough that you have to worry about this for existing filters.

Add TCP splicing and DSR (Direct Server Return/"one armed load balancer") sequence number management (say you are doing L4 load balancing for HTTP that talks to CGI for a front end machine that can serve only static content locally, for example), and things get real complicated real fast.

For these situations, I thiink you'd need to let tthe filter request that the capability be disabled on a per-stream basis (i.e. you'd want to serve out the static content this way, but the DSR or Splice data would need to go out without it).

If you add something like a Cavium (or other) card to do SSL acceleration, then obviously you'd want to run everything through the box with the card on it, but offload everything but the communications processing itself to a pool of back end servers, rather than spend cycles processing CGIs.

It gets complicated real fast.

Minimally, there would be a need for a global override for protecting legacy filters from TSO (i.e. a way to turn it off, if filters were present).

So issues like this needs to be taken into account too; things are much easier for statically configured filters (as in STREAMS's autopush scheme) but it gets complicated when filters can be loaded/unloaded at will. (I'm not saying this can't be solved; in fact, it is solvable but not that straight-forward.)

Right; you'd effectively need to diddle the stack to a larger allocated memory pool for the purposes of running filters, if you ran into the filter issue, or you'd have to decide that if there was a filter, you were going to implicitly disable the large writes, or, as above, make it a global option.

Adi
3) There is no need to buffer up an entire window on the NIC. All you really need to buffer on the NIC is one frame's worth of data (so that you can properly send the frame on the wire).

Well, you also need the header template in hand for each in progress connection for which the data hasn't already been sent (minimally), etc..

This assumes that either the transmit buffer data is contiguous (a bad assumption, unless we are packing packet headers at the end of a page, and only mapping page aligned data following the headers), or that the NIC supports scatter/gather DMA.

If you consider that 100% bus-on for a NIC on a 64 bit wide PCI bus, you are talking about monopolizing the bus with data to keep a card properly fed. Breaking this up with scatter/gather would make things worse by increasing the transaction overhead, rather than amortizing it over a large buffer.

Practically, I think the card has to have on-board buffers, not a single frame buffer, for this to be a happy thing; absolutely worst case, you have a .1% lossy link, and you are going back and redoing the DMA to do a retransmit otherwise, and having to step through and recreate the packets to the transmit packet (consider additional fragging on intermediate routes and RED queueing or QOS-based drops in intermediate routers may adjust the frame base on byte boundaries, rather than nice clean packet boundaries - ACK's in the middle of what you thought as a packet, etc.).

I expect that for TSO, you'd just eat the retransmit overhead, but you might want to have the stack know that it was happening, and cut down on the replicated data that was getting sent (and the amount of bus traffic, as a result).

In my experience, there's usually a much larger distance between initial implementation and what you can commercially deploy than you might think.

-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: TSO / LSO
From: Andrew Gallatin <email@hidden>


References:  
  >Re: mbuf_outbound_finalize bug? (From: Terry Lambert <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: Josh Graessley <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: Andrew Gallatin <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: Josh Graessley <email@hidden>)
  >TSO / LSO (From: Andrew Gallatin <email@hidden>)
  >Re: TSO / LSO (From: Terry Lambert <email@hidden>)
  >Re: TSO / LSO (From: Andrew Gallatin <email@hidden>)
  >Re: TSO / LSO (From: Adi Masputra <email@hidden>)




Prev by Date:
Re: TSO / LSO

Next by Date:
Re: TSO / LSO

Previous by thread:
Re: TSO / LSO

Next by thread:
Re: TSO / LSO

Index(es):

Date
Thread