Re: TSO / LSO
Re: TSO / LSO
- Subject: Re: TSO / LSO
- From: Terry Lambert <email@hidden>
- Date: Wed, 3 Jan 2007 18:44:33 -0800
On Jan 3, 2007, at 10:28 AM, Adi Masputra wrote:
On Jan 3, 2007, at 6:44 AM, Andrew Gallatin wrote:
There are 3 problems with the above:
1) The NIC does not advertise any sort of MTU. The stack just sends
down the biggest frame it can (ip_len is 16 bits, so the max is 64K-1
for IPv4..). The NIC is responsible for splitting this into as many
packets as required.
Adi covered most of this below, but the real concern I had here was
the NIC advertising the largest buffer that it was willing to take at
a time. To me, this is the effective MTU to the card; you said 64K,
but from my reading on TSO, you might wish to limit this due to on
card memory buffer constraints vs. the number of active streams you
expect to deal with simultaneously (i.e. host quench based on
knowledge of card transmit buffer size).
2) The NIC is implicitly informed of the path MTU by the TCP stack
tagging the frame with the TCP MSS (m->m_pkthdr.tso_segsize in
FreeBSD
terms). The packets a NIC generates are limited to the MSS + the
size
of the template TCP/IP + link-layer headers.
I'm not disagreeing with these points, and I personally know the
benefits
of TCP LSO having involved with it in the past. But I think Terry
brought
some valid points, especially those related to driver capability and
the
possibility of an interface filter interposed between IP and the
driver.
IIRC, in the TCP LSO case the hardware would generate the IP+TCP
headers
based on the template; e.g. the IP ID field and friends get generated
by an entity other than the network stack. This may become an issue
with some interface filters, especially those which splice or re-
injects
the packets especially when it's not LSO-aware and expects the TCP
segment to be no larger than the SMSS.
Right; any filter that operates on a stack-based buffer is going to
find itself limited to the kernel stack size for its data, or it'll
overflow (and panic the system, when it hits the guard page). 64K is
large enough that you have to worry about this for existing filters.
Add TCP splicing and DSR (Direct Server Return/"one armed load
balancer") sequence number management (say you are doing L4 load
balancing for HTTP that talks to CGI for a front end machine that can
serve only static content locally, for example), and things get real
complicated real fast.
For these situations, I thiink you'd need to let tthe filter request
that the capability be disabled on a per-stream basis (i.e. you'd want
to serve out the static content this way, but the DSR or Splice data
would need to go out without it).
If you add something like a Cavium (or other) card to do SSL
acceleration, then obviously you'd want to run everything through the
box with the card on it, but offload everything but the communications
processing itself to a pool of back end servers, rather than spend
cycles processing CGIs.
It gets complicated real fast.
Minimally, there would be a need for a global override for protecting
legacy filters from TSO (i.e. a way to turn it off, if filters were
present).
So issues like this needs to be taken into account too; things are
much
easier for statically configured filters (as in STREAMS's autopush
scheme) but it gets complicated when filters can be loaded/unloaded
at will. (I'm not saying this can't be solved; in fact, it is
solvable
but not that straight-forward.)
Right; you'd effectively need to diddle the stack to a larger
allocated memory pool for the purposes of running filters, if you ran
into the filter issue, or you'd have to decide that if there was a
filter, you were going to implicitly disable the large writes, or, as
above, make it a global option.
Adi
3) There is no need to buffer up an entire window on the NIC. All
you
really need to buffer on the NIC is one frame's worth of data (so
that
you can properly send the frame on the wire).
Well, you also need the header template in hand for each in progress
connection for which the data hasn't already been sent (minimally),
etc..
This assumes that either the transmit buffer data is contiguous (a bad
assumption, unless we are packing packet headers at the end of a page,
and only mapping page aligned data following the headers), or that the
NIC supports scatter/gather DMA.
If you consider that 100% bus-on for a NIC on a 64 bit wide PCI bus,
you are talking about monopolizing the bus with data to keep a card
properly fed. Breaking this up with scatter/gather would make things
worse by increasing the transaction overhead, rather than amortizing
it over a large buffer.
Practically, I think the card has to have on-board buffers, not a
single frame buffer, for this to be a happy thing; absolutely worst
case, you have a .1% lossy link, and you are going back and redoing
the DMA to do a retransmit otherwise, and having to step through and
recreate the packets to the transmit packet (consider additional
fragging on intermediate routes and RED queueing or QOS-based drops in
intermediate routers may adjust the frame base on byte boundaries,
rather than nice clean packet boundaries - ACK's in the middle of what
you thought as a packet, etc.).
I expect that for TSO, you'd just eat the retransmit overhead, but you
might want to have the stack know that it was happening, and cut down
on the replicated data that was getting sent (and the amount of bus
traffic, as a result).
-
In my experience, there's usually a much larger distance between
initial implementation and what you can commercially deploy than you
might think.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden