Re: mbuf_outbound_finalize bug?

Subject: Re: mbuf_outbound_finalize bug?
From: "Bhavesh Davda" <email@hidden>
Date: Tue, 2 Jan 2007 18:08:40 -0800

Here's what I was doing in the sample enetlognke NKE from Apple, to
figure out what checksum offloading features are supported by the NICs
on my first generation MacBook Pro with a Marvell Yukon MAC, and on a
Mac Mini with an Intel E1000 MAC:

ifnet_offload_t offloadFeatures = ifnet_offload(enetifnet);
printf("Interface supports features 0x%x\n", offloadFeatures);

On my MBP, it reported 0x10006. On the Mac Mini it reported 0x10007.

It was common for older NICs to only support TCP/UDP offload, and not
IP header checksum offload, because IP header checksum offloads don't
take much CPU, and probably for some of the hardware
errata/limitations that Adi mentions below. So I'm not surprised that
the Yukon NIC doesn't support IP header checksum offload.

That was why I had asked what the mbuf csum_data field would be when
both TCP/UDP and IP header checksum offloading was requested on an
mbuf. I subsequently figured out that the NIC driver could easily
compute and communicate the offset of the IP header checksum to the
NIC (using whatever Tx descriptor entry fields it uses for such
purposes), without the TCP/IP stack having to pass that information
down to the NIC driver.

But I am disappointed that I still don't have a good way to make use
of the existing KPIs to compute the checksums in software, and that I
am having to resort to walking the mbuf chain and compute my own
checksums in my KEXT.

Thanks

- Bhavesh

On 1/2/07, Adi Masputra <email@hidden> wrote:


On Jan 2, 2007, at 2:28 PM, Terry Lambert wrote:

> I believe that if TCP or UDP checksum offload is supported, then IP
> checksum offload _MUST_ be supported by the hardware, since it's
> impossible to calculate the IP checksum without the payload
> contents being known.
>
> In other words, it doesn't make sense to hardware offload something
> that can change the IP payload without offloading the IP checksum
> as well.

I suppose if the transport payload is changed then that wouldn't have
any
effect on the IP header checksum; the latter covers only the header
portion
of IP.  But I agree that modern NICs should be smart enough to parse the
IP header (including options) and most, if not all, already do.  But
if I
recall there were issues with some options, in particular the IPv4
source
routing option related to calculating the pseudo-header, i.e. the actual
destination is different from the destination in the header which is
that
of the next-hop.  Not all NICs actually go deep into the IP options and
do the parsing, most probably for performance reasons.

On older NICs, the issue with IP options was more apparent.  Some of the
NICs in the Sun boxes (ge, ce, eri) limit the "start offset" width to
6-bits.
This essentially sets the largest value of the offset of 64-bytes
starting
from the MAC header.  When IP options are present, the cumulative MAC
and
IP headers exceed such a limit; this causes a wrap-around and the
checksum
value (calculated during DMA read) got "stuffed" at the wrong place.
Some
OSes like Solaris decides to do software checksum calculation upon the
presence of any IP options because of issues like this.

Adi

>
> An easy way to check this would be to intentionally change packet
> contents over a known stream with a filter, and verify that the IP
> checksum was properly recalculated.  This would be pretty trivial
> using something like Ethereal to monitor wire traffic, since it can
> flag bad checksums in packets for you (even if the receiver ignored
> them, and you didn't see a communications interruption over the bad
> checksums).
>
> -- Terry
>
> On Jan 2, 2007, at 10:45 AM, Bhavesh Davda wrote:
>> Thanks Josh.
>>
>> So on my MacBook Pro with the Marvell Yukon NIC, I noticed that its
>> ifnet_offload capabilities report IFNET_CSUM_TCP, IFNET_CSUM_UDP, and
>> IFNET_VLAN_TAGGING, but not IFNET_CSUM_IP.
>>
>> And correspondingly, when examining the mbuf in my iff_output_func, I
>> never notice MBUF_CSUM_REQ_IP set.
>>
>> So, your recommendation boils down to the following on my MBP:
>>
>>> - Put ip_len field in host byte order.
>>> - Call mbuf_outbound_finalize
>>> - Put ip_len back in network byte order.
>>
>> Which is what Adi had recommended I do in the code snippet I
>> linked to
>> [ http://lists.apple.com/archives/darwin-kernel/2006/Dec/
>> msg00072.html ]
>>
>> Yet, I was seeing those kernel panics at boot time, in the 3rd step
>> (Put ip_len back), corresponding to those mDNS UDP offloaded packets.
>>
>> I don't know why mbuf_outbound_finalize *only* on these packets was
>> leaving the mbuf trashed, but I have a feeling it has to do with the
>> mbuf pull-up code in in_delayed_cksum_offset, and the fact that these
>> mDNS mbufs are chained mbufs, with the pkthdr mbuf containing the L2
>> header, and the rest of the IP+UDP+payload packet in the next mbuf in
>> the chain.
>>
>> So I suspect mbuf_outbound_finalize is probably broken for all
>> chained
>> mbufs where the pkthdr mbuf doesn't contain the IP header, but I
>> can't
>> point you to the line of code in the xnu/bsd source that is the
>> culprit, sorry!
>>
>> Thanks
>> - Bhavesh
>>
>>
>> On 1/2/07, Josh Graessley <email@hidden> wrote:
>>>
>>> Sorry, I saw that stuff earlier. I wanted to give a more thorough
>>> explanation of what those flags mean and what the csum_data value
>>> means.
>>>
>>> Here's what I would try. I think this will work, but I may have
>>> glossed over something.
>>>
>>> If I remember correctly, the mbuf_outbound_finalize code is entirely
>>> broken because it calls a function to calculate the ip header where
>>> it is expecting the ip header to be in host byte order and it
>>> calls a
>>> function to calculate the TCP/UDP header where it expected the
>>> ip_len
>>> field to be in network byte order. The TCP/UDP is the trickier part
>>> to generate.
>>>
>>> - Check if MBUF_CSUM_REQ_IP is set
>>> - Put ip_len field in host byte order.
>>> - Call mbuf_outbound_finalize
>>> - Put ip_len back in network byte order.
>>> - If MBUF_CSUM_REQ_IP was set, re-calculate the ip header checksum.
>>>
>>> I think that works around the problem. In addition, it will be safe
>>> when the bug is fixed. The fix will handle ip_len in either byte
>>> order and do the right thing. With this solution, after the fix, you
>>> will simply be calculating the ip header checksum twice.
>>>
>>> -josh
>>>
>>> On Jan 2, 2007, at 9:36 AM, Bhavesh Davda wrote:
>>>
>>> > Thanks for the explanation, Josh!
>>> >
>>> > I had figured as much reading the xnu-792.13.8 code.
>>> >
>>> > However, if you look at the subject of this e-mail thread, and
>>> go back
>>> > to my original post about this, I had reported a bug in
>>> > mbuf_outbound_finalize and friends, that Adi acknowledged as a
>>> known
>>> > bug that will be fixed in a subsequent release of the kernel.
>>> >
>>> > Instead of reposting those e-mails, I'll just link to the
>>> appropriate
>>> > posts here:
>>> >
>>> > My original post:
>>> > http://lists.apple.com/archives/darwin-kernel/2006/Dec/
>>> msg00063.html
>>> >
>>> > Adi's response with a code snippet for a workaround:
>>> > http://lists.apple.com/archives/darwin-kernel/2006/Dec/
>>> msg00072.html
>>> >
>>> > Doing what Adi suggested worked for *most* cases, except that I
>>> > started seeing kernel panics when my kext loads at boot time,
>>> when Mac
>>> > OS (probably Bonjour) sends an mDNS packet, which is UDP checksum
>>> > offloaded, and my kext calls mbuf_outbound_finalize(). The panic
>>> > happens in the call to mbuf_copyback when it tries to restore the
>>> > tot_len field in the IP header in the mbuf in network-byte-order.
>>> >
>>> > When I looked at the mbuf after gdb attaching to the panicked
>>> kernel,
>>> > all the pointers looked trashed. Instead of further debugging
>>> > mbuf_outbound_finalize/in_delayed_cksum_offset/in_cksum_skip, I
>>> > decided that I had spent way too much time on this whole hw
>>> checksum
>>> > offloading mess, and decided that I would walk the chain of mbufs
>>> > myself, computing my own checksums.
>>> >
>>> > Yes, I do so only when my kext has determined that an outbound
>>> mbuf is
>>> > *interesting*, and I clone the original mbuf for forwarding to my
>>> > applications. So it isn't wasted effort.
>>> >
>>> > However, I would love for you to tell me that I don't need to
>>> do so :)
>>> > My poor-mans attempt at checksumming a block of data is
>>> probably not
>>> > going to perform as well as the tuned in_cksum_skip and assembly
>>> > helpers :)
>>> >
>>> > Thanks
>>> >
>>> > - Bhavesh
>>> >
>>> >
>>> >
>>> > On 1/1/07, Josh Graessley <email@hidden> wrote:
>>> >>
>>> >> Oh no, the crummy hardware checksum stuff has stuck to me! Help!
>>> >> Someone save me!
>>> >>
>>> >> Terry has most of it.
>>> >>
>>> >> The value of csum_data is a bit more complicated.
>>> >>
>>> >> Hardware checksums occur in two directions, inbound and
>>> outbound. The
>>> >> csum_data field has very different meanings depending on the
>>> >> direction.
>>> >>
>>> >> Inbound:
>>> >> The csum_flags will have bits set indicating which checksums were
>>> >> performed in hardware. In the event that the hardware looked
>>> at the
>>> >> checksum for the ip header in hardware, the MBUF_CSUM_DID_IP flag
>>> >> will be set. This flag indicates that the MBUF_CSUM_IP_GOOD
>>> flag is
>>> >> valid. When we get to ip_input, if MBUF_CSUM_DID_IP is set we
>>> will
>>> >> skip doing the checksum ourself. Instead, we will check
>>> >> MBUF_CSUM_IP_GOOD. If MBUF_CSUM_IP_GOOD is not set, we discard
>>> the
>>> >> packet as having a bad ip header checksum.
>>> >>
>>> >> In addition, the MBUF_CSUM_DID_DATA flag may be set. This flag
>>> >> indicates that the TCP or UDP checksum was calculated in
>>> hardware.
>>> >> The calculated value will appear in csum_data. When the packet
>>> gets
>>> >> in to tcp_input or udp_input, we will check this flag. If it
>>> is set,
>>> >> we will use the value in csum_data instead of calling in_cksum.
>>> >>
>>> >> Outbound:
>>> >> The csum_flags field will have bits set indicating which
>>> checksums
>>> >> should be performed by the hardware. MBUF_CSUM_REQ_IP
>>> indicates that
>>> >> the ip header checksum has not been calculated yet and should be
>>> >> calculated before the packet is sent. MBUF_CSUM_REQ_TCP and
>>> >> MBUF_CSUM_REQ_UDP indicate that the hardware should calculate
>>> the TCP
>>> >> or UDP checksum before the packet is sent. When either of
>>> these flags
>>> >> are set, csum_data will contain a pre-computed psuedo header
>>> checksum
>>> >> and the length to the offset of the location where the TCP or UDP
>>> >> checksum should be placed. There is also some goo in there for
>>> some
>>> >> CSUM_TCP_SUM16 thing. For that case, the csum_data value will
>>> mean
>>> >> something completely different. I can't recall the details but
>>> there
>>> >> is a lot of extra work done for a very special chip that had some
>>> >> very special requirements. When I say special, I don't mean
>>> that in a
>>> >> good way. 'nuff said.
>>> >>
>>> >> What's the best way to deal with hardware checksums in a kext?
>>> Get
>>> >> rid of them as fast as you possibly can. How do you get rid of
>>> them?
>>> >> You calculate them in software the way the maker intended. The
>>> kpi
>>> >> provides two primary functions for doing this:
>>> >>
>>> >> mbuf_outbound_finalize
>>> >> and
>>> >> mbuf_inbound_modified
>>> >>
>>> >> Whenever you receive a packet on the outbound path (stack to
>>> >> interface), call mbuf_outbound_finalize. Whenever you receive a
>>> >> packet on the inbound path, call mbuf_inbound_modified. From
>>> there,
>>> >> you can go about your business as though the entire nightmare of
>>> >> hardware checksums does not exist. For a little extra credit,
>>> only
>>> >> call mbuf_inbound_modified and mbuf_outbound_finalize on
>>> packets that
>>> >> you intend to modify, forward, encapsulate or in some other
>>> way do
>>> >> anything beyond inspecting.
>>> >>
>>> >> Do not try to modify the value of csum_data. The stack is
>>> responsible
>>> >> for setting that and the drivers are responsible for consuming
>>> it.
>>> >>
>>> >> Why am I so opposed to hardware checksums? Two reasons. First,
>>> >> checksums detect errors. If you introduce an error somewhere
>>> in the
>>> >> software just below the stack (say the driver) or even in the
>>> >> hardware itself (dma engine), the checksum will catch the error
>>> >> quickly, making it easier to narrow down the root cause. Second,
>>> >> layer violations like this lead to problems for network kernel
>>> >> extensions and results in little bits of code to handle this
>>> peppered
>>> >> throughout the stack. Offloading more of the TCP stack on to
>>> hardware
>>> >> leads to similar problems. Mmmm canned worms.
>>> >>
>>> >> Best of luck,
>>> >> -josh
>>> >>
>>> >> On Jan 1, 2007, at 5:28 PM, Terry Lambert wrote:
>>> >>
>>> >> > Josh G. is the expert here, but I'm pretty sure that you
>>> would end
>>> >> > up shooting your foot off if you did this.
>>> >> >
>>> >> > When you set these flags, they are generally done in the
>>> protocol
>>> >> > specific output functions, and since TCP and UDP are both
>>> layered
>>> >> > on top of IP, you are requesting that the checksum be
>>> calculated
>>> >> > both as if the packet were a TCP and a UDP packet, and then
>>> asking
>>> >> > for the IP encapsulation checksum to also be calculated.
>>> >> >
>>> >> > If you had hardware checksum assist (common on most hardware
>>> these
>>> >> > days), then you may end up with both the TCP and UDP checksums
>>> >> > calculated, and one of them stomping the other.
>>> >> >
>>> >> >
>>> >> > FWIW, csum_data is a value, not a pointer, for the 16 bit
>>> checksum
>>> >> > that's trying to be calculated, and is used as an
>>> accumulator for
>>> >> > the checksum calculation.  Practically, this means that for
>>> your
>>> >> > applications, you'd likely want to leave it alone, unless
>>> you were
>>> >> > setting flags requesting checksums.   If you were doing
>>> that, then
>>> >> > you'd need to know if it was on input or output, and since
>>> you are
>>> >> > doing a full checksum, then you'd want to set it to 0xffff
>>> or 0,
>>> >> > respectively (see the source code for the protocol specific
>>> input/
>>> >> > output mechansims for details.
>>> >> >
>>> >> >
>>> >> > Again, if you are doing an incremental update (i.e. you already
>>> >> > have a valid packet in hand, and are trying to replace a few
>>> data
>>> >> > fields in it without hanging packet size), then the only
>>> things you
>>> >> > will be touching are the data, the layer 4 checksump (TCP or
>>> UDP)
>>> >> > and the layer 3 checksum (IP).  You won't be setting these
>>> flags to
>>> >> > request anything, clearing them to avoid requesting
>>> anything, and
>>> >> > you won't be using the accumulator yourself.
>>> >> >
>>> >> >
>>> >> > I think at this point, it would be educational to download
>>> the xnu
>>> >> > sources and do a "grep -r" for the field names and flags to
>>> better
>>> >> > understand how the code works.
>>> >> >
>>> >> > -- Terry
>>> >> >
>>> >> > On Dec 29, 2006, at 9:31 AM, Bhavesh Davda wrote:
>>> >> >> Hi Terry,
>>> >> >>
>>> >> >> That's exactly what I ended up implementing. I haven't read
>>> RFC
>>> >> 1624,
>>> >> >> but I will.
>>> >> >>
>>> >> >> Related question: if ever csum_flags has both CSUM_IP and
>>> >> CSUM_TCP |
>>> >> >> CSUM_UDP set, then what is csum_data going to point to?
>>> >> >>
>>> >> >> Thanks!
>>> >> >>
>>> >> >> - Bhavesh
>>> >> >>
>>> >> >> On 12/28/06, Terry Lambert <email@hidden> wrote:
>>> >> >>> Are you changing the length of the packet, or only the
>>> contents?
>>> >> >>>
>>> >> >>> If you are only changing the contents, then I suggest an
>>> mbuf-
>>> >> chain
>>> >> >>> aware incremental checksum update as an alternative to
>>> copying it
>>> >> >>> to a
>>> >> >>> contiguous buffer, operating on it, converting it to
>>> mbufs, then
>>> >> >>> doing
>>> >> >>> a full software checksum of the whole packet.
>>> >> >>>
>>> >> >>> See RFC 1624.
>>> >> >>>
>>> >> >>> -- Terry
>>> >> >>>
>>> >> >>> On Dec 22, 2006, at 10:12 AM, Adi Masputra wrote:
>>> >> >>> > On Dec 22, 2006, at 9:58 AM, Bhavesh Davda wrote:
>>> >> >>> >>
>>> >> >>> >> I compare the result of mbuf_pkthdr_len agains the
>>> network-
>>> >> >>> byte-order
>>> >> >>> >> value of IP length on purpose, to know whether indeed the
>>> >> >>> value of IP
>>> >> >>> >> length in the packet is *not* in host-byte-order, as I
>>> thought
>>> >> >>> you
>>> >> >>> >> said mbuf_outbound_finalize() expects.
>>> >> >>> >
>>> >> >>> > Since your module is an interface filter, the field(s)
>>> in the
>>> >> >>> > header(s)
>>> >> >>> > will be in network byte-order.  All you need to do is to
>>> >> byte-swap
>>> >> >>> > the length at all times assuming it is an IPv4 packet.
>>> >> >>> >
>>> >> >>> >>
>>> >> >>> >> Oh boy. This is going to get quite expensive to do on a
>>> per
>>> >> >>> packet
>>> >> >>> >> basis. I thought I could simply get a pointer into the
>>> packet
>>> >> >>> payload
>>> >> >>> >> to the IP length field in the IP header, and do a
>>> memory read/
>>> >> >>> write,
>>> >> >>> >> rather than going through so many function calls and
>>> memcpy's,
>>> >> >>> albeit
>>> >> >>> >> small (2 byte) ones.
>>> >> >>> >
>>> >> >>> > Well, you are already requesting the stack to perform
>>> software
>>> >> >>> > checksum
>>> >> >>> > calculation on the packet which includes the transport
>>> >> payload; in
>>> >> >>> > terms
>>> >> >>> > of per-byte cost that is much more significant than
>>> >> traversing the
>>> >> >>> > mbuf
>>> >> >>> > chain to safely obtain the 2-octets IP length in the
>>> >> header.  :)
>>> >> >>> >
>>> >> >>> >>
>>> >> >>> >> Is there any other way to do it more optimally? I
>>> >> understand your
>>> >> >>> >> concern about non-linear mbufs, and not assuming that
>>> the IP
>>> >> >>> header
>>> >> >>> >> is
>>> >> >>> >> in the pkthdr mbuf. But realistically, is that ever
>>> going to
>>> >> >>> be the
>>> >> >>> >> case?
>>> >> >>> >>
>>> >> >>> >
>>> >> >>> > There's no guarantee that the headers will be in contiguous
>>> >> span
>>> >> >>> > although for the majority of cases they would.  Sorry,
>>> but as I
>>> >> >>> said
>>> >> >>> > the fix is already in order.
>>> >> >>> >
>>> >> >>> > Adi
>>> >> >>> >
>>> >> >>> >> Thanks!
>>> >> >>> >>
>>> >> >>> >> --
>>> >> >>> >> Bhavesh P. Davda
>>> >> >>> >
>>> >> >>> >  _______________________________________________
>>> >> >>> > Do not post admin requests to the list. They will be
>>> ignored.
>>> >> >>> > Darwin-kernel mailing list      (Darwin-
>>> email@hidden)
>>> >> >>> > Help/Unsubscribe/Update your Subscription:
>>> tlambert%
>>> >> >>> 40apple.com
>>> >> >>> >
>>> >> >>> > This email sent to email@hidden
>>> >> >>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Bhavesh P. Davda
>>> >> >
>>> >> > _______________________________________________
>>> >> > Do not post admin requests to the list. They will be ignored.
>>> >> > Darwin-kernel mailing list      (email@hidden)
>>> >> > Help/Unsubscribe/Update your Subscription:
>>> jgraessley%
>>> >> > 40apple.com
>>> >> >
>>> >> > This email sent to email@hidden
>>> >>
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > Bhavesh P. Davda
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Bhavesh P. Davda
>> _______________________________________________
>> Do not post admin requests to the list. They will be ignored.
>> Darwin-kernel mailing list      (email@hidden)
>> Help/Unsubscribe/Update your Subscription:
>> 40apple.com
>>
>> This email sent to email@hidden
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Darwin-kernel mailing list      (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden

--
Bhavesh P. Davda
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: mbuf_outbound_finalize bug?
From: Andrew Gallatin <email@hidden>


References:  
  >Re: mbuf_outbound_finalize bug? (From: Terry Lambert <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: Josh Graessley <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: "Bhavesh Davda" <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: Josh Graessley <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: "Bhavesh Davda" <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: Terry Lambert <email@hidden>)
  >Re: mbuf_outbound_finalize bug? (From: Adi Masputra <email@hidden>)




Prev by Date:
Re: mbuf_outbound_finalize bug?

Next by Date:
Re: mbuf_outbound_finalize bug?

Previous by thread:
Re: mbuf_outbound_finalize bug?

Next by thread:
Re: mbuf_outbound_finalize bug?

Index(es):

Date
Thread