Re: mbuf_outbound_finalize bug?
Re: mbuf_outbound_finalize bug?
- Subject: Re: mbuf_outbound_finalize bug?
- From: "Bhavesh Davda" <email@hidden>
- Date: Tue, 2 Jan 2007 09:36:19 -0800
Thanks for the explanation, Josh!
I had figured as much reading the xnu-792.13.8 code.
However, if you look at the subject of this e-mail thread, and go back
to my original post about this, I had reported a bug in
mbuf_outbound_finalize and friends, that Adi acknowledged as a known
bug that will be fixed in a subsequent release of the kernel.
Instead of reposting those e-mails, I'll just link to the appropriate
posts here:
My original post:
http://lists.apple.com/archives/darwin-kernel/2006/Dec/msg00063.html
Adi's response with a code snippet for a workaround:
http://lists.apple.com/archives/darwin-kernel/2006/Dec/msg00072.html
Doing what Adi suggested worked for *most* cases, except that I
started seeing kernel panics when my kext loads at boot time, when Mac
OS (probably Bonjour) sends an mDNS packet, which is UDP checksum
offloaded, and my kext calls mbuf_outbound_finalize(). The panic
happens in the call to mbuf_copyback when it tries to restore the
tot_len field in the IP header in the mbuf in network-byte-order.
When I looked at the mbuf after gdb attaching to the panicked kernel,
all the pointers looked trashed. Instead of further debugging
mbuf_outbound_finalize/in_delayed_cksum_offset/in_cksum_skip, I
decided that I had spent way too much time on this whole hw checksum
offloading mess, and decided that I would walk the chain of mbufs
myself, computing my own checksums.
Yes, I do so only when my kext has determined that an outbound mbuf is
*interesting*, and I clone the original mbuf for forwarding to my
applications. So it isn't wasted effort.
However, I would love for you to tell me that I don't need to do so :)
My poor-mans attempt at checksumming a block of data is probably not
going to perform as well as the tuned in_cksum_skip and assembly
helpers :)
Thanks
- Bhavesh
On 1/1/07, Josh Graessley <email@hidden> wrote:
Oh no, the crummy hardware checksum stuff has stuck to me! Help!
Someone save me!
Terry has most of it.
The value of csum_data is a bit more complicated.
Hardware checksums occur in two directions, inbound and outbound. The
csum_data field has very different meanings depending on the direction.
Inbound:
The csum_flags will have bits set indicating which checksums were
performed in hardware. In the event that the hardware looked at the
checksum for the ip header in hardware, the MBUF_CSUM_DID_IP flag
will be set. This flag indicates that the MBUF_CSUM_IP_GOOD flag is
valid. When we get to ip_input, if MBUF_CSUM_DID_IP is set we will
skip doing the checksum ourself. Instead, we will check
MBUF_CSUM_IP_GOOD. If MBUF_CSUM_IP_GOOD is not set, we discard the
packet as having a bad ip header checksum.
In addition, the MBUF_CSUM_DID_DATA flag may be set. This flag
indicates that the TCP or UDP checksum was calculated in hardware.
The calculated value will appear in csum_data. When the packet gets
in to tcp_input or udp_input, we will check this flag. If it is set,
we will use the value in csum_data instead of calling in_cksum.
Outbound:
The csum_flags field will have bits set indicating which checksums
should be performed by the hardware. MBUF_CSUM_REQ_IP indicates that
the ip header checksum has not been calculated yet and should be
calculated before the packet is sent. MBUF_CSUM_REQ_TCP and
MBUF_CSUM_REQ_UDP indicate that the hardware should calculate the TCP
or UDP checksum before the packet is sent. When either of these flags
are set, csum_data will contain a pre-computed psuedo header checksum
and the length to the offset of the location where the TCP or UDP
checksum should be placed. There is also some goo in there for some
CSUM_TCP_SUM16 thing. For that case, the csum_data value will mean
something completely different. I can't recall the details but there
is a lot of extra work done for a very special chip that had some
very special requirements. When I say special, I don't mean that in a
good way. 'nuff said.
What's the best way to deal with hardware checksums in a kext? Get
rid of them as fast as you possibly can. How do you get rid of them?
You calculate them in software the way the maker intended. The kpi
provides two primary functions for doing this:
mbuf_outbound_finalize
and
mbuf_inbound_modified
Whenever you receive a packet on the outbound path (stack to
interface), call mbuf_outbound_finalize. Whenever you receive a
packet on the inbound path, call mbuf_inbound_modified. From there,
you can go about your business as though the entire nightmare of
hardware checksums does not exist. For a little extra credit, only
call mbuf_inbound_modified and mbuf_outbound_finalize on packets that
you intend to modify, forward, encapsulate or in some other way do
anything beyond inspecting.
Do not try to modify the value of csum_data. The stack is responsible
for setting that and the drivers are responsible for consuming it.
Why am I so opposed to hardware checksums? Two reasons. First,
checksums detect errors. If you introduce an error somewhere in the
software just below the stack (say the driver) or even in the
hardware itself (dma engine), the checksum will catch the error
quickly, making it easier to narrow down the root cause. Second,
layer violations like this lead to problems for network kernel
extensions and results in little bits of code to handle this peppered
throughout the stack. Offloading more of the TCP stack on to hardware
leads to similar problems. Mmmm canned worms.
Best of luck,
-josh
On Jan 1, 2007, at 5:28 PM, Terry Lambert wrote:
> Josh G. is the expert here, but I'm pretty sure that you would end
> up shooting your foot off if you did this.
>
> When you set these flags, they are generally done in the protocol
> specific output functions, and since TCP and UDP are both layered
> on top of IP, you are requesting that the checksum be calculated
> both as if the packet were a TCP and a UDP packet, and then asking
> for the IP encapsulation checksum to also be calculated.
>
> If you had hardware checksum assist (common on most hardware these
> days), then you may end up with both the TCP and UDP checksums
> calculated, and one of them stomping the other.
>
>
> FWIW, csum_data is a value, not a pointer, for the 16 bit checksum
> that's trying to be calculated, and is used as an accumulator for
> the checksum calculation. Practically, this means that for your
> applications, you'd likely want to leave it alone, unless you were
> setting flags requesting checksums. If you were doing that, then
> you'd need to know if it was on input or output, and since you are
> doing a full checksum, then you'd want to set it to 0xffff or 0,
> respectively (see the source code for the protocol specific input/
> output mechansims for details.
>
>
> Again, if you are doing an incremental update (i.e. you already
> have a valid packet in hand, and are trying to replace a few data
> fields in it without hanging packet size), then the only things you
> will be touching are the data, the layer 4 checksump (TCP or UDP)
> and the layer 3 checksum (IP). You won't be setting these flags to
> request anything, clearing them to avoid requesting anything, and
> you won't be using the accumulator yourself.
>
>
> I think at this point, it would be educational to download the xnu
> sources and do a "grep -r" for the field names and flags to better
> understand how the code works.
>
> -- Terry
>
> On Dec 29, 2006, at 9:31 AM, Bhavesh Davda wrote:
>> Hi Terry,
>>
>> That's exactly what I ended up implementing. I haven't read RFC 1624,
>> but I will.
>>
>> Related question: if ever csum_flags has both CSUM_IP and CSUM_TCP |
>> CSUM_UDP set, then what is csum_data going to point to?
>>
>> Thanks!
>>
>> - Bhavesh
>>
>> On 12/28/06, Terry Lambert <email@hidden> wrote:
>>> Are you changing the length of the packet, or only the contents?
>>>
>>> If you are only changing the contents, then I suggest an mbuf-chain
>>> aware incremental checksum update as an alternative to copying it
>>> to a
>>> contiguous buffer, operating on it, converting it to mbufs, then
>>> doing
>>> a full software checksum of the whole packet.
>>>
>>> See RFC 1624.
>>>
>>> -- Terry
>>>
>>> On Dec 22, 2006, at 10:12 AM, Adi Masputra wrote:
>>> > On Dec 22, 2006, at 9:58 AM, Bhavesh Davda wrote:
>>> >>
>>> >> I compare the result of mbuf_pkthdr_len agains the network-
>>> byte-order
>>> >> value of IP length on purpose, to know whether indeed the
>>> value of IP
>>> >> length in the packet is *not* in host-byte-order, as I thought
>>> you
>>> >> said mbuf_outbound_finalize() expects.
>>> >
>>> > Since your module is an interface filter, the field(s) in the
>>> > header(s)
>>> > will be in network byte-order. All you need to do is to byte-swap
>>> > the length at all times assuming it is an IPv4 packet.
>>> >
>>> >>
>>> >> Oh boy. This is going to get quite expensive to do on a per
>>> packet
>>> >> basis. I thought I could simply get a pointer into the packet
>>> payload
>>> >> to the IP length field in the IP header, and do a memory read/
>>> write,
>>> >> rather than going through so many function calls and memcpy's,
>>> albeit
>>> >> small (2 byte) ones.
>>> >
>>> > Well, you are already requesting the stack to perform software
>>> > checksum
>>> > calculation on the packet which includes the transport payload; in
>>> > terms
>>> > of per-byte cost that is much more significant than traversing the
>>> > mbuf
>>> > chain to safely obtain the 2-octets IP length in the header. :)
>>> >
>>> >>
>>> >> Is there any other way to do it more optimally? I understand your
>>> >> concern about non-linear mbufs, and not assuming that the IP
>>> header
>>> >> is
>>> >> in the pkthdr mbuf. But realistically, is that ever going to
>>> be the
>>> >> case?
>>> >>
>>> >
>>> > There's no guarantee that the headers will be in contiguous span
>>> > although for the majority of cases they would. Sorry, but as I
>>> said
>>> > the fix is already in order.
>>> >
>>> > Adi
>>> >
>>> >> Thanks!
>>> >>
>>> >> --
>>> >> Bhavesh P. Davda
>>> >
>>> > _______________________________________________
>>> > Do not post admin requests to the list. They will be ignored.
>>> > Darwin-kernel mailing list (email@hidden)
>>> > Help/Unsubscribe/Update your Subscription:
>>> 40apple.com
>>> >
>>> > This email sent to email@hidden
>>>
>>>
>>
>>
>> --
>> Bhavesh P. Davda
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Darwin-kernel mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
> 40apple.com
>
> This email sent to email@hidden
--
Bhavesh P. Davda
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden