Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

streamlining TCP throughput in IOKit driver



Hi. I've got a long, but hopefully interesting question and would really appreciate any insight into our problem.

We have a IOKit driver that uses a TCP connection to talk to its device (its not a network family driver, we're implementing a protocol that bridges over TCP/IP). Various changes we've made to the driver's architecture, threading and synchronization and such, have had surprising negative effects on performance. In general, I'm hoping for help diagnosing this issue and wondering how to best architect for TCP throughput.

During development we've gone though three major changes. Throughout, we've used a sockets-style abstraction layer whose send and recv calls are synchronous:

1) Two workloops:
A. has IOCommandGate, action routines for requests and data received
B. calls recv in action routine
Requests from client to do work call action routine through A's command gate, it calls send on initial command and data units built, then returns to client
In B when recv returns, calls another action routine through A's command gate, it parses response and data units, calls send on outgoing data units, invokes callback to complete client's requests, returns start next recv


With (1), in general we had okay read performance and poor write performance. This code was creating a workloop for the purposes of having a command gate, but never made use of its context. It probably took too long to return to the client after receiving a request, since it always did a synchronous data before returning. (I'm not trying to say these are the explanations for the performance problems, just our take on the obvious inefficiencies)

2) Three workloops!
A. has IOCommandGate, action routines for requests and data received (as before)
B. calls recv in action routine (as before)
C. calls send in action routine
Request from client on A's command gate, and recv on B as before, but outgoing command and data units are buffered and queued.
Action routine in workloop C dequeues buffers and calls send on them


After making change to (2) we had better performance overall, but less so for incoming, and it was still inconsistent at times. We were still doing work before returning the the client, but the meat of it was getting queued for later. The synchronization code at this point was getting messy :-)

3) Two workloops again:
A: has IOCommandGate, action routines for requests and data received (mostly as before)
B: calls recv in action routine (as before)
Requests from client to do work is queued in a FIFO and returns.
In B when recv returns, buffer is queued, another is allocated to do next recv.
A checkforwork function on A checks queues, and on the workloop, dequeues tasks or dequeues received buffers to parse incoming, calls send on outgoing command and data units.


Cleaning up things in (3) we were trying to further reduce the work done before returning to the client, and to turn mysterious emergent driver behavior into an explicit scheduling problem. We were expecting to be able to tweak the dequeuing order and priority and come up with performace at least as good as before and with more consistency. Receiving, processing and sending, and command queuing were now never waiting for each other.

But we could not get performance in (3) to be as good as (2). Moreover, incoming data which had always been smootly consistent was now having big hiccups, with an occasional >1sec pause in the data stream waiting for data or response data from the device.

i. Can anyone suggest approaches to determine where our problems lie? One thought is something to do with buffering in the TCP layer below our sockets abstraction, and how we may be filling up shared buffers with outgoing data and causing receives to block. Or some similar effect within our device.

ii. Should we be calling send or recv with a particular block size, or throttling at all? (I understand that I'm not sharing much about our sockets API layer, I don't know much about that part our driver at the moment except that its been independently debugged and profiled and in test cases performs very well) Its hard to see anything wrong with receiving in a dedicated context the largest expected data unit at a time (without WAITALL flag), queueing the result and immediately looping to receive more. For sending we are queueing buffers for each command or data unit we build, be it a command with <64 bytes or containing data >8K, and doing a single synchronous send for each dequeued buffer. TCP_NODELAY is turned on (and performance was poor until we did so). Should we be rebuffering ourselves to particular sizes?

iii. Are we out to lunch in how we're using architecting our driver? IOKit Fundamentals says a driver should rarely have to create its own workloops, we're not exactly heeding its advice. Are we introducing too many context switches? The way we interface to the TCP stack seems to prevent anything different, lest we lockstep our reads and writes and/or lock up someone else's workloop.

Thanks in advance for whatever wise advice anyone can provide.
J. Houston

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-drivers mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-drivers/email@hidden

This email sent to email@hidden


Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.