Hi. I've got a long, but hopefully interesting question and would
really appreciate any insight into our problem.
We have a IOKit driver that uses a TCP connection to talk to its device
(its not a network family driver, we're implementing a protocol that
bridges over TCP/IP). Various changes we've made to the driver's
architecture, threading and synchronization and such, have had
surprising negative effects on performance. In general, I'm hoping for
help diagnosing this issue and wondering how to best architect for TCP
throughput.
During development we've gone though three major changes. Throughout,
we've used a sockets-style abstraction layer whose send and recv calls
are synchronous:
1) Two workloops:
A. has IOCommandGate, action routines for requests and data received
B. calls recv in action routine
Requests from client to do work call action routine through A's command
gate, it calls send on initial command and data units built, then
returns to client
In B when recv returns, calls another action routine through A's
command gate, it parses response and data units, calls send on outgoing
data units, invokes callback to complete client's requests, returns
start next recv
With (1), in general we had okay read performance and poor write
performance. This code was creating a workloop for the purposes of
having a command gate, but never made use of its context. It probably
took too long to return to the client after receiving a request, since
it always did a synchronous data before returning. (I'm not trying to
say these are the explanations for the performance problems, just our
take on the obvious inefficiencies)
2) Three workloops!
A. has IOCommandGate, action routines for requests and data received
(as before)
B. calls recv in action routine (as before)
C. calls send in action routine
Request from client on A's command gate, and recv on B as before, but
outgoing command and data units are buffered and queued.
Action routine in workloop C dequeues buffers and calls send on them
After making change to (2) we had better performance overall, but less
so for incoming, and it was still inconsistent at times. We were still
doing work before returning the the client, but the meat of it was
getting queued for later. The synchronization code at this point was
getting messy :-)
3) Two workloops again:
A: has IOCommandGate, action routines for requests and data received
(mostly as before)
B: calls recv in action routine (as before)
Requests from client to do work is queued in a FIFO and returns.
In B when recv returns, buffer is queued, another is allocated to do
next recv.
A checkforwork function on A checks queues, and on the workloop,
dequeues tasks or dequeues received buffers to parse incoming, calls
send on outgoing command and data units.
Cleaning up things in (3) we were trying to further reduce the work
done before returning to the client, and to turn mysterious emergent
driver behavior into an explicit scheduling problem. We were expecting
to be able to tweak the dequeuing order and priority and come up with
performace at least as good as before and with more consistency.
Receiving, processing and sending, and command queuing were now never
waiting for each other.
But we could not get performance in (3) to be as good as (2). Moreover,
incoming data which had always been smootly consistent was now having
big hiccups, with an occasional >1sec pause in the data stream waiting
for data or response data from the device.
i. Can anyone suggest approaches to determine where our problems lie?
One thought is something to do with buffering in the TCP layer below
our sockets abstraction, and how we may be filling up shared buffers
with outgoing data and causing receives to block. Or some similar
effect within our device.
ii. Should we be calling send or recv with a particular block size, or
throttling at all? (I understand that I'm not sharing much about our
sockets API layer, I don't know much about that part our driver at the
moment except that its been independently debugged and profiled and in
test cases performs very well) Its hard to see anything wrong with
receiving in a dedicated context the largest expected data unit at a
time (without WAITALL flag), queueing the result and immediately
looping to receive more. For sending we are queueing buffers for each
command or data unit we build, be it a command with <64 bytes or
containing data >8K, and doing a single synchronous send for each
dequeued buffer. TCP_NODELAY is turned on (and performance was poor
until we did so). Should we be rebuffering ourselves to particular
sizes?
iii. Are we out to lunch in how we're using architecting our driver?
IOKit Fundamentals says a driver should rarely have to create its own
workloops, we're not exactly heeding its advice. Are we introducing too
many context switches? The way we interface to the TCP stack seems to
prevent anything different, lest we lockstep our reads and writes
and/or lock up someone else's workloop.
Thanks in advance for whatever wise advice anyone can provide.
J. Houston