Re: Socket read tuning on Mac OS X

2 Apr 2010

      site_archiver@lists.apple.com
Delivered-To: darwin-dev@lists.apple.com

Hello list,
Hope that helps!
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list      (Darwin-dev@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl...
On Apr 2, 2010, at 7:18 AM, Jakub Bednar wrote:

I have written an application, that communicates using HTTP
protocol.  During testing, I found out, that with high download
speed the CPU usage of my application is pretty high on some systems.
Performance testing with Shark showed, that the problem is in a
socket read() operation. I'm using the poll()/read()/write()
mechanism specified by POSIX.
On the systems were the CPU usage is normal, the read() system call
regularly returns about 10kB of data per read. These systems have a
default Ethernet driver (e.g. my Mac Book Pro with nvenet.kext driver)
On the systems with high CPU usage, the read() system call regularly
returns only about 1.4kB of data per read, leading to 10x bigger
overhead ===> higher CPU usage. An example of this system

is an iMac with AppleYukon2.kext driver or a Mac Book Air with
Ethernet over USB.
I have checked all the sysctl options and there is no difference. It
seems to me, that this is a driver issue. If I put a small sleep
before every read, more data are returned per read(), leading

to rapid drop of CPU usage at a cost of slight drop of download
speed. So I wonder, is there a way how to reconfigure a socket to
tell him something like: "Give me about xxx kB of data per read, but
if

that amount is not available in a specified time, give me what you
have."?

It's technically an issue with amortization of overhead on non-chunked
data streams in your code.
When data arrives at a protocol interface, it accumulates in a local
hardware buffer until a hardware dependent amount of time has elapsed
(if the hardware supports this), a transfer unit size has been reached
(if the hardware partitions its input buffer), or that buffer hits a
high water mark (to prevent hardware buffer overrun).  Then it sends
an interrupt.
Various low level drivers have various thresholds, depending on the
device firmware and buffer size that it accumulates before it has to
do a data transfer, and whether or not the device allows the host to
control them.  Once the data transfer happens, though, the upper level
driver thread wakes up, and calls proto_input() to input the data into
the network stack.  This hashes based on the protocol ID to AF_INET to
ip_input(), and then IPPROTO_TCP to go to tcp_input().
However, the TCP data is also then processed into the receiving socket
buffer (up to the administrative limit on the socket), and then
anything sleeping on the socket, like a poll in the program that
opened it, is woken up.  There's no guarantee about how much data is
going to have arrived (before the wakeup) and be sitting in the socket
by the time you get around to reading (i.e. additional data received
subsequent to the wakeup).
In other words, there's a queue retention time in the hardware, a
queue retention time in the stack, and a queue retention time in the
socket buffer, and a propagation latency in each case, plus a latency
between the notification, your process being scheduled, and you doing
the read.
I have tried SO_RCVLOWAT along with SO_RCVTIMEO options, but this is
not the way. Poll() does not respect SO_RCVLOWAT and the socket will
not give me less data but return an error when

SO_RCVTIMEO timeout expires.

SO_RCVLOWAT only applies to UNIX domain sockets or FIFOs, where there
is not potentially an administrative or system limit that will prevent
enough data accumulating in order to satisfy the low watermark.  It's
going to have no effect because it's not supposed to have any effect:
you could starvation deadlock yourself otherwise, which could be used
to stage a local DOS.  This is true on most other UNIX-like systems as
well.  One exception is "accept filters"-based systems with kernel pre-
parsing of data contents for recognition of complete messages; another
is the NCP streams MUX in NetWare for UNIX, which also recognizes
complete messages, and does not send unchunked data up to user space.
These are dangerous from a security perspective because they allow a
user space process to download interpreted code intended to run in
supervisor mode.
NB: As To A.M.'s suggestion, the NOTE_LOWAT for kqueue() is only
respected for sockets, pipes, and BPF (Berkeley Packet Filter); unless
you are doing raw ethernet frames via BPF, this will unfortunately not
help you out.
Please, any ideas how to solve this?

Use a read thread, wait for data to be available on the socket, and
add your own latency after that so that the read is delayed after the
arrival of the first byte becoming available until (hopefully) other
bytes have become available.  By isolating it to a separate thread,
you will only put a latency on the individual read itself in the
thread, and you will avoid introducing a processing latency of the pre-
read delay prior to operating on the data.
In other words, your read thread will always sleep after data is
available, but as long as your data is arriving in a continuous
stream, you will amortize only a single read latency across all the
processing you end up doing (much in the same way that the sliding
window in the TCP protocol amortizes the ACK round trip time of 2*MSL
over the entire data stream, rather than paying a per-packet
penalty).  This will keep the data fetch overhead down, while at the
same time not slowing down your processing waiting for the delay
before reading the data.
Use whatever producer/consumer design pattern you like best to move
the received data buffers between the read thread and the processing
thread (I would do it by having a list of work items in both, and
protecting insertion and deletion with a lock so that transferring
work is simply moving a data structure from one list to another under
the lock, rather than copying the data).
This email sent to site_archiver@lists.apple.com

Terry Lambert

tags

participants (1)