Re: Socket read tuning on Mac OS X
site_archiver@lists.apple.com Delivered-To: darwin-dev@lists.apple.com Hello list, Hope that helps! -- Terry _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-dev mailing list (Darwin-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl... On Apr 2, 2010, at 7:18 AM, Jakub Bednar wrote: I have written an application, that communicates using HTTP protocol. During testing, I found out, that with high download speed the CPU usage of my application is pretty high on some systems. Performance testing with Shark showed, that the problem is in a socket read() operation. I'm using the poll()/read()/write() mechanism specified by POSIX. On the systems were the CPU usage is normal, the read() system call regularly returns about 10kB of data per read. These systems have a default Ethernet driver (e.g. my Mac Book Pro with nvenet.kext driver) On the systems with high CPU usage, the read() system call regularly returns only about 1.4kB of data per read, leading to 10x bigger overhead ===> higher CPU usage. An example of this system is an iMac with AppleYukon2.kext driver or a Mac Book Air with Ethernet over USB. I have checked all the sysctl options and there is no difference. It seems to me, that this is a driver issue. If I put a small sleep before every read, more data are returned per read(), leading to rapid drop of CPU usage at a cost of slight drop of download speed. So I wonder, is there a way how to reconfigure a socket to tell him something like: "Give me about xxx kB of data per read, but if that amount is not available in a specified time, give me what you have."? It's technically an issue with amortization of overhead on non-chunked data streams in your code. When data arrives at a protocol interface, it accumulates in a local hardware buffer until a hardware dependent amount of time has elapsed (if the hardware supports this), a transfer unit size has been reached (if the hardware partitions its input buffer), or that buffer hits a high water mark (to prevent hardware buffer overrun). Then it sends an interrupt. Various low level drivers have various thresholds, depending on the device firmware and buffer size that it accumulates before it has to do a data transfer, and whether or not the device allows the host to control them. Once the data transfer happens, though, the upper level driver thread wakes up, and calls proto_input() to input the data into the network stack. This hashes based on the protocol ID to AF_INET to ip_input(), and then IPPROTO_TCP to go to tcp_input(). However, the TCP data is also then processed into the receiving socket buffer (up to the administrative limit on the socket), and then anything sleeping on the socket, like a poll in the program that opened it, is woken up. There's no guarantee about how much data is going to have arrived (before the wakeup) and be sitting in the socket by the time you get around to reading (i.e. additional data received subsequent to the wakeup). In other words, there's a queue retention time in the hardware, a queue retention time in the stack, and a queue retention time in the socket buffer, and a propagation latency in each case, plus a latency between the notification, your process being scheduled, and you doing the read. I have tried SO_RCVLOWAT along with SO_RCVTIMEO options, but this is not the way. Poll() does not respect SO_RCVLOWAT and the socket will not give me less data but return an error when SO_RCVTIMEO timeout expires. SO_RCVLOWAT only applies to UNIX domain sockets or FIFOs, where there is not potentially an administrative or system limit that will prevent enough data accumulating in order to satisfy the low watermark. It's going to have no effect because it's not supposed to have any effect: you could starvation deadlock yourself otherwise, which could be used to stage a local DOS. This is true on most other UNIX-like systems as well. One exception is "accept filters"-based systems with kernel pre- parsing of data contents for recognition of complete messages; another is the NCP streams MUX in NetWare for UNIX, which also recognizes complete messages, and does not send unchunked data up to user space. These are dangerous from a security perspective because they allow a user space process to download interpreted code intended to run in supervisor mode. NB: As To A.M.'s suggestion, the NOTE_LOWAT for kqueue() is only respected for sockets, pipes, and BPF (Berkeley Packet Filter); unless you are doing raw ethernet frames via BPF, this will unfortunately not help you out. Please, any ideas how to solve this? Use a read thread, wait for data to be available on the socket, and add your own latency after that so that the read is delayed after the arrival of the first byte becoming available until (hopefully) other bytes have become available. By isolating it to a separate thread, you will only put a latency on the individual read itself in the thread, and you will avoid introducing a processing latency of the pre- read delay prior to operating on the data. In other words, your read thread will always sleep after data is available, but as long as your data is arriving in a continuous stream, you will amortize only a single read latency across all the processing you end up doing (much in the same way that the sliding window in the TCP protocol amortizes the ACK round trip time of 2*MSL over the entire data stream, rather than paying a per-packet penalty). This will keep the data fetch overhead down, while at the same time not slowing down your processing waiting for the delay before reading the data. Use whatever producer/consumer design pattern you like best to move the received data buffers between the read thread and the processing thread (I would do it by having a list of work items in both, and protecting insertion and deletion with a lock so that transferring work is simply moving a data structure from one list to another under the lock, rather than copying the data). This email sent to site_archiver@lists.apple.com
participants (1)
-
Terry Lambert