We have kernel mode n/w file system for MAC OSx. Our kernel extension communicates with the server using socket apis from kpic_socket.h.
While writing huge file over the network, we observed that sock_send was hung almost for 1.5 mins [even though socket send timeout is set to 20secs]. sock_send exited only after the connection reset from server end. Same File system driver module works fine on previous version of OSx i.e. 10.7 and 10.6. To work around the issue on 10.8, I had to reset net.link.generic.system.flow_advisory to zero.
Below was call stack for the thread which has called sock_send and been blocked for almost 1.5 mins:
(gdb) bt
#0 0xffffff8015eb3d7e in machine_switch_context (old=0xffffff802590c550, continuation=0, new=0xffffff8035b71550) at /SourceCache/xnu/xnu-2050.18.24/osfmk/i386/pcb.c:511
#1 0xffffff8015e2e741 in thread_invoke (self=0xffffff802590c550, thread=0xffffff80164afae8, reason=0) at /SourceCache/xnu/xnu-2050.18.24/osfmk/kern/sched_prim.c:2231
#2 0xffffff8015e2da0c in thread_block_reason (continuation=<value temporarily unavailable, due to optimizations>, parameter=<value temporarily unavailable, due to optimizations>, reason=0) at /SourceCache/xnu/xnu-2050.18.24/osfmk/kern/sched_prim.c:2508
#3 0xffffff8015e26697 in lck_mtx_sleep_deadline (lck=0xffffff80366bdfe8, lck_sleep_action=0, event=<value temporarily unavailable, due to optimizations>, interruptible=<value temporarily unavailable, due to optimizations>, deadline=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.18.24/osfmk/kern/locks.c:551
#4 0xffffff8016169f46 in _sleep (chan=<value temporarily unavailable, due to optimizations>, pri=24, wmsg=<value temporarily unavailable, due to optimizations>, abstime=<value temporarily unavailable, due to optimizations>, continuation=<value temporarily unavailable, due to optimizations>, mtx=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.18.24/bsd/kern/kern_synch.c:199
#5 0xffffff801616a344 in msleep (chan=0xffffff80366bdc68, mtx=0xffffff80366bdfe8, pri=24, wmsg=0xffffff80162f2f6a "sbwait", ts=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.18.24/bsd/kern/kern_synch.c:347
#6 0xffffff801619cd5f in sbwait (sb=0xffffff80366bdc68) at /SourceCache/xnu/xnu-2050.18.24/bsd/kern/uipc_socket2.c:482
#7 0xffffff8016198326 in sosend (so=0xffffff80366bdb50, addr=0x0, uio=0xffffff80cd4d3100, top=0x0, control=0x0, flags=<value temporarily unavailable, due to optimizations>) at /SourceCache/xnu/xnu-2050.18.24/bsd/kern/uipc_socket.c:1439
#8 0xffffff80161b06ee in sock_send_internal (sock=<value temporarily unavailable, due to optimizations>, msg=<value temporarily unavailable, due to optimizations>, data="" flags=0, sentlen=0xffffff80cd4d39d0) at /SourceCache/xnu/xnu-2050.18.24/bsd/kern/kpi_socket.c:851
#9 0xffffff7f98060191 in sendMsgIntl (conn=0xffffff8035f3a008, buffer=0xffffff80e1c82000, num_elems=129, bytesSent=0xffffff80cd4d3b9c) at sdkrpc.c:82
#10 0xffffff7f98061bc8 in omSendData (server=0xffffff8035f3a008, item=0xffffff80e6dd53c8) at sdkrpc.c:505
#11 0xffffff7f98021d46 in cacheWriteSlice (cache=0xffffff80e1184000, pending=0xffffff80e19f4000, index=0, nothingToDo=0xffffff80cd4d3e98) at sdkcache.c:4655
#12 0xffffff7f98022dfe in flushCacheSlice (cache=0xffffff80e1184000) at sdkcache.c:4953
#13 0xffffff7f980235e8 in flushCache (cache=0xffffff80e1184000, force=0) at sdkcache.c:5105
#14 0xffffff7f9801eae5 in CacheWriter (arg=0x0) at sdkcache.c:3968
Further, checking 10.7 and 10.8 source code, we found that:
Below is code block where we are waiting for sbwait () - uipc_socket.c : line 1439
if ((space < resid + clen &&
(atomic || space < (int32_t)so->so_snd.sb_lowat || space < clen)) ||
(so->so_type == SOCK_STREAM && so_wait_for_if_feedback(so))) {
if ((so->so_state & SS_NBIO) || (flags & MSG_NBIO) ||
assumelock) {
return (EWOULDBLOCK);
}
sbunlock(&so->so_snd, 1);
*sblocked = 0;
error = sbwait(&so->so_snd);
if (error) {
if (so->so_flags & SOF_DEFUNCT)
goto defunct;
return (error);
}
goto restart;
}
After dumping local variables, it seems that only " (so->so_type == SOCK_STREAM && so_wait_for_if_feedback(so)" condition was true in our case. This condition is added newly in 10.8. To work around the issue, I had to change net.link.generic.system.flow_advisory to zero. After changing this setting, I am no longer seeing any socket send errors.
Has anyone seen such kind of errors before? Please let me know if you need any more information.
Thanks,
Pratima