site_archiver@lists.apple.com Delivered-To: darwin-dev@lists.apple.com On Feb 20, 2009, at 10:07 PM, Eric Gouriou wrote: On Feb 20, 2009, at 4:15 PM, Ryan McGann wrote: Thanks, Ryan Thanks, Ryan _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-dev mailing list (Darwin-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl... A couple weeks ago I asked about a machine-check panic I was getting. As it turns out, my suspicions were right about stack corruption. I disassembled a function in our kext and saw that for some reason, GCC's function prologue was allocating around 1660 bytes of stack (in release, debug was slightly better at 1140 bytes). On other platforms compiled with GCC (FreeBSD and many different Linux distros) the stack usage is near 110 bytes, but for some reason gcc on Mac OS X is allocating almost 4K. We are currently using -O3 because the code is pretty compute- bound, and on other platforms -O3 has a nice 5% boost compared to - O2. But changing it to -O2 doesn't even help, we have to go all the way to -O1 to get a usable stack of 400 bytes (still 4x larger than our Linux driver). What are your compile options, besides -O3 ? There's a couple, but most are-I and -isysroot. Here's the trimmed down (all -I s removed) command for the function that has the problem: g++ -W -Wall -Wcast-qual -Wcast-align -Wpointer-arith -Wsign-compare - Winline -Wunused -mmacosx-version-min=10.4 -Di386 -DDarwin - D__STDC_FORMAT_MACROS -fcheck-new -mkernel -Os -DNDEBUG -c -o BUILDTARGETS-RELEASE/Darwin-9.6.0-i386-Mac_OS_X-10.5.6/regexr.o regexr.cpp Only options of real significance is the -mkernel and -fcheck-new, neither of which AFAIK should cause (a ton) of stack usage. Also we have a userspace version of this library that uses the same options (except -mkernel), and it has stack sizes in the same neighborhood, so -mkernel is not the culprit here either. Do you have the same issue when using -mkernel -Os ? (-Os on Apple's gcc is mostly -O2 with a bit more emphasis on code size) The -Os option is better but still bad. All the options, including no optimization, produce a pretty large stack. -mkernel -O2, -mkernel -O3 produce a stack over 1600 bytes. -mkernel -O1 is better at 400 bytes, but that's still 4x larger than -O3 on FreeBSD. -Os produces a stack that's 1036 bytes, which is smaller than -O2 but way bigger than -O1. The code is vanilla C++ without anything fancy—no virtual functions even. There are no warnings about temporarys being used, so I have no clue what is causing the stack usage. It's a huge function with a lot ofswitch statements and for loops, but not a lot of function calls, mostly just computes on arrays of data. My best guess is that GCC is trying to optimize the intermediate operations and temporary results by placing them on the stack. You say "not a lot of function calls". Can you disable inlining or throttle it down to check that it's not the cause of the bloat ? If so, -Os would help. That's what I thought too—there is a lot of inlining, so I thought -Os would help, but in fact -Os seems to be only marginally better than -O2. This function is is kind of the "heart" of our code, so it can be called a lot, and sometimes recursively, so we are looking for something in the 100-200 byte range. All the variables have been put into a struct that is (OS)Malloc'd and the function was designed to be a leaf function (not many external function calls except for malloc/ free) to minimize stack usage. The best I can get is 400 bytes, and that's with -O1. For something that is basically just a lot of array/ pointer manipulation I don't understand where the space is going. We've used this library on Linksys switches without problem. My assembly language is pretty bad for x86, and this function is several pages of C[++] code, but taking a look through the assembly I found something strange—although the prologue moves the stack down 1000+ bytes, and the epilogue moves it up by the same, I didn't find any instructions that used anything beyond the 24th byte in the frame. It looks like the first couple of words on the stack are for the parameters to the function, and every time there's recursion, those words are retrieved from the stack to pass to the recurive call. I didn't see anything accessing a offset from the %esp except in the prologue and epilogue, where the esp was used for the subl/addl, and I don't see any pushl except in the prologue. So perhaps it's an alignment issue--but the default stack alignment should be 4 or 16 bytes, not 1600 bytes. Also of note: this problem occurs on PowerPC as well, though it's different (abnormally large) sizes for the stack, and it's not as bad. But our kext still panics on that platform too. Anybody have ideas on how to show where GCC is allocating things in the frame, and how to reduce the stack usage? It's hard to distill this to a single issue because the function is so large, but I am tempted to file a bug since GCC 4 optimizes things quite nicely on other platforms. This email sent to site_archiver@lists.apple.com