process exits with bus error: forced to power-down reboot to fix filesystem
process exits with bus error: forced to power-down reboot to fix filesystem
- Subject: process exits with bus error: forced to power-down reboot to fix filesystem
- From: Daniel Quinlan <email@hidden>
- Date: Wed, 7 Jul 2010 13:37:17 -0600
I'm not certain this is the right list for this problem, but it seems to me to be a kernel problem of some sort.
It's rather long, but I've tried to be comprehensive. I've also submitted this as a bug to the
apple bug reporter.
We have a problem with a ring buffer program which memory maps a large
file.
The normal first time start up mode is to create the large file by
opening it, seeking to the end and writing a null byte, closing it,
and then memory mapping the file. This approach has worked on Solaris
and Linux without any problem, and often works on Macs as well.
However, on mac's it can result in a bus error at some indeterminant
point, as the process tries to write somewhere in the file.
Oftentimes, this leaves the process in an exiting state, and the
filesystem in a screwed up state: if you try to read the output file
with another process, (e.g. cat), that process will also get locked into
an uninterruptible state -- that is, control-C and control-Z have no
apparent effect. (I've also seen `tail buffer` fail with a
segmentation error!). Attempts to remove the file also lock up.
Furthermore, the machine cannot be rebooted as it gets lost during the
shutdown with the twirling display on an otherwise blank screen. You
have to hold down the power button for 10 seconds to power down. On
the reboot, sometimes the extra partition where I'm running the
process will fail to mount. Disk Utility doesn't find anything wrong
with it and succeeds in mounting it, though. On the other hand,
changes to files during this period may vanish during the reboot,
suggesting (I guess) that the journaling is fouled up.
This message from the console following the reboot supports that idea:
7/6/10 4:52:33 PM fseventsd[58] event logs in /t/.fseventsd out of sync with volume. destroying old logs. (88483 1 88990)
7/6/10 4:52:33 PM fseventsd[58] log dir: /t/.fseventsd getting new uuid: 07C194F2-FDD3-4375-92F2-E33863DA9140
We can evade the problem by initializing the buffer file before
starting the ring buffer program, but this is operationally difficult
-- and it hasn't been necessary on other platforms.
The problem appears to be related to how quickly the ring buffer
program is writing to the buffer file -- if it's slow enough and the
buffer file is small enough, we typically don't see the problem. It
seems to need multiple threads writing in the operational program.
I've written a small, pretty simple, example program which emulates
the operational program which is failing. It regularly fails while
writing to the file. The filesystem problem doesn't always happen.
This program starts several threads which sequentially write to the
buffer, using a mutex to force just one thread at a time. In this
example, I started eight threads and am using a 6 Gigabyte buffer.
The bus error appears when the pointer is almost 1/3 of the way
through the buffer. You can see in this case that attempts to print
past this point using gdb also crash gdb.
% gdb simple
(gdb) run -n 8
Starting program: /opt/antelope/dev-64/bin/simple -n 8
Reading symbols for shared libraries +. done
creating file diskbuffer with size 6000.00 Mbytes
creating thread #0
creating thread #1
creating thread #2
Starting writer #0 with packet size 512
creating thread #3
Starting writer #2 with packet size 514
Starting writer #1 with packet size 513
creating thread #4
Starting writer #3 with packet size 515
creating thread #5
Starting writer #4 with packet size 516
creating thread #6
Starting writer #5 with packet size 517
creating thread #7
Starting writer #6 with packet size 518
Starting writer #7 with packet size 519
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 10 at address: 0x0000000165c0b000
[Switching to process 687]
0x00007fffffe00daf in __memcpy ()
(gdb) p BeginPtr
$1 = 0x100200000 "\n### Thread #0 0 0x0\n"
(gdb) p Ptr
$2 = 0x165c0af03 "\n### Thread #0 413481 0x65a0af03\n"
(gdb) p EndPtr
$3 = 0x265c0bc00 ""
(gdb) p Ptr-BeginPtr
$4 = 1705029379
(gdb) p Ptr+512
Bus error
A second time:
% gdb simple
(gdb) run
Starting program: /opt/antelope/dev-64/bin/simple
Reading symbols for shared libraries +. done
creating file diskbuffer with size 6000.00 Mbytes
creating thread #0
creating thread #1
Starting writer #1 with packet size 513
Starting writer #0 with packet size 512
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 10 at address: 0x0000000165c0b000
[Switching to process 943]
0x00007fffffe00aa2 in __memcpy ()
(gdb) where
#0 0x00007fffffe00aa2 in __memcpy ()
#1 0x0000000100001551 in __inline_memcpy_chk (__dest=0x165c0afdb, __src=0x265f00000, __len=512) at _string.h:58
#2 0x00000001000014ea in write_packet (mythread=0, my_packet=0x265f00000 "\n### Thread #0 1663436 0x65a0afdb\n", nbytes=512) at simple.c:105
#3 0x0000000100001707 in writer (varg=0x0) at simple.c:127
#4 0x00007fff838cb456 in _pthread_start ()
#5 0x00007fff838cb309 in thread_start ()
(gdb) p BeginPtr
$1 = 0x100200000 "\n### Thread #0 0 0x0\n"
(gdb) p Ptr
Bus error
Trying to look at disk buffer with less:
% less diskbuffer
.
.
.
less(1018) malloc: *** mmap(size=18446744071562067968) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
less(1018) malloc: *** mmap(size=18446744071562067968) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
less attempting to read diskbuffer.
53644 less danq 100.0 1 408 KB Intel (64 bit) 17.4 MB 596 KB
In one instance, the simple program got stuck attempting to unlink the
diskbuffer created by the previous execution.
Here's a report created automatically: Library/Logs/DiagnosticReports/simple_2010-07-06-171154_windom.crash
Process: simple [451]
Path: /opt/antelope/dev-64/bin/simple
Identifier: simple
Version: ??? (???)
Code Type: X86-64 (Native)
Parent Process: tcsh [186]
Date/Time: 2010-07-06 17:11:37.931 -0600
OS Version: Mac OS X 10.6.4 (10F569)
Report Version: 6
Exception Type: EXC_BAD_ACCESS (SIGBUS)
Exception Codes: 0x000000000000000a, 0x0000000165c0b000
Crashed Thread: 2
Thread 0: Dispatch queue: com.apple.main-thread
0 libSystem.B.dylib 0x00007fff838cceb6 __semwait_signal + 10
1 libSystem.B.dylib 0x00007fff838f3a36 pthread_join + 844
2 simple 0x00000001000019f0 main + 720 (simple.c:186)
3 simple 0x0000000100000fb4 start + 52
Thread 1:
0 libSystem.B.dylib 0x00007fff83892342 semaphore_wait_signal_trap + 10
1 libSystem.B.dylib 0x00007fff8389788d pthread_mutex_lock + 469
2 simple 0x00000001000016f0 writer + 400 (simple.c:127)
3 libSystem.B.dylib 0x00007fff838cb456 _pthread_start + 331
4 libSystem.B.dylib 0x00007fff838cb309 thread_start + 13
Thread 2 Crashed:
0 libSystem.B.dylib 0x00007fffffe00af7 __memcpy + 855
1 simple 0x0000000100001551 _inline_memcpy_chk + 65 (_string.h:58)
2 simple 0x00000001000014ea write_packet + 282 (simple.c:106)
3 simple 0x0000000100001707 writer + 423 (simple.c:127)
4 libSystem.B.dylib 0x00007fff838cb456 _pthread_start + 331
5 libSystem.B.dylib 0x00007fff838cb309 thread_start + 13
Thread 2 crashed with X86 Thread State (64-bit):
rax: 0x00007fffffe00ab9 rbx: 0x0000000265c8d000 rcx: 0xfffffffffffffe40 rdx: 0x000000000000003b
rdi: 0x0000000165c0b1c0 rsi: 0x0000000265e001c6 rbp: 0x0000000265c8cdf0 rsp: 0x0000000265c8cdf0
r8: 0x00007fffffe0082c r9: 0x0000000000000000 r10: 0x0000000100001be8 r11: 0x0000000165c0affa
r12: 0x0000000000000000 r13: 0x0000000000001003 r14: 0x0000000100001560 r15: 0x0000000000000001
rip: 0x00007fffffe00af7 rfl: 0x0000000000010212 cr2: 0x0000000165c0b000
Binary Images:
0x100000000 - 0x100001fff +simple ??? (???) <6BBCC819-B4C8-1FDA-691C-E737A8E26468> /opt/antelope/dev-64/bin/simple
0x7fff5fc00000 - 0x7fff5fc3bdef dyld 132.1 (???) <B536F2F1-9DF1-3B6C-1C2C-9075EA219A06> /usr/lib/dyld
0x7fff83891000 - 0x7fff83a51fef libSystem.B.dylib 125.2.0 (compatibility 1.0.0) <95E02DD0-ADEA-745B-E7FA-ABA064E4658C> /usr/lib/libSystem.B.dylib
0x7fff86b14000 - 0x7fff86b18ff7 libmathCommon.A.dylib 315.0.0 (compatibility 1.0.0) <95718673-FEEE-B6ED-B127-BCDBDB60D4E5> /usr/lib/system/libmathCommon.A.dylib
0x7fffffe00000 - 0x7fffffe01fff libSystem.B.dylib ??? (???) <95E02DD0-ADEA-745B-E7FA-ABA064E4658C> /usr/lib/libSystem.B.dylib
Similar behavior has been observed on various machines, but the
current one is an older intel Mac Pro, running the latest released OS:
Model Name: Mac Pro
Model Identifier: MacPro3,1
Processor Name: Quad-Core Intel Xeon
Processor Speed: 2.8 GHz
Number Of Processors: 2
Total Number Of Cores: 8
L2 Cache (per processor): 12 MB
Memory: 12 GB
Bus Speed: 1.6 GHz
Boot ROM Version: MP31.006C.B05
SMC Version (system): 1.25f4
Serial Number (system): G8806310XYL
Hardware UUID: 5150670E-61A1-5491-97F1-AF8F88B1AFAE
% uname -a
Darwin windom 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
I'm using the latest xcode 3.2.3, but the same behavior has been seen
with the previous version.
Here's the program and makefile:
% cat makefile
simple : simple.o FORCE
clang -o simple -g simple.o -m64
dsymutil simple
FORCE:
% cat simple.c
#include <errno.h>
#include <fcntl.h>
#include <math.h>
#include <pthread.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <synch.h>
#include <sys/mman.h>
#include <sys/param.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
char *BeginPtr, *EndPtr, *Ptr ;
long BufSize = 6000000000 ;
long PktSize = 512 ;
static void
usage (void)
{
fprintf(stderr, "Usage: %s [-P pktsiz] [-s size]\n", "simple" ) ;
exit(1) ;
}
double
mbytes ( long nbytes )
{
return (nbytes/1.0e6) ;
}
int
mkfile ( char *name, unsigned long nbytes )
{
int fd ;
int zero = 0 ;
if ( (fd = open ( name, O_CREAT | O_RDWR | O_TRUNC, 0664 )) < 0 ) {
fprintf( stderr, "Can't create new file '%s'\n", name ) ;
return -1 ;
}
if ( lseek (fd, nbytes-1, SEEK_SET) != (nbytes-1) ) {
fprintf( stderr, "Failed to seek to end of new file '%s'\n",
name ) ;
return -1 ;
}
if ( write ( fd, &zero, 1 ) != 1 ) {
fprintf( stderr, "Failed to write byte at end of new file '%s'\n",
name ) ;
return -1 ;
}
if ( truncate ( name, nbytes ) ) {
fprintf( stderr, "Can't truncate file '%s' to length %ld\n",
name, nbytes ) ;
return -1 ;
}
return 0 ;
}
long
wmapfile (char *filename, char **mmap_datap)
{
int fd ;
struct stat statbuf;
char *mmap_data;
long retcode = 0 ;
fd = open (filename, O_RDWR);
if (fd < 0) {
fprintf( stderr, "Couldn't open %s\n", filename);
retcode = -1 ;
} else if (fstat (fd, &statbuf)) {
fprintf( stderr, "fstat failed for '%s'\n", filename);
retcode = -1 ;
} else if (statbuf.st_size == 0) {
fprintf( stderr, "'%s' is empty\n", filename);
retcode = -1 ;
} else {
mmap_data = mmap (NULL, statbuf.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0L);
close (fd);
if (mmap_data == (char *) -1) {
fprintf( stderr, "Can't map file %s\n", filename);
retcode = -1 ;
} else {
*mmap_datap = mmap_data;
retcode = statbuf.st_size ;
}
}
return retcode ;
}
void
write_packet( long mythread, char *my_packet, long nbytes)
{
if ( EndPtr - Ptr < nbytes ) {
Ptr = BeginPtr ;
}
sprintf(my_packet+25, " %p\n", Ptr-BeginPtr) ;
memcpy(Ptr, my_packet, nbytes ) ;
Ptr += nbytes ;
}
void *
writer(void *varg)
{
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER ;
long mythread = (long) varg ;
long i, my_pktsize ;
long cnt = 0 ;
char *my_packet = 0 ;
my_pktsize = PktSize + mythread ;
my_packet = (char *) malloc(my_pktsize) ;
sprintf(my_packet, "\n### Thread #ld ", mythread) ;
for (i=strlen(my_packet) ; i<my_pktsize ; i++ ) {
my_packet[i] = ' ' ; // mythread*1000 + i ;
}
fprintf(stderr, "Starting writer #%ld with packet size %ld\n", mythread, my_pktsize) ;
for (;;) {
sprintf(my_packet+14, " ld", cnt++) ;
pthread_mutex_lock ( &lock) ;
write_packet(mythread, my_packet, my_pktsize) ;
pthread_mutex_unlock( &lock) ;
}
return (void *) 0 ;
}
int
main (int argc, char **argv)
{
int c,
verbose = 0,
errflg = 0;
long i, nthreads=2 ;
pthread_t child_thread;
char *filename = "diskbuffer" ;
void *status ;
while ((c = getopt (argc, argv, "n:s:S:vV")) != -1) {
switch (c) {
case 'n':
nthreads = atol(optarg) ;
break;
case 's':
BufSize = atol(optarg) ;
break;
case 'S':
PktSize = atol(optarg) ;
break;
case 'v':
verbose++ ;
break;
}
}
if (errflg)
usage ();
unlink(filename) ;
fprintf(stderr, "creating file %s with size %.2f Mbytes\n", filename, mbytes(BufSize) ) ;
if ( mkfile ( filename, BufSize) ) {
fprintf(stderr, "Can't create file %s of %.2f Mbytes\n", filename, mbytes(BufSize)) ;
exit(1) ;
}
BufSize = wmapfile ( filename, &BeginPtr ) ;
Ptr = BeginPtr ;
EndPtr = BeginPtr + BufSize ;
for (i=0 ; i<nthreads ; i++) {
fprintf (stderr, "creating thread #%ld\n", i ) ;
if ( pthread_create (&child_thread, 0, writer, (void *) i )) {
fprintf ( stderr, "failed to create client thread\n" ) ;
exit(1) ;
}
}
pthread_join (child_thread, &status ) ;
fprintf (stderr, "Quitting\n" ) ;
return 0;
}
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden