Re: Really big files and encodings
Re: Really big files and encodings
- Subject: Re: Really big files and encodings
- From: Greg Guerin <email@hidden>
- Date: Wed, 22 Apr 2009 10:34:49 -0700
Seth Willits wrote:
In my app, I import data from potentially very large files. In the
first pass, I simply mmap'd the entire file, created a string using
CFStringCreateWithBytesNoCopy, and go about my business. This works
great until it hits the address limit when it's running as a 32-bit
process, so now in the second pass I want to rework it a bit to
only mmap a chunk (128 MB) at a time.
Now, if it were simply binary data, I could chop up the file
however I wanted, but since the file I'm processing is actually a
huge *text* file, I need to mmap an appropriate range so creating
the string doesn't fail because a multi-byte character was split
down the middle.
Change the buffer management.
Add a cushion to your mmap'ed chunk, say 1 MB, so you mmap in 129 MB
at a time. When parsing the first 128 MB, everything proceeds
normally, and there are no worries about splitting a multi-byte
character. You can parse bytes after 128 MB because they're safely
represented in the cushion area.
When the get-next-string starting position moves into the cushion
area, then you re-mmap the next chunk (advance by 128 MB, i.e. buffer
minus cushion) and reposition your pointers in the buffer. Then you
have about 128 MB of no worries again.
Choose a cushion size suitable for the maximum length of multi-byte
sequence. There's no magic to 1 MB, if something smaller suffices.
And don't forget the combining character forms where multiple multi-
byte "characters" should remain together.
-- GG
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden