Really big files and encodings
Really big files and encodings
- Subject: Really big files and encodings
- From: Seth Willits <email@hidden>
- Date: Tue, 21 Apr 2009 22:57:00 -0700
There's actually just one simple question, but there's a bit of
background for context:
--
In my app, I import data from potentially very large files. In the
first pass, I simply mmap'd the entire file, created a string using
CFStringCreateWithBytesNoCopy, and go about my business. This works
great until it hits the address limit when it's running as a 32-bit
process, so now in the second pass I want to rework it a bit to only
mmap a chunk (128 MB) at a time.
Now, if it were simply binary data, I could chop up the file however I
wanted, but since the file I'm processing is actually a huge *text*
file, I need to mmap an appropriate range so creating the string
doesn't fail because a multi-byte character was split down the middle.
Obviously if the file's encoding is always single bytes (IOW,
CFStringGetMaximumSizeForEncoding(1, encoding) returns 1), then I can
just use any range I want. If it's UTF8 or 16 I can check the high
bits to figure out the right range before creating a string. But I
don't know anything about pretty much any other encoding, so I don't
know which ones are fixed-widths and which ones are variable like UTF8.
If I can identify an encoding as fixed width and know what the width
is, it's really easy to handle the range by simply using a multiple of
whatever the width is. And more or less, I'd expect that all of the
other variable-width encodings would needs some special handling like
UTF8.
So, I generally know what I should do, but the problem is that I don't
know how to identify an encoding as fixed-width or variable. I could
spend the time to look up each and every encoding on the internet, but
there are kind of a lot of them :) And then my code wouldn't be future-
proof if an encoding is added.
Can anyone offer some insight into how I could dynamically determine
an encoding's characteristics? Or maybe I should just hard code it/do
it by hand because there are really very few cases to handle.
Thanks,
--
Seth Willits
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden