• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Really big files and encodings
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Really big files and encodings


  • Subject: Really big files and encodings
  • From: Seth Willits <email@hidden>
  • Date: Tue, 21 Apr 2009 22:57:00 -0700


There's actually just one simple question, but there's a bit of background for context:
--



In my app, I import data from potentially very large files. In the first pass, I simply mmap'd the entire file, created a string using CFStringCreateWithBytesNoCopy, and go about my business. This works great until it hits the address limit when it's running as a 32-bit process, so now in the second pass I want to rework it a bit to only mmap a chunk (128 MB) at a time.


Now, if it were simply binary data, I could chop up the file however I wanted, but since the file I'm processing is actually a huge *text* file, I need to mmap an appropriate range so creating the string doesn't fail because a multi-byte character was split down the middle.

Obviously if the file's encoding is always single bytes (IOW, CFStringGetMaximumSizeForEncoding(1, encoding) returns 1), then I can just use any range I want. If it's UTF8 or 16 I can check the high bits to figure out the right range before creating a string. But I don't know anything about pretty much any other encoding, so I don't know which ones are fixed-widths and which ones are variable like UTF8.

If I can identify an encoding as fixed width and know what the width is, it's really easy to handle the range by simply using a multiple of whatever the width is. And more or less, I'd expect that all of the other variable-width encodings would needs some special handling like UTF8.

So, I generally know what I should do, but the problem is that I don't know how to identify an encoding as fixed-width or variable. I could spend the time to look up each and every encoding on the internet, but there are kind of a lot of them :) And then my code wouldn't be future- proof if an encoding is added.


Can anyone offer some insight into how I could dynamically determine an encoding's characteristics? Or maybe I should just hard code it/do it by hand because there are really very few cases to handle.



Thanks,

--
Seth Willits



_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: Really big files and encodings
      • From: Michael Ash <email@hidden>
    • Re: Really big files and encodings
      • From: Alastair Houghton <email@hidden>
  • Prev by Date: Re: How to make app login window to look like OS X user login window ?
  • Next by Date: Long Time Dealy When getting Vended Object
  • Previous by thread: Re: How to make app login window to look like OS X user login window ?
  • Next by thread: Re: Really big files and encodings
  • Index(es):
    • Date
    • Thread