Re: Read lines from very large text file
Re: Read lines from very large text file
- Subject: Re: Read lines from very large text file
- From: Michael Ash <email@hidden>
- Date: Tue, 3 Feb 2009 12:10:15 -0500
On Tue, Feb 3, 2009 at 11:13 AM, Joar Wingfors <email@hidden> wrote:
>
> On Feb 2, 2009, at 9:55 PM, Michael Ash wrote:
>
>> It hasn't been addressed because it's not really relevant to the
>> question at hand. Yes, you definitely need to either know or be able
>> to discover the text encoding of the text files you're dealing with.
>> But aside from both being about text files, that question is unrelated
>> to the question of how to process a large text file line-by-line.
>
>
> Would a correct implementation not depend on being able to iterate over
> characters, and not simply using a fixed step size? I wanted to call
> attention to this because I, perhaps incorrectly, assumed that it would be.
> I think that providing a general solution to this problem, one that works
> for all text encodings, is difficult enough that it should be provided by a
> library. That said, most developers (the OP included) probably wouldn't
> require a completely general solution, and might be able to cobble together
> something that works fine with the data that they have to deal with.
For 99% of the cases, no, you don't have to care about the encoding to
do the basic parsing. A \r or \n byte will indicate an actual CR or LF
character in ASCII, in any 8-bit ASCII-compatible encoding, in most or
all of the language-specific double-byte encodings, and in UTF-8. It
will fail with EBCDIC and with UTF-16.
Most text processing tools out there assume precisely this sort of
blind encoding-agnostic scanning, so character encodings tend to take
that into account. It would break a lot of tools to have a 0x0A or
0x0D byte show up in the stream but have a different meaning, so they
avoid it.
If you anticipate processing UTF-16 files then you'll have to write a
completely different code path for that, of course, but they tend to
be rare, and it would be entirely reasonable for such a tool to not
support UTF-16.
Mike
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden