Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Read lines from very large text file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Read lines from very large text file

Subject: Re: Read lines from very large text file
From: Michael Ash <email@hidden>
Date: Tue, 3 Feb 2009 12:10:15 -0500

On Tue, Feb 3, 2009 at 11:13 AM, Joar Wingfors <email@hidden> wrote:
>
> On Feb 2, 2009, at 9:55 PM, Michael Ash wrote:
>
>> It hasn't been addressed because it's not really relevant to the
>> question at hand. Yes, you definitely need to either know or be able
>> to discover the text encoding of the text files you're dealing with.
>> But aside from both being about text files, that question is unrelated
>> to the question of how to process a large text file line-by-line.
>
>
> Would a correct implementation not depend on being able to iterate over
> characters, and not simply using a fixed step size? I wanted to call
> attention to this because I, perhaps incorrectly, assumed that it would be.
> I think that providing a general solution to this problem, one that works
> for all text encodings, is difficult enough that it should be provided by a
> library. That said, most developers (the OP included) probably wouldn't
> require a completely general solution, and might be able to cobble together
> something that works fine with the data that they have to deal with.

For 99% of the cases, no, you don't have to care about the encoding to
do the basic parsing. A \r or \n byte will indicate an actual CR or LF
character in ASCII, in any 8-bit ASCII-compatible encoding, in most or
all of the language-specific double-byte encodings, and in UTF-8. It
will fail with EBCDIC and with UTF-16.

Most text processing tools out there assume precisely this sort of
blind encoding-agnostic scanning, so character encodings tend to take
that into account. It would break a lot of tools to have a 0x0A or
0x0D byte show up in the stream but have a different meaning, so they
avoid it.

If you anticipate processing UTF-16 files then you'll have to write a
completely different code path for that, of course, but they tend to
be rare, and it would be entirely reasonable for such a tool to not
support UTF-16.

Mike
_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

References:
	>Read lines from very large text file (From: Jacob Rhoden <email@hidden>)
	>Re: Read lines from very large text file (From: Alexander Spohr <email@hidden>)
	>Re: Read lines from very large text file (From: Jacob Rhoden <email@hidden>)
	>Re: Read lines from very large text file (From: Robert Martin <email@hidden>)
	>Re: Read lines from very large text file (From: Seth Willits <email@hidden>)
	>Re: Read lines from very large text file (From: Joar Wingfors <email@hidden>)
	>Re: Read lines from very large text file (From: Michael Ash <email@hidden>)
	>Re: Read lines from very large text file (From: Joar Wingfors <email@hidden>)

Prev by Date: Re: Read lines from very large text file
Next by Date: Re: garbage collection is ON NSCFDictionary crash
Previous by thread: Re: Read lines from very large text file
Next by thread: Re: Read lines from very large text file
Index(es):
- Date
- Thread