Re: Reading in dictionary from txt file: options for speed
Re: Reading in dictionary from txt file: options for speed
- Subject: Re: Reading in dictionary from txt file: options for speed
- From: Kyle Sluder <email@hidden>
- Date: Tue, 14 Apr 2009 15:03:42 -0400
On Tue, Apr 14, 2009 at 2:12 PM, Miles <email@hidden> wrote:
> I'm not super concerned about the 2MB of disk space the txt file takes up,
> although I wouldn't be mad about decreasing it somehow. And once I get the
> whole dictionary in an array, the searches are basically fast enough for my
> purposes. I've still been reading up on Huffman encoding if I decide to try
> to compress this. However, my main issue now is loading time, and it seems
> like this won't help me there.
Huffman coding won't help you with time (in fact it may hurt you
depending on how you reconstruct your entries) but it will definitely
help you with space. It performs excellently on plaintext because of
the distribution in letter frequency.
> And, I'm looking into creating a Trie (which is where the previous thread
> guided me), although I'm not sure this helps my current issue of loading
> time either. I'm thinking that creating a Trie will probably take just as
> long, or longer, than simply splitting the file using
> 'componentsSeparatedByString', right? So, is there some way to store the
> trie on disk so that the loading is my final data structure is faster? What
> other options do I have to speed this up?
I don't know anything about tries, so I'm afraid I can't help you there.
One thing that might help, however, is to store your data as a binary
blob on disk, glom it all in at once and treat it as a big structure.
To keep it simple, let's just stick with the array approach. Say you
define a structure like this:
typedef struct _word {
char len;
char text[];
} *word;
You can then store your dictionary as a binary dump of integer length
and word contents (with null terminator), like so:
\x06Hello\x00\x03Hi\x00
Then just read the entire file into memory at once, and treat the area
of memory as a word (aka struct _word *).
NSArray *wordsFromBlob(word *blob) {
// maybe store the entry count in the blob as well, and use
// that to give a hint to -[NSMutableArray initWithCapacity:]
NSMutableArray *result = [[NSMutableArray alloc] init];
word *cur;
for(cur = blob; cur != NULL; (*(char *)cur) += cur->len)
[result addObject:[NSString stringWithUTF8String:cur->text]];
return result;
}
Be careful: 1) You need to dump out your strings as UTF8, not ASCII;
2) Pointer access needs to be 4-byte aligned (on PC, don't know about
iPhone); 3) I wrote this code in the compose window and it's probably
horribly buggy. Hopefully it's sufficient to get my point across.
--Kyle Sluder
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden