Re: Avoiding duplicate records
Re: Avoiding duplicate records
- Subject: Re: Avoiding duplicate records
- From: Miguel Arroz <email@hidden>
- Date: Wed, 16 Jan 2008 01:15:19 +0000
Hi!
Well, on my PowerBook G4 1.67 Ghz, not yet with conflicts:
1050 records saved one by one: 14 secs
1050 records saved in 50 record batches: 6 secs
1050 records saved in 500 record batches: 5 segundos
10000 records saved in 50 record batches: 31 secs
10000 records saved in 500 record batches: 26 secs
[Curiosity] pasting a 10k line text block on a Safari text area:
>1 minute!
Facing this results, some preliminary conclusions:
1) This takes much less time than what I expected. If it runs at
this speed in my old PowerPC, with a slow drive, lots of processes
running (including Eclipse), WO app running in development mode etc,
then on a real server (with intel procs) it will run even faster
(much faster for what I've seen of Java running on intel).
2) There are no significative differences between 50 and 500 in
the size batch. I was NSLogging every time I saved a batch, so I did
a lot more logging in the 50 batch-sized tests. Logging takes a lot
of time, so I think globally it's not that different.
3) Inserting one by one is noticeable slower, but not THAT slower
(I logged every 50 inserts, so no special logging time here).
So, I think what I'll do is to write in batches of 50 or so, and
if a batch fails, then I write the batch contacts one by one. It's
probably a bit slower than fetching, removing duplicates and saving,
but it's not that bad and it's much easier to code, and it won't fail
a second time if concurrent updates are being made (each contact will
be saved, or not, period). It's actually fast enough to not be made
on a background process, but instead on an AJAXed long response.
Thank all of you for the help!
Yours
Miguel Arroz
On 2008/01/15, at 15:19, Mike Schrag wrote:
1) Fo a fetch request to get the contacts with the emails of the
100 contacts batch (ie, blablabla where email = email1 or email =
email2 or email = email3 ...).
2) Remove duplicates in memory using a fast method, like putting
the stuff in NSSets or whatever.
3) Try to save again. Of course, it may still fail (concurrency
sucks) but the probability is much lower.
This is all thought with the assumption that the UNIQUE-related
exception is thrown when the first offending object is inserted,
so I won't get all the information I need in one single exception,
which I'm not 100% sure it's true yet.
Depending on how your unique constraint is configured, it may throw
when the first conflicting insert happens or at the end of the
commit (this is that deferrable initially deferred, thing, which
I've honestly never tried on a unique constraint, but presumably it
works the same).
The only thing I would consider is how frequent conflicts will be.
If conflicts will be frequent, it may be cheaper to fetch dupes
first to weed them out (so you're not constantly failing out 100-
insert blocks).
I think if I were in your position I would just benchmark:
1) committing one at a time -- this is logically the easiest, but
it may be the overhead for this is way high ... but WO doesn't do
batching inserts ANYWAY, so who knows
2) fetching 100, comparing, deduping, then inserting and committing
3) inserting 100, committing, catch exception (fetch 100,
comparing, deduping, inserting, rinse and repeat)
You might also just benchmark the fetching and the inserting
independently so you know the relative cost of 100 of each for your
average data.
ms
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
40guiamac.com
This email sent to email@hidden
Miguel Arroz
http://www.terminalapp.net
http://www.ipragma.com
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden