Re: Avoiding duplicate records
Re: Avoiding duplicate records
- Subject: Re: Avoiding duplicate records
- From: "Daniele Corti" <email@hidden>
- Date: Tue, 15 Jan 2008 16:21:08 +0100
2008/1/15, Miguel Arroz <email@hidden>:
Hi!
I'm thinking how to approach the following problem, and I would
like to know opinions about this, because I may be overcomplicating
this, as I often do.
I need to manage contact lists. A contact is an object with an
email, first name, last name, and some flags. The important thing is
the email, that's what make a contact unique.
A contact list may have tens of thousands of contacts (this is not
a theoretical limit, it's a requirement), and cannot have duplicate
records (ie, two contacts with the same email).
Well, my first approach is to create a restriction on the DB that
will prevent the existence of two records with the same email on the
same contact list.
Then, let's suppose I have a contact list with 10k contacts, and
I'm adding another 10k contacts. The basic approach is:
1) Divide the 10k in batches of 100, to make this manageable.
2) Try to insert the 100 contacts.
3) If an exception raises due to the UNIQUE constraint, remove the
offending object and try again.
This has an obvious problem, which is the fact that in the worst
case, the 100 contacts may be repeated, making this very inefficient.
So, what I though was, if I have a failure:
1) Fo a fetch request to get the contacts with the emails of the
100 contacts batch (ie, blablabla where email = email1 or email =
email2 or email = email3 ...).
sorry, but this is very ugly... you should use something like InSetQualifier (e.g. "WHERE a IN (1,2,3...")
2) Remove duplicates in memory using a fast method, like putting
the stuff in NSSets or whatever.
3) Try to save again. Of course, it may still fail (concurrency
sucks) but the probability is much lower.
This is all thought with the assumption that the UNIQUE-related
exception is thrown when the first offending object is inserted, so I
won't get all the information I need in one single exception, which
I'm not 100% sure it's true yet.
So... suggestions! Is this too crappy? :)
in witch format do you have you contacts?
BTW I think the best way is to change the approach, and parse the contact before the insert fase.
Could be an idea to fetch duplicates in the list you are going to insert into the db, I mean if you have a list with:
email@hiddenemail@hidden
email@hiddenemail@hiddenyou should remove the duplicates before try to insert, and you can do this in java, out of EOF. This should give less Exception while insert
YOu can also do some reorder in the contacts before insert them for example order by host name, fetch by host name, is the host isn't in you can insert every one of them, else you will have very few of record to serch in.
One more thing whitch db you use?
Yours
Miguel Arroz
Miguel Arroz
http://www.terminalapp.net
http://www.ipragma.com
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AIM: S0CR4TE5
Messenger: email@hidden
--
Computers are like air conditioners -- they stop working properly if you open
WINDOWS
--
What about the four lusers of the apocalypse? I nominate:
"advertising", "can't log in", "power switch" and "what backup?"
--Alistair Young
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden