Re: duplicates in a file
Re: duplicates in a file
- Subject: Re: duplicates in a file
- From: "Arthur J. Knapp" <email@hidden>
- Date: Fri, 19 Jul 2002 16:44:28 -0400
>
Date: Fri, 19 Jul 2002 09:09:53 -0500
>
Subject: duplicates in a file
>
From: Rick Norman <email@hidden>
>
Anybody got any ideas on removing duplicate entries in a tab delimited file.
>
I used the following script comparing "C1R1" against "C1R2" and so on. After
>
looking a little closer, I realize that there needs to be a little more
>
considered here, see the snip of the data to see what I'm referring to. This
>
Excel script is excruciatingly slow, I'm hoping that this can be done with
>
BBEdit.
So is the actaul file tab-delimited, or is it in Excel format?
In any case, there are two seperate issues, comparing and removing
duplicates, so you may want to look at this as two seperate functions.
"Roger Jr 1234 Dalrymple Ct. Jackson MS 39211
Roger Jr 4321 Highway 22 Madison MS 39110
Clark Hughes 1111 Arbour Court Carthage MS 39501
Clark & Lucy Hughes 1111 Arbour Court Carthage MS 39501
Daryl & Kirby Neal P. O. Box 1234 Jackson MS 39207
Daryl Neal PO Box 1234 Jackson MS 39207"
set str to result
(* Extract records
*)
set rows to str's paragraphs -- breaks at approx. 4050 lines
(* Extract columns
*)
set text item delimiters to tab
repeat with col in rows
set col's contents to col's text items
end repeat
set text item delimiters to {""}
(* 2-loop traversal
*)
repeat with i from 2 to rows's length -- check row i
repeat with j from 1 to (i - 1)
if (rows's item j is not missing value) then
DataIsDuplicate(rows's item i, rows's item j)
if (result is true) then
set rows's item i to missing value -- removes duplicate
end if
end if
end repeat
end repeat
set rows to rows's lists --> remove missing values
(* ? Back to Tab-delimited ?
*)
set text item delimiters to tab
repeat with row in rows
set row's contents to "" & row
end repeat
set text item delimiters to {""}
(* ? Back to Return-delimited ?
*)
set text item delimiters to return
set rows to "" & rows
set text item delimiters to {""}
rows --> "Roger Jr 1234 Dalrymple Ct. Jackson MS 39211..."
on DataIsDuplicate(row1, row2)
-- item 1 = name
-- 2 = street
-- 3 = city state
-- 4 = zip code
set xName to 1
set xStreet to 2
set xCityState to 3
set xZipCode to 4
-- With the zip code provided, we can skip any check
-- of the city/state column, (unless someone knows
-- of two cities that share the same zip code [which
-- is NOT the same as one city having more than one
-- zip code]).
(* "P. O. Box 1234" == "PO Box 1234"
*)
ignoring case, punctuation and white space
row1's item xZipCode is not row2's item xZipCode
if result then return false
row1's item xStreet is not row2's item xStreet
if result then return false
return true
end ignoring
end DataIsDuplicate
{ Arthur J. Knapp, of <
http://www.STELLARViSIONs.com>
a r t h u r @ s t e l l a r v i s i o n s . c o m
}
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.