Removing duplicate records in a database

beno · Post by **beno** » Mon May 23, 2005 5:41 pm

Hi,

Well my question is not about NB, is about DBs but I put it here, maybe someone can hekp me.

I have a NBDB with 11,000 records. Different persons use it and I'm affraid now it has duplicate records.

This database is a medical app wich stores patients records. Each record is for a particular patient. We have its name, sex, age, history, evolution, etc.

I know that if I sort the name field, duplicates will appear together... but if I have 11,000 it is going to be a bad idea to try to check the list manually.

There is no key for each record, just the fields described..

Mmmm, If I try to compare each one with the rest... this is going to be real slow...

Maybe I can compare each person name with the records that only starts with the same letter... as I type this some ideas are comming to my neurons...

Thanks for your time...

Saludos

beno

Leos · Post by **Leos** » Mon May 23, 2005 6:43 pm

hi Beno,

My suggestion:

in this case i use dbfquery command

....
dbfQuery "sample.dbf" "Name==Beno AND cidade==mexico"

if [dbfQueryResult]>0
... you have great chances that it is a duplicated regist

so, use

dbfDeleteRecord
dbfshowall
dbfNext

you can do this procedure with a 'loop', or 'while'

I think this is not to slow with 11.000 records (i do this with a database with over 70.000 records, without problems... (just about one hour)

THIS is one more useful command that i suggest to include in the next update from this GREAT plugin

...find duplicated regists

Gaev · Post by **Gaev** » Mon May 23, 2005 6:47 pm

beno:

If I sort the name field, duplicates will appear together... If I try to compare each one with the rest... this is going to be real slow...

You only have to compare each current record with the previous (not all the rest) ... something like this ...

dbfSort "myDB" ...
SetVar "[Previous.Name]" ""
SetVar "[Previous.Age]" ""

Loop "1" "[dbfRecordCount]" "[counter]"

If "[myDB.Name]" "=" "[Previous.Name]"
If "[myDB.Age]" "=" "[Previous.Age]"
... same as previous record ... delete it ?
EndIf
EndIf

SetVar "[Previous.Name]" "[myDB.Name]"
SetVar "[Previous.Age]" "[myDB.Age]"
dbfNext "myDB"

EndLoop

... the key question is going to be which of the duplicate records should be deleted ? or are they 100% identical ?

If looping through 11,000 records takes too long ... the only thing I can suggest is ... a faster (make that super fast) machine.

beno · Post by **beno** » Mon May 23, 2005 7:06 pm

Gaev,

You are right!

Once sorted the DB I have only to compare with previous and next.

Possibly do a loop if there are many duplicates.

Thanks a lot,

beno

PS thanks for your time too Leos!

Sam Cox · Post by **Sam Cox** » Tue May 24, 2005 6:12 am

Hi beno,

Gaev's solution is a good one -- and it's now in my notebook!

Can you improve your medical record database program so that continued use will not result in more duplicate entries?

-- Sam

beno · Post by **beno** » Tue May 24, 2005 6:56 am

Hi Sam,

Good idea, now I think I can add your idea: check the existence of records using the patient's name before adding it to the database...

Thanks for your help,

beno

cp4w · Post by **cp4w** » Thu Aug 18, 2005 4:10 pm

Have not the patients a personal code, like fiscal code or medical public service code?

Bye
Licia