-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 nitingupta183 schrieb: > Hi all, > > I am supposed to add a feature in which my app will detect the duplicate > contacts of a user on the basis of their name, email, mobile number > etc.(i.e. Contacts Duplicate Killer kind of feature). The simplest algo i > can think of is find all the contacts on the basis of their name, email and > mobile and then run the loop to determine which all contacts have similar > entries. But i think this algo will have worst performance.
Try to prune your search space. It is reasonable to assume that there are not too many duplicates overall. You can use IndexReader.terms() to get a list of terms and then a docFreq() to check the number of documents containing that term. E.g. search for all email terms and process those, whose docFreq is >1. Add the corresponding documents for each email term to a "possible identical contacts" container. Repeat the same with birth dates, phone numbers and names, preferably with some normalization. Then merge those "possible identical contact" containers, who share a common document. Example: Container 1 Container 2 Merged Container ---> A, B B,C A,B,C (Implementation note: try to keep track of the list of containers a certain number is in using a look-up table: A -> 1; B -> 1,2,3,6; C ->2 etc. ) Then compare the documents inside these container with each other and decide, which contacts you want to merge and which not. > I am currently using Hibernate. I got to know about Hibernate Search/Lucene. > Can I use these solutions for this task. I am asking this on the basis that > Lucene already implements algos such as Levenshtein_distance. May be I can > harness the Lucene power to make this task efficient. Try using a Soundex or Metaphone analyzer for similarity; they map similar sounding strings to a single value and are much easier to handle in the Lucene framework than numeric measures like Levensthein; there are examples in Lucene contrib. - -- Rene Wiermer (Softwareentwickler/Systemingenieur) - -- LWsystems GmbH & Co. KG ++ http://www.lw-systems.de/impressum Tel: 05455 / 932 132 ++ Fax: 05455 / 932 099 ++ Mobil: 0171 / 37 28 760 Ihr Spezialist für Linux, Open Source & IT-Sicherheit ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ LWsystems GmbH & Co. KG Sitz der Gesellschaft: Tegelerweg 11, 49186 Bad Iburg Telefon +49 (0)5403 5556 Telefax +49 (0)5403 7958997 Handelsregister: Amtsgericht Osnabrück, HRA 110668 USt.-ID-Nr. DE23852211 Persönlich haftende Gesellschafterin: LWsystems Verwaltungs GmbH Sitz der Gesellschaft: Tegelerweg 11, 49186 Bad Iburg Handelsregister: Amtsgericht Osnabrück, HRB 111163 Geschäftsführer: Dipl.-Ing. Ansgar H. Licher, Bad Iburg Dipl.-Ing. Martin Werthmöller, Ibbenbüren Für weitere Firmendetails zu LWsystems siehe / For further company details please look at: http://www.lw-systems.de/impressum ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkrTJrAACgkQM8UTt+++8LiQogCfeMTyF9EMf2fVZtz61TnCIEII 5dMAn0YlKgiEQ8M5/Kkf2SZS/acHhe2u =TFm+ -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org