Duplicate detection during import

redst...@gmail.com Tue, 30 Mar 2021 03:57:49 -0700

Reg. class SimilarityComparator in similarity.py:

The final check is:
        # Here, we have found at least one common account with a close
        # amount. Now, we require that the set of accounts are equal or that
        # one be a subset of the other.
        return accounts1.issubset(accounts2) or 
accounts2.issubset(accounts1)


I've been instead using a slightly modified version, where I just check for 
intersection:
        return accounts1.intersection(accounts2)

For my use cases, this has worked better in every case. The common case is 
an import of a credit card transaction that is modified post-import. On a 
subsequent import (with an overlapping date range), dedupe does not work 
with the original heuristic.

I can't help but wonder if this would be universally better for everyone. 
Thoughts?

If not, perhaps an option might help users fine tune for their use cases? 
Suggestions:
--aggressive_match
--heuristic=match_on_one_common_posting  (--heuristic would take in a list)

Making dedupe detection better further cuts down ingest effort 
<https://reds-rants.netlify.app/personal-finance/the-five-minute-ledger-update/>
 
(links to 5min ledger update article).

Martin, would you be opposed to one of the approaches above?

Thanks,
-red

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beancount+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/ee41980d-dcea-4e82-879d-9bd41b9d7363n%40googlegroups.com.

Duplicate detection during import

Reply via email to