On 30/03/2021 12:57, redst...@gmail.com wrote: > Reg. class SimilarityComparator in similarity.py: > > The final check is: > # Here, we have found at least one common account with a close > # amount. Now, we require that the set of accounts are equal or that > # one be a subset of the other. > return accounts1.issubset(accounts2) or > accounts2.issubset(accounts1) > > I've been instead using a slightly modified version, where I just check > for intersection: > return accounts1.intersection(accounts2) > > For my use cases, this has worked better in every case. The common case > is an import of a credit card transaction that is modified post-import. > On a subsequent import (with an overlapping date range), dedupe does not > work with the original heuristic. > > I can't help but wonder if this would be universally better for > everyone. Thoughts?
What to consider duplicate entries in fuzzy matching as implemented by SimilarityComparator depends on how the ledger is organized. For example, (I have the impression that) your relaxed check would mark all couples of transactions that use a transfer account to record transactions that are posted and cleared on different days. Deduplication in beancount.ingest is handled by an hook that can be customized, thus I don't think there is need to provide command line switches. In Benagulp, the successor of beancount.ingest, the deduplication will be delegated to the importer (with a default implementation doing pretty much what the current one does) thus there will be the possibility for finer (and easier) customization. Cheers, Dan -- You received this message because you are subscribed to the Google Groups "Beancount" group. To unsubscribe from this group and stop receiving emails from it, send an email to beancount+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/77704a25-0270-7f41-7dc8-429364cde6ba%40grinta.net.