Re: Duplicate detection during import

Daniele Nicolodi Tue, 30 Mar 2021 04:11:30 -0700

On 30/03/2021 12:57, redst...@gmail.com wrote:
> Reg. class SimilarityComparator in similarity.py:
> 
> The final check is:
>         # Here, we have found at least one common account with a close
>         # amount. Now, we require that the set of accounts are equal or that
>         # one be a subset of the other.
>         return accounts1.issubset(accounts2) or
> accounts2.issubset(accounts1)
> 
> I've been instead using a slightly modified version, where I just check
> for intersection:
>         return accounts1.intersection(accounts2)
> 
> For my use cases, this has worked better in every case. The common case
> is an import of a credit card transaction that is modified post-import.
> On a subsequent import (with an overlapping date range), dedupe does not
> work with the original heuristic.
> 
> I can't help but wonder if this would be universally better for
> everyone. Thoughts?


What to consider duplicate entries in fuzzy matching as implemented by
SimilarityComparator depends on how the ledger is organized. For
example, (I have the impression that) your relaxed check would mark all
couples of transactions that use a transfer account to record
transactions that are posted and cleared on different days.

Deduplication in beancount.ingest is handled by an hook that can be
customized, thus I don't think there is need to provide command line
switches.

In Benagulp, the successor of beancount.ingest, the deduplication will
be delegated to the importer (with a default implementation doing pretty
much what the current one does) thus there will be the possibility for
finer (and easier) customization.

Cheers,
Dan

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beancount+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/77704a25-0270-7f41-7dc8-429364cde6ba%40grinta.net.

Re: Duplicate detection during import

Reply via email to