I'm increasingly convinced this is a bug and submitted a PR to address it.

https://github.com/beancount/beangulp/pull/159

The deduplicate logic merges the newly imported & deduped entries into the 
existing entries because it looks at the existing entries to decide what to 
deduplicate. But handling it this way pollutes the existing_entries for any 
hooks that get run later. As one other example of the problem: if you run 
ML training on "existing_data" it is wrong (with varying degrees of 
wrongness) because your newly imported entries (which btw only have a 
single leg so aren't fully valid beancount yet) have been merged into the 
training data.

I think the right way to handle this is just track existing_entries and 
"newly imported and deduped entries" separately for purposes of 
deduplication.

Cheers,
Justus

On Monday, March 3, 2025 at 2:31:20 PM UTC+10:30 Justus Pendleton wrote:

> According to examples/import.py hooks take two parameters
>
> Args:
>     extracted_entries_list: A list of (filename, entries) pairs, where
> 'entries' are the directives extract from 'filename'.
>      ledger_entries: If provided, a list of directives from the existing
> ledger of the user. This is non-None if the user provided their
> ledger file as an option.
>
> Returns:
> A possibly different version of extracted_entries_list, a list of
> (filename, entries), to be printed.
>
> But "ledger_entries" is never None -- that is, it is non-None even if a 
> user didn't provide their ledger file as an option.
>
> This is because of the deduplicate logic in __init__.py/_extract which 
> extends existing entries before calling hooks
>
>     # Deduplicate.
>     for filename, entries, account, importer in extracted:
>         importer.deduplicate(entries, existing_entries)
>         existing_entries.extend(entries)
>
>     # Invoke hooks.
>     for func in ctx.hooks:
>         extracted = func(extracted, existing_entries)
>
> This was somewhat surprising to me (especially since it was contrary to 
> the quasi-documentation/comment I quoted) as I wouldn't expect (or want) to 
> have the newly imported entries merged into the existing entries before 
> hooks have run.
>
> Is this intended? Is there another easy way to get the pristine set of 
> entries from within a hook short of just running beancount.loader myself?
>
> My actual use case: I want to know the most recent Balance statement of an 
> account in the ledger, which I am using as a proxy for "last imported 
> date". But the most recent Balance statement I actually find will the one 
> auto-generated by beangulp and then merged into existing_entries.
>
> Cheers,
> Justus
>

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to beancount+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/beancount/02d76426-ce8f-4617-992b-36e5f0344efbn%40googlegroups.com.

Reply via email to