I'm increasingly convinced this is a bug and submitted a PR to address it. https://github.com/beancount/beangulp/pull/159
The deduplicate logic merges the newly imported & deduped entries into the existing entries because it looks at the existing entries to decide what to deduplicate. But handling it this way pollutes the existing_entries for any hooks that get run later. As one other example of the problem: if you run ML training on "existing_data" it is wrong (with varying degrees of wrongness) because your newly imported entries (which btw only have a single leg so aren't fully valid beancount yet) have been merged into the training data. I think the right way to handle this is just track existing_entries and "newly imported and deduped entries" separately for purposes of deduplication. Cheers, Justus On Monday, March 3, 2025 at 2:31:20 PM UTC+10:30 Justus Pendleton wrote: > According to examples/import.py hooks take two parameters > > Args: > extracted_entries_list: A list of (filename, entries) pairs, where > 'entries' are the directives extract from 'filename'. > ledger_entries: If provided, a list of directives from the existing > ledger of the user. This is non-None if the user provided their > ledger file as an option. > > Returns: > A possibly different version of extracted_entries_list, a list of > (filename, entries), to be printed. > > But "ledger_entries" is never None -- that is, it is non-None even if a > user didn't provide their ledger file as an option. > > This is because of the deduplicate logic in __init__.py/_extract which > extends existing entries before calling hooks > > # Deduplicate. > for filename, entries, account, importer in extracted: > importer.deduplicate(entries, existing_entries) > existing_entries.extend(entries) > > # Invoke hooks. > for func in ctx.hooks: > extracted = func(extracted, existing_entries) > > This was somewhat surprising to me (especially since it was contrary to > the quasi-documentation/comment I quoted) as I wouldn't expect (or want) to > have the newly imported entries merged into the existing entries before > hooks have run. > > Is this intended? Is there another easy way to get the pristine set of > entries from within a hook short of just running beancount.loader myself? > > My actual use case: I want to know the most recent Balance statement of an > account in the ledger, which I am using as a proxy for "last imported > date". But the most recent Balance statement I actually find will the one > auto-generated by beangulp and then merged into existing_entries. > > Cheers, > Justus > -- You received this message because you are subscribed to the Google Groups "Beancount" group. To unsubscribe from this group and stop receiving emails from it, send an email to beancount+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/beancount/02d76426-ce8f-4617-992b-36e5f0344efbn%40googlegroups.com.