On Sun, Aug 7, 2016 at 12:36 PM, Simon Michael <[email protected]> wrote:
> Using a checksum for deduplication won't handle identical CSV records > well, right ? Those are unlikely with our usual banks but I think quite > possible if you consider CSV data generally. > > Here's my recent plan for hledger. Why reinvent the wheel? Write a customer HLedger printer (very simple) and reuse all the Beancount tools for importing. See also: https://bitbucket.org/blais/beancount/src/093c1ca595409bd86782f48a9abd915945484fe6/src/python/beancount/reports/convert_reports.py Let reading CSV files work as it currently does, but add a separate import > command which does some additional things: append the converted entries to > the main journal file, and save a position marker in CSVFILE.lastimport. > Also when reading, if the marker file is found, skip any CSV records before > the marked position. > > This much is generic and could be used with any data format, but I think > it makes sense mainly for "import" data which you want to move into a main > journal, and which is somewhat sequential, eg CSV, OFX QIF. It's would be > harder and is probably not needed for eg journal or timeclock files. > > For CSV specifically, I'm thinking the position marker will be the last > CSV record processed. It could be elaborated later to consider timestamps, > checksums etc. if needed. > > > > > > On 8/6/16 10:19 PM, Erik Hetzner wrote: > >> Hi Martin, >> >> On Sat, 06 Aug 2016 21:16:40 -0700, >> Martin Blais <[email protected]> wrote: >> >>> >>> >>> Storing a checksum for the imported row suffers from the problem that if >>> the user does not immediately copy the result of our conversion, it will >>> not be imported further, it could get lost. >>> >>> Beancount cross-checks extracted transactions against the contents of its >>> destination ledger, but because the user often massages the transactions >>> it >>> has to use heuristics in order to perform an approximate match to >>> determine >>> which transactions have already been seen. The heuristic I have in place >>> doesn't work too well at the moment (but it could be improved easily to >>> be >>> honest). >>> >>> A better idea would be to store a unique tag computed from the checksum >>> of >>> the input row and to cross-check the imported transactions against that >>> special tag. That uses both your insight around validating the input >>> instead of the resulting transaction, and uses the ledger instead of a >>> temporary cache. It's the best of both worlds. >>> >> >> Thanks for the comment. I’m not sure the distinction that you are making >> here. >> What I do, and I admit I only thought it through for a few minutes as I >> don’t >> actually use Mint but just wanted a simple CSV format for examples - is: >> >> 1. Take the input key-value pairs for the row, e.g. Date=2016/01/10 >> 2. Sort by key >> 3. Generate a string from the key-value pairs and calculate the MD5 >> checksum >> 4. Check against a metadata value in ledger using the checksum, >> a. If the row has already been imported, do nothing >> b. If the row is new (no match), import it. >> >> Here is an example of a generated ledger transaction: >> >> 2016/08/02 Amazon >> ; csvid: mint.a7c028a73d76956453dab634e8e5bdc1 >> 1234 $29.99 >> Expenses:Shopping -$29.99 >> >> As you can see, the csvid metadata field is what we query against using >> ledger >> to see if the transaction is already present. >> >> Simlarly, Beancount has a powerful but admittedly immature CSV importer >>> growing: >>> https://bitbucket.org/blais/beancount/src/9f3377eb58fe9ec8cf >>> ea8d9e3d56f2446d05592f/src/python/beancount/ingest/importers/csv.py >>> >>> I've switched to using this and CSV file formats whenever I have them >>> available - banking, credit cards, 401k. >>> >>> I'd like to make a routine to try to auto-detect the columns eventually, >>> at >>> the moment, they must be configured when creating the importer >>> configuration. >>> >> >> Thanks for the pointer - it does look a lot more flexible than my >> implementation. >> >> I decided it what simpler, for my needs, to require a new class for each >> type of >> CSV file. It was too much trouble to try to make it configurable. The >> core code >> handles reading the CSV file, deduplicating, and all of that. The CSV >> class >> simply implements a `convert(row)` method which returns a `Transaction` >> data >> structure. I hope that if others need to import a particular type of CSV >> file, >> e.g. from their bank, they can contribute that back to the project. >> >> best, Erik >> -- >> Sent from my free software system <http://fsf.org/>. >> >> > > -- > > --- You received this message because you are subscribed to the Google > Groups "Ledger" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Beancount" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAK21%2BhNZ7yPWpa00B7wQ7w03zPP2UKbYumV%3DsB93M_sNatHcZQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
