(Sorry, hit "Send" too soon on the last try!) On Thu, 20 Dec 2018 at 17:22, Chris Angelico <ros...@gmail.com> wrote: > > On Fri, Dec 21, 2018 at 2:44 AM Paul Moore <p.f.mo...@gmail.com> wrote: > > > > I'm looking for a way to incrementally decode a JSON file. I know this > > has come up before, and in general the problem is not soluble (because > > in theory the JSON file could be a single object). In my particular > > situation, though, I have a 9GB file containing a top-level array > > object, with many elements. So what I could (in theory) do is to parse > > an element at a time, yielding them. > > > > The problem is that the stdlib JSON library reads the whole file, > > which defeats my purpose. What I'd like is if it would read one > > complete element, then just enough far ahead to find out that the > > parse was done, and return the object it found (it should probably > > also return the "next token", as it can't reliably push it back - I'd > > check that it was a comma before proceeding with the next list > > element). > > It IS possible to do an incremental parse, but for that to work, you > would need to manually strip off the top-level array structure. What > you'd need to use would be this: > > https://docs.python.org/3/library/json.html#json.JSONDecoder.raw_decode > > It'll parse stuff and then tell you about what's left. Since your data > isn't coming from a ginormous string, but is coming from a file, > you're probably going to need something like this: > > def get_stuff_from_file(f): > buffer = "" > dec = json.JSONDecoder() > while "not eof": > while "no object yet": > try: obj, pos = dec.raw_decode(buffer) > except JSONDecodeError: buffer += f.read(1024) > else: break > yield obj > buffer = buffer[pos:].lstrip().lstrip(",")
Ah, right. I'd found that function, but as it took input from a string rather than a file-like object, I'd dismissed it. I didn't think of decoding partial reads. That's a nice trick, thanks! > Proper error handling is left as an exercise for the reader, both in > terms of JSON errors and file errors. Also, the code is completely > untested. Have fun :) Yeah, once you have the insight that you can attempt to parse a block at a time, the rest is just a "simple matter of programming" :-) > The basic idea is that you keep on grabbing more data till you can > decode an object, then you keep whatever didn't get used up ("pos" > points to whatever didn't get consumed). Algorithmic complexity should > be O(n) as long as your objects are relatively small, and you can > optimize disk access by tuning your buffer size to be at least the > average size of an object. Got it, thanks. > Hope that helps. Yes it does, a lot. Much appreciated. Paul -- https://mail.python.org/mailman/listinfo/python-list