On 12/20/2018 10:42 AM, Paul Moore wrote:
I'm looking for a way to incrementally decode a JSON file. I know this
has come up before, and in general the problem is not soluble (because
in theory the JSON file could be a single object).

AFAIK, a JSON file always represents a single JSON item and is translated (decoded) to a single Python object.

The json encoder has a iterencode method, but the decoder does not have an iterdecode method. I think it plausibly should have one that would iterate through a top-level list or dict (JSON array or object).

In my particular
situation, though, I have a 9GB file containing a top-level array
object, with many elements. So what I could (in theory) do is to parse
an element at a time, yielding them.

So your file format, not worrying about whitespace and the possible lack of ',' after the last item, is '[' (item ',')* ']'. You want to skip over the '[' instead of creating an empty list, then yield each item rather than appending to the list.

The problem is that the stdlib JSON library reads the whole file,
which defeats my purpose. What I'd like is if it would read one
complete element, then just enough far ahead to find out that the
parse was done, and return the object it found (it should probably
also return the "next token", as it can't reliably push it back - I'd
check that it was a comma before proceeding with the next list
element).

I looked at json.decode and json.scanner. After reading the whole file into a string, json decodes the string an item at a time with a scan_once(string, index) function that finds the end of the first item in the string. It then returns the decoded item and the index of where to continue scanning for the next item. If the string does not begin with a complete representation of an item, json.decode.JSONDecodeError is raised.

So I believe you could fairly easily write a function roughly as follows.
  open file and read and check the initial '['
  s = ''; idx = 0
  scanner = make_scanner(context)
  # I did not figure out what 'context' should be
  while more in file:
    s += large chunk
    try:
      ob, idx = scanner.scan_once(s, idx)
      yield ob
    except JSONDecodeError as e:
      check that problem is incompleteness rather than bad format


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to