Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Greg Ewing via Python-list

On 2/10/24 10:03 am, Left Right wrote:

Consider also an interesting
consequence of SCSI not being able to have infinite words: this means,
besides other things that fsync() is nonsense! :) If you aren't
familiar with the concept: UNIX filesystem API suggests that it's
possible to destage arbitrary large file (or a chunk of file) to disk.
But SCSI is built of finite "words" and to describe an arbitrary large
file you'd need to list all the blocks that constitute the file!


I don't follow. What fsync() does is ensure that any data buffered
in the kernel relating to the file is sent to the storage device.
It can send as many blocks of data over SCSI as required to
achieve this. There's no requirement for it to be atomic at the
level of the interface between the kernel and the hardware.

Some devices do their own buffering in ways that are invisible to
the software, so fsync() can't guarantee that the data is actually
written to the storage medium. But that's a problem stemming from
the design of the hardware, not the design of the protocol for
communicating with the hardware.

> the only way to implement fsync() in compliance with the
> standard is to sync _everything_

Again I'm not sure what you mean here. It may be difficult for the
kernel to track down exactly what data is relevant to a particular file,
and so the kernel programmers take the easy way out and just implement
fsync() as sync(). But again that has nothing to do with the protocol.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


RE: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread AVI GROSS via Python-list
This discussion has become less useful.

E can all agree that in Computer Science, real infinities are avoided, and
frankly, need not be taken seriously in any serious program.

You can store all kinds of infinities quite compactly as in a transcendental
number you can derive to as many decimal points as you like. Want 1/7 to a
thousand decimal places, no problem. You can be given a digit 1 and a digit
7 and asked to do a division to as many digits as you wish in a
deterministic manner. I can think of quite a few generators that could
easily supply the next digit, or just keep giving the next element from
142857 each time from a circular loop.

Sines, cosines, pi, e and so on, can often be calculated to arbitrary
precision by evaluating things like infinite Taylor Series as many times as
needed up to the precision of the data holding the number as you move along.

 Similar ideas allow generators to give you as many primes as you want, and
no more.

So, if you can store arbitrary python code as part of your JSON, you can
send quite a bit of somewhat compressed data.

The real problem is how the JSON is set up. If you take umpteen data
structures and wrap them all in something like a list, then it may be a tad
hard to stream as you may not necessarily be examining the contents till the
list finishes gigabytes later. But if, instead, you send lots of smaller
parts, such as perhaps sending each row of something like a data.frame
individually, the other side can recombine them incrementally to a larger
structure such as a data.frame and do some logic on it as it streams, such
as keeping only some columns and discarding the rest, or applying filters
that only keep rows you care about. And, of course, all rows could be
appended to one and perhaps more .CSV files as well so if you need multiple
passes on the data, it can now be processed locally in various modes,
including "streamed".

I think that for some purposes, it makes some sense to not stream anything
but results. I mean consider any database that allows a remote login and SQL
commands that only stream results. If I only want info on records about
company X between July 1 and September 15 of a particular year and only if
the amount paid remains zero or is less than the amount owed, ...


-Original Message-
From: Python-list  On
Behalf Of Greg Ewing via Python-list
Sent: Tuesday, October 1, 2024 5:48 PM
To: python-list@python.org
Subject: Re: Help with Streaming and Chunk Processing for Large JSON Data
(60 GB) from Kenna API

On 1/10/24 8:34 am, Left Right wrote:
> You probably forgot that it has to be _streaming_. Suppose you parse
> the first digit: can you hand this information over to an external
> function to process the parsed data? -- No! because you don't know the
> magnitude yet.

By that definition of "streaming", no parser can ever be streaming,
because there will be some constructs that must be read in their
entirety before a suitably-structured piece of output can be
emitted.

The context of this discussion about integers is the claim that
they *could* be parsed incrementally if they were written little
endian instead of big endian, but the same argument applies either
way.

-- 
Greg
-- 
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Dan Sommers via Python-list
On 2024-10-01 at 23:03:01 +0200,
Left Right  wrote:

> > If I recognize the first digit, then I *can* hand that over to an
> > external function to accumulate the digits that follow.
> 
> And what is that external function going to do with this information?
> The point is you didn't parse anything if you just sent the digit.
> You just delegated the parsing further. Parsing is only meaningful if
> you extracted some information, but your idea is, essentially "what if
> I do nothing?".

If the parser detects the first digit of a number, then the parser can
read digits one at a time (i.e., "streaming"), assimilate and accumulate
the value of the number being parsed, and successfully finish parsing
the number it reads a non-digit.  Whether the function that accumulates
the value during the process is internal or external isn't relevant; the
point is that it is possible to parse integers from most significant
digit to least significant digit under a streaming model (and if you're
sufficiently clever, you can even write partial results to external
storage and/or another transmission protocol, thus allowing for numbers
bigger (as measured by JSON or your internal representation) than your
RAM).

At most, the parser has to remember the non-digit character it read so
that it (the parser) can begin to parse whatever comes after the number.
Does that break your notion of "streaming"?

Why do I have to start with the least significant digit?

> > Under that constraint, I'm not sure I can parse anything.  How can I
> > parse a string (and hand it over to an external function) until I've
> > found the closing quote?
> 
> Nobody says that parsing a number is the only pathological case.  You,
> however, exaggerate by saying you cannot parse _anything_. You can
> parse booleans or null, for example.  There's no problem there.

My intent was only to repeat what you implied:  that any parser that
reads its input until it has parsed a value is not streaming.

So how much information can the parser keep before you consider it not
to be "streaming"?

[...]

> In principle, any language that has infinite words will have the same
> problem with streaming [...]

So what magic allows anyone to stream any JSON file over SCSI or IP?
Let alone some kind of "live stream" that by definition is indefinite,
even if it only lasts a few tenths of a second?

> [...] If you ever pondered h/w or low-level
> protocols s.a. SCSI or IP [...]

I spent a good deal of my career designing and implementing all manner
of communicaations protocols, from transmitting and receiving single
bits over a wire all the way up to what are now known as session and
presentation layers.  Some imposed maximum lengths in certain places;
some allowed for indefinite amounts of data to be transferred from one
end to the other without stopping, resetting, or overflowing.  And yet
somehow, the universe never collapsed.

If you believe that some implementation of fsync fails to meet a
specification, or fails to work correctly on files containign JSON, then
file a bug report.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Greg Ewing via Python-list

On 2/10/24 12:26 pm, avi.e.gr...@gmail.com wrote:

The real problem is how the JSON is set up. If you take umpteen data
structures and wrap them all in something like a list, then it may be a tad
hard to stream as you may not necessarily be examining the contents till the
list finishes gigabytes later.


Yes, if you want to process the items as they come in, you might
be better off sending a series of separate JSON strings, rather than
one JSON string containing a list.

Or, use a specialised JSON parser that processes each item of the
list as soon as it's finished parsing it, instead of collecting the
whole list first.

--
Greg

--
https://mail.python.org/mailman/listinfo/python-list


[RELEASE] Python 3.13.0rc3 and 3.12.7 released.

2024-10-01 Thread Thomas Wouters via Python-list
This is not the release you’re looking for…

(unless you’re looking for 3.12.7.)

Because no plan survives contact with reality, instead of the actual Python
3.13.0 release we have a new Python 3.13 release candidate today. Python
3.13.0rc3 rolls back the incremental cyclic garbage collector (GC), which
was added in one of the alpha releases. The incremental GC had more
significant performance regressions in specific workloads than we expected.
Rather than try to fiddle with its details in the hope of fixing them (and
not making anything else worse) we decided to revert back to the old GC in
3.13. Work on the incremental GC will continue in 3.14. We also took the
opportunity to fix some other (rare) bugs and issues found in 3.13.0rc2. The
final release of Python 3.13.0 will now happen next week, Monday October 7th
.

In an effort to return to normalcy, we’ve also released Python 3.12.7 as
scheduled, despite the expedited release a month ago. It’s important to be
regular!
3.13.0rc3

https://www.python.org/downloads/release/python-3130rc3/

The final cut of 3.13.0 (really, honest). Besides the incremental GC revert
it contains a small number of other fixes, as well as many documentation
improvements and testsuite improvements (~145 changes in total).
Call
to action

We strongly encourage maintainers of third-party Python projects to prepare
their projects for 3.13 compatibilities during this phase, and where
necessary publish Python 3.13 wheels on PyPI to be ready for the final
release of 3.13.0. Any binary wheels built against Python 3.13.0rc1 and
later will work with future versions of Python 3.13. As always, report any
issues to the Python bug tracker  
.

Please keep in mind that this is a preview release and while it’s as close
to the final release as we can get it, its use is not recommended for
production environments. Next week, though!
New
features in Python 3.13

   - A new and improved interactive interpreter
   
,
   based on PyPy  ’s, featuring multi-line editing and
   color support, as well as colorized exception tracebacks
   
   .
   - An *experimental* free-threaded build mode
   ,
   which disables the Global Interpreter Lock, allowing threads to run more
   concurrently. The build mode is available as an experimental feature in the
   Windows and macOS installers as well.
   - A preliminary, *experimental* JIT
   ,
   providing the ground work for significant performance improvements.
   - The locals() builtin function (and its C equivalent) now has well-defined
   semantics when mutating the returned mapping
   
,
   which allows debuggers to operate more consistently.
   - A modified version of mimalloc  is
   now included, optional but enabled by default if supported by the platform,
   and required for the free-threaded build mode.
   - Docstrings now have their leading indentation stripped
   ,
   reducing memory use and the size of .pyc files. (Most tools handling
   docstrings already strip leading indentation.)
   - The dbm module  has a
   new dbm.sqlite3 backend
   that is used by
   default when creating new files.
   - The minimum supported macOS version was changed from 10.9 to 10.13
   (High Sierra). Older macOS versions will not be supported going forward.
   - WASI is now a Tier 2 supported platform
   . Emscripten is no longer
an officially
   supported platform
   (but
   Pyodide  continues to support Emscripten).
   - iOS is now a Tier 3 supported platform
   .
   - Android is now a Tier 3 supported platform
   as well.

Python
3.12.7

https://www.python.org/downloads/release/python-3127/

A small release since 3.12.6 was only a month ago, but nevertheless 3.12.7
contains ~120 bug fixes, build improvements and documentation changes.


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Dan Sommers via Python-list
On 2024-09-30 at 18:48:02 -0700,
Keith Thompson via Python-list  wrote:

> 2qdxy4rzwzuui...@potatochowder.com writes:
> [...]
> > In Common Lisp, you can write integers as #nnR[digits], where nn is the
> > decimal representation of the base (possibly without a leading zero),
> > the # and the R are literal characters, and the digits are written in
> > the intended base.  So the input #16f is read as the integer 65535.
> 
> Typo: You meant #16R, not #16f.

Yep.  Sorry.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Left Right via Python-list
> What am I missing?  Handwavingly, start with the first digit, and as
> long as the next character is a digit, multipliy the accumulated result
> by 10 (or the appropriate base) and add the next value.  Oh, and handle
> scientific notation as a special case, and perhaps fail spectacularly
> instead of recovering gracefully in certain edge cases.  And in the
> pathological case of a single number with 60 billion digits, run out of
> memory (and complain loudly to the person who claimed that the file
> contained a "dataset").  But why do I need to start with the least
> significant digit?

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the
magnitude yet.  What about two digits? -- Same thing.  You cannot
leave the parser code until you know the magnitude (otherwise the
information is useless to the external code).

So, even if you have enough memory and don't care about special cases
like scientific notation: yes, you will be able to parse it, but it
won't be a streaming parser.

On Mon, Sep 30, 2024 at 9:30 PM Left Right  wrote:
>
> > Streaming won't work because the file is gzipped.  You have to receive
> > the whole thing before you can unzip it. Once unzipped it will be even
> > larger, and all in memory.
>
> GZip is specifically designed to be streamed.  So, that's not a
> problem (in principle), but you would need to have a streaming GZip
> parser, quick search in PyPI revealed this package:
> https://pypi.org/project/gzip-stream/ .
>
> On Mon, Sep 30, 2024 at 6:20 PM Thomas Passin via Python-list
>  wrote:
> >
> > On 9/30/2024 11:30 AM, Barry via Python-list wrote:
> > >
> > >
> > >> On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list 
> > >>  wrote:
> > >>
> > >>
> > >> import polars as pl
> > >> pl.read_json("file.json")
> > >>
> > >>
> > >
> > > This is not going to work unless the computer has a lot more the 60GiB of 
> > > RAM.
> > >
> > > As later suggested a streaming parser is required.
> >
> > Streaming won't work because the file is gzipped.  You have to receive
> > the whole thing before you can unzip it. Once unzipped it will be even
> > larger, and all in memory.
> > --
> > https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Keith Thompson via Python-list
2qdxy4rzwzuui...@potatochowder.com writes:
[...]
> In Common Lisp, you can write integers as #nnR[digits], where nn is the
> decimal representation of the base (possibly without a leading zero),
> the # and the R are literal characters, and the digits are written in
> the intended base.  So the input #16f is read as the integer 65535.

Typo: You meant #16R, not #16f.

-- 
Keith Thompson (The_Other_Keith) keith.s.thompso...@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Dan Sommers via Python-list
On 2024-09-30 at 21:34:07 +0200,
Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data (60 
GB) from Kenna API,"
Left Right via Python-list  wrote:

> > What am I missing?  Handwavingly, start with the first digit, and as
> > long as the next character is a digit, multipliy the accumulated result
> > by 10 (or the appropriate base) and add the next value.  Oh, and handle
> > scientific notation as a special case, and perhaps fail spectacularly
> > instead of recovering gracefully in certain edge cases.  And in the
> > pathological case of a single number with 60 billion digits, run out of
> > memory (and complain loudly to the person who claimed that the file
> > contained a "dataset").  But why do I need to start with the least
> > significant digit?
> 
> You probably forgot that it has to be _streaming_. Suppose you parse
> the first digit: can you hand this information over to an external
> function to process the parsed data? -- No! because you don't know the
> magnitude yet.  What about two digits? -- Same thing.  You cannot
> leave the parser code until you know the magnitude (otherwise the
> information is useless to the external code).

If I recognize the first digit, then I *can* hand that over to an
external function to accumulate the digits that follow.

> So, even if you have enough memory and don't care about special cases
> like scientific notation: yes, you will be able to parse it, but it
> won't be a streaming parser.

Under that constraint, I'm not sure I can parse anything.  How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?

How much state can a parser maintain (before it invokes an external
function) and still be considered streaming?  I fear that we may be
getting hung up on terminology rather than solving the problem at hand.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Left Right via Python-list
> If I recognize the first digit, then I *can* hand that over to an
> external function to accumulate the digits that follow.

And what is that external function going to do with this information?
The point is you didn't parse anything if you just sent the digit.
You just delegated the parsing further. Parsing is only meaningful if
you extracted some information, but your idea is, essentially "what if
I do nothing?".

> Under that constraint, I'm not sure I can parse anything.  How can I
parse a string (and hand it over to an external function) until I've
found the closing quote?

Nobody says that parsing a number is the only pathological case.  You,
however, exaggerate by saying you cannot parse _anything_. You can
parse booleans or null, for example.  There's no problem there.

Again, I think you misunderstand what streaming is for. Let me remind:
it's for processing information as it comes, potentially,
indefinitely. This has far more important implications than what you
find in computer science. For example, some mathematicians use the
same argument to show that real numbers are either fiction or useless:
consider adding two real numbers (where real numbers are potentially
infinite strings of decimal digits after the period) -- there's no way
to prove that such an addition is possible because you would need an
infinite proof for that (because you need to start adding from the
least significant digit).

In principle, any language that has infinite words will have the same
problem with streaming. If you ever pondered h/w or low-level
protocols s.a. SCSI or IP, you'd see that they are specifically
designed in such a way as to never have infinite words (because they
must be amenable to streaming). Consider also an interesting
consequence of SCSI not being able to have infinite words: this means,
besides other things that fsync() is nonsense! :) If you aren't
familiar with the concept: UNIX filesystem API suggests that it's
possible to destage arbitrary large file (or a chunk of file) to disk.
But SCSI is built of finite "words" and to describe an arbitrary large
file you'd need to list all the blocks that constitute the file!  And
that's why fsync() and family are so hated by people who deal with
storage: the only way to implement fsync() in compliance with the
standard is to sync _everything_ (and it hurts!)

On Tue, Oct 1, 2024 at 5:49 PM Dan Sommers via Python-list
 wrote:
>
> On 2024-09-30 at 21:34:07 +0200,
> Regarding "Re: Help with Streaming and Chunk Processing for Large JSON Data 
> (60 GB) from Kenna API,"
> Left Right via Python-list  wrote:
>
> > > What am I missing?  Handwavingly, start with the first digit, and as
> > > long as the next character is a digit, multipliy the accumulated result
> > > by 10 (or the appropriate base) and add the next value.  Oh, and handle
> > > scientific notation as a special case, and perhaps fail spectacularly
> > > instead of recovering gracefully in certain edge cases.  And in the
> > > pathological case of a single number with 60 billion digits, run out of
> > > memory (and complain loudly to the person who claimed that the file
> > > contained a "dataset").  But why do I need to start with the least
> > > significant digit?
> >
> > You probably forgot that it has to be _streaming_. Suppose you parse
> > the first digit: can you hand this information over to an external
> > function to process the parsed data? -- No! because you don't know the
> > magnitude yet.  What about two digits? -- Same thing.  You cannot
> > leave the parser code until you know the magnitude (otherwise the
> > information is useless to the external code).
>
> If I recognize the first digit, then I *can* hand that over to an
> external function to accumulate the digits that follow.
>
> > So, even if you have enough memory and don't care about special cases
> > like scientific notation: yes, you will be able to parse it, but it
> > won't be a streaming parser.
>
> Under that constraint, I'm not sure I can parse anything.  How can I
> parse a string (and hand it over to an external function) until I've
> found the closing quote?
>
> How much state can a parser maintain (before it invokes an external
> function) and still be considered streaming?  I fear that we may be
> getting hung up on terminology rather than solving the problem at hand.
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-10-01 Thread Greg Ewing via Python-list

On 1/10/24 8:34 am, Left Right wrote:

You probably forgot that it has to be _streaming_. Suppose you parse
the first digit: can you hand this information over to an external
function to process the parsed data? -- No! because you don't know the
magnitude yet.


By that definition of "streaming", no parser can ever be streaming,
because there will be some constructs that must be read in their
entirety before a suitably-structured piece of output can be
emitted.

The context of this discussion about integers is the claim that
they *could* be parsed incrementally if they were written little
endian instead of big endian, but the same argument applies either
way.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list