Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Thomas Passin via Python-list

On 9/30/2024 11:30 AM, Barry via Python-list wrote:




On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list 
 wrote:


import polars as pl
pl.read_json("file.json")




This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.


Streaming won't work because the file is gzipped.  You have to receive 
the whole thing before you can unzip it. Once unzipped it will be even 
larger, and all in memory.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Barry via Python-list



> On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list 
>  wrote:
> 
> 
> import polars as pl
> pl.read_json("file.json")
> 
> 

This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.

Barry


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Grant Edwards via Python-list
On 2024-09-30, Left Right via Python-list  wrote:
> Whether and to what degree you can stream JSON depends on JSON
> structure. In general, however, JSON cannot be streamed (but commonly
> it can be).
>
> Imagine a pathological case of this shape: 1... <60GB of digits>. This
> is still a valid JSON (it doesn't have any limits on how many digits a
> number can have). And you cannot parse this number in a streaming way
> because in order to do that, you need to start with the least
> significant digit.

Which is how arabic numbers were originally parsed, but when
westerners adopted them from a R->L written language, thet didn't flip
them around to match the L->R written language into which they were
being adopted.

So now long numbers can't be parsed as a stream in software. They
should have anticipated this problem back in the 13th century and
flipped the numbers around.




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Chris Angelico via Python-list
On Tue, 1 Oct 2024 at 02:20, Thomas Passin via Python-list
 wrote:
>
> On 9/30/2024 11:30 AM, Barry via Python-list wrote:
> >
> >
> >> On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list 
> >>  wrote:
> >>
> >>
> >> import polars as pl
> >> pl.read_json("file.json")
> >>
> >>
> >
> > This is not going to work unless the computer has a lot more the 60GiB of 
> > RAM.
> >
> > As later suggested a streaming parser is required.
>
> Streaming won't work because the file is gzipped.  You have to receive
> the whole thing before you can unzip it. Once unzipped it will be even
> larger, and all in memory.

Streaming gzip is perfectly possible. You may be thinking of PKZip
which has its EOCD at the end of the file (although it may still be
possible to stream-decompress if you work at it).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Left Right via Python-list
Whether and to what degree you can stream JSON depends on JSON
structure. In general, however, JSON cannot be streamed (but commonly
it can be).

Imagine a pathological case of this shape: 1... <60GB of digits>. This
is still a valid JSON (it doesn't have any limits on how many digits a
number can have). And you cannot parse this number in a streaming way
because in order to do that, you need to start with the least
significant digit.

Typically, however, JSON can be parsed incrementally. The format is
conceptually very simple to write a parser for. There are plenty of
parsers that do that, for example, this one:
https://pypi.org/project/json-stream/ . But, I'd encourage you to do
it yourself.  It's fun, and the resulting parser should end up less
than some 50 LoC.  Also, it allows you to closer incorporate your
desired output into your parser.

On Mon, Sep 30, 2024 at 8:44 AM Asif Ali Hirekumbi via Python-list
 wrote:
>
> Thanks Abdur Rahmaan.
> I will give it a try !
>
> Thanks
> Asif
>
> On Mon, Sep 30, 2024 at 11:19 AM Abdur-Rahmaan Janhangeer <
> arj.pyt...@gmail.com> wrote:
>
> > Idk if you tried Polars, but it seems to work well with JSON data
> >
> > import polars as pl
> > pl.read_json("file.json")
> >
> > Kind Regards,
> >
> > Abdur-Rahmaan Janhangeer
> > about  | blog
> > 
> > github 
> > Mauritius
> >
> >
> > On Mon, Sep 30, 2024 at 8:00 AM Asif Ali Hirekumbi via Python-list <
> > python-list@python.org> wrote:
> >
> >> Dear Python Experts,
> >>
> >> I am working with the Kenna Application's API to retrieve vulnerability
> >> data. The API endpoint provides a single, massive JSON file in gzip
> >> format,
> >> approximately 60 GB in size. Handling such a large dataset in one go is
> >> proving to be quite challenging, especially in terms of memory management.
> >>
> >> I am looking for guidance on how to efficiently stream this data and
> >> process it in chunks using Python. Specifically, I am wondering if there’s
> >> a way to use the requests library or any other libraries that would allow
> >> us to pull data from the API endpoint in a memory-efficient manner.
> >>
> >> Here are the relevant API endpoints from Kenna:
> >>
> >>- Kenna API Documentation
> >>
> >>- Kenna Vulnerabilities Export
> >>
> >>
> >> If anyone has experience with similar use cases or can offer any advice,
> >> it
> >> would be greatly appreciated.
> >>
> >> Thank you in advance for your help!
> >>
> >> Best regards
> >> Asif Ali
> >> --
> >> https://mail.python.org/mailman/listinfo/python-list
> >>
> >
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


ANN: Python Meeting Düsseldorf - 02.10.2024

2024-09-30 Thread eGenix Team via Python-list


/This announcement is in German since it targets a local user 
group//meeting in Düsseldorf, Germany/



   Ankündigung

   Python Meeting Düsseldorf - Oktober 2024
   

   Ein Treffen von Python Enthusiasten und Interessierten
   in ungezwungener Atmosphäre.

   *02.10.2024, 18:00 Uhr*
   Raum 1, 2.OG im Bürgerhaus Stadtteilzentrum Bilk
   
   Düsseldorfer Arcaden
   , Bachstr. 145,
   40217 Düsseldorf
   



   Programm

Bereits angemeldete Vorträge:

 * Detlef Lannert:
   /*pyinfra als Alternative zu Ansible

   */
 * Marc-André Lemburg:
   /*Rapid web app development with Panel

   */
 * Detlef Lannert:
   /*Low-cost-Objekte als Alternativen zu Dictionaries?

   */
 * Charlie Clark:
   /*Editieren von ZIP Dateien mit Python*/

Weitere Vorträge können gerne noch angemeldet werden. Bei Interesse, 
bitte unter i...@pyddf.de melden.



 Startzeit und Ort

Wir treffen uns um 18:00 Uhr im Bürgerhaus in den Düsseldorfer Arcaden.

Das Bürgerhaus teilt sich den Eingang mit dem Schwimmbad und befindet 
sich an der Seite der Tiefgarageneinfahrt der Düsseldorfer Arcaden.


Über dem Eingang steht ein großes "Schwimm’ in Bilk" Logo. Hinter der 
Tür direkt links zu den zwei Aufzügen, dann in den 2. Stock hochfahren. 
Der Eingang zum Raum 1 liegt direkt links, wenn man aus dem Aufzug kommt.


>>> Eingang in Google Street View 

*⚠️ Wichtig*: Bitte nur dann anmelden, wenn ihr absolut sicher seid, 
dass ihr auch kommt. Angesichts der begrenzten Anzahl Plätze, haben wir 
kein Verständnis für kurzfristige Absagen oder No-Shows.



   Einleitung

Das Python Meeting Düsseldorf  ist eine regelmäßige 
Veranstaltung in Düsseldorf, die sich an Python Begeisterte aus der 
Region wendet.


Einen guten Überblick über die Vorträge bietet unser PyDDF YouTube-Kanal 
, auf dem wir Videos der Vorträge nach 
den Meetings veröffentlichen.


Veranstaltet wird das Meeting von der eGenix.com GmbH 
, Langenfeld, in Zusammenarbeit mit Clark 
Consulting & Research , Düsseldorf:



   Format

Das Python Meeting Düsseldorf  nutzt eine Mischung aus 
(Lightning) Talks und offener Diskussion.


Vorträge können vorher angemeldet werden, oder auch spontan während des 
Treffens eingebracht werden. Ein Beamer mit HDMI und FullHD Auflösung 
steht zur Verfügung.


(Lightning) Talk Anmeldung bitte formlos per EMail an i...@pyddf.de 




   Kostenbeteiligung

Das Python Meeting Düsseldorf  wird von Python Nutzern 
für Python Nutzer veranstaltet.


Da Tagungsraum, Beamer, Internet und Getränke Kosten produzieren, bitten 
wir die Teilnehmer um einen Beitrag in Höhe von EUR 10,00 inkl. 19% 
Mwst. Schüler und Studenten zahlen EUR 5,00 inkl. 19% Mwst.


Wir möchten alle Teilnehmer bitten, den Betrag in bar mitzubringen.


   Anmeldung

Da wir nur 25 Personen in dem angemieteten Raum empfangen können, 
möchten wir bitten, sich vorher anzumelden.


   *Meeting Anmeldung* bitte per Meetup
   


   Weitere Informationen

Weitere Informationen finden Sie auf der Webseite des Meetings:

https://pyddf.de/

Viel Spaß !

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Sep 30 2024)

Python Projects, Coaching and Support ...https://www.egenix.com/
Python Product Development ...https://consulting.egenix.com/



::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48

D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   https://www.egenix.com/company/contact/
 https://www.malemburg.com/
--
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Thomas Passin via Python-list

On 9/30/2024 1:00 PM, Chris Angelico via Python-list wrote:

On Tue, 1 Oct 2024 at 02:20, Thomas Passin via Python-list
 wrote:


On 9/30/2024 11:30 AM, Barry via Python-list wrote:




On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list 
 wrote:


import polars as pl
pl.read_json("file.json")




This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.


Streaming won't work because the file is gzipped.  You have to receive
the whole thing before you can unzip it. Once unzipped it will be even
larger, and all in memory.


Streaming gzip is perfectly possible. You may be thinking of PKZip
which has its EOCD at the end of the file (although it may still be
possible to stream-decompress if you work at it).

ChrisA


You're right, that's what I was thinking of.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Dan Sommers via Python-list
On 2024-09-30 at 11:44:50 -0400,
Grant Edwards via Python-list  wrote:

> On 2024-09-30, Left Right via Python-list  wrote:
> > Whether and to what degree you can stream JSON depends on JSON
> > structure. In general, however, JSON cannot be streamed (but commonly
> > it can be).
> >
> > Imagine a pathological case of this shape: 1... <60GB of digits>. This
> > is still a valid JSON (it doesn't have any limits on how many digits a
> > number can have). And you cannot parse this number in a streaming way
> > because in order to do that, you need to start with the least
> > significant digit.
> 
> Which is how arabic numbers were originally parsed, but when
> westerners adopted them from a R->L written language, thet didn't flip
> them around to match the L->R written language into which they were
> being adopted.

Interesting.

> So now long numbers can't be parsed as a stream in software. They
> should have anticipated this problem back in the 13th century and
> flipped the numbers around.

What am I missing?  Handwavingly, start with the first digit, and as
long as the next character is a digit, multipliy the accumulated result
by 10 (or the appropriate base) and add the next value.  Oh, and handle
scientific notation as a special case, and perhaps fail spectacularly
instead of recovering gracefully in certain edge cases.  And in the
pathological case of a single number with 60 billion digits, run out of
memory (and complain loudly to the person who claimed that the file
contained a "dataset").  But why do I need to start with the least
significant digit?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Chris Angelico via Python-list
On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list
 wrote:
>
> But why do I need to start with the least
> significant digit?

If you start from the most significant, you don't know anything about
the number until you finish parsing it. There's almost nothing you can
say about a number given that it starts with a particular sequence
(since you don't know how MANY digits there are). However, if you know
the LAST digits, you can make certain statements about it (trivial
examples being whether it's odd or even).

It's not very, well, significant. But there's something to it. And it
extends nicely to p-adic numbers, which can have an infinite number of
nonzero digits to the left of the decimal:

https://en.wikipedia.org/wiki/P-adic_number

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Thomas Passin via Python-list

On 9/30/2024 11:30 AM, Barry via Python-list wrote:




On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list 
 wrote:


import polars as pl
pl.read_json("file.json")




This is not going to work unless the computer has a lot more the 60GiB of RAM.

As later suggested a streaming parser is required.


There is also the json-stream library, on PyPi at

https://pypi.org/project/json-stream/


--
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Dan Sommers via Python-list
On 2024-10-01 at 09:09:07 +1000,
Chris Angelico via Python-list  wrote:

> On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list
>  wrote:
> >
> > On 2024-09-30, Dan Sommers via Python-list  wrote:
> >
> > > In Common Lisp, integers can be written in any integer base from two
> > > to thirty six, inclusive.  So knowing the last digit doesn't tell
> > > you whether an integer is even or odd until you know the base
> > > anyway.
> >
> > I had to think about that for an embarassingly long time before it
> > clicked.
> 
> The only part I'm not clear on is what identifies the base. If you're
> going to write numbers little-endian, it's not that hard to also write
> them with a base indicator before the digits [...]

In Common Lisp, you can write integers as #nnR[digits], where nn is the
decimal representation of the base (possibly without a leading zero),
the # and the R are literal characters, and the digits are written in
the intended base.  So the input #16f is read as the integer 65535.

You can also set or bind the global variable *read-base* (yes, the
asterisks are part of the name) to an integer between 2 and 36, and then
anything that looks like an integer in that base is interpreted as such
(including literals in programs).  The literals I described above are
still handled correctly no matter the current value of *read-base*.  So
if the value of *read-base* is 16, then the input  is read as the
integer 65535 (as is the input #16r).

(Pedants may point our details I omitted.  I admit to omitting them.)

IIRC, certain [old 8080 and Z-80?] assemblers used to put the base
indicator at the end.  So 10 meant, well, 10, but 10H meant 16 and 10b
meant 2 (IDK; the capital H and the lower case b both look right to me).

I don't recall numbers written from least significant digit to most
significant digit (big and little endian *storage*, yes, but not the
digits when presented to or read from a human).
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Left Right via Python-list
> Streaming won't work because the file is gzipped.  You have to receive
> the whole thing before you can unzip it. Once unzipped it will be even
> larger, and all in memory.

GZip is specifically designed to be streamed.  So, that's not a
problem (in principle), but you would need to have a streaming GZip
parser, quick search in PyPI revealed this package:
https://pypi.org/project/gzip-stream/ .

On Mon, Sep 30, 2024 at 6:20 PM Thomas Passin via Python-list
 wrote:
>
> On 9/30/2024 11:30 AM, Barry via Python-list wrote:
> >
> >
> >> On 30 Sep 2024, at 06:52, Abdur-Rahmaan Janhangeer via Python-list 
> >>  wrote:
> >>
> >>
> >> import polars as pl
> >> pl.read_json("file.json")
> >>
> >>
> >
> > This is not going to work unless the computer has a lot more the 60GiB of 
> > RAM.
> >
> > As later suggested a streaming parser is required.
>
> Streaming won't work because the file is gzipped.  You have to receive
> the whole thing before you can unzip it. Once unzipped it will be even
> larger, and all in memory.
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Grant Edwards via Python-list
On 2024-09-30, Dan Sommers via Python-list  wrote:
> On 2024-09-30 at 11:44:50 -0400,
> Grant Edwards via Python-list  wrote:
>
>> On 2024-09-30, Left Right via Python-list  wrote:
>> > [...]
>> > Imagine a pathological case of this shape: 1... <60GB of digits>. This
>> > is still a valid JSON (it doesn't have any limits on how many digits a
>> > number can have). And you cannot parse this number in a streaming way
>> > because in order to do that, you need to start with the least
>> > significant digit.
>> 
>> Which is how arabic numbers were originally parsed, but when
>> westerners adopted them from a R->L written language, thet didn't
>> flip them around to match the L->R written language into which they
>> were being adopted.
>
> Interesting.
>
>> So now long numbers can't be parsed as a stream in software. They
>> should have anticipated this problem back in the 13th century and
>> flipped the numbers around.
>
> What am I missing?  Handwavingly, start with the first digit, and as
> long as the next character is a digit, multipliy the accumulated
> result by 10 (or the appropriate base) and add the next value.
> [...]  But why do I need to start with the least significant digit?

Excellent question.  That's actully a pretty standard way to parse
numeric literals. I accepted the claim at face value that in JSON
there is something that requires parsing numeric literals from the
least significant end -- but I can't think of why the usual algorithms
used by other languages' lexers for yonks wouldn't work for JSON.

--
Grant
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Grant Edwards via Python-list
On 2024-09-30, Dan Sommers via Python-list  wrote:

> In Common Lisp, integers can be written in any integer base from two
> to thirty six, inclusive.  So knowing the last digit doesn't tell
> you whether an integer is even or odd until you know the base
> anyway.

I had to think about that for an embarassingly long time before it
clicked.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Chris Angelico via Python-list
On Tue, 1 Oct 2024 at 08:56, Grant Edwards via Python-list
 wrote:
>
> On 2024-09-30, Dan Sommers via Python-list  wrote:
>
> > In Common Lisp, integers can be written in any integer base from two
> > to thirty six, inclusive.  So knowing the last digit doesn't tell
> > you whether an integer is even or odd until you know the base
> > anyway.
>
> I had to think about that for an embarassingly long time before it
> clicked.

The only part I'm not clear on is what identifies the base. If you're
going to write numbers little-endian, it's not that hard to also write
them with a base indicator before the digits. But, whatever. This is a
typical tangent and people are argumentative for no reason. I was just
trying to add some explanatory notes to why little-endian does make
more sense than big-endian.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Help with Streaming and Chunk Processing for Large JSON Data (60 GB) from Kenna API

2024-09-30 Thread Dan Sommers via Python-list
On 2024-10-01 at 04:46:35 +1000,
Chris Angelico via Python-list  wrote:

> On Tue, 1 Oct 2024 at 04:30, Dan Sommers via Python-list
>  wrote:
> >
> > But why do I need to start with the least
> > significant digit?
> 
> If you start from the most significant, you don't know anything about
> the number until you finish parsing it. There's almost nothing you can
> say about a number given that it starts with a particular sequence
> (since you don't know how MANY digits there are). However, if you know
> the LAST digits, you can make certain statements about it (trivial
> examples being whether it's odd or even).

But that wasn't the question.  Sure, under certain circumstances and for
specific use cases and/or requirements, there might be arguments to read
potential numbers as strings and possibly not have to parse them
completely before accepting or rejecting them.

And if I start with the least significant digit and the number happens
to be written in scientific notation and/or has a decimal point, then I
can't tell whether it's odd or even until I further process the whole
thing anyway.

> It's not very, well, significant. But there's something to it. And it
> extends nicely to p-adic numbers, which can have an infinite number of
> nonzero digits to the left of the decimal:
> 
> https://en.wikipedia.org/wiki/P-adic_number

In Common Lisp, integers can be written in any integer base from two to
thirty six, inclusive.  So knowing the last digit doesn't tell you
whether an integer is even or odd until you know the base anyway.

Curiously, we agree:  if you move the goal posts arbitrarily, then
some algorithms that parse JSON numbers will fail.
-- 
https://mail.python.org/mailman/listinfo/python-list