Wonderful, I'll give that try then.  Thanks again Andreas.

Oscar

Oscar Bastidas
Research Associate
University of Minnesota

On Mon, May 3, 2021, 6:26 AM Andreas Reichel <[email protected]>
wrote:

> Oscar,
>
> unfortunately I am just an accountant and so my answer might not be
> technically accurate.
> In my understanding, you will read row by row and write the cells into
> a BufferedOutPutStream which is feeding into a FileOutPutStream.
>
> You could even put a ZIP/BZIP2 stream in between to compress the output
> on the fly.
>
> This won't take much memory (especially not in comparison to POI
> holding the whole XSSF Objects in Memory).
> 60k lines are nothing. I wold expects that -Xmx250MB would serve
> already, probably less.
>
> Cheers
> Andreas
>
>
> On Mon, 2021-05-03 at 06:19 -0500, Oscar Bastidas wrote:
> > Great.  Thanks Andreas, I'll check this one out too.
> >
> > I just started taking a quick look at the first Github you sent me
> > and I
> > didn't see anything explicit about writing the streamed data from the
> > Excel
> > workbook to a csv file.  Pardon my elementary question, but is it
> > possible
> > to write a big and single csv file using these streaming methods
> > where only
> > a handful of rows are kept in memory at a time?  Or would I need to
> > write a
> > bunch of mini csv files and then concatenate the file contents later
> > after
> > the Github code is finished?  I ask this because my only experience
> > with my
> > past work is where a file is read and held in memory at once and then
> > written to the file vs. the streaming technique.  Thanks again.
> >
> > Oscar
> >
> > Oscar Bastidas
> > Research Associate
> > University of Minnesota
> >
> > On Mon, May 3, 2021, 6:12 AM Andreas Reichel
> > <[email protected]>
> > wrote:
> >
> > > Oscar,
> > >
> > > there is also a fork of Excel Streamer which is worth to
> > > mention: https://github.com/pjfanning/excel-streaming-reader
> > > Personally I prefer that fork, but I have no good reason or
> > > argument.
> > > In the past this fork picked up version changes of POI faster.
> > >
> > > Cheers
> > > Andreas
> > >
> > >
> > >
> > > On Mon, 2021-05-03 at 06:05 -0500, Oscar Bastidas wrote:
> > > > Awesome.  Thanks, I'll give this a try.
> > > >
> > > > Oscar
> > > >
> > > > Oscar Bastidas
> > > > Research Associate
> > > > University of Minnesota
> > > >
> > > > On Mon, May 3, 2021, 6:04 AM Andreas Reichel
> > > > <[email protected]>
> > > > wrote:
> > > >
> > > > > Greetings.
> > > > >
> > > > > Please use the Excel Streaming Reader when reading large
> > > > > files: https://github.com/monitorjbl/excel-streaming-reader
> > > > >
> > > > > import com.monitorjbl.xlsx.StreamingReader;
> > > > >
> > > > > InputStream is = new FileInputStream(new
> > > > > File("/path/to/workbook.xlsx"));
> > > > > Workbook workbook = StreamingReader.builder()
> > > > >         .rowCacheSize(100)    // number of rows to keep in
> > > > > memory
> > > > > (defaults to 10)
> > > > >         .bufferSize(4096)     // buffer size to use when
> > > > > reading
> > > > > InputStream to file (defaults to 1024)
> > > > >         .open(is);            // InputStream or File for XLSX
> > > > > file
> > > > > (required)
> > > > >
> > > > >
> > > > >
> > > > > With the code above you can loop through your rows and write it
> > > > > to
> > > > > CSV.
> > > > > Best regards
> > > > > Andreas
> > > > >
> > > > >
> > > > > On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I am trying to read a large Excel spreadsheet (60,000 rows)
> > > > > > but I
> > > > > > get
> > > > > > what
> > > > > > appears to be a memory leak error from the JVM when I use the
> > > > > > *XSSFWorkbook
> > > > > > *API.  I learned recently that there are size limitations on
> > > > > > Excel
> > > > > > files
> > > > > > being read in this way and the Apache POI website
> > > > > > specifically
> > > > > > recommends
> > > > > > reading the file in a streaming fashion instead of taking the
> > > > > > whole
> > > > > > file in
> > > > > > memory.  To do this, POI recommends using something called
> > > > > > *XLSX2CSV*
> > > > > > but
> > > > > > the provided link to teach how to use this returns a "page
> > > > > > not
> > > > > > found
> > > > > > error."
> > > > > >
> > > > > > Would someone please point me in the direction of how to
> > > > > > handle
> > > > > > reading my
> > > > > > big Excel file?
> > > > > >
> > > > > > The Apache POI URL that contains the link to *XLSX2CSV* is:
> > > > > >
> > > > > > http://poi.apache.org/components/spreadsheet/limitations.html
> > > > > >
> > > > > > Thanks for any help anyone can provide.
> > > > > >
> > > > > > Oscar
> > > > > >
> > > > > > Oscar Bastidas
> > > > > > Research Associate
> > > > > > University of Minnesota
> > > > >
> > > > >
> > >
> > >
>
>

Reply via email to