Wonderful, I'll give that try then. Thanks again Andreas. Oscar
Oscar Bastidas Research Associate University of Minnesota On Mon, May 3, 2021, 6:26 AM Andreas Reichel <[email protected]> wrote: > Oscar, > > unfortunately I am just an accountant and so my answer might not be > technically accurate. > In my understanding, you will read row by row and write the cells into > a BufferedOutPutStream which is feeding into a FileOutPutStream. > > You could even put a ZIP/BZIP2 stream in between to compress the output > on the fly. > > This won't take much memory (especially not in comparison to POI > holding the whole XSSF Objects in Memory). > 60k lines are nothing. I wold expects that -Xmx250MB would serve > already, probably less. > > Cheers > Andreas > > > On Mon, 2021-05-03 at 06:19 -0500, Oscar Bastidas wrote: > > Great. Thanks Andreas, I'll check this one out too. > > > > I just started taking a quick look at the first Github you sent me > > and I > > didn't see anything explicit about writing the streamed data from the > > Excel > > workbook to a csv file. Pardon my elementary question, but is it > > possible > > to write a big and single csv file using these streaming methods > > where only > > a handful of rows are kept in memory at a time? Or would I need to > > write a > > bunch of mini csv files and then concatenate the file contents later > > after > > the Github code is finished? I ask this because my only experience > > with my > > past work is where a file is read and held in memory at once and then > > written to the file vs. the streaming technique. Thanks again. > > > > Oscar > > > > Oscar Bastidas > > Research Associate > > University of Minnesota > > > > On Mon, May 3, 2021, 6:12 AM Andreas Reichel > > <[email protected]> > > wrote: > > > > > Oscar, > > > > > > there is also a fork of Excel Streamer which is worth to > > > mention: https://github.com/pjfanning/excel-streaming-reader > > > Personally I prefer that fork, but I have no good reason or > > > argument. > > > In the past this fork picked up version changes of POI faster. > > > > > > Cheers > > > Andreas > > > > > > > > > > > > On Mon, 2021-05-03 at 06:05 -0500, Oscar Bastidas wrote: > > > > Awesome. Thanks, I'll give this a try. > > > > > > > > Oscar > > > > > > > > Oscar Bastidas > > > > Research Associate > > > > University of Minnesota > > > > > > > > On Mon, May 3, 2021, 6:04 AM Andreas Reichel > > > > <[email protected]> > > > > wrote: > > > > > > > > > Greetings. > > > > > > > > > > Please use the Excel Streaming Reader when reading large > > > > > files: https://github.com/monitorjbl/excel-streaming-reader > > > > > > > > > > import com.monitorjbl.xlsx.StreamingReader; > > > > > > > > > > InputStream is = new FileInputStream(new > > > > > File("/path/to/workbook.xlsx")); > > > > > Workbook workbook = StreamingReader.builder() > > > > > .rowCacheSize(100) // number of rows to keep in > > > > > memory > > > > > (defaults to 10) > > > > > .bufferSize(4096) // buffer size to use when > > > > > reading > > > > > InputStream to file (defaults to 1024) > > > > > .open(is); // InputStream or File for XLSX > > > > > file > > > > > (required) > > > > > > > > > > > > > > > > > > > > With the code above you can loop through your rows and write it > > > > > to > > > > > CSV. > > > > > Best regards > > > > > Andreas > > > > > > > > > > > > > > > On Mon, 2021-05-03 at 05:31 -0500, Oscar Bastidas wrote: > > > > > > Hello, > > > > > > > > > > > > I am trying to read a large Excel spreadsheet (60,000 rows) > > > > > > but I > > > > > > get > > > > > > what > > > > > > appears to be a memory leak error from the JVM when I use the > > > > > > *XSSFWorkbook > > > > > > *API. I learned recently that there are size limitations on > > > > > > Excel > > > > > > files > > > > > > being read in this way and the Apache POI website > > > > > > specifically > > > > > > recommends > > > > > > reading the file in a streaming fashion instead of taking the > > > > > > whole > > > > > > file in > > > > > > memory. To do this, POI recommends using something called > > > > > > *XLSX2CSV* > > > > > > but > > > > > > the provided link to teach how to use this returns a "page > > > > > > not > > > > > > found > > > > > > error." > > > > > > > > > > > > Would someone please point me in the direction of how to > > > > > > handle > > > > > > reading my > > > > > > big Excel file? > > > > > > > > > > > > The Apache POI URL that contains the link to *XLSX2CSV* is: > > > > > > > > > > > > http://poi.apache.org/components/spreadsheet/limitations.html > > > > > > > > > > > > Thanks for any help anyone can provide. > > > > > > > > > > > > Oscar > > > > > > > > > > > > Oscar Bastidas > > > > > > Research Associate > > > > > > University of Minnesota > > > > > > > > > > > > > > > > > >
