As part of my corpora research work I have to work with such large text files. Wikipedia dumps are bzip2 so I have been working with:
commons/compress/compressors/bzip2/BZip2CompressorInputStream.html and I consistently notice that it just stops processing without an error of any kind. I checked the file at the offset where it stops and I also checked the file with the Linuz bzip2 utility and nothing seems to be wrong in any way. The source file I used is: enwiki-20141008-pages-articles.xml.bz2 which you can get from: http://torrentz.pl/search?f=articles%20enwiki&safe=0 I am using exactly the code example you had on your user guide: commons-compress/commons-compress_User Guide.html aBZ2IFl = IFl.getCanonicalPath(); File OFl = new File(aOFlNm); aOFlNm = OFl.getCanonicalPath(); // __ InputStream NwIS = Files.newInputStream(Paths.get(aBZ2IFl)); BufferedInputStream BIS = new BufferedInputStream(NwIS); BZip2CompressorInputStream bz2IS = new BZip2CompressorInputStream(BIS); OutputStream NwOS = Files.newOutputStream(Paths.get(aOFlNm)); int n = 0; while (-1 != (n = bz2IS.read(bArBfr))) { NwOS.write(bArBfr, 0, n); lTtlByts += n; } NwOS.close(); bz2IS.close(); but it stops abruptly: // __ aOFlNm: |enwiki-20141008-pages-articles-multistream_20201012174009.440.xml| // __ |2601| total bytes compressed into |12081280894| processed in |2586| (ms), |1| (bytes/ms) real 0m2.955s user 0m2.996s sys 0m0.176s ~ _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" ls -l "${_OFL}" wc -l "${_OFL}" $ _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" $ ls -l "${_OFL}" -r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml $ wc -l "${_OFL}" 41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml $ md5sum --text "${_OFL}" 75c87a6650433b5cea4fef0bdae1cc1f enwiki-20141008-pages-articles-multistream_20201012174009.440.xml $ sha1sum --text "${_OFL}" 2799934309372685af919c17798e78c1796637ef enwiki-20141008-pages-articles-multistream_20201012174009.440.xml $ file --brief "${_OFL}" ASCII text $ // __ originally downloaded file checked and decompressed using Linux bzip2 Version 1.0.6, 6-Sept-2010: $ which bzip2 /bin/bzip2 _BZ2="bzip2_--version.txt" bzip2 --version > "${_BZ2}" 2>&1 cat "${_BZ2}" | head -n 1 rm -f "${_BZ2}" $ _BZ2="bzip2_--version.txt" $ bzip2 --version > "${_BZ2}" 2>&1 $ cat "${_BZ2}" | head -n 1 bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010. $ rm -f "${_BZ2}" // __ "testing" bz2 file $ _IFL="enwiki-20141008-pages-articles-multistream.xml.bz2" $ time bzip2 --test --verbose "${_IFL}" enwiki-20141008-pages-articles-multistream.xml.bz2: ok real 93m51.202s user 92m31.600s sys 0m35.188s // __ decompressing bz2 file $ time bzip2 --decompress --verbose --keep "${_IFL}" enwiki-20141008-pages-articles-multistream.xml.bz2: done real 129m39.665s user 108m15.368s sys 7m18.684s $ // __ decompressed file _IFL="enwiki-20141008-pages-articles-multistream.xml" ls -l "${_IFL}" time wc -l "${_IFL}" time md5sum --text "${_IFL}" time sha1sum --text "${_IFL}" file --brief "${_IFL}" $ _IFL="enwiki-20141008-pages-articles-multistream.xml" $ ls -l "${_IFL}" -r--r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 22 2014 enwiki-20141008-pages-articles-multistream.xml $ time wc -l "${_IFL}" 800855553 enwiki-20141008-pages-articles-multistream.xml real 26m13.664s user 1m3.308s sys 1m30.616s $ time md5sum --text "${_IFL}" 1cfabd688427728794e7ae75dc93e84c enwiki-20141008-pages-articles-multistream.xml real 27m39.208s user 4m14.884s sys 1m33.788s $ time sha1sum --text "${_IFL}" e337572c1957a5a4d7625e3180e16f20e77749b1 enwiki-20141008-pages-articles-multistream.xml real 30m40.383s user 8m39.852s sys 1m32.864s $ file --brief "${_IFL}" HTML document, UTF-8 Unicode text, with very long lines $ // __ file decompressed using common compress bz2 (decompressing worked fine!) _IFL="enwiki-latest-pages-articles_20201013002000.103.xml" ls -l "${_IFL}" time wc -l "${_IFL}" time md5sum --text "${_IFL}" time sha1sum --text "${_IFL}" file --brief "${_IFL}" $ _IFL="enwiki-latest-pages-articles_20201013002000.103.xml" $ ls -l "${_IFL}" -rw-r--r-- 1 lbrtchx lbrtchx 50151236957 Oct 13 03:35 enwiki-latest-pages-articles_20201013002000.103.xml $ time wc -l "${_IFL}" 800855553 enwiki-latest-pages-articles_20201013002000.103.xml real 14m44.535s user 3m55.816s sys 1m22.816s $ time md5sum --text "${_IFL}" 1cfabd688427728794e7ae75dc93e84c enwiki-latest-pages-articles_20201013002000.103.xml real 16m14.680s user 3m19.256s sys 1m30.488s $ time sha1sum --text "${_IFL}" e337572c1957a5a4d7625e3180e16f20e77749b1 enwiki-latest-pages-articles_20201013002000.103.xml real 17m45.103s user 7m29.988s sys 1m29.540s $ file --brief "${_IFL}" HTML document, UTF-8 Unicode text, with very long lines $ // __ file decompressed using common compress bz2 (decompressing somehow abruptly stopped) _OFL="enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" $ ls -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" -r--r--r-- 1 lbrtchx lbrtchx 2601 Oct 12 17:40 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml $ wc -l "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" 41 enwiki-20141008-pages-articles-multistream_20201012174009.440.xml $ cat "enwiki-20141008-pages-articles-multistream_20201012174009.440.xml" <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/ http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9" xml:lang="en"> <siteinfo> <sitename>Wikipedia</sitename> <dbname>enwiki</dbname> <base>http://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.25wmf1</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> <namespace key="0" case="first-letter" /> <namespace key="1" case="first-letter">Talk</namespace> <namespace key="2" case="first-letter">User</namespace> <namespace key="3" case="first-letter">User talk</namespace> <namespace key="4" case="first-letter">Wikipedia</namespace> <namespace key="5" case="first-letter">Wikipedia talk</namespace> <namespace key="6" case="first-letter">File</namespace> <namespace key="7" case="first-letter">File talk</namespace> <namespace key="8" case="first-letter">MediaWiki</namespace> <namespace key="9" case="first-letter">MediaWiki talk</namespace> <namespace key="10" case="first-letter">Template</namespace> <namespace key="11" case="first-letter">Template talk</namespace> <namespace key="12" case="first-letter">Help</namespace> <namespace key="13" case="first-letter">Help talk</namespace> <namespace key="14" case="first-letter">Category</namespace> <namespace key="15" case="first-letter">Category talk</namespace> <namespace key="100" case="first-letter">Portal</namespace> <namespace key="101" case="first-letter">Portal talk</namespace> <namespace key="108" case="first-letter">Book</namespace> <namespace key="109" case="first-letter">Book talk</namespace> <namespace key="118" case="first-letter">Draft</namespace> <namespace key="119" case="first-letter">Draft talk</namespace> <namespace key="446" case="first-letter">Education Program</namespace> <namespace key="447" case="first-letter">Education Program talk</namespace> <namespace key="710" case="first-letter">TimedText</namespace> <namespace key="711" case="first-letter">TimedText talk</namespace> <namespace key="828" case="first-letter">Module</namespace> <namespace key="829" case="first-letter">Module talk</namespace> <namespace key="2600" case="first-letter">Topic</namespace> </namespaces> </siteinfo> $ // __ first 45 lines of decompressed file using Linux bzip2 _IFL="enwiki-20141008-pages-articles-multistream.xml" head -n 45 "${_IFL}" $ head -n 45 "${_IFL}" <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.9/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.9/ http://www.mediawiki.org/xml/export-0.9.xsd" version="0.9" xml:lang="en"> <siteinfo> <sitename>Wikipedia</sitename> <dbname>enwiki</dbname> <base>http://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.25wmf1</generator> <case>first-letter</case> <namespaces> <namespace key="-2" case="first-letter">Media</namespace> <namespace key="-1" case="first-letter">Special</namespace> <namespace key="0" case="first-letter" /> <namespace key="1" case="first-letter">Talk</namespace> <namespace key="2" case="first-letter">User</namespace> <namespace key="3" case="first-letter">User talk</namespace> <namespace key="4" case="first-letter">Wikipedia</namespace> <namespace key="5" case="first-letter">Wikipedia talk</namespace> <namespace key="6" case="first-letter">File</namespace> <namespace key="7" case="first-letter">File talk</namespace> <namespace key="8" case="first-letter">MediaWiki</namespace> <namespace key="9" case="first-letter">MediaWiki talk</namespace> <namespace key="10" case="first-letter">Template</namespace> <namespace key="11" case="first-letter">Template talk</namespace> <namespace key="12" case="first-letter">Help</namespace> <namespace key="13" case="first-letter">Help talk</namespace> <namespace key="14" case="first-letter">Category</namespace> <namespace key="15" case="first-letter">Category talk</namespace> <namespace key="100" case="first-letter">Portal</namespace> <namespace key="101" case="first-letter">Portal talk</namespace> <namespace key="108" case="first-letter">Book</namespace> <namespace key="109" case="first-letter">Book talk</namespace> <namespace key="118" case="first-letter">Draft</namespace> <namespace key="119" case="first-letter">Draft talk</namespace> <namespace key="446" case="first-letter">Education Program</namespace> <namespace key="447" case="first-letter">Education Program talk</namespace> <namespace key="710" case="first-letter">TimedText</namespace> <namespace key="711" case="first-letter">TimedText talk</namespace> <namespace key="828" case="first-letter">Module</namespace> <namespace key="829" case="first-letter">Module talk</namespace> <namespace key="2600" case="first-letter">Topic</namespace> </namespaces> </siteinfo> <page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> $ --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org