There is the splittable gzip Hadoop input format, maybe someone could extend that to use support bgzip?
On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello Chris, > > Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but > to start reading from somewhere other than the beginning of the file, you > would need to use an index to tell you where the blocks start. Originally, > a Tabix index was used and is still the popular choice, although other > types of indices also exist. > > Best, Oliver > > On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth <cnaur...@apache.org> wrote: > >> Sorry, I misread that in the original email. >> >> This is my first time looking at bgzip. I see from the documentation that >> it is putting some additional framing around gzip and producing a series of >> small blocks, such that you can create an index of the file and decompress >> individual blocks instead of the whole file. That's interesting, because it >> could potentially support a splittable format. (Plain gzip isn't >> splittable.) >> >> I also noticed that it states it is "compatible with" gzip. I tried a >> basic test of running bgzip on a file, which produced a .gz output file, >> and then running the same spark.read.text code sample from earlier. Sure >> enough, I was able to read the data. This implies there is at least some >> basic compatibility, so that you could read files created by bgzip. >> However, that read would not be optimized in any way to take advantage of >> an index file. There also would not be any way to produce bgzip-style >> output like in the df.write.option code sample. To achieve either of those, >> it would require writing a custom Hadoop compression codec to integrate >> more closely with the data format. >> >> Chris Nauroth >> >> >> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker < >> oliv...@broadinstitute.org> wrote: >> >>> >>> Hello, >>> >>> Thanks for the response, but I mean compressed with bgzip >>> <http://www.htslib.org/doc/bgzip.html>, not bzip2. >>> >>> Best, Oliver >>> >>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org> >>> wrote: >>> >>>> Hello Oliver, >>>> >>>> Yes, Spark makes this possible using the Hadoop compression codecs and >>>> the Hadoop-compatible FileSystem interface [1]. Here is an example of >>>> reading: >>>> >>>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2") >>>> df.show(10) >>>> >>>> This is using a test data set of the complete works of Shakespeare, >>>> stored as text and compressed to a single .bz2 file. This code sample >>>> didn't need to do anything special to declare that it's working with bzip2 >>>> compression, because the Hadoop compression codecs detect that the file has >>>> a .bz2 extension and automatically assume it needs to be decompressed >>>> before presenting it to our code in the DataFrame as text. >>>> >>>> On the write side, if you wanted to declare a particular kind of output >>>> compression, you can do it with a write option like this: >>>> >>>> df.write.option("compression", >>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS >>>> bucket>/data/shakespeare-bz2-copy") >>>> >>>> This writes the contents of the DataFrame, stored as text and >>>> compressed to .bz2 files in the destination path. >>>> >>>> My example is testing with a GCS bucket (scheme "gs:"), but you can >>>> also switch the Hadoop file system interface to target other file systems >>>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure >>>> the S3AFIleSystem, including how to pass credentials for access to the S3 >>>> bucket [2]. >>>> >>>> Note that for big data use cases, other compression codecs like Snappy >>>> are generally preferred for greater efficiency. (Of course, we're not >>>> always in complete control of the data formats we're given, so the support >>>> for bz2 is there.) >>>> >>>> [1] >>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html >>>> [2] >>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html >>>> >>>> Chris Nauroth >>>> >>>> >>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker < >>>> oliv...@broadinstitute.org> wrote: >>>> >>>>> >>>>> Hello, >>>>> >>>>> Is it possible to read/write a DataFrame from/to a set of bgzipped >>>>> files? Can it read from/write to AWS S3? Thanks! >>>>> >>>>> Best, Oliver >>>>> >>>>> -- >>>>> Oliver Ruebenacker, Ph.D. (he) >>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>>> <http://www.broadinstitute.org/> >>>>> >>>> >>> >>> -- >>> Oliver Ruebenacker, Ph.D. (he) >>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>> Flannick >>> Lab <http://www.flannicklab.org/>, Broad Institute >>> <http://www.broadinstitute.org/> >>> >> > > -- > Oliver Ruebenacker, Ph.D. (he) > Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, > Flannick > Lab <http://www.flannicklab.org/>, Broad Institute > <http://www.broadinstitute.org/> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau