Re: [PySpark] Reader/Writer for bgzipped data

Holden Karau Tue, 06 Dec 2022 06:23:21 -0800

There is the splittable gzip Hadoop input format, maybe someone could
extend that to use support bgzip?


On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>      Hello Chris,
>
>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but
> to start reading from somewhere other than the beginning of the file, you
> would need to use an index to tell you where the blocks start. Originally,
> a Tabix index was used and is still the popular choice, although other
> types of indices also exist.
>
>      Best, Oliver
>
> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth <cnaur...@apache.org> wrote:
>
>> Sorry, I misread that in the original email.
>>
>> This is my first time looking at bgzip. I see from the documentation that
>> it is putting some additional framing around gzip and producing a series of
>> small blocks, such that you can create an index of the file and decompress
>> individual blocks instead of the whole file. That's interesting, because it
>> could potentially support a splittable format. (Plain gzip isn't
>> splittable.)
>>
>> I also noticed that it states it is "compatible with" gzip. I tried a
>> basic test of running bgzip on a file, which produced a .gz output file,
>> and then running the same spark.read.text code sample from earlier. Sure
>> enough, I was able to read the data. This implies there is at least some
>> basic compatibility, so that you could read files created by bgzip.
>> However, that read would not be optimized in any way to take advantage of
>> an index file. There also would not be any way to produce bgzip-style
>> output like in the df.write.option code sample. To achieve either of those,
>> it would require writing a custom Hadoop compression codec to integrate
>> more closely with the data format.
>>
>> Chris Nauroth
>>
>>
>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   Thanks for the response, but I mean compressed with bgzip
>>> <http://www.htslib.org/doc/bgzip.html>, not bzip2.
>>>
>>>      Best, Oliver
>>>
>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org>
>>> wrote:
>>>
>>>> Hello Oliver,
>>>>
>>>> Yes, Spark makes this possible using the Hadoop compression codecs and
>>>> the Hadoop-compatible FileSystem interface [1]. Here is an example of
>>>> reading:
>>>>
>>>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2")
>>>> df.show(10)
>>>>
>>>> This is using a test data set of the complete works of Shakespeare,
>>>> stored as text and compressed to a single .bz2 file. This code sample
>>>> didn't need to do anything special to declare that it's working with bzip2
>>>> compression, because the Hadoop compression codecs detect that the file has
>>>> a .bz2 extension and automatically assume it needs to be decompressed
>>>> before presenting it to our code in the DataFrame as text.
>>>>
>>>> On the write side, if you wanted to declare a particular kind of output
>>>> compression, you can do it with a write option like this:
>>>>
>>>> df.write.option("compression",
>>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS
>>>> bucket>/data/shakespeare-bz2-copy")
>>>>
>>>> This writes the contents of the DataFrame, stored as text and
>>>> compressed to .bz2 files in the destination path.
>>>>
>>>> My example is testing with a GCS bucket (scheme "gs:"), but you can
>>>> also switch the Hadoop file system interface to target other file systems
>>>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure
>>>> the S3AFIleSystem, including how to pass credentials for access to the S3
>>>> bucket [2].
>>>>
>>>> Note that for big data use cases, other compression codecs like Snappy
>>>> are generally preferred for greater efficiency. (Of course, we're not
>>>> always in complete control of the data formats we're given, so the support
>>>> for bz2 is there.)
>>>>
>>>> [1]
>>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>>>> [2]
>>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>>>
>>>> Chris Nauroth
>>>>
>>>>
>>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>>>> oliv...@broadinstitute.org> wrote:
>>>>
>>>>>
>>>>>      Hello,
>>>>>
>>>>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>>>>> files? Can it read from/write to AWS S3? Thanks!
>>>>>
>>>>>      Best, Oliver
>>>>>
>>>>> --
>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>>> <http://www.broadinstitute.org/>
>>>>>
>>>>
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>>> Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
> Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [PySpark] Reader/Writer for bgzipped data

Reply via email to