Re: (Ab)using parquet files on S3 storage for a huge logging database

Bill Glennon Wed, 19 Sep 2018 07:35:08 -0700

I have not had a chance to look into this but at least wanted to share.

Log Search and Analytics Hub on Amazon S3
https://chaossearch.io/


You can listen to a podcast about it if interested.
https://www.dataengineeringpodcast.com/chaos-search-with-pete-cheslock-and-thomas-hazel-episode-47/

Thanks,
Bill Glennon

On Wed, Sep 19, 2018 at 10:26 AM Brian Bowman <brian.bow...@sas.com> wrote:

> Gerlando is correct that S3 Objects, once created are immutable.  They
> cannot updated-in-place, appended to, nor even renamed.   However, S3
> supports seeking to offsets within the object being read.  The challenge is
> knowing where to read within the S3 object, which to perform well will
> require metadata that can be derived by doing minimal I/O operations prior
> to seeking/reading the needed parts of the S3 object.
>
> -Brian
>
> -Brian
>
> On 9/19/18, 9:23 AM, "Gerlando Falauto" <gerlando.fala...@gmail.com>
> wrote:
>
>     EXTERNAL
>
>     Hi,
>
>     I'm looking for a way to store huge amounts of logging data in the
> cloud
>     from about 100 different data sources, each producing about 50MB/day
> (so
>     it's something like 5GB/day).
>     The target storage would be an S3 object storage for cost-efficiency
>     reasons.
>     I would like to be able to store (i.e. append-like) data in realtime,
> and
>     retrieve data based on time frame and data source with fast access. I
> was
>     thinking of partitioning data based on datasource and calendar day, so
> to
>     have one file per day, per data source, each 50MB.
>
>     I played around with pyarrow and parquet (using s3fs), and came across
> the
>     following limitations:
>
>     1) I found no way to append to existing files. I believe that's some
>     limitation with S3, but it could be worked around by using datasets
>     instead. In principle, I believe I could also trigger some daily job
> which
>     coalesces, today's data into a single file, if having too much
>     fragmentation causes any disturbance. Would that make any sense?
>
>     2) When reading, if I'm only interested in a small portion of the data
> (for
>     instance, based on a timestamp field), I obviously wouldn't want to
> have to
>     read (i.e. download) the whole file. I believe Parquet was designed to
>     handle huge amounts of data with relatively fast access. Yet I fail to
>     understand if there's some way to allow for random access, particularly
>     when dealing with a file stored within S3.
>     The following code snippet refers to a 150MB dataset composed of 1000
>     rowgroups of 150KB each. I was expecting it to run very fast, yet it
>     apparently downloads the whole file (pyarrow 0.9.0):
>
>     fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
>     client_kwargs=client_kwargs)
>     with fs.open(bucket_uri) as f:
>         pf = pq.ParquetFile(f)
>         print(pf.num_row_groups) # yields 1000
>         pf.read_row_group(1)
>
>     3) I was also expecting to be able to perform some sort of query, but
> I'm
>     also failing to see how to specify index columns or such.
>
>     What am I missing? Did I get it all wrong?
>
>     Thank you!
>     Gerlando
>
>
>

Re: (Ab)using parquet files on S3 storage for a huge logging database

Reply via email to