Re: (Ab)using parquet files on S3 storage for a huge logging database

Bill Glennon Wed, 19 Sep 2018 10:51:52 -0700

Also, may want to take a look at https://aws.amazon.com/athena/.


Thanks,
Bill

On Wed, Sep 19, 2018 at 1:43 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Gerlando,
>
> I believe AWS has entire logging pipeline they offer. If you want
> something quick, perhaps look into that offering.
>
> What you describe is pretty much the classic approach to log aggregation:
> partition data, gather data incrementally, then later consolidate. A while
> back, someone invented the term "lambda architecture" for this idea. You
> should be able to find examples of how others have done something similar.
>
> Drill can scan directories of files. So, in your buckets (source-date)
> directories, you can have multiple files. If you receive data, say, every 5
> or 10 minutes, you can just create a separate file for each new drop of
> data. You'll end up with many files, but you can query the data as it
> arrives.
>
> Then, later, say once per day, you can consolidate the files into a few
> big files. The only trick is the race condition of doing the consolidation
> while running queries. Not sure how to do that on S3, since you can't
> exploit rename operations as you can on Linux. Anyone have suggestions for
> this step?
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, September 19, 2018, 6:23:13 AM PDT, Gerlando Falauto <
> gerlando.fala...@gmail.com> wrote:
>
>  Hi,
>
> I'm looking for a way to store huge amounts of logging data in the cloud
> from about 100 different data sources, each producing about 50MB/day (so
> it's something like 5GB/day).
> The target storage would be an S3 object storage for cost-efficiency
> reasons.
> I would like to be able to store (i.e. append-like) data in realtime, and
> retrieve data based on time frame and data source with fast access. I was
> thinking of partitioning data based on datasource and calendar day, so to
> have one file per day, per data source, each 50MB.
>
> I played around with pyarrow and parquet (using s3fs), and came across the
> following limitations:
>
> 1) I found no way to append to existing files. I believe that's some
> limitation with S3, but it could be worked around by using datasets
> instead. In principle, I believe I could also trigger some daily job which
> coalesces, today's data into a single file, if having too much
> fragmentation causes any disturbance. Would that make any sense?
>
> 2) When reading, if I'm only interested in a small portion of the data (for
> instance, based on a timestamp field), I obviously wouldn't want to have to
> read (i.e. download) the whole file. I believe Parquet was designed to
> handle huge amounts of data with relatively fast access. Yet I fail to
> understand if there's some way to allow for random access, particularly
> when dealing with a file stored within S3.
> The following code snippet refers to a 150MB dataset composed of 1000
> rowgroups of 150KB each. I was expecting it to run very fast, yet it
> apparently downloads the whole file (pyarrow 0.9.0):
>
> fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
> client_kwargs=client_kwargs)
> with fs.open(bucket_uri) as f:
>     pf = pq.ParquetFile(f)
>     print(pf.num_row_groups) # yields 1000
>     pf.read_row_group(1)
>
> 3) I was also expecting to be able to perform some sort of query, but I'm
> also failing to see how to specify index columns or such.
>
> What am I missing? Did I get it all wrong?
>
> Thank you!
> Gerlando
>

Re: (Ab)using parquet files on S3 storage for a huge logging database

Reply via email to