Re: (Ab)using parquet files on S3 storage for a huge logging database

Ted Dunning Wed, 19 Sep 2018 10:56:05 -0700

The effect of rename can be had by handling a small inventory file that is
updated atomically.


Having real file semantics is sooo much nicer, though.



On Wed, Sep 19, 2018 at 1:51 PM Bill Glennon <wglen...@gmail.com> wrote:

> Also, may want to take a look at https://aws.amazon.com/athena/.
>
> Thanks,
> Bill
>
> On Wed, Sep 19, 2018 at 1:43 PM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi Gerlando,
> >
> > I believe AWS has entire logging pipeline they offer. If you want
> > something quick, perhaps look into that offering.
> >
> > What you describe is pretty much the classic approach to log aggregation:
> > partition data, gather data incrementally, then later consolidate. A
> while
> > back, someone invented the term "lambda architecture" for this idea. You
> > should be able to find examples of how others have done something
> similar.
> >
> > Drill can scan directories of files. So, in your buckets (source-date)
> > directories, you can have multiple files. If you receive data, say,
> every 5
> > or 10 minutes, you can just create a separate file for each new drop of
> > data. You'll end up with many files, but you can query the data as it
> > arrives.
> >
> > Then, later, say once per day, you can consolidate the files into a few
> > big files. The only trick is the race condition of doing the
> consolidation
> > while running queries. Not sure how to do that on S3, since you can't
> > exploit rename operations as you can on Linux. Anyone have suggestions
> for
> > this step?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >     On Wednesday, September 19, 2018, 6:23:13 AM PDT, Gerlando Falauto <
> > gerlando.fala...@gmail.com> wrote:
> >
> >  Hi,
> >
> > I'm looking for a way to store huge amounts of logging data in the cloud
> > from about 100 different data sources, each producing about 50MB/day (so
> > it's something like 5GB/day).
> > The target storage would be an S3 object storage for cost-efficiency
> > reasons.
> > I would like to be able to store (i.e. append-like) data in realtime, and
> > retrieve data based on time frame and data source with fast access. I was
> > thinking of partitioning data based on datasource and calendar day, so to
> > have one file per day, per data source, each 50MB.
> >
> > I played around with pyarrow and parquet (using s3fs), and came across
> the
> > following limitations:
> >
> > 1) I found no way to append to existing files. I believe that's some
> > limitation with S3, but it could be worked around by using datasets
> > instead. In principle, I believe I could also trigger some daily job
> which
> > coalesces, today's data into a single file, if having too much
> > fragmentation causes any disturbance. Would that make any sense?
> >
> > 2) When reading, if I'm only interested in a small portion of the data
> (for
> > instance, based on a timestamp field), I obviously wouldn't want to have
> to
> > read (i.e. download) the whole file. I believe Parquet was designed to
> > handle huge amounts of data with relatively fast access. Yet I fail to
> > understand if there's some way to allow for random access, particularly
> > when dealing with a file stored within S3.
> > The following code snippet refers to a 150MB dataset composed of 1000
> > rowgroups of 150KB each. I was expecting it to run very fast, yet it
> > apparently downloads the whole file (pyarrow 0.9.0):
> >
> > fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
> > client_kwargs=client_kwargs)
> > with fs.open(bucket_uri) as f:
> >     pf = pq.ParquetFile(f)
> >     print(pf.num_row_groups) # yields 1000
> >     pf.read_row_group(1)
> >
> > 3) I was also expecting to be able to perform some sort of query, but I'm
> > also failing to see how to specify index columns or such.
> >
> > What am I missing? Did I get it all wrong?
> >
> > Thank you!
> > Gerlando
> >
>

Re: (Ab)using parquet files on S3 storage for a huge logging database

Reply via email to