Hi Gerlando,

I believe AWS has entire logging pipeline they offer. If you want something 
quick, perhaps look into that offering.

What you describe is pretty much the classic approach to log aggregation: 
partition data, gather data incrementally, then later consolidate. A while 
back, someone invented the term "lambda architecture" for this idea. You should 
be able to find examples of how others have done something similar.

Drill can scan directories of files. So, in your buckets (source-date) 
directories, you can have multiple files. If you receive data, say, every 5 or 
10 minutes, you can just create a separate file for each new drop of data. 
You'll end up with many files, but you can query the data as it arrives.

Then, later, say once per day, you can consolidate the files into a few big 
files. The only trick is the race condition of doing the consolidation while 
running queries. Not sure how to do that on S3, since you can't exploit rename 
operations as you can on Linux. Anyone have suggestions for this step?

Thanks,
- Paul

 

    On Wednesday, September 19, 2018, 6:23:13 AM PDT, Gerlando Falauto 
<gerlando.fala...@gmail.com> wrote:  
 
 Hi,

I'm looking for a way to store huge amounts of logging data in the cloud
from about 100 different data sources, each producing about 50MB/day (so
it's something like 5GB/day).
The target storage would be an S3 object storage for cost-efficiency
reasons.
I would like to be able to store (i.e. append-like) data in realtime, and
retrieve data based on time frame and data source with fast access. I was
thinking of partitioning data based on datasource and calendar day, so to
have one file per day, per data source, each 50MB.

I played around with pyarrow and parquet (using s3fs), and came across the
following limitations:

1) I found no way to append to existing files. I believe that's some
limitation with S3, but it could be worked around by using datasets
instead. In principle, I believe I could also trigger some daily job which
coalesces, today's data into a single file, if having too much
fragmentation causes any disturbance. Would that make any sense?

2) When reading, if I'm only interested in a small portion of the data (for
instance, based on a timestamp field), I obviously wouldn't want to have to
read (i.e. download) the whole file. I believe Parquet was designed to
handle huge amounts of data with relatively fast access. Yet I fail to
understand if there's some way to allow for random access, particularly
when dealing with a file stored within S3.
The following code snippet refers to a 150MB dataset composed of 1000
rowgroups of 150KB each. I was expecting it to run very fast, yet it
apparently downloads the whole file (pyarrow 0.9.0):

fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
client_kwargs=client_kwargs)
with fs.open(bucket_uri) as f:
    pf = pq.ParquetFile(f)
    print(pf.num_row_groups) # yields 1000
    pf.read_row_group(1)

3) I was also expecting to be able to perform some sort of query, but I'm
also failing to see how to specify index columns or such.

What am I missing? Did I get it all wrong?

Thank you!
Gerlando
  

Reply via email to