I have not had a chance to look into this but at least wanted to share. Log Search and Analytics Hub on Amazon S3 https://chaossearch.io/
You can listen to a podcast about it if interested. https://www.dataengineeringpodcast.com/chaos-search-with-pete-cheslock-and-thomas-hazel-episode-47/ Thanks, Bill Glennon On Wed, Sep 19, 2018 at 10:26 AM Brian Bowman <brian.bow...@sas.com> wrote: > Gerlando is correct that S3 Objects, once created are immutable. They > cannot updated-in-place, appended to, nor even renamed. However, S3 > supports seeking to offsets within the object being read. The challenge is > knowing where to read within the S3 object, which to perform well will > require metadata that can be derived by doing minimal I/O operations prior > to seeking/reading the needed parts of the S3 object. > > -Brian > > -Brian > > On 9/19/18, 9:23 AM, "Gerlando Falauto" <gerlando.fala...@gmail.com> > wrote: > > EXTERNAL > > Hi, > > I'm looking for a way to store huge amounts of logging data in the > cloud > from about 100 different data sources, each producing about 50MB/day > (so > it's something like 5GB/day). > The target storage would be an S3 object storage for cost-efficiency > reasons. > I would like to be able to store (i.e. append-like) data in realtime, > and > retrieve data based on time frame and data source with fast access. I > was > thinking of partitioning data based on datasource and calendar day, so > to > have one file per day, per data source, each 50MB. > > I played around with pyarrow and parquet (using s3fs), and came across > the > following limitations: > > 1) I found no way to append to existing files. I believe that's some > limitation with S3, but it could be worked around by using datasets > instead. In principle, I believe I could also trigger some daily job > which > coalesces, today's data into a single file, if having too much > fragmentation causes any disturbance. Would that make any sense? > > 2) When reading, if I'm only interested in a small portion of the data > (for > instance, based on a timestamp field), I obviously wouldn't want to > have to > read (i.e. download) the whole file. I believe Parquet was designed to > handle huge amounts of data with relatively fast access. Yet I fail to > understand if there's some way to allow for random access, particularly > when dealing with a file stored within S3. > The following code snippet refers to a 150MB dataset composed of 1000 > rowgroups of 150KB each. I was expecting it to run very fast, yet it > apparently downloads the whole file (pyarrow 0.9.0): > > fs = s3fs.S3FileSystem(key=access_key, secret=secret_key, > client_kwargs=client_kwargs) > with fs.open(bucket_uri) as f: > pf = pq.ParquetFile(f) > print(pf.num_row_groups) # yields 1000 > pf.read_row_group(1) > > 3) I was also expecting to be able to perform some sort of query, but > I'm > also failing to see how to specify index columns or such. > > What am I missing? Did I get it all wrong? > > Thank you! > Gerlando > > >