Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Gerlando Falauto
; > Gerlando, > > > > > > AFAIK Parquet does not yet support indexing. I believe it does store > > min/max values at the row batch (or maybe it's page) level which may help > > eliminate large "swaths" of data depending on how actual data values

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Gerlando Falauto
) level which may help > eliminate large "swaths" of data depending on how actual data values > corresponding to a search predicate are distributed across large Parquet > files. > > I have an interest in the future of indexing within the native Parquet > structure as

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Gerlando Falauto
ut you can query the data as it > > > arrives. > > > > > > Then, later, say once per day, you can consolidate the files into a few > > > big files. The only trick is the race condition of doing the > > consolidation > > > while running queries. Not su

(Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Gerlando Falauto
Hi, I'm looking for a way to store huge amounts of logging data in the cloud from about 100 different data sources, each producing about 50MB/day (so it's something like 5GB/day). The target storage would be an S3 object storage for cost-efficiency reasons. I would like to be able to store (i.e. a

[jira] [Created] (ARROW-2616) Cross-compiling Pyarrow

2018-05-20 Thread Gerlando Falauto (JIRA)
Gerlando Falauto created ARROW-2616: --- Summary: Cross-compiling Pyarrow Key: ARROW-2616 URL: https://issues.apache.org/jira/browse/ARROW-2616 Project: Apache Arrow Issue Type: Bug