There's been a bunch of work on adding page indices to parquet: https://github.com/apache/parquet-format/blob/master/PageIndex.md <https://github.com/apache/parquet-format/blob/master/PageIndex.md> I haven't followed progress in detail but I think the Java implementation supports this now.
Looked pretty similar to what kdb or clickhouse do, so with fitting storage underneath it should enable very fast point lookups. Andreas > Am 19.09.2018 um 20:30 schrieb Brian Bowman <brian.bow...@sas.com>: > > Gerlando, > > AFAIK Parquet does not yet support indexing. I believe it does store min/max > values at the row batch (or maybe it's page) level which may help eliminate > large "swaths" of data depending on how actual data values corresponding to a > search predicate are distributed across large Parquet files. > > I have an interest in the future of indexing within the native Parquet > structure as well. It will be interesting to see where this discussion goes > from here. > > -Brian > > On 9/19/18, 3:21 PM, "Gerlando Falauto" <gerlando.fala...@gmail.com> wrote: > > EXTERNAL > > Thank you all guys, you've been extremely helpful with your ideas. > I'll definitely have a look at all your suggestions to see what others have > been doing in this respect. > > What I forgot to mention was that while the service uses the S3 API, it's > not provided by AWS so any solution should be based on a cloud offering > from a different big player (it's the three-letter big-blue one, in case > you're wondering). > > However, I'm still not clear as to how Drill (or pyarrow) would be able to > gather data with random access. In any database, you just build an index on > the fields you're going to run most of your queries over, and then the > database takes care of everything else. > > With Parquet, as I understand, you can do folder-based partitioning (is > that called "hive" partitioning?) so that you can get random access over > let's say > source=source1/date=20180918/*.parquet. > I assume drill could be instructed into doing this or even figure it out by > itself, by just looking at the folder structure. > What I still don't get though, is how to "index" the parquet file(s), so > that random (rather than sequential) access can be performed over the whole > file. > Brian mentioned metadata, I had a quick look at the parquet specification > and I sortof understand it somehow resembles an index. > Yet I fail to understand how such an index could be built (if at all > possible), for instance using pyarrow (or any other tool, for that matter) > for reading and/or writing. > > Thank you! > Gerlando > > On Wed, Sep 19, 2018 at 7:55 PM Ted Dunning <ted.dunn...@gmail.com> wrote: > >> The effect of rename can be had by handling a small inventory file that is >> updated atomically. >> >> Having real file semantics is sooo much nicer, though. >> >> >> >> On Wed, Sep 19, 2018 at 1:51 PM Bill Glennon <wglen...@gmail.com> wrote: >> >>> Also, may want to take a look at https://aws.amazon.com/athena/. >>> >>> Thanks, >>> Bill >>> >>> On Wed, Sep 19, 2018 at 1:43 PM Paul Rogers <par0...@yahoo.com.invalid> >>> wrote: >>> >>>> Hi Gerlando, >>>> >>>> I believe AWS has entire logging pipeline they offer. If you want >>>> something quick, perhaps look into that offering. >>>> >>>> What you describe is pretty much the classic approach to log >> aggregation: >>>> partition data, gather data incrementally, then later consolidate. A >>> while >>>> back, someone invented the term "lambda architecture" for this idea. >> You >>>> should be able to find examples of how others have done something >>> similar. >>>> >>>> Drill can scan directories of files. So, in your buckets (source-date) >>>> directories, you can have multiple files. If you receive data, say, >>> every 5 >>>> or 10 minutes, you can just create a separate file for each new drop of >>>> data. You'll end up with many files, but you can query the data as it >>>> arrives. >>>> >>>> Then, later, say once per day, you can consolidate the files into a few >>>> big files. The only trick is the race condition of doing the >>> consolidation >>>> while running queries. Not sure how to do that on S3, since you can't >>>> exploit rename operations as you can on Linux. Anyone have suggestions >>> for >>>> this step? >>>> >>>> Thanks, >>>> - Paul >>>> >>>> >>>> >>>> On Wednesday, September 19, 2018, 6:23:13 AM PDT, Gerlando Falauto >> < >>>> gerlando.fala...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> I'm looking for a way to store huge amounts of logging data in the >> cloud >>>> from about 100 different data sources, each producing about 50MB/day >> (so >>>> it's something like 5GB/day). >>>> The target storage would be an S3 object storage for cost-efficiency >>>> reasons. >>>> I would like to be able to store (i.e. append-like) data in realtime, >> and >>>> retrieve data based on time frame and data source with fast access. I >> was >>>> thinking of partitioning data based on datasource and calendar day, so >> to >>>> have one file per day, per data source, each 50MB. >>>> >>>> I played around with pyarrow and parquet (using s3fs), and came across >>> the >>>> following limitations: >>>> >>>> 1) I found no way to append to existing files. I believe that's some >>>> limitation with S3, but it could be worked around by using datasets >>>> instead. In principle, I believe I could also trigger some daily job >>> which >>>> coalesces, today's data into a single file, if having too much >>>> fragmentation causes any disturbance. Would that make any sense? >>>> >>>> 2) When reading, if I'm only interested in a small portion of the data >>> (for >>>> instance, based on a timestamp field), I obviously wouldn't want to >> have >>> to >>>> read (i.e. download) the whole file. I believe Parquet was designed to >>>> handle huge amounts of data with relatively fast access. Yet I fail to >>>> understand if there's some way to allow for random access, particularly >>>> when dealing with a file stored within S3. >>>> The following code snippet refers to a 150MB dataset composed of 1000 >>>> rowgroups of 150KB each. I was expecting it to run very fast, yet it >>>> apparently downloads the whole file (pyarrow 0.9.0): >>>> >>>> fs = s3fs.S3FileSystem(key=access_key, secret=secret_key, >>>> client_kwargs=client_kwargs) >>>> with fs.open(bucket_uri) as f: >>>> pf = pq.ParquetFile(f) >>>> print(pf.num_row_groups) # yields 1000 >>>> pf.read_row_group(1) >>>> >>>> 3) I was also expecting to be able to perform some sort of query, but >> I'm >>>> also failing to see how to specify index columns or such. >>>> >>>> What am I missing? Did I get it all wrong? >>>> >>>> Thank you! >>>> Gerlando >>>> >>> >> > >