[jira] [Created] (ARROW-3277) [Python] Validate manylinux1 builds with crossbow instead of each Travis CI build

2018-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3277: --- Summary: [Python] Validate manylinux1 builds with crossbow instead of each Travis CI build Key: ARROW-3277 URL: https://issues.apache.org/jira/browse/ARROW-3277 Project

[jira] [Created] (ARROW-3276) [Packaging] Add support Parquet related Linux packages

2018-09-19 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3276: --- Summary: [Packaging] Add support Parquet related Linux packages Key: ARROW-3276 URL: https://issues.apache.org/jira/browse/ARROW-3276 Project: Apache Arrow Iss

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Gerlando Falauto
Hi Sam, thanks a lot for all your help! As I just said, the perfect solution for me would be to run your approach remotely, close to the data center, and then just deliver the matching results to the requesting user with their narrowband connection. Have you figured out some meaningful way to mod

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Gerlando Falauto
Hi Paul, I see your point. I'm probably worrying too much about indices, inasmuch partitioning just reduces the problem enough, down to a bearable size. I have to understand better how Apache drill can be interfaced with. If it could get easily deployed somewhere in the cloud -- where fetching file

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Sam Kennerly
Hi Gerlando, I'm also using remote Parquet files as a pseudo-database for long-term storage of log-like records. Here's what I do: # save log files 0. Every (second|minute|hour|whatever), parse new logs and combine them into 1 pyarrow.Table in RAM on 1 machine. 1. Use pyarrow.parquet.write_to_dat

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Brian Bowman
Distributed row-level indexing has been done well in a particular large-scale data system that I'm very familiar with, albeit within a row-wise organization. -Brian On 9/19/18, 5:04 PM, "Paul Rogers" wrote: EXTERNAL Hi Gerlando, Parquet does not allow row-level indexin

[jira] [Created] (ARROW-3275) [Python] Add documentation about inspecting Parquet file metadata

2018-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3275: --- Summary: [Python] Add documentation about inspecting Parquet file metadata Key: ARROW-3275 URL: https://issues.apache.org/jira/browse/ARROW-3275 Project: Apache Arrow

[jira] [Created] (ARROW-3274) [Packaging] Missing glog dependency on conda-forge builds

2018-09-19 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3274: -- Summary: [Packaging] Missing glog dependency on conda-forge builds Key: ARROW-3274 URL: https://issues.apache.org/jira/browse/ARROW-3274 Project: Apache Arrow

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Andreas Heider
There's been a bunch of work on adding page indices to parquet: https://github.com/apache/parquet-format/blob/master/PageIndex.md I haven't followed progress in detail but I think the Java implementation supports this now. Look

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Paul Rogers
Hi Gerlando, Parquet does not allow row-level indexing because some data for a row might not even exist, it is encoded in data about a group of similar rows. In the world of Big Data, it seems that the most common practice is to simply scan all the data to find the bits you want. Indexing is ve

[jira] [Created] (ARROW-3273) [Java] checkstyle - fix javadoc style

2018-09-19 Thread Bryan Cutler (JIRA)
Bryan Cutler created ARROW-3273: --- Summary: [Java] checkstyle - fix javadoc style Key: ARROW-3273 URL: https://issues.apache.org/jira/browse/ARROW-3273 Project: Apache Arrow Issue Type: Sub-task

[jira] [Created] (ARROW-3272) [Java] Document deviations from Google Style

2018-09-19 Thread Bryan Cutler (JIRA)
Bryan Cutler created ARROW-3272: --- Summary: [Java] Document deviations from Google Style Key: ARROW-3272 URL: https://issues.apache.org/jira/browse/ARROW-3272 Project: Apache Arrow Issue Type: S

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Brian Bowman
Gerlando, AFAIK Parquet does not yet support indexing. I believe it does store min/max values at the row batch (or maybe it's page) level which may help eliminate large "swaths" of data depending on how actual data values corresponding to a search predicate are distributed across large Parquet

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Gerlando Falauto
Thank you all guys, you've been extremely helpful with your ideas. I'll definitely have a look at all your suggestions to see what others have been doing in this respect. What I forgot to mention was that while the service uses the S3 API, it's not provided by AWS so any solution should be based o

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Wes McKinney
On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi wrote: > > Hi Johan, Wes, and Jacques - many thanks for your comments: > > @Johan - > 1. I also do not suspect that there is any inherent drawback in Java or C++ > due to the Arrow format. I mentioned C++ because Wes pointed out that Java > routines

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Animesh Trivedi
Hi Johan, Wes, and Jacques - many thanks for your comments: @Johan - 1. I also do not suspect that there is any inherent drawback in Java or C++ due to the Arrow format. I mentioned C++ because Wes pointed out that Java routines are not the most optimized ones (yet!). And naturally one would expec

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Ted Dunning
The effect of rename can be had by handling a small inventory file that is updated atomically. Having real file semantics is sooo much nicer, though. On Wed, Sep 19, 2018 at 1:51 PM Bill Glennon wrote: > Also, may want to take a look at https://aws.amazon.com/athena/. > > Thanks, > Bill > > O

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Bill Glennon
Also, may want to take a look at https://aws.amazon.com/athena/. Thanks, Bill On Wed, Sep 19, 2018 at 1:43 PM Paul Rogers wrote: > Hi Gerlando, > > I believe AWS has entire logging pipeline they offer. If you want > something quick, perhaps look into that offering. > > What you describe is pret

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Paul Rogers
Hi Gerlando, I believe AWS has entire logging pipeline they offer. If you want something quick, perhaps look into that offering. What you describe is pretty much the classic approach to log aggregation: partition data, gather data incrementally, then later consolidate. A while back, someone in

[jira] [Created] (ARROW-3271) [Python] Manylinux1 builds timing out in Travis CI

2018-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3271: --- Summary: [Python] Manylinux1 builds timing out in Travis CI Key: ARROW-3271 URL: https://issues.apache.org/jira/browse/ARROW-3271 Project: Apache Arrow Issue T

[jira] [Created] (ARROW-3270) [Release] Adjust release verification scripts to recent parquet migration

2018-09-19 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3270: -- Summary: [Release] Adjust release verification scripts to recent parquet migration Key: ARROW-3270 URL: https://issues.apache.org/jira/browse/ARROW-3270 Project:

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Jacques Nadeau
My big question is what is the use case and how/what are you trying to compare? Arrow's implementation is more focused on interacting with the structure than transporting it. Generally speaking, when we're working with Arrow data we frequently are just interacting with memory locations and doing di

Re: Arrow dev sync call 12pm US Eastern

2018-09-19 Thread Jacques Nadeau
Quick notes from call: Jacques Wes Li Pearu Shiv Camilo Kristian Charles 11 release - - About 20 “grindy” python patches left. Wes working on it but maybe others can help? - Outstanding packaging issues: zlib dynamic linking with Conda packages. - Would be good to get Java Guava removal and Jacks

Arrow dev sync call 12pm US Eastern

2018-09-19 Thread Wes McKinney
All are welcome at: https://meet.google.com/vtm-teks-phx

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Wes McKinney
hi Animesh, Per Johan's comments, the C++ library is essentially going to be IO/memory bandwidth bound since you're interacting with raw pointers. I'm looking at your code private void consumeFloat4(FieldVector fv) { Float4Vector accessor = (Float4Vector) fv; int valCount = accessor.getV

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Bill Glennon
I have not had a chance to look into this but at least wanted to share. Log Search and Analytics Hub on Amazon S3 https://chaossearch.io/ You can listen to a podcast about it if interested. https://www.dataengineeringpodcast.com/chaos-search-with-pete-cheslock-and-thomas-hazel-episode-47/ Thanks

[jira] [Created] (ARROW-3269) [Python] Fix warnings in unit test suite

2018-09-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3269: --- Summary: [Python] Fix warnings in unit test suite Key: ARROW-3269 URL: https://issues.apache.org/jira/browse/ARROW-3269 Project: Apache Arrow Issue Type: Impro

Re: (Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Brian Bowman
Gerlando is correct that S3 Objects, once created are immutable. They cannot updated-in-place, appended to, nor even renamed. However, S3 supports seeking to offsets within the object being read. The challenge is knowing where to read within the S3 object, which to perform well will require

[jira] [Created] (ARROW-3268) [CI] Reduce conda times on AppVeyor

2018-09-19 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3268: - Summary: [CI] Reduce conda times on AppVeyor Key: ARROW-3268 URL: https://issues.apache.org/jira/browse/ARROW-3268 Project: Apache Arrow Issue Type: Improv

(Ab)using parquet files on S3 storage for a huge logging database

2018-09-19 Thread Gerlando Falauto
Hi, I'm looking for a way to store huge amounts of logging data in the cloud from about 100 different data sources, each producing about 50MB/day (so it's something like 5GB/day). The target storage would be an S3 object storage for cost-efficiency reasons. I would like to be able to store (i.e. a

Re: [JAVA] Arrow performance measurement

2018-09-19 Thread Johan Peltenburg - EWI
Hello Animesh, I browsed a bit in your sources, thanks for sharing. We have performed some similar measurements to your third case in the past for C/C++ on collections of various basic types such as primitives and strings. I can say that in terms of consuming data from the Arrow format versus

[JAVA] Arrow performance measurement

2018-09-19 Thread Animesh Trivedi
Hi all, A week ago, Wes and I had a discussion about the performance of the Arrow/Java implementation on the Apache Crail (Incubating) mailing list ( http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser). In a nutshell: I am investigating the performance of various file formats (

[jira] [Created] (ARROW-3267) [Python] Create empty table from schema

2018-09-19 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-3267: -- Summary: [Python] Create empty table from schema Key: ARROW-3267 URL: https://issues.apache.org/jira/browse/ARROW-3267 Project: Apache Arrow Issue Type: Improvem