Re: HFile vs Parquet for very wide table

2016-01-22 Thread Jerry He
Parquet may be more efficient in your use case, coupled with a upper layer query engine. But Parquet has schema. Schema can evolve though. e.g. adding columns in new Parquet files. HBase would be able to do the job too, and it schema-less -- you can add columns freely. Jerry On Fri, Jan 22, 2016

Re: HFile vs Parquet for very wide table

2016-01-22 Thread Krishna
Thanks Ted, Jerry. Computing pairwise similarity is the primary purpose of the matrix. This is done by extracting all rows for a set of columns at each iteration. On Thursday, January 21, 2016, Jerry He wrote: > What do you want to do with your matrix data? How do you want to use it? > Do you

Re: HFile vs Parquet for very wide table

2016-01-21 Thread Jerry He
What do you want to do with your matrix data? How do you want to use it? Do you need random read/write or point query? Do you need to get the row/record or many many columns at a time? If yes, HBase is a good choice for you. Parquet is good as a storage format for large scans, aggregations, on li

Re: HFile vs Parquet for very wide table

2016-01-21 Thread Ted Yu
I have very limited knowledge on Parquet, so I can only answer from HBase point of view. Please see recent thread on number of columns in a row in HBase: http://search-hadoop.com/m/YGbb3NN3v1jeL1f There're a few Spark hbase connectors. See this thread: http://search-hadoop.com/m/q3RTt4cp9Z4p37s

HFile vs Parquet for very wide table

2016-01-21 Thread Krishna
We are evaluating Parquet and HBase for storing a dense & very, very wide matrix (can have more than 600K columns). I've following questions: - Is there is a limit on # of columns in Parquet or HFile? We expect to query [10-100] columns at a time using Spark - what are the performance im