Parquet may be more efficient in your use case, coupled with a upper layer
query engine.
But Parquet has schema. Schema can evolve though. e.g. adding columns in
new Parquet files.
HBase would be able to do the job too, and it schema-less -- you can add
columns freely.
Jerry
On Fri, Jan 22, 2016
Thanks Ted, Jerry.
Computing pairwise similarity is the primary purpose of the matrix. This is
done by extracting all rows for a set of columns at each iteration.
On Thursday, January 21, 2016, Jerry He wrote:
> What do you want to do with your matrix data? How do you want to use it?
> Do you
What do you want to do with your matrix data? How do you want to use it?
Do you need random read/write or point query? Do you need to get the
row/record or many many columns at a time?
If yes, HBase is a good choice for you.
Parquet is good as a storage format for large scans, aggregations, on
li
I have very limited knowledge on Parquet, so I can only answer from HBase
point of view.
Please see recent thread on number of columns in a row in HBase:
http://search-hadoop.com/m/YGbb3NN3v1jeL1f
There're a few Spark hbase connectors.
See this thread:
http://search-hadoop.com/m/q3RTt4cp9Z4p37s
We are evaluating Parquet and HBase for storing a dense & very, very wide
matrix (can have more than 600K columns).
I've following questions:
- Is there is a limit on # of columns in Parquet or HFile? We expect to
query [10-100] columns at a time using Spark - what are the performance
im