Thank you sir. Very helpful
On Thursday, 29 October 2015, 15:22, Alan Gates <alanfga...@gmail.com> wrote: Ashok Kumar October 28, 2015 at 22:43 hi gurus, kindly clarify the following please - Hive currently does not support indexes or indexes are not used in the query Mostly true. There is a create index, but Hive does not use the resulting index by default. Some storage formats (ORC, Parquet I think) have their own indices they use internally to speed access. - The lowest granularity for concurrency is partition. If table is partitioned, then partition will be lucked in DML operation lucked =locked? I'm not sure what you intended here. If you mean locked, then it depends. By default Hive doesn't use locking. You can set it up to do locking via ZooKeeper or as part of Hive transactions. They have different locking models. See https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions and https://cwiki.apache.org/confluence/display/Hive/Locking for more information. You can sub-partition using buckets, but for most queries partition is the lowest level of granularity. Hive does a lot of work to optimize only reading relevant partitions for a query. - What is the best file format to store Hive table in HDFS? Is this ORC or Avro that allow being split and support block compression? It depends on what you want to do. ORC and Parquet do better for traditional data warehousing type queries because they are columnar formats and have lots of optimization built in for fast access, pushing filter down into the storage level etc. People like Avro and other self describing formats when their data brings its own structure. We very frequently see pipelines where people dump Avro, text, etc. into Hive and then ETL it into ORC. - Text/CSV files. By default if file type is not specified at creation time, Hive will default to text file? Out of the box yes, but you can change that in your Hive installation by setting hive.default.fileformat in your hive-site.xml. Alan. Thanks