What is the physical query plan after you set spark.sql.hive.convertMetastoreParquet to true?
Michael > On Jan 17, 2017, at 6:51 PM, Raju Bairishetti <r...@apache.org> wrote: > > Thanks Michael for the respopnse. > > > On Wed, Jan 18, 2017 at 2:45 AM, Michael Allman <mich...@videoamp.com > <mailto:mich...@videoamp.com>> wrote: > Hi Raju, > > I'm sorry this isn't working for you. I helped author this functionality and > will try my best to help. > > First, I'm curious why you set spark.sql.hive.convertMetastoreParquet to > false? > I had set as suggested in SPARK-6910 and corresponsing pull reqs. It did not > work for me without setting spark.sql.hive.convertMetastoreParquet property. > > Can you link specifically to the jira issue or spark pr you referred to? The > first thing I would try is setting spark.sql.hive.convertMetastoreParquet to > true. Setting that to false might also explain why you're getting parquet > decode errors. If you're writing your table data with Spark's parquet file > writer and reading with Hive's parquet file reader, there may be an > incompatibility accounting for the decode errors you're seeing. > > https://issues.apache.org/jira/browse/SPARK-6910 > <https://issues.apache.org/jira/browse/SPARK-6910> . My main motivation is to > avoid fetching all the partitions. We reverted > spark.sql.hive.convertMetastoreParquet setting to true to decoding errors. > After reverting this it is fetching all partiitons from the table. > > Can you reply with your table's Hive metastore schema, including partition > schema? > col1 string > col2 string > year int > month int > day int > hour int > # Partition Information > > # col_name data_type comment > > year int > > month int > > day int > > hour int > > venture string > > > Where are the table's files located? > In hadoop. Under some user directory. > If you do a "show partitions <dbname>.<tablename>" in the spark-sql shell, > does it show the partitions you expect to see? If not, run "msck repair table > <dbname>.<tablename>". > Yes. It is listing the partitions > Cheers, > > Michael > > >> On Jan 17, 2017, at 12:02 AM, Raju Bairishetti <r...@apache.org >> <mailto:r...@apache.org>> wrote: >> >> Had a high level look into the code. Seems getHiveQlPartitions method from >> HiveMetastoreCatalog is getting called irrespective of >> metastorePartitionPruning conf value. >> >> It should not fetch all partitions if we set metastorePartitionPruning to >> true (Default value for this is false) >> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = >> { >> val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) { >> table.getPartitions(predicates) >> } else { >> allPartitions >> } >> ... >> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] = >> client.getPartitionsByFilter(this, predicates) >> lazy val allPartitions = table.getAllPartitions >> But somehow getAllPartitions is getting called eventough after setting >> metastorePartitionPruning to true. >> Am I missing something or looking at wrong place? >> >> On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <r...@apache.org >> <mailto:r...@apache.org>> wrote: >> Hello, >> >> Spark sql is generating query plan with all partitions information even >> though if we apply filters on partitions in the query. Due to this, >> sparkdriver/hive metastore is hitting with OOM as each table is with lots of >> partitions. >> >> We can confirm from hive audit logs that it tries to fetch all partitions >> from hive metastore. >> >> 2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: HiveMetaStore.audit >> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub ip=/x.x.x.x >> cmd=get_partitions : db=xxxx tbl=xxxxx >> >> >> Configured the following parameters in the spark conf to fix the above >> issue(source: from spark-jira & github pullreq): >> spark.sql.hive.convertMetastoreParquet false >> spark.sql.hive.metastorePartitionPruning true >> >> plan: rdf.explain >> == Physical Plan == >> HiveTableScan [rejection_reason#626], MetastoreRelation dbname, >> tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = >> 28),(hour#317 = 2),(venture#318 = DEFAULT)] >> >> get_partitions_by_filter method is called and fetching only required >> partitions. >> >> But we are seeing parquetDecode errors in our applications frequently >> after this. Looks like these decoding errors were because of changing serde >> fromspark-builtin to hive serde. >> >> I feel like, fixing query plan generation in the spark-sql is the right >> approach instead of forcing users to use hive serde. >> >> Is there any workaround/way to fix this issue? I would like to hear more >> thoughts on this :) >> >> >> On Tue, Jan 17, 2017 at 4:00 PM, Raju Bairishetti <r...@apache.org >> <mailto:r...@apache.org>> wrote: >> Had a high level look into the code. Seems getHiveQlPartitions method from >> HiveMetastoreCatalog is getting called irrespective of >> metastorePartitionPruning conf value. >> >> It should not fetch all partitions if we set metastorePartitionPruning to >> true (Default value for this is false) >> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = >> { >> val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) { >> table.getPartitions(predicates) >> } else { >> allPartitions >> } >> ... >> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] = >> client.getPartitionsByFilter(this, predicates) >> lazy val allPartitions = table.getAllPartitions >> But somehow getAllPartitions is getting called eventough after setting >> metastorePartitionPruning to true. >> Am I missing something or looking at wrong place? >> >> On Mon, Jan 16, 2017 at 12:53 PM, Raju Bairishetti <r...@apache.org >> <mailto:r...@apache.org>> wrote: >> Waiting for suggestions/help on this... >> >> On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti <r...@apache.org >> <mailto:r...@apache.org>> wrote: >> Hello, >> >> Spark sql is generating query plan with all partitions information even >> though if we apply filters on partitions in the query. Due to this, spark >> driver/hive metastore is hitting with OOM as each table is with lots of >> partitions. >> >> We can confirm from hive audit logs that it tries to fetch all partitions >> from hive metastore. >> >> 2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: HiveMetaStore.audit >> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub ip=/x.x.x.x >> cmd=get_partitions : db=xxxx tbl=xxxxx >> >> >> Configured the following parameters in the spark conf to fix the above >> issue(source: from spark-jira & github pullreq): >> spark.sql.hive.convertMetastoreParquet false >> spark.sql.hive.metastorePartitionPruning true >> >> plan: rdf.explain >> == Physical Plan == >> HiveTableScan [rejection_reason#626], MetastoreRelation dbname, >> tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = >> 28),(hour#317 = 2),(venture#318 = DEFAULT)] >> >> get_partitions_by_filter method is called and fetching only required >> partitions. >> >> But we are seeing parquetDecode errors in our applications frequently >> after this. Looks like these decoding errors were because of changing serde >> from spark-builtin to hive serde. >> >> I feel like, fixing query plan generation in the spark-sql is the right >> approach instead of forcing users to use hive serde. >> >> Is there any workaround/way to fix this issue? I would like to hear more >> thoughts on this :) >> >> ------ >> Thanks, >> Raju Bairishetti, >> www.lazada.com <http://www.lazada.com/> >> >> >> -- >> >> ------ >> Thanks, >> Raju Bairishetti, >> www.lazada.com <http://www.lazada.com/> >> >> >> -- >> >> ------ >> Thanks, >> Raju Bairishetti, >> www.lazada.com <http://www.lazada.com/> >> >> >> -- >> >> ------ >> Thanks, >> Raju Bairishetti, >> www.lazada.com <http://www.lazada.com/> >> >> >> -- >> >> ------ >> Thanks, >> Raju Bairishetti, >> www.lazada.com <http://www.lazada.com/> > > > > -- > > ------ > Thanks, > Raju Bairishetti, > www.lazada.com <http://www.lazada.com/>