Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread @Sanjiv Singh
Yes, It is very strange and also very opposite to my belief on Spark SQL on hive tables. I am facing this issue on HDP setup on which COMPACTION is required only once. On the other hand, Apache setup doesn't required compaction even once. May be something got triggered on meta-store after compact

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread Varadharajan Mukundan
That's interesting. I'm not sure why first compaction is needed but not on the subsequent inserts. May be its just to create few metadata. Thanks for clarifying this :) On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh wrote: > Try this, > > > hive> create table default.foo(id int) clustered by (id

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread @Sanjiv Singh
Try this, hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true'); hive> insert into default.foo values(10); scala> sqlContext.table("default.foo").count // Gives 0, which is wrong because data is still in delta files Now run

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-23 Thread Varadharajan Mukundan
This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure if its different. Please walkthrough the below commands in the same order to understand the sequence. hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread @Sanjiv Singh
Hi Varadharajan, That is the point, Spark SQL is able to recognize delta files. See below directory structure, ONE BASE (43 records) and one DELTA (created after last insert). And I am able see last insert through Spark SQL. *See below complete scenario :* *Steps:* - Inserted 43 records in

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread Varadharajan Mukundan
Hi Sanjiv, Yes.. If we make use of Hive JDBC we should be able to retrieve all the rows since it is hive which processes the query. But i think the problem with Hive JDBC is that there are two layers of processing, hive and then at spark with the result set. And another one is performance is limit

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread @Sanjiv Singh
Hi Varadharajan, Can you elaborate on (you quoted on previous mail) : "I observed that hive transaction storage structure do not work with spark yet" If it is related to delta files created after each transaction and spark would not be able recognize them. then I have a table *mytable *(ORC , BU

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-22 Thread Varadharajan Mukundan
Actually the auto compaction if enabled is triggered based on the volume of changes. It doesn't automatically run after every insert. I think its possible to reduce the thresholds but that might reduce performance by a big margin. As of now, we do compaction after the batch insert completes. The o

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Compaction would have been triggered automatically as following properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not been set for these tables. hive.compactor.initiator.on true hive.compactor.worker.threads 1 Do

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread Varadharajan Mukundan
Yes, I was burned down by this issue couple of weeks back. This also means that after every insert job, compaction should be run to access new rows from Spark. Sad that this issue is not documented / mentioned anywhere. On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh wrote: > Hi Varadharajan, > >

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Hi Varadharajan, Thanks for your response. Yes it is transnational table; See below *show create table. * Table hardly have 3 records , and after triggering minor compaction on tables , it start showing results on spark SQL. > *ALTER TABLE hivespark COMPACT 'major';* > *show create table hiv

Re: Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread Varadharajan Mukundan
Hi, Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of "desc extended " in hive console. If you'd need to access transactional table, conside

Spark SQL is not returning records for hive bucketed tables on HDP

2016-02-21 Thread @Sanjiv Singh
Hi, I have observed that Spark SQL is not returning records for hive bucketed ORC tables on HDP. On spark SQL , I am able to list all tables , but queries on hive bucketed tables are not returning records. I have also tried the same for non-bucketed hive tables. it is working fine. Same is