That's interesting. I'm not sure why first compaction is needed but not on the subsequent inserts. May be its just to create few metadata. Thanks for clarifying this :)
On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote: > Try this, > > > hive> create table default.foo(id int) clustered by (id) into 2 buckets > STORED AS ORC TBLPROPERTIES ('transactional'='true'); > hive> insert into default.foo values(10); > > scala> sqlContext.table("default.foo").count // Gives 0, which is wrong > because data is still in delta files > > Now run major compaction: > > hive> ALTER TABLE default.foo COMPACT 'MAJOR'; > > scala> sqlContext.table("default.foo").count // Gives 1 > > hive> insert into foo values(20); > > scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction > required.* > > > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan < > srinath...@gmail.com> wrote: > >> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure >> if its different. >> >> Please walkthrough the below commands in the same order to understand the >> sequence. >> >> hive> create table default.foo(id int) clustered by (id) into 2 buckets >> STORED AS ORC TBLPROPERTIES ('transactional'='true'); >> hive> insert into foo values(10); >> >> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong >> because data is still in delta files >> >> Now run major compaction: >> >> hive> ALTER TABLE default.foo COMPACT 'MAJOR'; >> >> scala> sqlContext.table("default.foo").count // Gives 1 >> >> >> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sanjiv.is...@gmail.com> >> wrote: >> >>> Hi Varadharajan, >>> >>> >>> That is the point, Spark SQL is able to recognize delta files. See below >>> directory structure, ONE BASE (43 records) and one DELTA (created after >>> last insert). And I am able see last insert through Spark SQL. >>> >>> >>> *See below complete scenario :* >>> >>> *Steps:* >>> >>> - Inserted 43 records in table. >>> - Run major compaction on table. >>> - *alter table mytable COMPACT 'major';* >>> - Disabled auto compaction on table. >>> - *alter table mytable set >>> TBLPROPERTIES("NO_AUTO_COMPACTION"="true");* >>> - Inserted 1 record in table. >>> >>> >>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >>> drwxrwxrwx - root hdfs 0 2016-02-23 11:43 >>> /apps/hive/warehouse/mydb.db/mytable/base_0000087 >>> drwxr-xr-x - root hdfs 0 2016-02-23 12:02 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088 >>> >>> *SPARK JDBC :* >>> >>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ; >>> +------+ >>> | _c0 | >>> +------+ >>> | 44 | >>> +------+ >>> 1 row selected (1.196 seconds) >>> >>> *HIVE JDBC :* >>> >>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ; >>> +------+--+ >>> | _c0 | >>> +------+--+ >>> | 44 | >>> +------+--+ >>> 1 row selected (0.121 seconds) >>> >>> >>> Regards >>> Sanjiv Singh >>> Mob : +091 9990-447-339 >>> >>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan < >>> srinath...@gmail.com> wrote: >>> >>>> Hi Sanjiv, >>>> >>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the >>>> rows since it is hive which processes the query. But i think the problem >>>> with Hive JDBC is that there are two layers of processing, hive and then at >>>> spark with the result set. And another one is performance is limited to >>>> that single HiveServer2 node and network. >>>> >>>> But If we make use of sqlContext.table function in spark to access hive >>>> tables, it is supposed to read files directly from HDFS skipping the hive >>>> layer. But it doesn't read delta files and just reads the contents from >>>> base folder. Only after Major compaction, the delta files would be merged >>>> with based folder and be visible for Spark SQL >>>> >>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com >>>> > wrote: >>>> >>>>> Hi Varadharajan, >>>>> >>>>> Can you elaborate on (you quoted on previous mail) : >>>>> "I observed that hive transaction storage structure do not work with >>>>> spark yet" >>>>> >>>>> >>>>> If it is related to delta files created after each transaction and >>>>> spark would not be able recognize them. then I have a table *mytable *(ORC >>>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and >>>>> deletes. I can see delta files created in HDFS (see below), Still able to >>>>> fetch consistent records through Spark JDBC and HIVE JDBC. >>>>> >>>>> Not compaction triggered for that table. >>>>> >>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >>>>> >>>>> drwxrwxrwx - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:40 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086 >>>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:41 >>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087 >>>>> >>>>> >>>>> >>>>> Regards >>>>> Sanjiv Singh >>>>> Mob : +091 9990-447-339 >>>>> >>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan < >>>>> srinath...@gmail.com> wrote: >>>>> >>>>>> Actually the auto compaction if enabled is triggered based on the >>>>>> volume of changes. It doesn't automatically run after every insert. I >>>>>> think >>>>>> its possible to reduce the thresholds but that might reduce performance >>>>>> by >>>>>> a big margin. As of now, we do compaction after the batch insert >>>>>> completes. >>>>>> >>>>>> The only other way to solve this problem as of now is to use Hive >>>>>> JDBC API. >>>>>> >>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh < >>>>>> sanjiv.is...@gmail.com> wrote: >>>>>> >>>>>>> Compaction would have been triggered automatically as following >>>>>>> properties already set in *hive-site.xml*. and also >>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables. >>>>>>> >>>>>>> >>>>>>> <property> >>>>>>> >>>>>>> <name>hive.compactor.initiator.on</name> >>>>>>> >>>>>>> <value>true</value> >>>>>>> >>>>>>> </property> >>>>>>> >>>>>>> <property> >>>>>>> >>>>>>> <name>hive.compactor.worker.threads</name> >>>>>>> >>>>>>> <value>1</value> >>>>>>> >>>>>>> </property> >>>>>>> >>>>>>> >>>>>>> Documentation is upset sometimes. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards >>>>>>> Sanjiv Singh >>>>>>> Mob : +091 9990-447-339 >>>>>>> >>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan < >>>>>>> srinath...@gmail.com> wrote: >>>>>>> >>>>>>>> Yes, I was burned down by this issue couple of weeks back. This >>>>>>>> also means that after every insert job, compaction should be run to >>>>>>>> access >>>>>>>> new rows from Spark. Sad that this issue is not documented / mentioned >>>>>>>> anywhere. >>>>>>>> >>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh < >>>>>>>> sanjiv.is...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Varadharajan, >>>>>>>>> >>>>>>>>> Thanks for your response. >>>>>>>>> >>>>>>>>> Yes it is transnational table; See below *show create table. * >>>>>>>>> >>>>>>>>> Table hardly have 3 records , and after triggering minor >>>>>>>>> compaction on tables , it start showing results on spark SQL. >>>>>>>>> >>>>>>>>> >>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';* >>>>>>>>> >>>>>>>>> >>>>>>>>> > *show create table hivespark;* >>>>>>>>> >>>>>>>>> CREATE TABLE `hivespark`( >>>>>>>>> >>>>>>>>> `id` int, >>>>>>>>> >>>>>>>>> `name` string) >>>>>>>>> >>>>>>>>> CLUSTERED BY ( >>>>>>>>> >>>>>>>>> id) >>>>>>>>> >>>>>>>>> INTO 32 BUCKETS >>>>>>>>> >>>>>>>>> ROW FORMAT SERDE >>>>>>>>> >>>>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' >>>>>>>>> >>>>>>>>> STORED AS INPUTFORMAT >>>>>>>>> >>>>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' >>>>>>>>> >>>>>>>>> OUTPUTFORMAT >>>>>>>>> >>>>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' >>>>>>>>> >>>>>>>>> LOCATION >>>>>>>>> >>>>>>>>> 'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark' >>>>>>>>> TBLPROPERTIES ( >>>>>>>>> >>>>>>>>> 'COLUMN_STATS_ACCURATE'='true', >>>>>>>>> >>>>>>>>> 'last_modified_by'='root', >>>>>>>>> >>>>>>>>> 'last_modified_time'='1455859079', >>>>>>>>> >>>>>>>>> 'numFiles'='37', >>>>>>>>> >>>>>>>>> 'numRows'='3', >>>>>>>>> >>>>>>>>> 'rawDataSize'='0', >>>>>>>>> >>>>>>>>> 'totalSize'='11383', >>>>>>>>> >>>>>>>>> 'transactional'='true', >>>>>>>>> >>>>>>>>> 'transient_lastDdlTime'='1455864121') ; >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Sanjiv Singh >>>>>>>>> Mob : +091 9990-447-339 >>>>>>>>> >>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan < >>>>>>>>> srinath...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Is the transaction attribute set on your table? I observed that >>>>>>>>>> hive transaction storage structure do not work with spark yet. You >>>>>>>>>> can >>>>>>>>>> confirm this by looking at the transactional attribute in the output >>>>>>>>>> of >>>>>>>>>> "desc extended <tablename>" in hive console. >>>>>>>>>> >>>>>>>>>> If you'd need to access transactional table, consider doing a >>>>>>>>>> major compaction and then try accessing the tables >>>>>>>>>> >>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh < >>>>>>>>>> sanjiv.is...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I have observed that Spark SQL is not returning records for hive >>>>>>>>>>> bucketed ORC tables on HDP. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On spark SQL , I am able to list all tables , but queries on >>>>>>>>>>> hive bucketed tables are not returning records. >>>>>>>>>>> >>>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is >>>>>>>>>>> working fine. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Same is working on plain Apache setup. >>>>>>>>>>> >>>>>>>>>>> Let me know if needs other details. >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> Sanjiv Singh >>>>>>>>>>> Mob : +091 9990-447-339 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Thanks, >>>>>>>>>> M. Varadharajan >>>>>>>>>> >>>>>>>>>> ------------------------------------------------ >>>>>>>>>> >>>>>>>>>> "Experience is what you get when you didn't get what you wanted" >>>>>>>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>>>>>>> >>>>>>>>>> My Journal :- http://varadharajan.in >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Thanks, >>>>>>>> M. Varadharajan >>>>>>>> >>>>>>>> ------------------------------------------------ >>>>>>>> >>>>>>>> "Experience is what you get when you didn't get what you wanted" >>>>>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>>>>> >>>>>>>> My Journal :- http://varadharajan.in >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> M. Varadharajan >>>>>> >>>>>> ------------------------------------------------ >>>>>> >>>>>> "Experience is what you get when you didn't get what you wanted" >>>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>>> >>>>>> My Journal :- http://varadharajan.in >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> M. Varadharajan >>>> >>>> ------------------------------------------------ >>>> >>>> "Experience is what you get when you didn't get what you wanted" >>>> -By Prof. Randy Pausch in "The Last Lecture" >>>> >>>> My Journal :- http://varadharajan.in >>>> >>> >>> >> >> >> -- >> Thanks, >> M. Varadharajan >> >> ------------------------------------------------ >> >> "Experience is what you get when you didn't get what you wanted" >> -By Prof. Randy Pausch in "The Last Lecture" >> >> My Journal :- http://varadharajan.in >> > > -- Thanks, M. Varadharajan ------------------------------------------------ "Experience is what you get when you didn't get what you wanted" -By Prof. Randy Pausch in "The Last Lecture" My Journal :- http://varadharajan.in