Re: Spark SQL is not returning records for hive bucketed tables on HDP

Varadharajan Mukundan Tue, 23 Feb 2016 00:59:50 -0800

That's interesting. I'm not sure why first compaction is needed but not on
the subsequent inserts. May be its just to create few metadata. Thanks for
clarifying this :)


On Tue, Feb 23, 2016 at 2:15 PM, @Sanjiv Singh <sanjiv.is...@gmail.com>
wrote:

> Try this,
>
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into default.foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
> hive> insert into foo values(20);
>
> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
> required.*
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <
> srinath...@gmail.com> wrote:
>
>> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
>> if its different.
>>
>> Please walkthrough the below commands in the same order to understand the
>> sequence.
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>>
>> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sanjiv.is...@gmail.com>
>> wrote:
>>
>>> Hi Varadharajan,
>>>
>>>
>>> That is the point, Spark SQL is able to recognize delta files. See below
>>> directory structure, ONE BASE (43 records) and one DELTA (created after
>>> last insert). And I am able see last insert through Spark SQL.
>>>
>>>
>>> *See below complete scenario :*
>>>
>>> *Steps:*
>>>
>>>    - Inserted 43 records in table.
>>>    - Run major compaction on table.
>>>    - *alter table mytable COMPACT 'major';*
>>>    - Disabled auto compaction on table.
>>>    - *alter table mytable set
>>>       TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>>    - Inserted 1 record in table.
>>>
>>>
>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
>>> /apps/hive/warehouse/mydb.db/mytable/base_0000087
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>>>
>>> *SPARK JDBC :*
>>>
>>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
>>> +------+
>>> | _c0  |
>>> +------+
>>> | 44   |
>>> +------+
>>> 1 row selected (1.196 seconds)
>>>
>>> *HIVE JDBC :*
>>>
>>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
>>> +------+--+
>>> | _c0  |
>>> +------+--+
>>> | 44   |
>>> +------+--+
>>> 1 row selected (0.121 seconds)
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>>> srinath...@gmail.com> wrote:
>>>
>>>> Hi Sanjiv,
>>>>
>>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>>>> rows since it is hive which processes the query. But i think the problem
>>>> with Hive JDBC is that there are two layers of processing, hive and then at
>>>> spark with the result set. And another one is performance is limited to
>>>> that single HiveServer2 node and network.
>>>>
>>>> But If we make use of sqlContext.table function in spark to access hive
>>>> tables, it is supposed to read files directly from HDFS skipping the hive
>>>> layer. But it doesn't read delta files and just reads the contents from
>>>> base folder. Only after Major compaction, the delta files would be merged
>>>> with based folder and be visible for Spark SQL
>>>>
>>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Varadharajan,
>>>>>
>>>>> Can you elaborate on (you quoted on previous mail) :
>>>>> "I observed that hive transaction storage structure do not work with
>>>>> spark yet"
>>>>>
>>>>>
>>>>> If it is related to delta files created after each transaction and
>>>>> spark would not be able recognize them. then I have a table *mytable *(ORC
>>>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and
>>>>> deletes. I can see delta files created in HDFS (see below), Still able to
>>>>> fetch consistent records through Spark JDBC and HIVE JDBC.
>>>>>
>>>>> Not compaction triggered for that table.
>>>>>
>>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>>>
>>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Sanjiv Singh
>>>>> Mob :  +091 9990-447-339
>>>>>
>>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>>>> srinath...@gmail.com> wrote:
>>>>>
>>>>>> Actually the auto compaction if enabled is triggered based on the
>>>>>> volume of changes. It doesn't automatically run after every insert. I 
>>>>>> think
>>>>>> its possible to reduce the thresholds but that might reduce performance 
>>>>>> by
>>>>>> a big margin. As of now, we do compaction after the batch insert 
>>>>>> completes.
>>>>>>
>>>>>> The only other way to solve this problem as of now is to use Hive
>>>>>> JDBC API.
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <
>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>
>>>>>>> Compaction would have been triggered automatically as following
>>>>>>> properties already set in *hive-site.xml*. and also
>>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>>>
>>>>>>>
>>>>>>>     <property>
>>>>>>>
>>>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>>>
>>>>>>>       <value>true</value>
>>>>>>>
>>>>>>>     </property>
>>>>>>>
>>>>>>>     <property>
>>>>>>>
>>>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>>>
>>>>>>>       <value>1</value>
>>>>>>>
>>>>>>>     </property>
>>>>>>>
>>>>>>>
>>>>>>> Documentation is upset sometimes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>> Sanjiv Singh
>>>>>>> Mob :  +091 9990-447-339
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>>>> srinath...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes, I was burned down by this issue couple of weeks back. This
>>>>>>>> also means that after every insert job, compaction should be run to 
>>>>>>>> access
>>>>>>>> new rows from Spark. Sad that this issue is not documented / mentioned
>>>>>>>> anywhere.
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Varadharajan,
>>>>>>>>>
>>>>>>>>> Thanks for your response.
>>>>>>>>>
>>>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>>>
>>>>>>>>> Table hardly have 3 records , and after triggering minor
>>>>>>>>> compaction on tables , it start showing results on spark SQL.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > *show create table hivespark;*
>>>>>>>>>
>>>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>>>
>>>>>>>>>     `id` int,
>>>>>>>>>
>>>>>>>>>     `name` string)
>>>>>>>>>
>>>>>>>>>   CLUSTERED BY (
>>>>>>>>>
>>>>>>>>>     id)
>>>>>>>>>
>>>>>>>>>   INTO 32 BUCKETS
>>>>>>>>>
>>>>>>>>>   ROW FORMAT SERDE
>>>>>>>>>
>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>>>
>>>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>>>
>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>>>
>>>>>>>>>   OUTPUTFORMAT
>>>>>>>>>
>>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>>>
>>>>>>>>>   LOCATION
>>>>>>>>>
>>>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>>>   TBLPROPERTIES (
>>>>>>>>>
>>>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>>>
>>>>>>>>>     'last_modified_by'='root',
>>>>>>>>>
>>>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>>>
>>>>>>>>>     'numFiles'='37',
>>>>>>>>>
>>>>>>>>>     'numRows'='3',
>>>>>>>>>
>>>>>>>>>     'rawDataSize'='0',
>>>>>>>>>
>>>>>>>>>     'totalSize'='11383',
>>>>>>>>>
>>>>>>>>>     'transactional'='true',
>>>>>>>>>
>>>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Sanjiv Singh
>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>
>>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>>>> srinath...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>>>> hive transaction storage structure do not work with spark yet. You 
>>>>>>>>>> can
>>>>>>>>>> confirm this by looking at the transactional attribute in the output 
>>>>>>>>>> of
>>>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>>>
>>>>>>>>>> If you'd need to access transactional table, consider doing a
>>>>>>>>>> major compaction and then try accessing the tables
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On spark SQL , I am able to list all tables , but queries on
>>>>>>>>>>> hive bucketed tables are not returning records.
>>>>>>>>>>>
>>>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>>>> working fine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>>>
>>>>>>>>>>> Let me know if needs other details.
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>> Sanjiv Singh
>>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks,
>>>>>>>>>> M. Varadharajan
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>>
>>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> M. Varadharajan
>>>>>>>>
>>>>>>>> ------------------------------------------------
>>>>>>>>
>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>
>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> M. Varadharajan
>>>>>>
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>
>>>>>> My Journal :- http://varadharajan.in
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> M. Varadharajan
>>>>
>>>> ------------------------------------------------
>>>>
>>>> "Experience is what you get when you didn't get what you wanted"
>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>
>>>> My Journal :- http://varadharajan.in
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> M. Varadharajan
>>
>> ------------------------------------------------
>>
>> "Experience is what you get when you didn't get what you wanted"
>>                -By Prof. Randy Pausch in "The Last Lecture"
>>
>> My Journal :- http://varadharajan.in
>>
>
>


-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Reply via email to