Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Timur Shenkao Sat, 12 Mar 2016 11:26:01 -0800

Hi,

I have suffered from Hive Streaming , Transactions enough, so I can share
my experience with you.

1) It's not a problem of Spark. It happens because of "peculiarities" /
bugs of Hive Streaming.  Hive Streaming, transactions are very raw
technologies. If you look at Hive JIRA, you'll see several critical bugs
concerning Hive Streaming, transactions. Some of them are resolved in Hive
2+ only. But Cloudera & Hortonworks ship their distributions with outdated
& buggy Hive.
So use Hive 2+. Earlier versions of Hive didn't run compaction at all.

2) In Hive 1.1, I  issue the following lines
ALTER TABLE default.foo COMPACT 'MAJOR';
SHOW COMPACTIONS;

My manual compaction was shown but it was never fulfilled.

3) If you use Hive Streaming, it's not recommended or even forbidden to
insert rows into Hive Streaming tables manually. Only the process that
writes to such table should insert incoming rows sequentially. Otherwise
you'll get unpredictable behaviour.

4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
Hive Streaming / transactional tables are catalogs that have numerous
subcatalogs with "delta" prefix. Moreover, there are files with
"flush_length" suffix in some delta subfolders. "flush_length" files have 8
bytes length. The presence of "flush_length" file in some subfolder means
that Hive writes updates to this subfolder right now. When Hive fails or is
restarted, it begins to write into new delta subfolder with new
"flush_length" file. And old "flush_length" file (that was used before
failure) still remains.
One of the goal of compaction is to delete outdated "flush_length" files.
Not every application / library can read such folder structure or knows
details of Hive Streaming / transactions implementation. Most of the
software solutions still expect ordinary Hive tables as input.
When they encounter subcatalogs or special files "flush_length" file,
applications / libraries either "see nothing" (return 0 or empty result
set) or stumble over "flush_length" files (return unexplainable errors).

For instance, Facebook Presto couldn't read subfolders by default unless
you activate special parameters. But it stumbles over "flush_length" files
as Presto expect legal ORC files not 8-byte-length text files in folders.

So, I don't advise you to use Hive Streaming, transactions right now in
real production systems (24 / 7 /365) with hundreds millions of events a
day.

On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
wrote:

> Hi All,
>
> I am facing this issue on HDP setup on which COMPACTION is required only
> once for transactional tables to fetch records with Spark SQL.
> On the other hand, Apache setup doesn't required compaction even once.
>
> May be something got triggered on meta-store after compaction, Spark SQL
> start recognizing delta files.
>
> Let know me if needed other details to get root cause.
>
> Try this,
>
> *See complete scenario :*
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into default.foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
> hive> insert into foo values(20);
>
> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
> required.*
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Reply via email to