Hi, I have suffered from Hive Streaming , Transactions enough, so I can share my experience with you.
1) It's not a problem of Spark. It happens because of "peculiarities" / bugs of Hive Streaming. Hive Streaming, transactions are very raw technologies. If you look at Hive JIRA, you'll see several critical bugs concerning Hive Streaming, transactions. Some of them are resolved in Hive 2+ only. But Cloudera & Hortonworks ship their distributions with outdated & buggy Hive. So use Hive 2+. Earlier versions of Hive didn't run compaction at all. 2) In Hive 1.1, I issue the following lines ALTER TABLE default.foo COMPACT 'MAJOR'; SHOW COMPACTIONS; My manual compaction was shown but it was never fulfilled. 3) If you use Hive Streaming, it's not recommended or even forbidden to insert rows into Hive Streaming tables manually. Only the process that writes to such table should insert incoming rows sequentially. Otherwise you'll get unpredictable behaviour. 4) Ordinary Hive tables are catalogs with text, ORC, etc. files. Hive Streaming / transactional tables are catalogs that have numerous subcatalogs with "delta" prefix. Moreover, there are files with "flush_length" suffix in some delta subfolders. "flush_length" files have 8 bytes length. The presence of "flush_length" file in some subfolder means that Hive writes updates to this subfolder right now. When Hive fails or is restarted, it begins to write into new delta subfolder with new "flush_length" file. And old "flush_length" file (that was used before failure) still remains. One of the goal of compaction is to delete outdated "flush_length" files. Not every application / library can read such folder structure or knows details of Hive Streaming / transactions implementation. Most of the software solutions still expect ordinary Hive tables as input. When they encounter subcatalogs or special files "flush_length" file, applications / libraries either "see nothing" (return 0 or empty result set) or stumble over "flush_length" files (return unexplainable errors). For instance, Facebook Presto couldn't read subfolders by default unless you activate special parameters. But it stumbles over "flush_length" files as Presto expect legal ORC files not 8-byte-length text files in folders. So, I don't advise you to use Hive Streaming, transactions right now in real production systems (24 / 7 /365) with hundreds millions of events a day. On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote: > Hi All, > > I am facing this issue on HDP setup on which COMPACTION is required only > once for transactional tables to fetch records with Spark SQL. > On the other hand, Apache setup doesn't required compaction even once. > > May be something got triggered on meta-store after compaction, Spark SQL > start recognizing delta files. > > Let know me if needed other details to get root cause. > > Try this, > > *See complete scenario :* > > hive> create table default.foo(id int) clustered by (id) into 2 buckets > STORED AS ORC TBLPROPERTIES ('transactional'='true'); > hive> insert into default.foo values(10); > > scala> sqlContext.table("default.foo").count // Gives 0, which is wrong > because data is still in delta files > > Now run major compaction: > > hive> ALTER TABLE default.foo COMPACT 'MAJOR'; > > scala> sqlContext.table("default.foo").count // Gives 1 > > hive> insert into foo values(20); > > scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction > required.* > > > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 >