[GitHub] [hudi] vinothchandar commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

GitBox Fri, 30 Apr 2021 15:47:25 -0700


vinothchandar commented on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-830440547



   @pengzhiwei2018 did some basic testing of functionality. Most works good. 
Below is my summary. Can we look at each issue and see if we can fix/address 
it? I had problem with partitioned tables and a custom merge expression 
actually. So wondering what I am missing or if this is expected. 
   
   
   ## Writing 
   
   ```
   // Create gh sample data.
   create table gh_raw using parquet location 
'file:///Users/vs/Cache/lake-microbenchmarks/sample-parquet';
   ```
   
   ### Working
   
   #### Creating a table over existing Hudi table 
   ```
   create table hudi_debug using hudi location 
'file:///Users/vs/Cache/hudi-debug/junit6920417274916003231/dataset/';
   ```
   
   #### Create table as select
   ```
   create table hudi_managed using hudi as select type, public, payload, repo, 
actor, org, id, other  from gh_raw;
   ```
   
   Issues: 
   
   1) Even if it fails, it ends up creating the table (i.e its not atomic per 
se) 
   
   ```
   java.lang.RuntimeException: Table default.hudi_managed already exists. You 
need to drop it first.
        at 
org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:48)
        at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
        at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
        at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
        at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
   ```
   
   3) When selecting all columns (probably need more tests across data types)
   
   ```
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Could not convert 
field Type from TIMESTAMP to bigint for field created_at
        at 
org.apache.hudi.hive.util.HiveSchemaUtil.getSchemaDifference(HiveSchemaUtil.java:98)
        at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:205)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:155)
        at 
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:109)
        ... 52 more
   ```
   
   #### Create table, external location
   
   ```
   create table hudi_gh_ext using hudi location 'file:///tmp/hudi-gh' as select 
type, public, payload, repo, actor, org, id, other  from gh_raw;
   ```
   
   #### Create table with schema, no location
   
   ```
   create table hudi_gh_managed_fixed (id int, name string, price double, ts 
long) using hudi options(primaryKey = 'id', precombineField = 'ts')
   ```
   
   #### Truncate table 
   
   ```
   TRUNCATE table hudi_gh
   ```
   
   Issues 
   1) Truncation succeeds, but throws an error. may be confusing to user
   
   ```
   21/04/30 14:34:46 WARN TruncateTableCommand: Exception when attempting to 
uncache table `default`.`hudi_gh`
   org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in 
path Unable to find a hudi table for the user provided paths.
        ...
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: org.apache.hudi.exception.TableNotFoundException: Hoodie table 
not found in path Unable to find a hudi table for the user provided paths.
        at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:81)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:99)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:337)
        at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:256)
   ```
   
   2) Querying truncated table throws error, instead of returning empty result 
set
   
   ```
   Caused by: org.apache.hudi.exception.TableNotFoundException: Hoodie table 
not found in path Unable to find a hudi table for the user provided paths.
        at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:81)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:99)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:337)
        at 
org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:256)
        at 
org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
        at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
   ```
   
   
   #### Drop table 
   
   managed table is deleted, external table is not deleted. All good. 
   
   
   ```
   drop table hudi_managed; 
   drop table hudi_gh_ext;
   ```
   
   
   
   #### Insert INTO .. Values 
   
   ```
   create table hudi_gh_ext_fixed (id int, name string, price double, ts long) 
using hudi options(primaryKey = 'id', precombineField = 'ts') location 
'file:///tmp/hudi-gh-fixed';
   insert into hudi_gh_ext_fixed values(3, 'AMZN', 530, 120);
   ``` 
   
   #### Insert INTO  tbl SELECT * from anotherTbl 
   
   ```
   create table hudi_gh (type string, public boolean, payload string, id 
string) using hudi;
   insert into hudi_gh select type, public, payload, id from gh_raw;
   
   ```
   
   #### Insert overwrite 
   
   ```
   insert overwrite table hudi_gh select type, public, payload, id from gh_raw 
limit 10000;
   ```
   
   #### Update Table 
   
   ```
   update hudi_gh_ext_fixed set price = 100.0 where name = 'UBER';
   ```
   
   #### Merge Table
   
   ```
   create table hudi_fixed (id int, name string, price double, ts long) using 
hudi options(primaryKey = 'id', precombineField = 'ts') location 
'file:///tmp/hudi-fixed';
   insert into hudi_fixed values(1, 'UBER', 200, 120);
   
   MERGE INTO hudi_fixed 
   USING hudi_gh_ext_fixed
   ON hudi_fixed.id = hudi_gh_ext_fixed.id
   WHEN MATCHED THEN
     UPDATE SET *
   WHEN NOT MATCHED
     THEN INSERT * 
   
   MERGE INTO hudi_fixed 
   USING (select id, name, price, ts from hudi_gh_ext_fixed) updates
   ON hudi_fixed.id = updates.id
   WHEN MATCHED THEN
     UPDATE SET hudi_fixed.price = hudi_fixed.price + updates.price
   WHEN NOT MATCHED
     THEN INSERT * 
   ```
   
   Issues. 
   
   1) Fails due to assignment field/schema mismatch 
   
   ```
   spark-sql> describe hudi_fixed;
   ...
   _hoodie_commit_time  string  NULL
   _hoodie_commit_seqno string  NULL
   _hoodie_record_key   string  NULL
   _hoodie_partition_path       string  NULL
   _hoodie_file_name    string  NULL
   id   int     NULL
   name string  NULL
   price        double  NULL
   ts   bigint  NULL
   Time taken: 0.027 seconds, Fetched 9 row(s)
   spark-sql> describe hudi_gh_ext_fixed;
   ...
   _hoodie_commit_time  string  NULL
   _hoodie_commit_seqno string  NULL
   _hoodie_record_key   string  NULL
   _hoodie_partition_path       string  NULL
   _hoodie_file_name    string  NULL
   id   int     NULL
   name string  NULL
   price        double  NULL
   ts   bigint  NULL
   Time taken: 0.028 seconds, Fetched 9 row(s)
   spark-sql>
   ```
   
   ```
   java.lang.AssertionError: assertion failed: The number of update 
assignments[1] must equal to the targetTable field size[4]
        at scala.Predef$.assert(Predef.scala:223)
        at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.$anonfun$checkUpdateAssignments$1(MergeIntoHoodieTableCommand.scala:307)
        at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.$anonfun$checkUpdateAssignments$1$adapted(MergeIntoHoodieTableCommand.scala:305)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   ```
   
   2. Merges only allowed by PK 
   
   ```
   java.lang.IllegalArgumentException: Merge Key[name] is not Equal to the 
defined primary key[id] in table hudi_fixed
        at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:429)
        at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:156)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   ```
   
   3. Merge not updating to new value 
   
   ```
   MERGE INTO hudi_fixed 
   USING hudi_gh_ext_fixed
   ON hudi_fixed.id = hudi_gh_ext_fixed.id
   WHEN MATCHED THEN
     UPDATE SET hudi_fixed.price =hudi_gh_ext_fixed +  hudi_fixed.price
   WHEN NOT MATCHED
     THEN INSERT * 
   ```
   
   I see no effect on the hudi_fixed table. Old values remain. 
   
   #### Delete Table 
   
   ```
   delete from hudi_gh_ext_fixed where _hoodie_record_key = 'id:3';
   delete from hudi_gh_ext where type = 'GollumEvent';
   ```
   
   Issues 
   
   1) Non PK based deletes are not working atm 
   
   ```
   java.lang.AssertionError: assertion failed: There are no primary key in 
table `default`.`hudi_gh_ext`, cannot execute delete operator
        at scala.Predef$.assert(Predef.scala:223)
        at 
org.apache.spark.sql.hudi.command.DeleteHoodieTableCommand.buildHoodieConfig(DeleteHoodieTableCommand.scala:68)
        at 
org.apache.spark.sql.hudi.command.DeleteHoodieTableCommand.run(DeleteHoodieTableCommand.scala:48)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
   ```
   
   2) Why do we have to encode column name into reach record key?  i.e 
_hoodie_record_key = '1' vs being _hoodie_record_key = 'id:1'
   
   ### Not working: 
   
   #### Create or Replace table
   
   ```
   spark-sql> create or replace table hudi_debug using hudi location 
'file:///Users/vs/Cache/hudi-debug/junit6920417274916003231/dataset/';
   Error in query: REPLACE TABLE is only supported with v2 tables.;
   ```
   
   #### Create table, partitioned by 
   
   ```
   create table hudi_gh_ext using hudi partitioned by (type) location 
'file:///tmp/hudi-gh-ext' as select type, public, payload, repo, actor, org, 
id, other  from gh_raw;
   select count(*) from hudi_gh_ext;
   0
   ```
   
   Issues : 
   
   1) Throws an error after running sql (Physically deleted external table 
basepath, dropped table and yet create fails )
   
   ```
   21/04/30 15:32:20 ERROR SparkSQLDriver: Failed in [create table hudi_gh_ext 
using hudi partitioned by (type) location 'file:///tmp/hudi-gh-ext' as select 
type, public, payload, repo, actor, org, id, other  from gh_raw]
   java.lang.AssertionError: assertion failed
        at scala.Predef$.assert(Predef.scala:208)
        at 
org.apache.spark.sql.catalyst.catalog.CatalogTable.partitionSchema(interface.scala:259)
        at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignOutputFields(InsertIntoHoodieTableCommand.scala:104)
        at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:85)
        at 
org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:64)
        at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
        at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
        at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

Reply via email to