vinothchandar commented on pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#issuecomment-830440547
@pengzhiwei2018 did some basic testing of functionality. Most works good.
Below is my summary. Can we look at each issue and see if we can fix/address
it? I had problem with partitioned tables and a custom merge expression
actually. So wondering what I am missing or if this is expected.
## Writing
```
// Create gh sample data.
create table gh_raw using parquet location
'file:///Users/vs/Cache/lake-microbenchmarks/sample-parquet';
```
### Working
#### Creating a table over existing Hudi table
```
create table hudi_debug using hudi location
'file:///Users/vs/Cache/hudi-debug/junit6920417274916003231/dataset/';
```
#### Create table as select
```
create table hudi_managed using hudi as select type, public, payload, repo,
actor, org, id, other from gh_raw;
```
Issues:
1) Even if it fails, it ends up creating the table (i.e its not atomic per
se)
```
java.lang.RuntimeException: Table default.hudi_managed already exists. You
need to drop it first.
at
org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:48)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
at
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
```
3) When selecting all columns (probably need more tests across data types)
```
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Could not convert
field Type from TIMESTAMP to bigint for field created_at
at
org.apache.hudi.hive.util.HiveSchemaUtil.getSchemaDifference(HiveSchemaUtil.java:98)
at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:205)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:155)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:109)
... 52 more
```
#### Create table, external location
```
create table hudi_gh_ext using hudi location 'file:///tmp/hudi-gh' as select
type, public, payload, repo, actor, org, id, other from gh_raw;
```
#### Create table with schema, no location
```
create table hudi_gh_managed_fixed (id int, name string, price double, ts
long) using hudi options(primaryKey = 'id', precombineField = 'ts')
```
#### Truncate table
```
TRUNCATE table hudi_gh
```
Issues
1) Truncation succeeds, but throws an error. may be confusing to user
```
21/04/30 14:34:46 WARN TruncateTableCommand: Exception when attempting to
uncache table `default`.`hudi_gh`
org.sparkproject.guava.util.concurrent.UncheckedExecutionException:
org.apache.hudi.exception.TableNotFoundException: Hoodie table not found in
path Unable to find a hudi table for the user provided paths.
...
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hudi.exception.TableNotFoundException: Hoodie table
not found in path Unable to find a hudi table for the user provided paths.
at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:81)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:99)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:337)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:256)
```
2) Querying truncated table throws error, instead of returning empty result
set
```
Caused by: org.apache.hudi.exception.TableNotFoundException: Hoodie table
not found in path Unable to find a hudi table for the user provided paths.
at org.apache.hudi.DataSourceUtils.getTablePath(DataSourceUtils.java:81)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:99)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:337)
at
org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:256)
at
org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
at
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
```
#### Drop table
managed table is deleted, external table is not deleted. All good.
```
drop table hudi_managed;
drop table hudi_gh_ext;
```
#### Insert INTO .. Values
```
create table hudi_gh_ext_fixed (id int, name string, price double, ts long)
using hudi options(primaryKey = 'id', precombineField = 'ts') location
'file:///tmp/hudi-gh-fixed';
insert into hudi_gh_ext_fixed values(3, 'AMZN', 530, 120);
```
#### Insert INTO tbl SELECT * from anotherTbl
```
create table hudi_gh (type string, public boolean, payload string, id
string) using hudi;
insert into hudi_gh select type, public, payload, id from gh_raw;
```
#### Insert overwrite
```
insert overwrite table hudi_gh select type, public, payload, id from gh_raw
limit 10000;
```
#### Update Table
```
update hudi_gh_ext_fixed set price = 100.0 where name = 'UBER';
```
#### Merge Table
```
create table hudi_fixed (id int, name string, price double, ts long) using
hudi options(primaryKey = 'id', precombineField = 'ts') location
'file:///tmp/hudi-fixed';
insert into hudi_fixed values(1, 'UBER', 200, 120);
MERGE INTO hudi_fixed
USING hudi_gh_ext_fixed
ON hudi_fixed.id = hudi_gh_ext_fixed.id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED
THEN INSERT *
MERGE INTO hudi_fixed
USING (select id, name, price, ts from hudi_gh_ext_fixed) updates
ON hudi_fixed.id = updates.id
WHEN MATCHED THEN
UPDATE SET hudi_fixed.price = hudi_fixed.price + updates.price
WHEN NOT MATCHED
THEN INSERT *
```
Issues.
1) Fails due to assignment field/schema mismatch
```
spark-sql> describe hudi_fixed;
...
_hoodie_commit_time string NULL
_hoodie_commit_seqno string NULL
_hoodie_record_key string NULL
_hoodie_partition_path string NULL
_hoodie_file_name string NULL
id int NULL
name string NULL
price double NULL
ts bigint NULL
Time taken: 0.027 seconds, Fetched 9 row(s)
spark-sql> describe hudi_gh_ext_fixed;
...
_hoodie_commit_time string NULL
_hoodie_commit_seqno string NULL
_hoodie_record_key string NULL
_hoodie_partition_path string NULL
_hoodie_file_name string NULL
id int NULL
name string NULL
price double NULL
ts bigint NULL
Time taken: 0.028 seconds, Fetched 9 row(s)
spark-sql>
```
```
java.lang.AssertionError: assertion failed: The number of update
assignments[1] must equal to the targetTable field size[4]
at scala.Predef$.assert(Predef.scala:223)
at
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.$anonfun$checkUpdateAssignments$1(MergeIntoHoodieTableCommand.scala:307)
at
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.$anonfun$checkUpdateAssignments$1$adapted(MergeIntoHoodieTableCommand.scala:305)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
```
2. Merges only allowed by PK
```
java.lang.IllegalArgumentException: Merge Key[name] is not Equal to the
defined primary key[id] in table hudi_fixed
at
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.buildMergeIntoConfig(MergeIntoHoodieTableCommand.scala:429)
at
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:156)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
```
3. Merge not updating to new value
```
MERGE INTO hudi_fixed
USING hudi_gh_ext_fixed
ON hudi_fixed.id = hudi_gh_ext_fixed.id
WHEN MATCHED THEN
UPDATE SET hudi_fixed.price =hudi_gh_ext_fixed + hudi_fixed.price
WHEN NOT MATCHED
THEN INSERT *
```
I see no effect on the hudi_fixed table. Old values remain.
#### Delete Table
```
delete from hudi_gh_ext_fixed where _hoodie_record_key = 'id:3';
delete from hudi_gh_ext where type = 'GollumEvent';
```
Issues
1) Non PK based deletes are not working atm
```
java.lang.AssertionError: assertion failed: There are no primary key in
table `default`.`hudi_gh_ext`, cannot execute delete operator
at scala.Predef$.assert(Predef.scala:223)
at
org.apache.spark.sql.hudi.command.DeleteHoodieTableCommand.buildHoodieConfig(DeleteHoodieTableCommand.scala:68)
at
org.apache.spark.sql.hudi.command.DeleteHoodieTableCommand.run(DeleteHoodieTableCommand.scala:48)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
```
2) Why do we have to encode column name into reach record key? i.e
_hoodie_record_key = '1' vs being _hoodie_record_key = 'id:1'
### Not working:
#### Create or Replace table
```
spark-sql> create or replace table hudi_debug using hudi location
'file:///Users/vs/Cache/hudi-debug/junit6920417274916003231/dataset/';
Error in query: REPLACE TABLE is only supported with v2 tables.;
```
#### Create table, partitioned by
```
create table hudi_gh_ext using hudi partitioned by (type) location
'file:///tmp/hudi-gh-ext' as select type, public, payload, repo, actor, org,
id, other from gh_raw;
select count(*) from hudi_gh_ext;
0
```
Issues :
1) Throws an error after running sql (Physically deleted external table
basepath, dropped table and yet create fails )
```
21/04/30 15:32:20 ERROR SparkSQLDriver: Failed in [create table hudi_gh_ext
using hudi partitioned by (type) location 'file:///tmp/hudi-gh-ext' as select
type, public, payload, repo, actor, org, id, other from gh_raw]
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at
org.apache.spark.sql.catalyst.catalog.CatalogTable.partitionSchema(interface.scala:259)
at
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignOutputFields(InsertIntoHoodieTableCommand.scala:104)
at
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:85)
at
org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:64)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]