schlichtanders commented on issue #6808:
URL: https://github.com/apache/hudi/issues/6808#issuecomment-1308766125

   sorry, I was blocked by higher priorities.
   The production usecase are proper unittests of spark hudi jobs. I was able 
to setup a postgresql as a local hive-metastore as a workaround. That worked. 
Switching the very same setting to derby does not throw an error, but also does 
not update the derby database. I.e. derby does not work. I am using 
hudi-spark3.2-bundle_2.12:0.12.0.
   
   # Minimal failing example (which works for Postgres MetaStore)
   
   Here the code which was used for both postgresql and in-memory derby, with 
the in-memory derby config enabled in PYSPARK_SUBMIT_ARGS.
   
   ```python
   from pyspark.sql import SparkSession
   import pyspark.sql.functions as sqlf
   from pathlib import Path
   from types import SimpleNamespace
   import os
   
   os.environ["SPARK_PRINT_LAUNCH_COMMAND"] = "true"
   
   # postgres is searched on classpath by hudi, but unfortunately the 
--packages are not 
   # properly added to classpath. The --packages jars are always downloaded to 
.ivy2/jars
   # hence we can directly reference them from there
   # see 
https://stackoverflow.com/questions/43417216/spark-submit-packages-is-not-working-on-my-cluster-what-could-be-the-reason
   postgres = SimpleNamespace(
       domain="org.postgresql",
       package="postgresql",
       version="9.4.1207",
   )
   spark_packages_jars = Path.home() / ".ivy2" / "jars" / 
f"{postgres.domain}_{postgres.package}-{postgres.version}.jar"
   
   
   os.environ["PYSPARK_SUBMIT_ARGS"] = " ".join([
       # hudi, avro and postgresql
       f"--packages 
org.apache.hudi:hudi-spark3.2-bundle_2.12:0.12.0,org.apache.spark:spark-avro_2.12:3.2.2,{postgres.domain}:{postgres.package}:{postgres.version}",
       f"--conf spark.driver.extraClassPath={spark_packages_jars}",
       f"--conf spark.executor.extraClassPath={spark_packages_jars}",
       
       # hudi config
       # -----------
   
       "--conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
       "--conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog",
       "--conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
       "--conf spark.sql.hive.convertMetastoreParquet=false", # taken from AWS 
example 
https://aws.amazon.com/blogs/big-data/part-1-integrate-apache-hudi-delta-lake-apache-iceberg-datasets-at-scale-aws-glue-studio-notebook/
       
       # hive metastore config
       # ---------------------
       
       # SparkSession.builder.enableHiveSupport()
       "--conf spark.sql.catalogImplementation=hive",
       # taken from dbt-spark example
       "--conf spark.hadoop.datanucleus.schema.autoCreateTables=true",
       "--conf spark.hadoop.datanucleus.fixedDatastore=false",
       "--conf spark.hadoop.hive.metastore.schema.verification=false",
       # "--conf 
spark.hadoop.hive.metastore.schema.verification.record.version=false",
       "--conf spark.driver.userClassPathFirst=true",
   
       # in memory derby metastore - does not work
       "--conf 
spark.hadoop.javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.EmbeddedDriver",
       "--conf 
spark.hadoop.javax.jdo.option.ConnectionURL='jdbc:derby:memory:databaseName=metastore_db;create=true'",
  # noqa
       
       # local postgresql metastore - works
       # "--conf 
spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:postgresql://localhost:5432/metastore?createDatabaseIfNotExist=true",
       # "--conf 
spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver",
       # "--conf spark.hadoop.javax.jdo.option.ConnectionUserName=test",
       # "--conf spark.hadoop.javax.jdo.option.ConnectionPassword=test",
   
       # others
       # ------
   
       "--conf spark.eventLog.enabled=false",
       
       # necessary last string
       "pyspark-shell",
   ])
   
   spark = SparkSession.builder.getOrCreate()
   
   dst_database = "default"
   spark.sql(f"CREATE DATABASE IF NOT EXISTS {dst_database}")
   
   spark.sql(f"SHOW TABLES FROM `{dst_database}`").show(truncate=False)  # empty
   
   
   # Dummy Data
   # -----------------
   
   sc = spark.sparkContext
   sc.setLogLevel("WARN")
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(
       dataGen.generateInserts(5)
   )
   from pyspark.sql.functions import expr
   
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 
2)).withColumn("part", sqlf.lit("partition"))
   
   
   # Write via saveAsTable works
   # -------------------------------
   
   df.write.mode("overwrite").saveAsTable("saveastable_table")
   spark.sql(f"SHOW TABLES FROM `{dst_database}`").show(truncate=False)  # the 
saveastable_table shows up
   
   
   # Write via Hudi does nothing
   # -------------------------------
   
   table = "test_hudi_pyspark_local"
   path = f"{Path('.').absolute()}/tmp/{table}"
   col_id = "uuid"
   col_sort = "ts"
   col_partition = "part"
   
   hudi_options = {
       'hoodie.table.name': table,
       'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
       'hoodie.datasource.write.recordkey.field': col_id,
       'hoodie.datasource.write.partitionpath.field': col_partition,
       'hoodie.datasource.write.table.name': table,
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.datasource.write.precombine.field': col_sort,
       'hoodie.datasource.write.hive_style_partitioning': 'true',
       'hoodie.upsert.shuffle.parallelism': 2,
       'hoodie.insert.shuffle.parallelism': 2,
       'path': path,
       'hoodie.datasource.hive_sync.enable': 'true',
       'hoodie.datasource.hive_sync.database': dst_database,
       'hoodie.datasource.hive_sync.table': table,
       'hoodie.datasource.hive_sync.partition_fields': col_partition,
       'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'hoodie.datasource.hive_sync.use_jdbc': 'false',
       'hoodie.datasource.hive_sync.mode': 'hms',
       "hoodie.index.type": "GLOBAL_BLOOM",
   }
   
   
df.write.format("org.apache.hudi").options(**hudi_options).mode("overwrite").save()
   spark.sql(f"SHOW TABLES FROM {dst_database}").show(truncate=False)  # still 
only saveastable_table shows up
   ```
   
   The Hudi save outputs the following
   ```
   22/11/09 14:18:44 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   22/11/09 14:18:44 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   22/11/09 14:18:44 WARN DataSourceOptionsHelper$: 
hoodie.datasource.write.storage.type is deprecated and will be removed in a 
later release; Please use hoodie.datasource.write.table.type
   22/11/09 14:18:45 WARN HoodieBackedTableMetadata: Metadata table was not 
found at path /path/to/test_hudi_pyspark_local/.hoodie/metadata
   22/11/09 14:18:48 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
   22/11/09 14:19:05 WARN ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 2.3.0
   22/11/09 14:19:05 WARN ObjectStore: setMetaStoreSchemaVersion called but 
recording version is disabled: version = 2.3.0, comment = Set by MetaStore 
[email protected]
   22/11/09 14:19:05 WARN ObjectStore: Failed to get database default, 
returning NoSuchObjectException
   22/11/09 14:19:07 WARN log: Updating partition stats fast for: 
test_hudi_pyspark_local
   22/11/09 14:19:07 WARN log: Updated size to 438288
   ```
   
   # Postgres local metastore
   
   For the postgres metastore I started postgres via docker 
   ```
   docker run -p 5432:5432 -e POSTGRES_USER=test -e POSTGRES_PASSWORD=test -e 
POSTGRES_DB=metastore -i postgres:9.6.17-alpine
   ```
   in addition you need to uncomment/comment the following lines within the 
above definition of `PYSPARK_SUBMIT_ARGS`
   ```python
       # in memory derby metastore - does not work
       # "--conf 
spark.hadoop.javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.EmbeddedDriver",
       # "--conf 
spark.hadoop.javax.jdo.option.ConnectionURL='jdbc:derby:memory:databaseName=metastore_db;create=true'",
  # noqa
       
       # local postgresql metastore - works
       "--conf 
spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:postgresql://localhost:5432/metastore?createDatabaseIfNotExist=true",
       "--conf 
spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver",
       "--conf spark.hadoop.javax.jdo.option.ConnectionUserName=test",
       "--conf spark.hadoop.javax.jdo.option.ConnectionPassword=test",
   ```
   Then everything works.
   
   # Links to derby tests
   
   the links do no longer work unfortunately, probably because they are 
referring to master/main which changed in the meantime
   
   > btw, we have a script that we developed recently to test out local derby 
using a derby client. we developed this flow to test hive sync using 
spark-bundle. You can find the scripts here 
https://github.com/apache/hudi/tree/master/packaging/bundle-validation/spark-write-hive-sync
   > 
   > specifically: 
https://github.com/apache/hudi/blob/master/packaging/bundle-validation/spark-write-hive-sync/Dockerfile
 
https://github.com/apache/hudi/blob/master/packaging/bundle-validation/spark-write-hive-sync/validate.scala
   
   # it would be great if in-memory derby could be supported by hudi
   
   We would like to use it to simplify our unittests, which now use a local 
postgresql as a workaround.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to