schlichtanders commented on issue #6808: URL: https://github.com/apache/hudi/issues/6808#issuecomment-1308766125
sorry, I was blocked by higher priorities. The production usecase are proper unittests of spark hudi jobs. I was able to setup a postgresql as a local hive-metastore as a workaround. That worked. Switching the very same setting to derby does not throw an error, but also does not update the derby database. I.e. derby does not work. I am using hudi-spark3.2-bundle_2.12:0.12.0. # Minimal failing example (which works for Postgres MetaStore) Here the code which was used for both postgresql and in-memory derby, with the in-memory derby config enabled in PYSPARK_SUBMIT_ARGS. ```python from pyspark.sql import SparkSession import pyspark.sql.functions as sqlf from pathlib import Path from types import SimpleNamespace import os os.environ["SPARK_PRINT_LAUNCH_COMMAND"] = "true" # postgres is searched on classpath by hudi, but unfortunately the --packages are not # properly added to classpath. The --packages jars are always downloaded to .ivy2/jars # hence we can directly reference them from there # see https://stackoverflow.com/questions/43417216/spark-submit-packages-is-not-working-on-my-cluster-what-could-be-the-reason postgres = SimpleNamespace( domain="org.postgresql", package="postgresql", version="9.4.1207", ) spark_packages_jars = Path.home() / ".ivy2" / "jars" / f"{postgres.domain}_{postgres.package}-{postgres.version}.jar" os.environ["PYSPARK_SUBMIT_ARGS"] = " ".join([ # hudi, avro and postgresql f"--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.12.0,org.apache.spark:spark-avro_2.12:3.2.2,{postgres.domain}:{postgres.package}:{postgres.version}", f"--conf spark.driver.extraClassPath={spark_packages_jars}", f"--conf spark.executor.extraClassPath={spark_packages_jars}", # hudi config # ----------- "--conf spark.serializer=org.apache.spark.serializer.KryoSerializer", "--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog", "--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension", "--conf spark.sql.hive.convertMetastoreParquet=false", # taken from AWS example https://aws.amazon.com/blogs/big-data/part-1-integrate-apache-hudi-delta-lake-apache-iceberg-datasets-at-scale-aws-glue-studio-notebook/ # hive metastore config # --------------------- # SparkSession.builder.enableHiveSupport() "--conf spark.sql.catalogImplementation=hive", # taken from dbt-spark example "--conf spark.hadoop.datanucleus.schema.autoCreateTables=true", "--conf spark.hadoop.datanucleus.fixedDatastore=false", "--conf spark.hadoop.hive.metastore.schema.verification=false", # "--conf spark.hadoop.hive.metastore.schema.verification.record.version=false", "--conf spark.driver.userClassPathFirst=true", # in memory derby metastore - does not work "--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.EmbeddedDriver", "--conf spark.hadoop.javax.jdo.option.ConnectionURL='jdbc:derby:memory:databaseName=metastore_db;create=true'", # noqa # local postgresql metastore - works # "--conf spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:postgresql://localhost:5432/metastore?createDatabaseIfNotExist=true", # "--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver", # "--conf spark.hadoop.javax.jdo.option.ConnectionUserName=test", # "--conf spark.hadoop.javax.jdo.option.ConnectionPassword=test", # others # ------ "--conf spark.eventLog.enabled=false", # necessary last string "pyspark-shell", ]) spark = SparkSession.builder.getOrCreate() dst_database = "default" spark.sql(f"CREATE DATABASE IF NOT EXISTS {dst_database}") spark.sql(f"SHOW TABLES FROM `{dst_database}`").show(truncate=False) # empty # Dummy Data # ----------------- sc = spark.sparkContext sc.setLogLevel("WARN") dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList( dataGen.generateInserts(5) ) from pyspark.sql.functions import expr df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)).withColumn("part", sqlf.lit("partition")) # Write via saveAsTable works # ------------------------------- df.write.mode("overwrite").saveAsTable("saveastable_table") spark.sql(f"SHOW TABLES FROM `{dst_database}`").show(truncate=False) # the saveastable_table shows up # Write via Hudi does nothing # ------------------------------- table = "test_hudi_pyspark_local" path = f"{Path('.').absolute()}/tmp/{table}" col_id = "uuid" col_sort = "ts" col_partition = "part" hudi_options = { 'hoodie.table.name': table, 'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field': col_id, 'hoodie.datasource.write.partitionpath.field': col_partition, 'hoodie.datasource.write.table.name': table, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': col_sort, 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, 'path': path, 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database': dst_database, 'hoodie.datasource.hive_sync.table': table, 'hoodie.datasource.hive_sync.partition_fields': col_partition, 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.hive_sync.mode': 'hms', "hoodie.index.type": "GLOBAL_BLOOM", } df.write.format("org.apache.hudi").options(**hudi_options).mode("overwrite").save() spark.sql(f"SHOW TABLES FROM {dst_database}").show(truncate=False) # still only saveastable_table shows up ``` The Hudi save outputs the following ``` 22/11/09 14:18:44 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 22/11/09 14:18:44 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file 22/11/09 14:18:44 WARN DataSourceOptionsHelper$: hoodie.datasource.write.storage.type is deprecated and will be removed in a later release; Please use hoodie.datasource.write.table.type 22/11/09 14:18:45 WARN HoodieBackedTableMetadata: Metadata table was not found at path /path/to/test_hudi_pyspark_local/.hoodie/metadata 22/11/09 14:18:48 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties 22/11/09 14:19:05 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 22/11/09 14:19:05 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore [email protected] 22/11/09 14:19:05 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 22/11/09 14:19:07 WARN log: Updating partition stats fast for: test_hudi_pyspark_local 22/11/09 14:19:07 WARN log: Updated size to 438288 ``` # Postgres local metastore For the postgres metastore I started postgres via docker ``` docker run -p 5432:5432 -e POSTGRES_USER=test -e POSTGRES_PASSWORD=test -e POSTGRES_DB=metastore -i postgres:9.6.17-alpine ``` in addition you need to uncomment/comment the following lines within the above definition of `PYSPARK_SUBMIT_ARGS` ```python # in memory derby metastore - does not work # "--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.EmbeddedDriver", # "--conf spark.hadoop.javax.jdo.option.ConnectionURL='jdbc:derby:memory:databaseName=metastore_db;create=true'", # noqa # local postgresql metastore - works "--conf spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:postgresql://localhost:5432/metastore?createDatabaseIfNotExist=true", "--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver", "--conf spark.hadoop.javax.jdo.option.ConnectionUserName=test", "--conf spark.hadoop.javax.jdo.option.ConnectionPassword=test", ``` Then everything works. # Links to derby tests the links do no longer work unfortunately, probably because they are referring to master/main which changed in the meantime > btw, we have a script that we developed recently to test out local derby using a derby client. we developed this flow to test hive sync using spark-bundle. You can find the scripts here https://github.com/apache/hudi/tree/master/packaging/bundle-validation/spark-write-hive-sync > > specifically: https://github.com/apache/hudi/blob/master/packaging/bundle-validation/spark-write-hive-sync/Dockerfile https://github.com/apache/hudi/blob/master/packaging/bundle-validation/spark-write-hive-sync/validate.scala # it would be great if in-memory derby could be supported by hudi We would like to use it to simplify our unittests, which now use a local postgresql as a workaround. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
