nsivabalan opened a new issue #4889: URL: https://github.com/apache/hudi/issues/4889
**Describe the problem you faced** Ran a deltastreamer job w/ inline clustering and metadata enabled w/ latest master. Clustering failed towards completion when building col stats index for metadata table. this is actually an integ test suite job that failed. property file: https://github.com/apache/hudi/pull/4884/files#diff-63e53dde6161fec3341cf85817ff0d8f2b0d33d75803da8818afaf3bb544722e yaml file: https://github.com/apache/hudi/blob/master/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions.yaml It is strange, given that I did not enable col stats and just left it to default settings. A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.11-snapshot * Spark version : 2.4.5 * Hive version : * Hadoop version : 2.7 * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : docker **Additional context** Add any other context about the problem here. **Stacktrace** ``` . . 03:47 WARN: Timeline-server-based markers are configured as the marker type but embedded timeline server is not enabled. Falling back to direct markers. 22/02/23 16:22:11 ERROR DagScheduler: Exception executing node org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842 at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:394) at org.apache.hudi.client.SparkRDDWriteClient.completeTableService(SparkRDDWriteClient.java:473) at org.apache.hudi.client.SparkRDDWriteClient.cluster(SparkRDDWriteClient.java:360) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$inlineClustering$15(BaseHoodieWriteClient.java:1196) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.client.BaseHoodieWriteClient.inlineClustering(BaseHoodieWriteClient.java:1194) at org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:502) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:124) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:74) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:173) at org.apache.hudi.integ.testsuite.HoodieTestSuiteWriter.commit(HoodieTestSuiteWriter.java:267) at org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.execute(InsertNode.java:54) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.hudi.index.columnstats.ColumnStatsIndexHelper.updateColumnStatsIndexFor(ColumnStatsIndexHelper.java:315) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateColumnsStatsIndex(HoodieSparkCopyOnWriteTable.java:219) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateMetadataIndexes(HoodieSparkCopyOnWriteTable.java:177) at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:386) ... 19 more 22/02/23 16:22:11 INFO DagScheduler: Forcing shutdown of executor service, this might kill running tasks 22/02/23 16:22:11 ERROR HoodieTestSuiteJob: Failed to run Test Suite java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68) at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203) at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842 at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:146) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842 at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:394) at org.apache.hudi.client.SparkRDDWriteClient.completeTableService(SparkRDDWriteClient.java:473) at org.apache.hudi.client.SparkRDDWriteClient.cluster(SparkRDDWriteClient.java:360) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$inlineClustering$15(BaseHoodieWriteClient.java:1196) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.client.BaseHoodieWriteClient.inlineClustering(BaseHoodieWriteClient.java:1194) at org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:502) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:124) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:74) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:173) at org.apache.hudi.integ.testsuite.HoodieTestSuiteWriter.commit(HoodieTestSuiteWriter.java:267) at org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.execute(InsertNode.java:54) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) ... 6 more Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.hudi.index.columnstats.ColumnStatsIndexHelper.updateColumnStatsIndexFor(ColumnStatsIndexHelper.java:315) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateColumnsStatsIndex(HoodieSparkCopyOnWriteTable.java:219) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateMetadataIndexes(HoodieSparkCopyOnWriteTable.java:177) at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:386) ... 19 more Exception in thread "main" org.apache.hudi.exception.HoodieException: Failed to run Test Suite at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:208) at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68) at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203) ... 13 more Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842 at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:146) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842 at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:394) at org.apache.hudi.client.SparkRDDWriteClient.completeTableService(SparkRDDWriteClient.java:473) at org.apache.hudi.client.SparkRDDWriteClient.cluster(SparkRDDWriteClient.java:360) at org.apache.hudi.client.BaseHoodieWriteClient.lambda$inlineClustering$15(BaseHoodieWriteClient.java:1196) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.client.BaseHoodieWriteClient.inlineClustering(BaseHoodieWriteClient.java:1194) at org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:502) at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:124) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:74) at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:173) at org.apache.hudi.integ.testsuite.HoodieTestSuiteWriter.commit(HoodieTestSuiteWriter.java:267) at org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.execute(InsertNode.java:54) at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139) ... 6 more Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.hudi.index.columnstats.ColumnStatsIndexHelper.updateColumnStatsIndexFor(ColumnStatsIndexHelper.java:315) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateColumnsStatsIndex(HoodieSparkCopyOnWriteTable.java:219) at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateMetadataIndexes(HoodieSparkCopyOnWriteTable.java:177) at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:386) ... 19 more ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
