[GitHub] [hudi] nsivabalan commented on issue #4784: [SUPPORT] Partition column not appearing in spark dataframe

GitBox Fri, 11 Feb 2022 07:26:20 -0800


nsivabalan commented on issue #4784:
URL: https://github.com/apache/hudi/issues/4784#issuecomment-1036330505



   I could not reproduce the partitioning issue you are facing.
   
   local spark shell
   ```
   
   import java.sql.Timestamp
   import spark.implicits._
   
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   
   val df1 = Seq(
           ("row1", 1, "part1" ,1578283932000L ),
           ("row2", 1, "part1", 1578283942000L)
         ).toDF("row", "ppath", "preComb","eventTime")
   
   
    df1.write.format("hudi").
           options(getQuickstartWriteConfigs).
           option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
           option(RECORDKEY_FIELD_OPT_KEY, "row").
           option(PARTITIONPATH_FIELD_OPT_KEY, 
"preComb:simple,ppath:timestamp").
           
option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.CustomKeyGenerator").
           
option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
           
option("hoodie.deltastreamer.keygen.timebased.output.dateformat","yyyy-MM-dd").
           option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
           option(TABLE_NAME, "timestamp_tbl4").
           mode(Overwrite).
           save("/tmp/hudi_timestamp_tbl4")
   
   
   val hudiDF4 = spark.read.format("hudi").load("/tmp/hudi_timestamp_tbl4")
   hudiDF4.registerTempTable("tbl4")
   spark.sql("describe tbl4").show()
   spark.sql("select * from tbl4 limit 3").show()
   
   ```
   
   Output
   ```
   spark.sql("select * from tbl4 limit 3").show()
   
+-------------------+--------------------+------------------+----------------------+--------------------+----+-------------+-------+-----+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| row|    eventTime|preComb|ppath|
   
+-------------------+--------------------+------------------+----------------------+--------------------+----+-------------+-------+-----+
   |  20220211102107283|20220211102107283...|              row1|      
part1/1970-01-01|dfc23d4b-8177-4fa...|row1|1578283932000|  part1|    0|
   |  20220211102107283|20220211102107283...|              row2|      
part1/1970-01-01|dfc23d4b-8177-4fa...|row2|1578283942000|  part1|    0|
   
+-------------------+--------------------+------------------+----------------------+--------------------+----+-------------+-------+-----+
   ```
   
   specifically values for _hoodie_partition_path are 
   part1/1970-01-01
   
   2: if you disable hive style partitioning, you may not see the "fieldname=". 
But if you want to enable it, don't think hudi allows changing the fieldname 
for partition paths. 
   3: I am not sure on how to leverage partition pruning for custom key gen 
based tables. @xushiyan @YannByron @bhasudha : do you folks have any pointers 
here. 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4784: [SUPPORT] Partition column not appearing in spark dataframe

Reply via email to