[I] [Bug] [Module Name] importing ORC format files into HDFS, the performance is slow [seatunnel]

via GitHub Thu, 13 Feb 2025 00:48:39 -0800


longdpt opened a new issue, #8689:
URL: https://github.com/apache/seatunnel/issues/8689


   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22bug%22)
 and found no similar issues.
   
   
   ### What happened
   
   When importing data from MySQL to HDFS, under the fully configured 
environment with the Zeta engine and sufficient resources, when sinking data in 
Parquet or TXT format, or when sinking to a local file (sink2localFile), the 
write speed maintains at 100,000+ per second. However, when using the ORC 
format, the initial write speed starts at 15,000 per second and gradually 
decreases to 3,000 per second.
   
   
   ******the configuration of sink2orcFile and execution results are as 
follows.****************
   
   sink {
       HdfsFile {
         fs.defaultFS = "hdfs://mycluster"
         path = "/tmp/hive/warehouse/test2"
         hdfs_site_path = "/data/hadoop-2.7.1/etc/hadoop/hdfs-site.xml"
         file_format_type = "orc"
         compress_codec="snappy"
         remote_user="hadoop"
         file_exists_action = "OVERWRITE"
       }
   }
   
   Statistic Information：
   ***********************************************
              Job Statistic Information
   ***********************************************
   Start Time                : 2025-02-13 16:19:23
   End Time                  : 2025-02-13 16:40:15
   Total Time(s)             :                1251
   Total Read Count          :            11568409
   Total Write Count         :            11568409
   Total Failed Count        :                   0
   ***********************************************
   
   
   
   
   
   
   ******the configuration of sink2parquet and execution results are as 
follows.****************
   
   sink {
       HdfsFile {
         fs.defaultFS = "hdfs://mycluster"
         path = "/tmp/hive/warehouse/test2"
         hdfs_site_path = "/data/hadoop-2.7.1/etc/hadoop/hdfs-site.xml"
         file_format_type = "parquet"
         compress_codec="snappy"
         remote_user="hadoop"
         file_exists_action = "OVERWRITE"
       }
   }
   
   Statistic Information：
   ***********************************************
              Job Statistic Information
   ***********************************************
   Start Time                : 2025-02-13 16:42:32
   End Time                  : 2025-02-13 16:45:22
   Total Time(s)             :                 170
   Total Read Count          :            11568987
   Total Write Count         :            11568987
   Total Failed Count        :                   0
   ***********************************************
   
   
   
   
   ### SeaTunnel Version
   
   2.3.9
   
   ### SeaTunnel Config
   
   ```conf
   env {
       parallelism = 5
       job.mode = "BATCH"
   }
   
   source{
       Jdbc {
           url = 
"jdbc:mysql://10.101.xx.xx:3711/information_schema?serverTimezone=Asia/Shanghai&useUnicode=true&characterEncoding=UTF-8&rewriteBatchedStatements=true"
           driver = "com.mysql.cj.jdbc.Driver"
           connection_check_timeout_sec = 100
           user = "etldb"
           password = "xxxxx"
           query = "select  xxx  from yyrenting_mall.tb_trade_order t"
           partition_column= "id"
           split.size=500000
           fetch_size=20000
   
       }
   }
   
   sink {
       HdfsFile {
         fs.defaultFS = "hdfs://mycluster"
         path = "/tmp/hive/warehouse/test2"
   #      path = 
"/ODS/YYRENTING_MALL/TB_TRADE_ORDER/etl_date=${etl_date}/child=${child}"
         hdfs_site_path = "/data/hadoop-2.7.1/etc/hadoop/hdfs-site.xml"
   #      custom_filename= true
         file_format_type = "parquet"
         compress_codec="snappy"
         remote_user="hadoop"
         file_exists_action = "OVERWRITE"
       }
   }
   ```
   
   ### Running Command
   
   ```shell
   ./bin/seatunnel.sh --config  ./job/mysql2hdfs.cnf -n mysql2hdfs
   ```
   
   ### Error Exception
   
   ```log
   Non
   ```
   
   ### Zeta or Flink or Spark Version
   
   _No response_
   
   ### Java or Scala Version
   
   _No response_
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@seatunnel.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Bug] [Module Name] importing ORC format files into HDFS, the performance is slow [seatunnel]

Reply via email to