[jira] [Updated] (HUDI-1232) Caching takes lot of time

Selvaraj periyasamy (Jira) Wed, 26 Aug 2020 22:50:13 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Selvaraj periyasamy updated HUDI-1232:
--------------------------------------
    Description: 
I am using Hudi 0.5.0.  

 

Issue 1) I have source 15 different transaction source tables written using 
COPY_ON_WRITE and all of them are partitioned by transaction_day, 
transaction_hour. Eat one of them are having more than 6 month data.  I need to 
join all of them and few of the tables would have to looked back up to 6 months 
 for join condition. When I do the join and execute, I see the attached logs 
rolling for longer time. Say example for one table to list all the files, it 
takes more than 5 mins for listing down the files before initiating join.  and 
it is happening in sequential flow. Wen I had to join 6 months data from two 
different tables, it easily takes more than 10 mins to list the file. I have 
grepped some the lines and attached. File name is log1.txt .

 

Issue 2)  After above joining all 15 tables , am writing then into another huh 
target table called trr.  The other slowness I am seeing is that, as the number 
of partitions is growing in trr table, I see below logs rolling for all 
individual partitions even though my write is on only couple of partitions  and 
it takes unto 4 to 5  mins. I pasted only few of them alone. I am wondering , 
in future , I would ave 3 years worth of data, and write will be very slow 
every time I write into only couple of partitions.

 

20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@fed0a8b
 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200714/01, #FileGroups=1
 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from 
base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
caching 1 files under
 hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
core-site.xml, mapred-default.xml, m
 apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, ugi=svchdc36q@V
 ISA.COM (auth:KERBEROS)]]]
 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@285c67a9
 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200714/02, #FileGroups=1
 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from 
base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
caching 1 files under
 hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
core-site.xml, mapred-default.xml, m
 apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, ugi=svchdc36q@V
 ISA.COM (auth:KERBEROS)]]]
 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@2edd9c8
 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200714/03, #FileGroups=1
 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: 
NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from 
base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
caching 1 files under 
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, 
[email protected] (auth:KERBEROS)]]]

 

 

could someone provide more insight on ow to make these things work faster and 
make it scalable when the number of partitions is increasing? 

 

I have below setting in stark session.

 

sparkSession.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
 classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
classOf[org.apache.hadoop.fs.PathFilter]);

 

and also using --conf spark.sql.hive.convertMetastoreParquet=false in 
spark2-submit

 

 

  was:
I am using Hudi 0.5.0.  

 

Issue 1) I have source 15 different transaction source tables written using 
COPY_ON_WRITE and all of them are partitioned by transaction_day, 
transaction_hour. Eat one of them are having more than 6 month data.  I need to 
join all of them and few of the tables would have to looked back up to 6 months 
 for join condition. When I do the join and execute, I see the attached logs 
rolling for longer time. Say example for one table to list all the files, it 
takes more than 5 mins for listing down the files before initiating join.  and 
it is happening in sequential flow. Wen I had to join 6 months data from two 
different tables, it easily takes more than 10 mins to list the file. I have 
grepped some the lines and attached. File name is log1.txt .

 

Issue 2)  After above joining all 15 tables , am writing then into another huh 
target table called trr.  The other slowness I am seeing is that, as the number 
of partitions is growing in trr table, I see below logs rolling for all 
individual partitions even though my write is on only couple of partitions  and 
it takes unto 4 to 5  mins. I pasted only few of them alone. I am wondering , 
in future , I would ave 3 years worth of data, and write will be very slow 
every time I write into only couple of partitions.

 

20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@fed0a8b
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200714/01, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=4, 
FileGroupsCreationTime=0, StoreTimeTaken=1
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from 
base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
caching 1 files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@285c67a9
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200714/02, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=4, 
FileGroupsCreationTime=0, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from 
base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
caching 1 files under
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
core-site.xml, mapred-default.xml, m
apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, ugi=svchdc36q@V
ISA.COM (auth:KERBEROS)]]]
20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@2edd9c8
20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
partition :20200714/03, #FileGroups=1
20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=4, 
FileGroupsCreationTime=1, StoreTimeTaken=0
20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from 
base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
caching 1 files under 
hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
[hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: 
[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, 
[email protected] (auth:KERBEROS)]]]

 

 


> Caching takes lot of time
> -------------------------
>
>                 Key: HUDI-1232
>                 URL: https://issues.apache.org/jira/browse/HUDI-1232
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Selvaraj periyasamy
>            Priority: Major
>         Attachments: log1.txt
>
>
> I am using Hudi 0.5.0.  
>  
> Issue 1) I have source 15 different transaction source tables written using 
> COPY_ON_WRITE and all of them are partitioned by transaction_day, 
> transaction_hour. Eat one of them are having more than 6 month data.  I need 
> to join all of them and few of the tables would have to looked back up to 6 
> months  for join condition. When I do the join and execute, I see the 
> attached logs rolling for longer time. Say example for one table to list all 
> the files, it takes more than 5 mins for listing down the files before 
> initiating join.  and it is happening in sequential flow. Wen I had to join 6 
> months data from two different tables, it easily takes more than 10 mins to 
> list the file. I have grepped some the lines and attached. File name is 
> log1.txt .
>  
> Issue 2)  After above joining all 15 tables , am writing then into another 
> huh target table called trr.  The other slowness I am seeing is that, as the 
> number of partitions is growing in trr table, I see below logs rolling for 
> all individual partitions even though my write is on only couple of 
> partitions  and it takes unto 4 to 5  mins. I pasted only few of them alone. 
> I am wondering , in future , I would ave 3 years worth of data, and write 
> will be very slow every time I write into only couple of partitions.
>  
> 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
>  20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
> COPY_ON_WRITE from 
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
>  20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
> java.util.stream.ReferencePipeline$Head@fed0a8b
>  20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200714/01, #FileGroups=1
>  20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1
>  20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata 
> from base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
> caching 1 files under
>  hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01
>  20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
>  20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, m
>  apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
> hdfs-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, ugi=svchdc36q@V
>  ISA.COM (auth:KERBEROS)]]]
>  20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
>  20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
> COPY_ON_WRITE from 
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
>  20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
> java.util.stream.ReferencePipeline$Head@285c67a9
>  20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200714/02, #FileGroups=1
>  20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0
>  20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata 
> from base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
> caching 1 files under
>  hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02
>  20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
>  20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, m
>  apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
> hdfs-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, ugi=svchdc36q@V
>  ISA.COM (auth:KERBEROS)]]]
>  20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from 
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties
>  20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type 
> COPY_ON_WRITE from 
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
>  20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants 
> java.util.stream.ReferencePipeline$Head@2edd9c8
>  20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for 
> partition :20200714/03, #FileGroups=1
>  20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: 
> NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0
>  20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata 
> from base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, 
> caching 1 files under 
> hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03
>  20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
> from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr
>  20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, 
> [email protected] (auth:KERBEROS)]]]
>  
>  
> could someone provide more insight on ow to make these things work faster and 
> make it scalable when the number of partitions is increasing? 
>  
> I have below setting in stark session.
>  
> sparkSession.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
>  classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], 
> classOf[org.apache.hadoop.fs.PathFilter]);
>  
> and also using --conf spark.sql.hive.convertMetastoreParquet=false in 
> spark2-submit
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1232) Caching takes lot of time

Reply via email to