Yeah partition level statistics are good.I see hive orc reader is reading rows from s3 for each file in the hive server log.Will it be referring to orc metadata or it will be loading the whole file and then counting the rows.Is there any place to cache this information so that I don't need to scan all the files every time.
On Sun, Aug 26, 2018, 4:24 PM Jörn Franke <jornfra...@gmail.com> wrote: > You can partition it and only compute statistics for new partitions... > > On 26. Aug 2018, at 12:43, Prabhakar Reddy <prabha.cl...@gmail.com> wrote: > > Hello, > > Are there any properties that I can set to improve the performance of > Analyze table compute statistics statement.My data sits in s3 and I see > it's taking one second per file to read the schema of each file from s3. > > 2018-08-24T03:25:57,525 INFO [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 > main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC > rows from s3://file_1 > 2018-08-24T03:25:57,526 INFO [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 > main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:<init>(187)) - > Reader schema not provided -- using file schema > > 2018-08-24T03:25:58,395 INFO [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 > main([])]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(79)) - Reading ORC > rows from s3://file_2 > 2018-08-24T03:25:58,395 INFO [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 > main([])]: impl.RecordReaderImpl (RecordReaderImpl.java:<init>(187)) - > Reader schema not provided -- using file schema > > It takes around 80 seconds for 76 files with total size of 23 GB. > > > 2018-08-24T03:27:07,673 INFO [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 > main([])]: exec.Task (SessionState.java:printInfo(1111)) - Table > dept_data_services.lc_credit_2018_08_20_temp stats: [numFiles=76, > numRows=101341845, totalSize=26500166568, rawDataSize=294491898741] > 2018-08-24T03:27:07,673 INFO [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 > main([])]: ql.Driver (Driver.java:execute(2050)) - Completed executing > command(queryId=lcapp_20180824032545_aba21e71-ea4b-4214-8793-705a5e0367f0); > Time taken: 81.169 seconds > 2018-08-24T03:27:07,674 INFO [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 > main([])]: ql.Driver (SessionState.java:printInfo(1111)) - OK > 2018-08-24T03:27:07,681 INFO [2b2c0a06-7da5-4fcd-83a7-4931b8e1b4b1 > main([])]: CliDriver (SessionState.java:printInfo(1111)) - Time taken: > 81.992 seconds > > If I run the same command with few columns then the query runs 60% > faster.Is there any property that I can modify to reduce the time taken for > this read? > > Regards > Prabhakar Reddy > > > > >