ad 1) My files are not bigger than Block Size. To be precise all data from one day are up to 2GB gzipped. Unzipped they are ~15GB. The are split in one folder into files less then block size (block size in my case is 128MB, files are ~100MB). I can transform them to other format if you think it will speed up my queries, but as I understand the documentation, such structure should be same as split-able files. If I add one more partition dimension with code, so I have partition by date and code, single files is 3-10MB. Because unique codes I have 3000
ad 2) I read your code and I use same serde only newer one. Should I check something more? Ad 4) Most queries will be about last 3-6 months Ad 3) My example queries are: a) Code heatmap - how many users used some "code" SELECT A.code, A.NumberOfDistinctCustomers, B.NumberOfAllDistinctCustomers, A.NumberOfEntries, B.NumberOfAllEntries FROM ( SELECT code, COUNT(DISTINCT customerId) AS NumberOfDistinctCustomers, COUNT(*) AS NumberOfEntries FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate > '20140922' AND partitionDate <= '20141022' GROUP BY code ) A JOIN ( SELECT COUNT(DISTINCT customerId) AS NumberOfAllDistinctCustomers, COUNT(*) AS NumberOfAllEntries FROM json_table WHERE 1=1 AND code in (1,3,54,300,222,111) AND partitionDate > '20140922' AND partitionDate <= '20141022' ) B ON(1=1); b) Select users which use some code, but have some special attributes in second table: SELECT json_table.customerId FROM json_table JOIN attribute_table ON (json_table.customerId=attribute_table.customerId) AND attribute_table.attribute_X1='1' AND attribute_table.attribute_X2='1' AND json_table.code in (1,3,54,300,222,111); c) Mix of above. On Tue, Oct 21, 2014 at 11:12 PM, Sanjay Subramanian < sanjaysubraman...@yahoo.com> wrote: > 1. The gzip files are not splittable, so gzip itself will make the queries > slower. > > 2. As a reference for JSON serdes , here is a example from my blog > http://bigdatalatte.wordpress.com/2014/08/21/denormalizing-json-arrays-in-hive/ > > 3. Need to see your query first to try and optimize it > > 4. Even if you have datewise partitions and u have 5 years of data i.e. > about 1825 partitions. > -- Trying to do a select count(*) without where clause might make hive > crawl. > > > ------------------------------ > *From:* Ja Sam <ptrstp...@gmail.com> > *To:* user@hive.apache.org > *Sent:* Tuesday, October 21, 2014 10:37 AM > *Subject:* Optimize hive external tables with serde > > *Part 1: my enviroment* > I have following files uploaded to Hadoop: > > 1. The are plain text > 2. Each line contains JSON like: > > {code:[int], customerId:[string], data:{[something more here]}} > > 1. code are numbers from 1 to 3000, > 2. customerId are total up to 4 millions, daily up to 0.5 millon > 3. All files are gzip > 4. In hive I created external table with custom JSON serde (let's call > it CUSTOMER_DATA) > 5. All files from each date is stored in separate directory - and I > use it as partitions in Hive tables > > Most queries which I do are filtering by date, code and customerId. I > have also a second file with format (let's call it CUSTOMER_ATTRIBUTES]: > [customerId] [attribute_1] [attribute_2] ... [attribute_n] which contains > data for all my customers, so rows are up to 4 millions. > I query and filter my data in following way: > > 1. Filtering by date - partitions do the job here using WHERE > partitionDate IN (20141020,20141020) > 2. Filtering by code using statement like for example `WHERE code IN > (1,4,5,33,6784) > 3. Joining table CUSTOMER_ATTRIBUTES with CUSTOMER_DATA with condition > query like SELECT customerId FROM CUSTOMER_DATA JOIN > CUSTOMER_ATTRIBUTES ON > (CUSTOMER_ATTRIBUTES.customerId=CUSTOMER_DATA.customerId) WHERE > CUSTOMER_ATTRIBUTES.attribute_1=[something] > > *Part 2: question* > Is there any efficient way how can I optimize my queries. I read about > indexes and buckets by I don't know if I can use them with external tables > and if they will optimize my queries. > > >