Re: How to merge two large table and remove duplicates?

Sayan Sanyal Sat, 09 Jan 2016 01:36:33 -0800

Unsubscribe

Sent from Outlook Mobile


    _____________________________
From: Gavin Yue <yue.yuany...@gmail.com>
Sent: Saturday, January 9, 2016 14:33
Subject: Re: How to merge two large table and remove duplicates?
To: Ted Yu <yuzhih...@gmail.com>
Cc: Benyi Wang <bewang.t...@gmail.com>, user <user@spark.apache.org>, ayan guha 
<guha.a...@gmail.com>


                             So I tried to set the parquet compression codec to 
lzo, but hadoop does not have the lzo natives, while lz4 does included.        
      But I could set the code to lz4, it only accepts lzo.       
      
     Any solution here?     
     
    Thank,    
   Gavin   
            
                  
                         
       On Sat, Jan 9, 2016 at 12:09 AM, Gavin Yue     <yue.yuany...@gmail.com> 
wrote:    
                                       I saw in the document, the value is LZO. 
   Is it LZO or LZ4?          
                
              https://github.com/Cyan4973/lz4       
       
      Based on this benchmark, they differ quite a lot.       
      
      
                               
                 On Fri, Jan 8, 2016 at 9:55 PM, Ted Yu          
<yuzhih...@gmail.com> wrote:         
                              gzip is relatively slow. It consumes much CPU.    
                   
                                  snappy is faster.                             
     
                                  LZ4 is faster than             GZIP           
  and smaller than             Snappy            .            
                                  
                                  Cheers                                        
                           
                           On Fri, Jan 8, 2016 at 7:56 PM, Gavin Yue            
   <yue.yuany...@gmail.com> wrote:              
                                                                                
Thank you .                  
                  
                 And speaking of compression, is there big difference on 
performance between gzip and snappy? And why parquet is using gzip by default?  
               
                 
                Thanks.                
                
                                                                                
 
                                     On Fri, Jan 8, 2016 at 6:39 PM, Ted Yu     
               <yuzhih...@gmail.com> wrote:                   
                                                            Cycling old bits:   
                                        
http://search-hadoop.com/m/q3RTtRuvrm1CGzBJ                      
                                                                
                                                                Gavin:          
                                                      Which release of hbase 
did you play with ?                                                             
   
                                                                HBase has been 
evolving and is getting more stable.                                            
                    
                                                                Cheers          
                                                                                
                                     
                                               On Fri, Jan 8, 2016 at 6:29 PM, 
Gavin Yue                         <yue.yuany...@gmail.com> wrote:               
         
                                                                                
                                                  I used to maintain a HBase 
cluster. The experience with it was not happy.                             
                            
                           I just tried query the data  from each day's first 
and dedup with smaller set, the performance is acceptable.  So I guess I will 
use this method.                            
                           
                          Again, could anyone give advice about:                
           
                                                                                
 Automatically determine the number of reducers for joins and groupbys: 
Currently in Spark SQL, you need to control the degree of parallelism 
post-shuffle using “SET spark.sql.shuffle.partitions=[num_tasks];”.             
                                        

Thanks.                          

Gavin



                                                                                
                                                   
                                                         On Fri, Jan 8, 2016 at 
6:25 PM, Ted Yu                              <yuzhih...@gmail.com> wrote:       
                      
                                                                                
          bq.                                in an noSQL db such as Hbase       
                                                        
                                                                                
              +1 :-)                                                            
                                  
                                                                                
                                                                                
                           
                                                                   On Fri, Jan 
8, 2016 at 6:25 PM, ayan guha                                   
<guha.a...@gmail.com> wrote:                                  
                                                                                
                         One option you may want to explore is writing event 
table in an noSQL db such as Hbase. One inherent problem in your approach is 
you always need to load either full data set or a defined number of partitions 
to see if the event has already come (and no gurantee it is full proof, but 
lead to unnecessary loading in most cases).                                     
                                                                                
                                                                
                                                                             On 
Sat, Jan 9, 2016 at 12:56 PM, Gavin Yue                                        
<yue.yuany...@gmail.com> wrote:                                       
                                                                                
                                                                                
                                             Hey,                               
             
                                          Thank you for the answer. I checked 
the setting you mentioend they are all correct.  I noticed that in the job, 
there are always only 200 reducers for shuffle read, I believe it is setting in 
the sql shuffle parallism.                                           
                                          
                                         In the doc, it mentions:               
                           
                                                                                
   Automatically determine the number of reducers for joins and groupbys: 
Currently in Spark SQL, you need to control the degree of parallelism 
post-shuffle using “SET spark.sql.shuffle.partitions=[num_tasks];”.             
                                                                     


                                         

What would be the ideal number for this setting? Is it based on the hardware of 
cluster?                                         


                                         

Thanks,                                         

Gavin 
                                                                                
                                         
                                                                                
   On Fri, Jan 8, 2016 at 2:48 PM, Benyi Wang <bewang.t...@gmail.com> wrote:
                                                                                
                                                                                
                                                                                
                           I assume your parquet files are compressed. Gzip or 
Snappy?                                               What spark version did 
you use? It seems at least 1.4. If you use spark-sql and tungsten, you might 
have better performance. but spark 1.5.2 gave me a wrong result when the data 
was about 300~400GB, just for a simple group-by and aggregate.                  
                             Did you use kyro serialization?                    
                            you should have spark.shuffle.compress=true, verify 
it.                                               How many tasks did you use? 
spark.default.parallelism=?                                                What 
about this:                                                                     
                           Read the data day by day                             
                   compute a bucket id from timestamp, e.g., the date and hour  
                                              Write into different buckets (you 
probably need a special writer to write data efficiently without shuffling the 
data).                                                distinct for each bucket. 
Because each bucket is small, spark can get it done faster than having 
everything in one run.                                                I think 
using groupBy (userId, timestamp) might be better than distinct. I guess 
distinct() will compare every field.                                            
                                                                                
                                                                                
                                                                                
                                         
                                                                                
                 On Fri, Jan 8, 2016 at 2:31 PM, Gavin Yue                      
                            <yue.yuany...@gmail.com> wrote:                     
                            
                                                                                
                                                                                
                                                                                
               And the most frequent operation I am gonna do is find the UserID 
who have some events, then retrieve all the events associted with the UserID.   
                                                   
                                                     
                                                    In this case, how should I 
partition to speed up the process?                                              
       
                                                    
                                                   Thanks.                      
                             
                                                                                
                                                                                
                                                                                
                
                                                                                
                           On Fri, Jan 8, 2016 at 2:29 PM, Gavin Yue            
                                           <yue.yuany...@gmail.com> wrote:      
                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          hey Ted,                              
                                 
                                                              
                                                             Event table is 
like this: UserID, EventType, EventKey, TimeStamp, MetaData.  I just parse it 
from Json and save as Parquet, did not change the partition.                    
                                          
                                                             
                                                            Annoyingly, every 
day's incoming Event data having duplicates among each other.  One same event 
could show up in Day1 and Day2 and probably Day3.                               
                              
                                                            
                                                           I only want to keep 
single Event table and each day it come so many duplicates.                     
                                      
                                                           
                                                          Is there a way I 
could just insert into Parquet and if duplicate found, just ignore?             
                                              
                                                          
                                                         Thanks,                
                                         
                                                        Gavin                   
                                      
                                                                                
                                                                                
                                                                      
                                                           
                                                           
                                                                                
                                       
                                                            
                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                               
                                                                                
                                     On Fri, Jan 8, 2016 at 2:18 PM, Ted Yu     
                                                       <yuzhih...@gmail.com> 
wrote:                                                           
                                                                                
                                                                                
                    Is your                                                     
         Parquet data source partitioned by date ?                              
                                                                                
             
                                                                                
                                                                                
                        Can you dedup within partitions ?                       
                                                                                
                                                                                
 
                                                                                
                                                                                
                        Cheers                                                  
                                                                                
                                                                                
                                                                                
                                                                             
                                                                                
                                               On Fri, Jan 8, 2016 at 2:10 PM, 
Gavin Yue                                                                 
<yue.yuany...@gmail.com> wrote:                                                 
               
                                                                                
                                                                                
                                                                                
                                                                                
          I tried on Three day's data.  The total input is only 980GB, but the 
shuffle write Data is about 6.2TB, then the job failed during shuffle read 
step, which should be another 6.2TB shuffle read.                               
                                      
                                                                    
                                                                   I think to 
Dedup, the shuffling can not be avoided. Is there anything I could do to 
stablize this process?                                                          
          
                                                                   
                                                                  Thanks.       
                                                           
                                                                  
                                                                                
                                                                                
                                                                                
                                                                                
           
                                                                                
                                                         On Fri, Jan 8, 2016 at 
2:04 PM, Gavin Yue                                                              
        <yue.yuany...@gmail.com> wrote:                                         
                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
               Hey,                                                             
                  
                                                                              
                                                                             I 
got everyday's Event table and want to merge them into a single Event table. 
But there so many duplicates among each day's data.                             
                                                 
                                                                             
                                                                            I 
use Parquet as the data source.  What I am doing now is                         
                                                    
                                                                            
                                                                           
EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet file").   
                                                                         
                                                                           
                                                                                
                                                                                
                                                               Each day's Event 
is stored in their own Parquet file                                             
                              
                                                                                
                                                                                
                                                               
                                                                          But 
it failed at the stage2 which keeps losing connection to one executor. I guess 
this is due to the memory issue.                                                
                           
                                                                          
                                                                         Any 
suggestion how I do this efficiently?                                           
                               
                                                                         
                                                                        Thanks, 
                                                                       
                                                                       Gavin    
                                                                    
                                                                                
                                                                                
                                                                                
                                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                       
                                                                                
                                                                                
                                                                                
                                                                                
         
                                                                                
                                                                                
                                                                                
                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                              
                                                                                
                                                                            
                                      
                                                                             
                                                                                
                                                                   -- 
                                                                             
Best Regards,                                       
Ayan Guha

Re: How to merge two large table and remove duplicates?

Reply via email to