[GitHub] spark pull request: [SPARK-2304] tera sort example program for shu...

liuqiyun Sat, 27 Dec 2014 22:51:07 -0800

Github user liuqiyun commented on the pull request:

    https://github.com/apache/spark/pull/1242#issuecomment-68199942
  
    @rxin I am confusing on the input parameters of GenSort.scala.
    It requires 3 parameters: " [num-parts] [records-per-part] [output-path]".
    If I want to generate and sort 100 GB data using 4 partitions, is that 
correct to set the parameters as '4, 268435456, /tmp/sort-output'?
    
    Seems 1 row(record) equals 100 byte, so I computed the records(rows) number 
as following way:
    100 GB data = 107374182400 byte = 1073741824 row * 100 byte/row = 268435456 
row * 4 partition * 100 byte/row 
    So each partition should compute 268435456 row(record), right?
    
    However, If I save the output as sequence file, the size of output files is 
only 20 GB.  if I save the output as text file, not sequence file, the size of 
output files is 309.2 GB(77.3 GB * 4 partition), but NOT 100 GB. why?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2304] tera sort example program for shu...

Reply via email to