Re: parquet vs orc files

2018-02-21 Thread Kane Kim
; on the data and the analysis you want to do. > > > On 21. Feb 2018, at 21:54, Kane Kim wrote: > > > > Hello, > > > > Which format is better supported in spark, parquet or orc? > > Will spark use internal sorting of parquet/orc files (and how to test > that)? > > Can spark save sorted parquet/orc files? > > > > Thanks! >

parquet vs orc files

2018-02-21 Thread Kane Kim
Hello, Which format is better supported in spark, parquet or orc? Will spark use internal sorting of parquet/orc files (and how to test that)? Can spark save sorted parquet/orc files? Thanks!

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
e it look like your time is correct when it is skewed. > > cheers > > On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim wrote: > >> The thing is that my time is perfectly valid... >> >> On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das >> wrote: >> >>> Its with th

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
> telnet s3.amazonaws.com 80 > GET / HTTP/1.0 > > > [image: Inline image 1] > > Thanks > Best Regards > > On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim wrote: > >> I'm getting this warning when using s3 input: >> 15/02/11 00:58:37 WARN RestStorageService:

spark, reading from s3

2015-02-10 Thread Kane Kim
I'm getting this warning when using s3 input: 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 0 seconds. Retrying connection. After that there are tons of 403/forbidden erro

spark python exception

2015-02-10 Thread Kane Kim
sometimes I'm getting this exception: Traceback (most recent call last): File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py", line 162, in manager code = worker(sock) File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py", line 64, in worker outfile.flush() IOError:

Re: python api and gzip compression

2015-02-09 Thread Kane Kim
Found it - used saveAsHadoopFile On Mon, Feb 9, 2015 at 9:11 AM, Kane Kim wrote: > Hi, how to compress output with gzip using python api? > > Thanks! >

python api and gzip compression

2015-02-09 Thread Kane Kim
Hi, how to compress output with gzip using python api? Thanks!

Re: spark on ec2

2015-02-05 Thread Kane Kim
on my integration EC2 cluster and got odd results > for stopping the workers (no workers found) but the start script... seemed > to work. My integration cluster was running and functioning after executing > both scripts, but I also didn't make any changes to spark-env either. > >

spark on ec2

2015-02-05 Thread Kane Kim
Hi, I'm trying to change setting as described here: http://spark.apache.org/docs/1.2.0/ec2-scripts.html export SPARK_WORKER_CORES=6 Then I ran ~/spark-ec2/copy-dir /root/spark/conf to distribute to slaves, but without any effect. Do I have to restart workers? How to do that with spark-ec2? Than

spark driver behind firewall

2015-02-05 Thread Kane Kim
I submit spark job from machine behind firewall, I can't open any incoming connections to that box, does driver absolutely need to accept incoming connections? Is there any workaround for that case? Thanks.

Re: pyspark - gzip output compression

2015-02-05 Thread Kane Kim
I'm getting SequenceFile doesn't work with GzipCodec without native-hadoop code! Where to get those libs and where to put it in the spark? Also can I save plain text file (like saveAsTextFile) as gzip? Thanks. On Wed, Feb 4, 2015 at 11:10 PM, Kane Kim wrote: > How to save

pyspark - gzip output compression

2015-02-04 Thread Kane Kim
How to save RDD with gzip compression? Thanks.

processing large dataset

2015-01-22 Thread Kane Kim
I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've tried to deploy to ec2 with the script provided with spark on pretty beefy machines (100 r3.2xlarge nodes). Really frustrated tha

reducing number of output files

2015-01-22 Thread Kane Kim
How I can reduce number of output files? Is there a parameter to saveAsTextFile? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Kane Kim
I'm trying to process a large dataset, mapping/filtering works ok, but as long as I try to reduceByKey, I get out of memory errors: http://pastebin.com/70M5d0Bn Any ideas how I can fix that? Thanks. - To unsubscribe, e-mail: us

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Kane Kim
Related question - is execution of different stages optimized? I.e. map followed by a filter will require 2 loops or they will be combined into single one? On Tue, Jan 20, 2015 at 4:33 AM, Bob Tiernay wrote: > I found the following to be a good discussion of the same topic: > > http://apache-spar

spark java options

2015-01-16 Thread Kane Kim
I want to add some java options when submitting application: --conf "spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder" But looks like it doesn't get set. Where I can add it to make it working? Thanks. --