Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi, I'm trying to save a simple dataframe to S3 in ORC format. The code is as follows: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > import sqlContext.implicits._ > val df=sc.parallelize(1 to 1000).toDF() > df.write.format("orc").save("s3://logs/dummy)

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Jerrick Hoang
anybody has any suggestions? On Fri, Aug 21, 2015 at 3:14 PM, Jerrick Hoang wrote: > Is there a workaround without updating Hadoop? Would really appreciate if > someone can explain what spark is trying to do here and what is an easy way > to turn this off. Thanks all! > > On Fri, Aug 21, 2015 at

How to set environment of worker applications

2015-08-23 Thread Jan Algermissen
Hi, I am starting a spark streaming job in standalone mode with spark-submit. Is there a way to make the UNIX environment variables with which spark-submit is started available to the processes started on the worker nodes? Jan

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
You may have seen this: http://search-hadoop.com/m/q3RTtdSyM52urAyI > On Aug 23, 2015, at 1:01 AM, lostrain A > wrote: > > Hi, > I'm trying to save a simple dataframe to S3 in ORC format. The code is as > follows: > > >> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

Re: Spark Mesos Dispatcher

2015-08-23 Thread bcajes
I'm currently having the same issues. The documentation for Mesos dispatcher is sparse. I'll also add that I'm able to see the framework running in the mesos and spark driver UIs, but when viewing the spark job ui on a slave, no job is seen. -- View this message in context: http://apache-sp

Re: How to set environment of worker applications

2015-08-23 Thread Hemant Bhanawat
Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions in the following article. I think you can use -D to pass system vars: spark.apache.org/docs/latest/configuration.html#runtime-environment Hi, I am starting a spark streaming job in standalone mode with spark-submit. Is t

Re: Spark Mesos Dispatcher

2015-08-23 Thread Timothy Chen
Hi Bcjaes, Sorry I didn't see the previous thread so not sure what issues you are running into. In cluster mode the driver logs and results are all available through the Mesos UI, you need to look at terminated frameworks if it's a job that's already finished. I'll try to add more docs as we

Re: How to set environment of worker applications

2015-08-23 Thread Raghavendra Pandey
I think the only way to pass on environment variables to worker node is to write it in spark-env.sh file on each worker node. On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat wrote: > Check for spark.driver.extraJavaOptions and > spark.executor.extraJavaOptions in the following article. I think

How to parse multiple event types using Kafka

2015-08-23 Thread Spark Enthusiast
Folks, I use the following Streaming API from KafkaUtils : public JavaPairInputDStream inputDStream() { HashSet topicsSet = new HashSet(Arrays.asList(topics.split(","))); HashMap kafkaParams = new HashMap(); kafkaParams.put(Tokens.KAFKA_BROKER_LIST_TOKEN.getRealTokenName(), brokers);

Re: How to parse multiple event types using Kafka

2015-08-23 Thread Cody Koeninger
Each spark partition will contain messages only from a single kafka topcipartition. Use hasOffsetRanges to tell which kafka topicpartition it's from. See the docs http://spark.apache.org/docs/latest/streaming-kafka-integration.html On Sun, Aug 23, 2015 at 10:56 AM, Spark Enthusiast wrote: > Fo

Re: Spark streaming multi-tasking during I/O

2015-08-23 Thread Akhil Das
If you set concurrentJobs flag to 2, then it lets you run two jobs parallel. It will be a bit hard for you predict the application behavior with this flag thus debugging would be a headache. Thanks Best Regards On Sun, Aug 23, 2015 at 10:36 AM, Sateesh Kavuri wrote: > Hi Akhil, > > Think of the

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi Ted, Thanks for the reply. I tried setting both of the keyid and accesskey via sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***") > sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**") However, the error still occurs for ORC format. If I change the format to JSON, although

Re: How to set environment of worker applications

2015-08-23 Thread Sathish Kumaran Vairavelu
spark-env.sh works for me in Spark 1.4 but not spark.executor.extraJavaOptions. On Sun, Aug 23, 2015 at 11:27 AM Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > I think the only way to pass on environment variables to worker node is to > write it in spark-env.sh file on each worker no

Spark YARN executors are not launching when using +UseG1GC

2015-08-23 Thread unk1102
Hi I am hitting issue of long GC pauses in my Spark job and because of it YARN is killing executors one by one and Spark job becomes slower and slower. I came across this article where they mentioned about using G1GC I tried to use the same command but something seems wrong https://databricks.com/

is there a 'knack' to docker and mesos?

2015-08-23 Thread Dick Davies
in-hadoop2.6.tgz' with os::net I0823 19:13:25.608620 3069 fetcher.cpp:135] Downloading 'http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz' to '/var/mesos/slaves/20150823-110659-1862270986-5050-3230-S1/frameworks/20150823-191138-1862270986-5050-3768-/execut

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
In your case, I would specify "fs.s3.awsAccessKeyId" / "fs.s3.awsSecretAccessKey" since you use s3 protocol. On Sun, Aug 23, 2015 at 11:03 AM, lostrain A wrote: > Hi Ted, > Thanks for the reply. I tried setting both of the keyid and accesskey via > > sc.hadoopConfiguration.set("fs.s3n.awsAcces

B2i Healthcare "Powered by Spark" addition

2015-08-23 Thread Brandon Ulrich
Another addition to the Powered by Spark page: B2i Healthcare (http://b2i.sg) uses Spark in healthcare analytics with medical ontologies like SNOMED CT. Our Snow Owl MQ ( http://b2i.sg/snow-owl-mq) product relies on the Spark ecosystem to analyze ~1 billion health records with over 70 healthcare t

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Ted, Thanks for the suggestions. Actually I tried both s3n and s3 and the result remains the same. On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu wrote: > In your case, I would specify "fs.s3.awsAccessKeyId" / > "fs.s3.awsSecretAccessKey" since you use s3 protocol. > > On Sun, Aug 23, 2015 at 11:03

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Zhan Zhang
If you are using spark-1.4.0, probably it is caused by SPARK-8458 Thanks. Zhan Zhang On Aug 23, 2015, at 12:49 PM, lostrain A mailto:donotlikeworkingh...@gmail.com>> wrote: Ted, Thanks for the suggestions. Actually I tried both s3n and s3 an

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi Zhan, Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it looks like this is most likely the reason. I'll verify this again once the we make the upgrade. Best, los On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang wrote: > If you are using spark-1.4.0, probably it is caused by

Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
SPARK-8458 is in 1.4.1 release. You can upgrade to 1.4.1 or, wait for the upcoming 1.5.0 release. On Sun, Aug 23, 2015 at 2:05 PM, lostrain A wrote: > Hi Zhan, > Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it > looks like this is most likely the reason. I'll verify this

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Michael Armbrust
We should not be actually scanning all of the data of all of the partitions, but we do need to at least list all of the available directories so that we can apply your predicates to the actual values that are present when we are deciding which files need to be read in a given spark job. While this

Re: SparkSQL concerning materials

2015-08-23 Thread Michael Armbrust
Here's a longer version of that talk that I gave, which goes into more detail on the internals: http://www.slideshare.net/databricks/spark-sql-deep-dive-melbroune On Fri, Aug 21, 2015 at 8:28 AM, Sameer Farooqui wrote: > Have you seen the Spark SQL paper?: > https://people.csail.mit.edu/matei/pa

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Philip Weaver
1 minute to discover 1000s of partitions -- yes, that is what I have observed. And I would assert that is very slow. On Sun, Aug 23, 2015 at 7:16 PM, Michael Armbrust wrote: > We should not be actually scanning all of the data of all of the > partitions, but we do need to at least list all of th

DataFrame rollup with alias?

2015-08-23 Thread Isabelle Phan
Hello, I am new to Spark and just running some tests to get familiar with the APIs. When calling the rollup function on my DataFrame, I get different results when I alias the columns I am grouping on (see below for example data set). I was expecting alias function to only affect the column name.

Re: Spark GraphaX

2015-08-23 Thread Robineast
GrapX is a graph analytics engine rather than a graph database. It's typical use case is running large-scale graph algorithms like page rank , connected components, label propagation and so on. It can be an element of complex processing pipelines that involve other Spark components such as Data Fra

How to remove worker node but let it finish first?

2015-08-23 Thread Romi Kuntsman
Hi, I have a spark standalone cluster with 100s of applications per day, and it changes size (more or less workers) at various hours. The driver runs on a separate machine outside the spark cluster. When a job is running and it's worker is killed (because at that hour the number of workers is redu

Re: Memory-efficient successive calls to repartition()

2015-08-23 Thread Alexis Gillain
Hi Aurelien, The first code should create a new RDD in memory at each iteration (check the webui). The second code will unpersist the RDD but that's not the main problem. I think you have trouble due to long lineage as .cache() keep track of lineage for recovery. You should have a look at checkpo