Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Felix C
Thanks! It's good to know --- Original Message --- From: "Eskilson,Aleksander" Sent: June 25, 2015 5:57 AM To: "Felix C" , user@spark.apache.org Subject: Re: SparkR parallelize not found with 1.4.1? Hi there, Parallelize is part of the RDD API which was made private

SparkR parallelize not found with 1.4.1?

2015-06-24 Thread Felix C
Hi, It must be something very straightforward... Not working: parallelize(sc) Error: could not find function "parallelize" Working: df <- createDataFrame(sqlContext, localDF) What did I miss? Thanks

RE: Using Pandas/Scikit Learning in Pyspark

2015-05-09 Thread Felix C
Your python job runs in a python process interacting with JVM. You do need matching python version and other dependent packages on the driver and all worker nodes if you run in YARN mode. --- Original Message --- From: "Bin Wang" Sent: May 8, 2015 9:56 PM To: "Apache.Spark.User" Subject: Usin

Re: Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-07 Thread Felix C
Or you could build an uber jar ( you could google that ) https://eradiating.wordpress.com/2015/02/15/getting-spark-streaming-on-kafka-to-work/ --- Original Message --- From: "Akhil Das" Sent: April 4, 2015 11:52 PM To: "Priya Ch" Cc: user@spark.apache.org, "dev" Subject: Re: Spark streaming w

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Felix C
The spark-csv package can handle header row, and the code is at the link below. It could also use the header to infer field names in the schema. https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvRelation.scala --- Original Message --- From: "Dean Wam

Re: Running Spark jobs via oozie

2015-03-04 Thread Felix C
We have gotten it to work... --- Original Message --- From: "nitinkak001" Sent: March 3, 2015 7:46 AM To: user@spark.apache.org Subject: Re: Running Spark jobs via oozie I am also starting to work on this one. Did you get any solution to this issue? -- View this message in context: http://a

Re: Executing hive query from Spark code

2015-03-02 Thread Felix C
It should work in CDH without having to recompile. http://eradiating.wordpress.com/2015/02/22/getting-hivecontext-to-work-in-cdh/ --- Original Message --- From: "Ted Yu" Sent: March 2, 2015 1:35 PM To: "nitinkak001" Cc: "user" Subject: Re: Executing hive query from Spark code Here is snippet

Re: Tools to manage workflows on Spark

2015-03-01 Thread Felix C
We use Oozie as well, and it has worked well. The catch is each action in Oozie is separate and one cannot retain SparkContext or RDD, or leverage caching or temp table, going into another Oozie action. You could either save output to file or put all Spark processing into one Oozie action. ---

Re: Write ahead Logs and checkpoint

2015-02-23 Thread Felix C
Kafka 0.8.2 has built-in offset management, how would that affect direct stream in spark? Please see KAFKA-1012 --- Original Message --- From: "Tathagata Das" Sent: February 23, 2015 9:53 PM To: "V Dineshkumar" Cc: "user" Subject: Re: Write ahead Logs and checkpoint Exactly, that is the reas

Re: HiveContext in SparkSQL - concurrency issues

2015-02-12 Thread Felix C
Your earlier call stack clearly states that it fails because the Derby metastore has already been started by another instance, so I think that is explained by your attempt to run this concurrently. Are you running Spark standalone? Do you have a cluster? You should be able to run spark in yarn-

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Felix C
You would probably write to hdfs or check out https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html You might be able to retrofit it to you use case. --- Original Message --- From: "Su She" Sent: February 11, 2015 10:55 PM To: "Felix C&qu

Re: Can spark job server be used to visualize streaming data?

2015-02-11 Thread Felix C
: "Su She" Sent: February 11, 2015 10:23 AM To: "Felix C" Cc: "Kelvin Chu" <2dot7kel...@gmail.com>, user@spark.apache.org Subject: Re: Can spark job server be used to visualize streaming data? Thank you Felix and Kelvin. I think I'll def be using the k-means

Re: Strongly Typed SQL in Spark

2015-02-11 Thread Felix C
As far as from my tests, language integrated query in spark isn't type safe, ie. query.where('cost == "foo") Would compile and return nothing. If you want type safety, perhaps you want to map the SchemaRDD to a RDD of Product (your type, not scala.Product) --- Original Message --- From: "jay

Re: Can spark job server be used to visualize streaming data?

2015-02-10 Thread Felix C
Checkout https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html In there are links to how that is done. --- Original Message --- From: "Kelvin Chu" <2dot7kel...@gmail.com> Sent: February 10, 2015 12:48 PM To: "Su She" Cc: user@spark.apache.org Subject: Re: Can s

RE: Open file limit settings for Spark on Yarn job

2015-02-10 Thread Felix C
Alternatively, is there another way to do it? groupByKey has been called out as expensive and should be avoid (it causes shuffling of data). I've generally found it possible to use reduceByKey instead --- Original Message --- From: "Arun Luthra" Sent: February 10, 2015 1:16 PM To: user@spark.a

Re: ImportError: No module named pyspark, when running pi.py

2015-02-10 Thread Felix C
Agree. PySpark would call spark-submit. Check out the command line there. --- Original Message --- From: "Mohit Singh" Sent: February 9, 2015 11:26 PM To: "Ashish Kumar" Cc: user@spark.apache.org Subject: Re: ImportError: No module named pyspark, when running pi.py I think you have to run that

Re: Spark Job running on localhost on yarn cluster

2015-02-04 Thread Felix C
Is YARN_CONF_DIR set? --- Original Message --- From: "Aniket Bhatnagar" Sent: February 4, 2015 6:16 AM To: "kundan kumar" , "spark users" Subject: Re: Spark Job running on localhost on yarn cluster Have you set master in SparkConf/SparkContext in your code? Driver logs show in which mode the

RE: schemaRDD.saveAsParquetFile creates large number of small parquet files ...

2015-01-29 Thread Felix C
Try rdd.coalesce(1).saveAsParquetFile(...) http://spark.apache.org/docs/1.2.0/programming-guide.html#transformations --- Original Message --- From: "Manoj Samel" Sent: January 29, 2015 9:28 AM To: user@spark.apache.org Subject: schemaRDD.saveAsParquetFile creates large number of small parquet

Re: Using third party libraries in pyspark

2015-01-22 Thread Felix C
Python couldn't find your module. Do you have that on each worker node? You will need to have that on each one --- Original Message --- From: "Davies Liu" Sent: January 22, 2015 9:12 PM To: "Mohit Singh" Cc: user@spark.apache.org Subject: Re: Using third party libraries in pyspark You need to

Re: Error for first run from iPython Notebook

2015-01-20 Thread Felix C
+1. I can confirm this. It says collect fails in Py4J --- Original Message --- From: "Dave" Sent: January 20, 2015 6:49 AM To: user@spark.apache.org Subject: Re: Error for first run from iPython Notebook Not sure if anyone who can help has seen this. Any suggestions would be appreciated, thanks