Re: Random pairs / RDD order

2015-04-19 Thread Aurélien Bellet
Hi Imran, Thanks for the suggestion! Unfortunately the type does not match. But I could write my own function that shuffle the sample though. Le 4/17/15 9:34 PM, Imran Rashid a écrit : if you can store the entire sample for one partition in memory, I think you just want: val sample1 = rdd.sa

Re: spark application was submitted twice unexpectedly

2015-04-19 Thread Pengcheng Liu
looking into the work folder of problematic application, seems that the application is continuing creating executors, and error log of worker is as below: Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs at org.apache.hadoop.security.UserG

Re: Can a map function return null

2015-04-19 Thread Evo Eftimov
I am on the move at the moment so i cant try it immediately but from previous memory / experience i think if you return plain null you will get a spark exception Anyway yiu can try it and see what happens and then ask the question  If you do get exception try Optional instead of plain null Se

Re: Spark Cassandra Connector

2015-04-19 Thread Ted Yu
1.2.0-rc3 can be found here: http://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10 Can you use maven to build your project ? Cheers > On Apr 18, 2015, at 9:02 PM, DStrip wrote: > > Hello, > > I am facing some difficulties on installing the Cassandra Spark connect

Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Using Spark 1.2.0. Tried to apply register an RDD and got: scala.MatchError: class java.util.Date (of class java.lang.Class) I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562 (included in 1.2.0) Anyone encountered this issue? Thanks, Lior

Re: Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Here's a code example: public class DateSparkSQLExample { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("test").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); List itemsList = Lists.newArrayListWithCapacity

Re: Dataframes Question

2015-04-19 Thread Ted Yu
bq. SchemaRDD is not existing in 1.3? That's right. See this thread for more background: http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemardd&subj=renaming+SchemaRDD+gt+DataFrame On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > I am no

Re: spark with kafka

2015-04-19 Thread Cody Koeninger
Take a look at https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md if you haven't already. If you're fine with saving offsets yourself, I'd stick with KafkaRDD, as Koert said. I haven't tried 2 hour stream batch durations, so I can't vouch for using createDirectStream in that

Re: Dataframes Question

2015-04-19 Thread Arun Patel
Thanks Ted. So, whatever the operations I am performing now are DataFrames and not SchemaRDD? Is that right? Regards, Venkat On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu wrote: > bq. SchemaRDD is not existing in 1.3? > > That's right. > > See this thread for more background: > > http://search-hado

Re: Dataframes Question

2015-04-19 Thread Ted Yu
That's right. On Sun, Apr 19, 2015 at 8:59 AM, Arun Patel wrote: > Thanks Ted. > > So, whatever the operations I am performing now are DataFrames and not > SchemaRDD? Is that right? > > Regards, > Venkat > > On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu wrote: > >> bq. SchemaRDD is not existing in 1

Re: MLlib -Collaborative Filtering

2015-04-19 Thread Christian S. Perone
The easiest way to do that is to use a similarity metric between the different user factors. On Sat, Apr 18, 2015 at 7:49 AM, riginos wrote: > Is there any way that i can see the similarity table of 2 users in that > algorithm? by that i mean the similarity between 2 users > > > > -- > View this

Aggregation by column and generating a json

2015-04-19 Thread dsub
I am exploring Spark SQL and Dataframe and trying to create an aggregration by column and generate a single json row with aggregation. Any inputs on the right approach will be helpful. Here is my sample data user,sports,major,league,count [test1,Sports,Switzerland,NLA,6] [test1,Football,Australi

Re: Can a map function return null

2015-04-19 Thread Steve Lewis
So you imagine something like this: JavaRDD words = ... JavaRDD< Optional> wordsFiltered = words.map(new Function>() { @Override public Optional call(String s) throws Exception { if ((s.length()) % 2 == 1) // drop strings of odd length return Optional.empty();

Skipped Jobs

2015-04-19 Thread James King
In the web ui i can see some jobs as 'skipped' what does that mean? why are these jobs skipped? do they ever get executed? Regards jk

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
Well you can do another map to turn Optional into String as in the cases when Optional is empty you can store e.g. “NULL” as the value of the RDD element If this is not acceptable (based on the objectives of your architecture) and IF when returning plain null instead of Optional does throw

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
In fact you can return “NULL” from your initial map and hence not resort to Optional at all From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Sunday, April 19, 2015 9:48 PM To: 'Steve Lewis' Cc: 'Olivier Girardot'; 'user@spark.apache.org' Subject: RE: Can a map function return null

how to make a spark cluster ?

2015-04-19 Thread hnahak
Hi All, I've big physical machine with 16 CPUs , 256 GB RAM, 20 TB Hard disk. I just need to know what should be the best solution to make a spark cluster? If I need to process TBs of data then 1. Only one machine, which contain driver, executor, job tracker and task tracker everything. 2. crea

Re: Skipped Jobs

2015-04-19 Thread Denny Lee
The job is skipped because the results are available in memory from a prior run. More info at: http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3ccakx7bf-u+jc6q_zm7gtsj1mihagd_4up4qxpd9jfdjrfjax...@mail.gmail.com%3E. HTH! On Sun, Apr 19, 2015 at 1:43 PM James King wrote: > In th

Data frames in GraphX

2015-04-19 Thread hnahak
To Spark-admin, I like the data frames in 1.3 version, is there any plan to integrate this with Graphx in 1.4 or later ? currently I have huge information in vertex property, if I can use data frames to hold the properties instead of VerexRDD, that will help me a lot. -- View this mess

Re: Skipped Jobs

2015-04-19 Thread Mark Hamstra
Almost. Jobs don't get skipped. Stages and Tasks do if the needed results are already available. On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee wrote: > The job is skipped because the results are available in memory from a > prior run. More info at: > http://mail-archives.apache.org/mod_mbox/spar

GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread harenbergsd
Hi all, I have been testing GraphX on the soc-LiveJournal1 network from the SNAP repository. Currently I am running on c3.8xlarge EC2 instances on Amazon. These instances have 32 cores and 60GB RAM per node, and so far I have run SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster. The issues I

Re: GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread hnahak
Hi Steve i did spark 1.3.0 page rank bench-marking on soc-LiveJournal1 in 4 node cluster. 16,16,8,8 Gbs ram respectively. Cluster have 4 worker including master with 4,4,2,2 CPUs I set executor memroy to 3g and driver to 5g. No. of Iterations --> GraphX(mins) 1 --> 1 2

Re: Skipped Jobs

2015-04-19 Thread Denny Lee
Thanks for the correction Mark :) On Sun, Apr 19, 2015 at 3:45 PM Mark Hamstra wrote: > Almost. Jobs don't get skipped. Stages and Tasks do if the needed > results are already available. > > On Sun, Apr 19, 2015 at 3:18 PM, Denny Lee wrote: > >> The job is skipped because the results are avai

Re: newAPIHadoopRDD file name

2015-04-19 Thread hnahak
In record reader level you can pass the file name as key or value. sc.newAPIHadoopRDD(job.getConfiguration, classOf[AvroKeyInputFormat[myObject]], classOf[AvroKey[myObject]], classOf[Text] // can contain your file) AvroKeyInputFormat extends InputFormat { cretaRecor

GraphX: unbalanced computation and slow runtime on livejournal network

2015-04-19 Thread Steven Harenberg
Hi all, I have been testing GraphX on the soc-LiveJournal1 network from the SNAP repository. Currently I am running on c3.8xlarge EC2 instances on Amazon. These instances have 32 cores and 60GB RAM per node, and so far I have run SSSP, PageRank, and WCC on a 1, 4, and 8 node cluster. The issues I

Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai
Hi Cesar, Can you try 1.3.1 ( https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it still shows the error? Thanks, Yin On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin wrote: > This is strange. cc the dev list since it might be a bug. > > > > On Thu, Apr 16, 2015 at 3:18 PM, C

compliation error

2015-04-19 Thread Brahma Reddy Battula
Hi All Getting following error, when I am compiling spark..What did I miss..? Even googled and did not find the exact solution for this... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar:

Re: Can't get SparkListener to work

2015-04-19 Thread Shixiong Zhu
The problem is the code you use to test: sc.parallelize(List(1, 2, 3)).map(throw new SparkException("test")).collect(); is like the following example: def foo: Int => Nothing = { throw new SparkException("test") } sc.parallelize(List(1, 2, 3)).map(foo).collect(); So actually the Spark jobs do

Re: compliation error

2015-04-19 Thread Ted Yu
What JDK release are you using ? Can you give the complete command you used ? Which Spark branch are you working with ? Cheers On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula < brahmareddy.batt...@huawei.com> wrote: > Hi All > > Getting following error, when I am compiling spark..What d

Code Deployment tools in Production

2015-04-19 Thread Arun Patel
Generally what tools are used to schedule spark jobs in production? How is spark streaming code is deployed? I am interested in knowing the tools used like cron, oozie etc. Thanks, Arun

Re: Can't get SparkListener to work

2015-04-19 Thread Praveen Balaji
Thanks Shixiong. I'll try this. On Sun, Apr 19, 2015, 7:36 PM Shixiong Zhu wrote: > The problem is the code you use to test: > > > sc.parallelize(List(1, 2, 3)).map(throw new > SparkException("test")).collect(); > > is like the following example: > > def foo: Int => Nothing = { > throw new Spa

RE: compliation error

2015-04-19 Thread Brahma Reddy Battula
Hey Todd Thanks a lot for your reply...Kindly check following details.. spark version :1.1.0 jdk:jdk1.7.0_60 , command:mvn -Pbigtop-dist -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=V100R001C00 -DskipTests package Thanks & Regards Brahma Reddy Battula From

Re: compliation error

2015-04-19 Thread Ted Yu
bq. -Dhadoop.version=V100R001C00 First time I saw above hadoop version. Doesn't look like Apache release. I checked my local maven repo but didn't find impl under ~/.m2/repository/com/ibm/icu FYI On Sun, Apr 19, 2015 at 8:04 PM, Brahma Reddy Battula < brahmareddy.batt...@huawei.com> wrote: >

[STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread RimBerry
Hi everyone, i am trying to use the direct approach in streaming-kafka-integration pulling data from kafka as follow JavaPairInputDStream messages = KafkaUatils.createDirectStr

Re: [STREAMING KAFKA - Direct Approach] JavaPairRDD cannot be cast to HasOffsetRanges

2015-04-19 Thread Sean Owen
You need to access the underlying RDD with .rdd() and cast that. That works for me. On Mon, Apr 20, 2015 at 4:41 AM, RimBerry wrote: > Hi everyone, > > i am trying to use the direct approach in streaming-kafka-integration >

Re: compliation error

2015-04-19 Thread Sean Owen
Brahma since you can see the continuous integration builds are passing, it's got to be something specific to your environment, right? this is not even an error from Spark, but from Maven plugins. On Mon, Apr 20, 2015 at 4:42 AM, Ted Yu wrote: > bq. -Dhadoop.version=V100R001C00 > > First time I sa

RE: compliation error

2015-04-19 Thread Brahma Reddy Battula
Thanks a lot for your replies.. @Ted,V100R001C00 this is our internal hadoop version which is based on hadoop 2.4.1.. @Sean Owen,Yes, you are correct...Just I wanted to know, what leads this problem... Thanks & Regards Brahma Reddy Battula From: Sean O

SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-19 Thread Ankit Patel
I am experiencing problem with SparkStreaming (Spark 1.2.0), the onStart method is never called on CustomReceiver when calling spark-submit against a master node with multiple workers. However, SparkStreaming works fine with no master node set. Anyone notice this issue?

Re: Code Deployment tools in Production

2015-04-19 Thread Vova Shelgunov
On 20 Apr 2015 05:45, "Arun Patel" wrote: > http://23.251.129.190:8090/spark-twitter-streaming-web/analysis/3fb28f76-62fe-47f3-a1a8-66ac610c2447.html spark jobs in production? > > How is spark streaming code is deployed? > > I am interested in knowing the tools used like cron, oozie etc. > > Thank

Re: how to make a spark cluster ?

2015-04-19 Thread Jörn Franke
Hi, If you have just one physical machine then I would try out Docker instead of a full VM (would be waste of memory and CPU). Best regards Le 20 avr. 2015 00:11, "hnahak" a écrit : > Hi All, > > I've big physical machine with 16 CPUs , 256 GB RAM, 20 TB Hard disk. I > just > need to know what s

sparksql - HiveConf not found during task deserialization

2015-04-19 Thread Manku Timma
I am using spark-1.3 with hadoop-provided and hive-provided and hive-0.13.1 profiles. I am running a simple spark job on a yarn cluster by adding all hadoop2 and hive13 jars to the spark classpaths. If I remove the hive-provided while building spark, I dont face any issue. But with hive-provided I

How to run spark programs in eclipse like mapreduce

2015-04-19 Thread sandeep vura
Hi Sparkers, I have written a code in python in eclipse now that code should execute in spark cluster like mapreduce jobs in hadoop cluster.Can anyone please help me with instructions. Regards, Sandeep.v

Re: How to run spark programs in eclipse like mapreduce

2015-04-19 Thread ๏̯͡๏
I just do " Run as Applicaton"/"Debug As Application" on main program. On Mon, Apr 20, 2015 at 12:14 PM, sandeep vura wrote: > Hi Sparkers, > > I have written a code in python in eclipse now that code should execute in > spark cluster like mapreduce jobs in hadoop cluster.Can anyone please help

Re: How to run spark programs in eclipse like mapreduce

2015-04-19 Thread Akhil Das
Why not build the project and submit the build jar with Spark submit? If you want to run it within eclipse, then all you have to do is, create a SparkContext pointing to your cluster, do a sc.addJar("/path/to/your/project/jar") and then you can hit the run button to run the job (note that network

Re: sparksql - HiveConf not found during task deserialization

2015-04-19 Thread Akhil Das
Looks like a missing jar, try to print the classpath and make sure the hive jar is present. Thanks Best Regards On Mon, Apr 20, 2015 at 11:52 AM, Manku Timma wrote: > I am using spark-1.3 with hadoop-provided and hive-provided and > hive-0.13.1 profiles. I am running a simple spark job on a yar

Re: SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-19 Thread Akhil Das
Would be good, if you can paste your custom receiver code and the code that you used to invoke it. Thanks Best Regards On Mon, Apr 20, 2015 at 9:43 AM, Ankit Patel wrote: > > I am experiencing problem with SparkStreaming (Spark 1.2.0), the onStart > method is never called on CustomReceiver when

Running spark over HDFS

2015-04-19 Thread madhvi
Hi All, I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted any r

Re: MLlib -Collaborative Filtering

2015-04-19 Thread Nick Pentreath
You will have to get the two user factor vectors from the ALS model and compute the cosine similarity between them. You can do this using Breeze vectors: import breeze.linalg._ val user1 = new DenseVector[Double](userFactors.lookup("user1").head) val user2 = new DenseVector[Double](userFactors.loo

Re: sparksql - HiveConf not found during task deserialization

2015-04-19 Thread Manku Timma
Akhil, But the first case of creating HiveConf on the executor works fine (map case). Only the second case fails. I was suspecting some foul play with classloaders. On 20 April 2015 at 12:20, Akhil Das wrote: > Looks like a missing jar, try to print the classpath and make sure the > hive jar is

Re: shuffle.FetchFailedException in spark on YARN job

2015-04-19 Thread Akhil Das
Which version of Spark are you using? Did you try using spark.shuffle.blockTransferService=nio Thanks Best Regards On Sat, Apr 18, 2015 at 11:14 PM, roy wrote: > Hi, > > My spark job is failing with following error message > > org.apache.spark.shuffle.FetchFailedException: > > /mnt/ephemeral12

NEWBIE/not able to connect to postgresql using jdbc

2015-04-19 Thread shashanksoni
I am using spark 1.3 standalone cluster on my local windows and trying to load data from one of our server. Below is my code - import os os.environ['SPARK_CLASSPATH'] = "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\postgresql-9.2-1002.jdbc3.jar" from pyspark import SparkContext, Spar

Re: Running spark over HDFS

2015-04-19 Thread Akhil Das
In your eclipse, while you create your SparkContext, set the master uri as shown in the web UI's top left corner like: spark://someIPorHost:7077 and it should be fine. Thanks Best Regards On Mon, Apr 20, 2015 at 12:22 PM, madhvi wrote: > Hi All, > > I am new to spark and have installed spark cl