spark application was submitted twice unexpectedly

2015-04-18 Thread Pengcheng Liu
Hi guys, I was trying to submit several spark applications to a standalone cluster at the same time from a shell script. One issue I met is sometimes one application may be submitted to cluster twice unexpectedly(from web UI, I can see the two applications of same name were generated exactly at th

Re: Can't get SparkListener to work

2015-04-18 Thread Archit Thakur
Hi Praveen, Can you try once removing throw exception in map. Do you still not get it.? On Apr 18, 2015 8:14 AM, "Praveen Balaji" wrote: > Thanks for the response, Imran. I probably chose the wrong methods for > this email. I implemented all methods of SparkListener and the only > callback I get

Re: Custom partioner

2015-04-18 Thread Archit Thakur
Yes you can. Use partitionby method and pass partitioner to it. On Apr 17, 2015 8:18 PM, "Jeetendra Gangele" wrote: > Ok is there a way, I can use hash Partitioning so that I can improve the > performance? > > > On 17 April 2015 at 19:33, Archit Thakur > wrote: > >> By custom installation, I me

MLlib -Collaborative Filtering

2015-04-18 Thread riginos
Is there any way that i can see the similarity table of 2 users in that algorithm? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Collaborative-Filtering-tp22552.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: When querying ElasticSearch, score is 0

2015-04-18 Thread Nick Pentreath
ES-hadoop uses a scan & scroll search to efficiently retrieve large result sets. Scores are not tracked in a scan and sorting is not supported hence 0 scores. http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan — Sent from Mailbox

Re: MLlib -Collaborative Filtering

2015-04-18 Thread Nick Pentreath
What do you mean by similarity table of 2 users? Do you mean the similarity between 2 users? — Sent from Mailbox On Sat, Apr 18, 2015 at 11:09 AM, riginos wrote: > Is there any way that i can see the similarity table of 2 users in that > algorithm? > -- > View this message in context: >

Re: When querying ElasticSearch, score is 0

2015-04-18 Thread Andrejs Abele
Thank you for the information. Cheers, Andrejs On 04/18/2015 10:23 AM, Nick Pentreath wrote: > ES-hadoop uses a scan & scroll search to efficiently retrieve large > result sets. Scores are not tracked in a scan and sorting is not > supported hence 0 scores. > > http://www.elastic.co/guide/en/elast

spark with kafka

2015-04-18 Thread Shushant Arora
Hi I want to consume messages from kafka queue using spark batch program not spark streaming, Is there any way to achieve this, other than using low level(simple api) of kafka consumer. Thanks

MLlib -Collaborative Filtering

2015-04-18 Thread riginos
Is there any way that i can see the similarity table of 2 users in that algorithm? by that i mean the similarity between 2 users -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Collaborative-Filtering-tp22553.html Sent from the Apache Spark User List m

Number of input partitions in SparkContext.sequenceFile

2015-04-18 Thread Wenlei Xie
Hi, I am wondering the mechanism that determines the number of partitions created by SparkContext.sequenceFile ? For example, although my file has only 4 splits, Spark would create 16 partitions for it. Is it determined by the file size? Is there any way to control it? (Looks like I can only tune

Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
I'm trying to solve a Word-Count like problem, the difference lies in that, I need the count of a specific word among a specific timespan in a social message stream. My data is in the format of (time, message), and I transformed (flatMap etc.) it into a series of (time, word_id), the time is rep

RE: spark with kafka

2015-04-18 Thread Ganelin, Ilya
Write Kafka stream to HDFS via Spark streaming then ingest files via Spark from HDFS. Sent with Good (www.good.com) -Original Message- From: Shushant Arora [shushantaror...@gmail.com] Sent: Saturday, April 18, 2015 06:44 AM Eastern Standard Time To:

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Use KafkaRDD directly. It is in spark-streaming-kafka package On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora wrote: > Hi > > I want to consume messages from kafka queue using spark batch program not > spark streaming, Is there any way to achieve this, other than using low > level(simple api) of

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Ted Yu
Can you show us the function you passed to reduceByKey() ? What release of Spark are you using ? Cheers On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke wrote: > I'm trying to solve a Word-Count like problem, the difference lies in > that, I need the count of a specific word among a specific times

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Sean Owen
Do these datetime objects implement a the notion of equality you'd expect? (This may be a dumb question; I'm thinking of the equivalent of equals() / hashCode() from the Java world.) On Sat, Apr 18, 2015 at 4:17 PM, SecondDatke wrote: > I'm trying to solve a Word-Count like problem, the differenc

Re: spark with kafka

2015-04-18 Thread Ilya Ganelin
That's a much better idea :) On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers wrote: > Use KafkaRDD directly. It is in spark-streaming-kafka package > > On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora > wrote: > >> Hi >> >> I want to consume messages from kafka queue using spark batch program not

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
Thanks for your reply. I used a simple (a+b) function for reduceByKey(), nothing special. And I'm running Spark 1.3.1 RC3 (don't know how much it differs from the final release). Date: Sat, 18 Apr 2015 08:28:50 -0700 Subject: Re: Does reduceByKey only work properly for numeric keys? From: yuzhih.

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
The comparison of Python tuple is lexicographical(that is, two tuple equals if and only if every element is same and in the same position), and though I'm not very clear with what's going on inside Spark, I have tried comparing the duplicate keys in the result with == operator, and hashing them

newAPIHadoopRDD file name

2015-04-18 Thread Manas Kar
I would like to get the file name along with the associated objects so that I can do further mapping on it. My code below gives me AvroKey[myObject], NullWritable but I don't know how to get the file that gave those objects. sc.newAPIHadoopRDD(job.getConfiguration, classOf[AvroKeyInputFo

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
Well, I'm changing the strategy. (Assuming we have N(words) < 10, interval between two time points > 1 hour) * Use only the timestamp as the key, no duplicate key. But it would only reduce to something like "the total number of words in a timespan". * Key = tuple(timestamp, word_id), duplicati

Re: Can't get SparkListener to work

2015-04-18 Thread Praveen Balaji
Thanks for the response, Archit. I get callbacks when I do not throw an exception from map. My use case, however, is to get callbacks for exceptions in transformations on executors. Do you think I'm going down the right route? Cheers -p On Sat, Apr 18, 2015 at 1:49 AM, Archit Thakur wrote: > Hi

shuffle.FetchFailedException in spark on YARN job

2015-04-18 Thread roy
Hi, My spark job is failing with following error message org.apache.spark.shuffle.FetchFailedException: /mnt/ephemeral12/yarn/nm/usercache/abc/appcache/application_1429353954024_1691/spark-local-20150418132335-0723/28/shuffle_3_1_0.index (No such file or directory) at org.apache.spark.s

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
I don't think my experiment is suprising, it's my fault: To move away from my case, I wrote a test program, which generates data randomly, and cast the key to string: import randomimport operator COUNT = 2COUNT_PARTITIONS = 36LEN = 233 rdd = sc.parallelize(((str(random.randint(1, LEN)), 1) for

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Ted Yu
Looks like the two keys involving (datetime.datetime(2009, 10, 6, 3, 0) in your first email might have been processed by x86 node and x64 node, respectively. Cheers On Sat, Apr 18, 2015 at 11:09 AM, SecondDatke wrote: > I don't think my experiment is suprising, it's my fault: > > To move away f

Re: spark with kafka

2015-04-18 Thread Shushant Arora
Thanks !! I have few more doubts : Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I mean do I need to handle offset of partitions myself or it will be taken care by KafkaRDD, Plus which one is better for batch programming. I have a requirement to read kafka messages by a spark

Can a map function return null

2015-04-18 Thread Steve Lewis
I find a number of cases where I have an JavaRDD and I wish to transform the data and depending on a test return 0 or one item (don't suggest a filter - the real case is more complex). So I currently do something like the following - perform a flatmap returning a list with 0 or 1 entry depending on

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
KafkaRDD uses the simple consumer api. and i think you need to handle offsets yourself, unless things changed since i last looked. I would do second approach. On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora wrote: > Thanks !! > I have few more doubts : > > Does kafka RDD uses simpleAPI for kafk

Re: spark with kafka

2015-04-18 Thread Shushant Arora
Thanks Koert. So in short for Highlevel api I ll have to go with spark streaming only and there the issue is of handling cluster restart , thats why you opted for second approach of batch job or due to batch interval (2 hours is large for stream job) or some other reason? On Sun, Apr 19, 2015 a

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Yeah I think would pick the second approach because it is simpler operationally in case of any failures. But of course the smaller the window gets the more attractive the streaming solution gets. We do daily extracts, not every 2 hours. On Sat, Apr 18, 2015 at 2:57 PM, Shushant Arora wrote: > T

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
I mean to say it is simpler in case of failures, restarts, upgrades, etc. Not just failures. But they did do a lot of work on streaming from kafka in spark 1.3.x to make it simpler (streaming simple calls KafkaRDD for every batch if you use KafkaUtils.createDirectStream), so maybe i am wrong and s

Re: Can a map function return null

2015-04-18 Thread Olivier Girardot
You can return an RDD with null values inside, and afterwards filter on "item != null" In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala they're directly usable from Spark. Exemple : sc.parallelize(1 to 1000).flatMap(item => if (item % 2 ==0) Some(item) else None).collec

Dataframes Question

2015-04-18 Thread Arun Patel
Experts, I have few basic questions on DataFrames vs Spark SQL. My confusion is more with DataFrames. 1) What is the difference between Spark SQL and DataFrames? Are they same? 2) Documentation says SchemaRDD is renamed as DataFrame. This means SchemaRDD is not existing in 1.3? 3) As per doc

spark sql error with proto/parquet

2015-04-18 Thread Abhishek R. Singh
I have created a bunch of protobuf based parquet files that I want to read/inspect using Spark SQL. However, I am running into exceptions and not able to proceed much further: This succeeds successfully (probably because there is no action yet). I can also printSchema() and count() without any

Re: Dataframes Question

2015-04-18 Thread Abhishek R. Singh
I am no expert myself, but from what I understand DataFrame is grandfathering SchemaRDD. This was done for API stability as spark sql matured out of alpha as part of 1.3.0 release. It is forward looking and brings (dataframe like) syntax that was not available with the older schema RDD. On Ap

Spark-csv data source: infer data types

2015-04-18 Thread Oleg Shirokikh
I'm experimenting with Spark-CSV package (https://github.com/databricks/spark-csv) for reading csv files into Spark DataFrames. Everything works but all columns are assumed to be of StringType. As shown in Spark SQL documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html

Spark Cassandra Connector

2015-04-18 Thread DStrip
Hello, I am facing some difficulties on installing the Cassandra Spark connector. Specifically I am working on Cassandra 2.0.13 and Spark 1.2.1. I am trying to build the-create the JAR- for the connection but unfortunately I cannot "see" in which file I have to declare the dependencies libraryDe

Spark-csv data source: infer data types

2015-04-18 Thread barmaley
I'm experimenting with Spark-CSV package (https://github.com/databricks/spark-csv) for reading csv files into Spark DataFrames. Everything works but all columns are assumed to be of StringType. As shown in Spark SQL documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html), fo