How to stream all data out of a Kafka topic once, then terminate job?

2015-04-28 Thread dgoldenberg
Hi, I'm wondering about the use-case where you're not doing continuous, incremental streaming of data out of Kafka but rather want to publish data once with your Producer(s) and consume it once, in your Consumer, then terminate the consumer Spark job. JavaStreamingContext jssc = new JavaStreaming

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
I am 100% sure how it's picking up the configuration. I copied the hbase-site.xml in hdfs/spark cluster (single machine). I also included hbase-site.xml in spark-job jar files. spark-job jar file also have yarn-site and mapred-site and core-site.xml in it. One interesting thing is, when I run

Re: HBase HTable constructor hangs

2015-04-28 Thread Ted Yu
How did you distribute hbase-site.xml to the nodes ? Looks like HConnectionManager couldn't find the hbase:meta server. Cheers On Tue, Apr 28, 2015 at 9:19 PM, Tridib Samanta wrote: > I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0. > > Here is the jstack trace. Complete stack trace attached.

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0. Here is the jstack trace. Complete stack trace attached. "Executor task launch worker-1" #58 daemon prio=5 os_prio=0 tid=0x7fd3d0445000 nid=0x488 waiting on condition [0x7fd4507d9000] java.lang.Thread.State: TIMED_WAITING (sleeping)

External Application Run Status

2015-04-28 Thread Nastooh Avessta (navesta)
Hi In a multi-node setup, I am invoking a number external apps, through Runtime.getRuntime.exec from rdd.map function, and would like to track their completion status. Evidently, such calls spawn a separate thread, which is not tracked by the standalone scheduler, i.e., reduce or collect are cal

Re: Weird error/exception

2015-04-28 Thread Vadim Bichutskiy
I was having this issue when my batch interval was very big -- like 5 minutes. When my batch interval is smaller, I don't get this exception. Can someone explain to me why this might be happening? Vadim ᐧ On Tue, Apr 28, 2015 at 4:26 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > I

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread bit1...@163.com
For the SparkContext#textFile, if a directory is given as the path parameter ,then it will pick up the files in the directory, so the same thing will occur. bit1...@163.com From: Saisai Shao Date: 2015-04-29 10:54 To: Vadim Bichutskiy CC: bit1...@163.com; lokeshkumar; user Subject: Re: Re: S

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Saisai Shao
I think it might be useful in Spark Streaming's file input stream, but not sure is it useful in SparkContext#textFile, since we specify the file by our own, so why we still need to know the file name. I will open up a JIRA to mention about this feature. Thanks Jerry 2015-04-29 10:49 GMT+08:00 V

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Vadim Bichutskiy
I was wondering about the same thing. Vadim ᐧ On Tue, Apr 28, 2015 at 10:19 PM, bit1...@163.com wrote: > Looks to me that the same thing also applies to the SparkContext.textFile > or SparkContext.wholeTextFile, there is no way in RDD to figure out the > file information where the data in RDD

Re: HBase HTable constructor hangs

2015-04-28 Thread Ted Yu
Can you give us more information ? Such as hbase release, Spark release. If you can pastebin jstack of the hanging HTable process, that would help. BTW I used http://search-hadoop.com/?q=spark+HBase+HTable+constructor+hangs and saw a very old thread with this subject. Cheers On Tue, Apr 28, 201

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread bit1...@163.com
Looks to me that the same thing also applies to the SparkContext.textFile or SparkContext.wholeTextFile, there is no way in RDD to figure out the file information where the data in RDD is from bit1...@163.com From: Saisai Shao Date: 2015-04-29 10:10 To: lokeshkumar CC: spark users Subject:

Question about Memory Used and VCores Used

2015-04-28 Thread bit1...@163.com
Hi,guys, I have the following computation with 3 workers: spark-sql --master yarn --executor-memory 3g --executor-cores 2 --driver-memory 1g -e 'select count(*) from table' The resources used are shown as below on the UI: I don't understand why the memory used is 15GB and vcores used is 5. I thin

Re: HBase HTable constructor hangs

2015-04-28 Thread tridib
I am exactly having same issue. I am running hbase and spark in docker container. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Saisai Shao
I think currently there's no API in Spark Streaming you can use to get the file names for file input streams. Actually it is not trivial to support this, may be you could file a JIRA with wishes you want the community to support, so anyone who is interested can take a crack on this. Thanks Jerry

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
btw, from spark web ui, the acl is marked with root Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: Dean Wampler To: Lin Hao Xu/China/IBM@IBMCN Cc: Hai Shan Wu/China/IBM@IBMCN, user Date: 2015/04/2

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
Actually, to simplify this problem, we run our program on a single machine with 4 slave workers. Since on a single machine, I think all slave workers are ran with root privilege. BTW, if we have a cluster, how to make sure slaves on remote machines run program as root? Best regards, Lin Hao XU I

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Dean Wampler
Are the tasks on the slaves also running as root? If not, that might explain the problem. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
1. The full command line is written in a shell script: LIB=/home/spark/.m2/repository /opt/spark/bin/spark-submit \ --class spark.pcap.run.TestPcapSpark \ --jars $LIB/org/pcap4j/pcap4j-core/1.4.0/pcap4j-core-1.4.0.jar,$LIB/org/pcap4j/pcap4j-packetfactory-static/1.4.0/pcap4j-packetfactory-static-1

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-28 Thread Koert Kuipers
our experience is that unless you can benefit from spark features such as co-partitioning that allow for more efficient execution that spark is slightly slower for disk to disk. On Apr 27, 2015 10:34 PM, "bit1...@163.com" wrote: > Hi, > > I am frequently asked why spark is also much faster than H

RE: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-28 Thread Mohammed Guller
One reason Spark on disk is faster than MapReduce is Spark’s advanced Directed Acyclic Graph (DAG) engine. MapReduce will require a complex job to be split into multiple Map-Reduce jobs, with disk I/O at the end of each job and beginning of a new job. With Spark, you may be able to express the s

Re: MLLib SVMWithSGD is failing for large dataset

2015-04-28 Thread sarathkrishn...@gmail.com
Hi, I'm just calling the standard SVMWithSGD implementation of Spark's MLLib. I'm not using any method like "collect". Thanks, Sarath On Tue, Apr 28, 2015 at 4:35 PM, ai he wrote: > Hi Sarath, > > It might be questionable to set num-executors as 64 if you only has 8 > nodes. Do you use any act

Re: MLLib SVMWithSGD is failing for large dataset

2015-04-28 Thread ai he
Hi Sarath, It might be questionable to set num-executors as 64 if you only has 8 nodes. Do you use any action like "collect" which will overwhelm the driver since you have a large dataset? Thanks On Tue, Apr 28, 2015 at 10:50 AM, sarath wrote: > > I am trying to train a large dataset consisting

Metric collection

2015-04-28 Thread Giovanni Paolo Gibilisco
Hi, I would like to collect some metrics from spark and plot them with graphite. I managed to do that withe the metrics provided by the or.apache.park.metrics.source.JvmSource but I would like to know if there are other sources available beside this one. Best, Giovanni

Re: Initial tasks in job take time

2015-04-28 Thread Anshul Singhle
yes On 29 Apr 2015 03:31, "ayan guha" wrote: > Are your driver running on the same m/c as master? > On 29 Apr 2015 03:59, "Anshul Singhle" wrote: > >> Hi, >> >> I'm running short spark jobs on rdds cached in memory. I'm also using a >> long running job context. I want to be able to complete my j

Re: Initial tasks in job take time

2015-04-28 Thread ayan guha
Are your driver running on the same m/c as master? On 29 Apr 2015 03:59, "Anshul Singhle" wrote: > Hi, > > I'm running short spark jobs on rdds cached in memory. I'm also using a > long running job context. I want to be able to complete my jobs (on the > cached rdd) in under 1 sec. > I'm getting

Re: Spark SQL 1.3.1 "saveAsParquetFile" will output tachyon file with different block size

2015-04-28 Thread Calvin Jia
Hi, You can apply this patch and recompile. Hope this helps, Calvin On Tue, Apr 28, 2015 at 1:19 PM, sara mustafa wrote: > Hi Zhang, > > How did you compile Spark 1.3.1 with Tachyon? when i changed Tachyon > version > to 0.6.3 in core/pom.xml, make-d

Re: Weird error/exception

2015-04-28 Thread Vadim Bichutskiy
I think I need to modify my code as discussed under "Design Patterns for using foreachRDD" in the docs. ᐧ On Tue, Apr 28, 2015 at 4:26 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > I am using Spark Streaming to monitor an S3 bucket. Everything appears to > be fine. But every batch i

RE: Scalability of group by

2015-04-28 Thread Ulanov, Alexander
Richard, The same problem is with sort. I have enough disk space and tmp folder. The errors in logs tell out of memory. I wonder what does it hold in memory? Alexander From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Tuesday, April 28, 2015 7:34 AM To: Ulanov, Alexander Cc: user@

Weird error/exception

2015-04-28 Thread Vadim Bichutskiy
I am using Spark Streaming to monitor an S3 bucket. Everything appears to be fine. But every batch interval I get the following: *15/04/28 16:12:36 WARN HttpMethodReleaseInputStream: Attempting to release HttpMethod in finalize() as its response data stream has gone out of scope. This attempt will

rdd.count with 100 elements taking 1 second to run

2015-04-28 Thread Anshul Singhle
Hi, I'm running the following code in my cluster (standalone mode) via spark shell - val rdd = sc.parallelize(1 to 100) rdd.count This takes around 1.2s to run. Is this expected or am I configuring something wrong? I'm using about 30 cores with 512MB executor memory As expected, GC time is ne

Initial tasks in job take time

2015-04-28 Thread Anshul Singhle
Hi, I'm running short spark jobs on rdds cached in memory. I'm also using a long running job context. I want to be able to complete my jobs (on the cached rdd) in under 1 sec. I'm getting the following job times with about 15 GB of data distributed across 6 nodes. Each executor has about 20GB of m

MLLib SVMWithSGD is failing for large dataset

2015-04-28 Thread sarath
I am trying to train a large dataset consisting of 8 million data points and 20 million features using SVMWithSGD. But it is failing after running for some time. I tried increasing num-partitions, driver-memory, executor-memory, driver-max-resultSize. Also I tried by reducing the size of dataset f

Re: JAVA_HOME problem

2015-04-28 Thread Marcelo Vanzin
Are you using a Spark build that matches your YARN cluster version? That seems like it could happen if you're using a Spark built against a newer version of YARN than you're running. On Thu, Apr 2, 2015 at 12:53 AM, 董帅阳 <917361...@qq.com> wrote: > spark 1.3.0 > > > spark@pc-zjqdyyn1:~> tail /etc/

How to setup this "false streaming" problem

2015-04-28 Thread Toni Cebrián
Hi, Just new to Spark and in need of some help for framing the problem I have. A problem well stated is half solved it's the saying :) Let's say that I have a DStream[String] basically containing Json of some measurements from IoT devices. In order to keep it simple say that after unmars

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Silvio Fiorito
The initializer is a tuple (0, 0) it seems you just have 0 From: "subscripti...@prismalytics.io" Organization: PRISMALYTICS, LLC. Reply-To: "subscripti...@prismalytics.io" Date: Tuesday, April 28, 2015 at 1:28 PM To: Silvi

default number of reducers

2015-04-28 Thread Shushant Arora
In Normal MR job can I configure ( cluster wide) default number of reducers - if I don't specify any reducers in my job

How to run customized Spark on EC2?

2015-04-28 Thread Bo Fu
Hi experts, I have an issue. I added some timestamps in Spark source code and built it using: mvn package -DskipTests I checked the new version in my own computer and it works. However, when I ran spark on EC2, the spark code EC2 machines ran is the original version. Anyone knows how to deplo

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread subscripti...@prismalytics.io
Thank you Todd, Silvio... I had to stare at Silvio's answer for a while. _If I'm interpreting the aggregateByKey() statement__correctly ... _ (Within-Partition Reduction Step) a: is a TUPLE that holds: (runningSum, runningCount). b: is a SCALAR that holds the next Value (Cross-Partiti

Code For Loading Graph from Edge Tuple File

2015-04-28 Thread geek2
Hi Everyone, Does anyone have example code for generating a graph from a file of edge name-edge name tuples? I've seen the example where a Graph is generated from an RDD of triplets composed of edge longs, but I'd like to see an example where a graph is built from a edge name-edge -name file such

Re: solr in spark

2015-04-28 Thread Nick Pentreath
Depends on your use case and search volume. Typically you'd have a dedicated ES cluster if your app is doing a lot of real time indexing and search. If it's only for spark integration then you could colocate ES and spark — Sent from Mailbox On Tue, Apr 28, 2015 at 6:41 PM, Jeetendra Gangel

Spark - Timeout Issues - OutOfMemoryError

2015-04-28 Thread ๏̯͡๏
I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB size) and it takes 16 executors to do so. I wanted to run it against 10 files of each input type (10*3 files as there are three inputs that are transformed). [Input1 = 10*750 MB, Input2=10*2.5GB, Input3 = 10*1.5G], Hence i used

Re: Question regarding join with multiple columns with pyspark

2015-04-28 Thread Ali Bajwa
Thanks again Ayan! To close the loop on this issue, I have filed the below JIRA to track the issue: https://issues.apache.org/jira/browse/SPARK-7197 On Fri, Apr 24, 2015 at 8:21 PM, ayan guha wrote: > I just tested, your observation in DataFrame API is correct. It behaves > weirdly in case of

PySpark: slicing issue with dataframes

2015-04-28 Thread Ali Bajwa
Hi experts, Trying to use the "slicing" functionality in strings as part of a Spark program (PySpark) I get this error: Code import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'Firstname': ['James', 'Ali', 'Daniel'], 'Lastname': ['Jones', 'Bajw

Re: solr in spark

2015-04-28 Thread Jeetendra Gangele
Thanks for reply. Elastic search index will be within my Cluster? or I need the separate host the elastic search? On 28 April 2015 at 22:03, Nick Pentreath wrote: > I haven't used Solr for a long time, and haven't used Solr in Spark. > > However, why do you say "Elasticsearch is not a good opt

Re: solr in spark

2015-04-28 Thread andy petrella
AFAIK Datastax is heavily looking at it. they have a good integration of Cassandra with it. the next was clearly to have a strong combination of the three in one of the coming releases Le mar. 28 avr. 2015 18:28, Jeetendra Gangele a écrit : > Does anyone tried using solr inside spark? > below is

Re: solr in spark

2015-04-28 Thread Nick Pentreath
I haven't used Solr for a long time, and haven't used Solr in Spark. However, why do you say "Elasticsearch is not a good option ..."? ES absolutely supports full-text search and not just filtering and grouping (in fact it's original purpose was and still is text search, though filtering, grouping

solr in spark

2015-04-28 Thread Jeetendra Gangele
Does anyone tried using solr inside spark? below is the project describing it. https://github.com/LucidWorks/spark-solr. I have a requirement in which I want to index 20 millions companies name and then search as and when new data comes in. the output should be list of companies matching the query

Re: How to deploy self-build spark source code on EC2

2015-04-28 Thread Nicholas Chammas
[-dev] [+user] This is a question for the user list, not the dev list. Use the --spark-version and --spark-git-repo options to specify your own repo and hash to deploy. Source code link. N

How to run self-build spark on EC2?

2015-04-28 Thread Bo Fu
Hi all, I have an issue. I added some timestamps in Spark source code and built it using: mvn package -DskipTests I checked the new version in my own computer and it works. However, when I ran spark on EC2, the spark code EC2 machines ran is the original version. Anyone knows how to deploy th

Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread lokeshkumar
Hi Forum, Using spark streaming and listening to the files in HDFS using textFileStream/fileStream methods, how do we get the fileNames which are read by these methods? I used textFileStream which has file contents in JavaDStream and I got no success with fileStream as it is throwing me a compila

Re: Best practices on testing Spark jobs

2015-04-28 Thread Sourav Chandra
Hi, Can you give some tutorials/examples how to write test case based on the mentioned framework? Thanks, Sourav On Tue, Apr 28, 2015 at 9:22 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Sorry that’s correct, I was thinking you were maybe trying to mock > certain aspects of Spa

Re: hive-thriftserver maven artifact

2015-04-28 Thread Ted Yu
Credit goes to Misha Chernetsov (see SPARK-4925) FYI On Tue, Apr 28, 2015 at 8:25 AM, Marco wrote: > Thx Ted for the info ! > > 2015-04-27 23:51 GMT+02:00 Ted Yu : > >> This is available for 1.3.1: >> >> http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10 >> >> FYI >

Re: Best practices on testing Spark jobs

2015-04-28 Thread Silvio Fiorito
Sorry that’s correct, I was thinking you were maybe trying to mock certain aspects of Spark core to write your tests. This is a library to help write unit tests by managing the SparkContext and StreamingContext. So you can test your transformations as necessary. More importantly on the streaming

Re: Best practices on testing Spark jobs

2015-04-28 Thread Michal Michalski
Thanks Silvio. I might be missing something, but it looks like this project is a kind of a "framework" for setting up Spark for a testing, but after taking a quick look at the code it doesn't seem like it's solving the problem with mocking which is my main concern now - am I wrong? Kind regards, M

Re: Best practices on testing Spark jobs

2015-04-28 Thread Silvio Fiorito
Hi Michal, Please try spark-testing-base by Holden. I’ve used it and it works well for unit testing batch and streaming jobs https://github.com/holdenk/spark-testing-base Thanks, Silvio From: Michal Michalski Date: Tuesday, April 28, 2015 at 11:32 AM To: user Subject: Best practices on testing

Best practices on testing Spark jobs

2015-04-28 Thread Michal Michalski
Hi, I have two questions regarding testing Spark jobs: 1. Is it possible to use Mockito for that purpose? I tried to use it, but it looks like there are no interactions with mocks. I didn't dive into the details of how Mockito works, but I guess it might be because of the serialization and how Sp

Re: hive-thriftserver maven artifact

2015-04-28 Thread Marco
Thx Ted for the info ! 2015-04-27 23:51 GMT+02:00 Ted Yu : > This is available for 1.3.1: > > http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10 > > FYI > > On Mon, Feb 16, 2015 at 7:24 AM, Marco wrote: > >> Ok, so will it be only available for the next version (1.30

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Silvio Fiorito
If you need to keep the keys, you can use aggregateByKey to calculate an avg of the values: val step1 = data.aggregateByKey((0.0, 0))((a, b) => (a._1 + b, a._2 + 1), (a, b) => (a._1 + b._1, a._2 + b._2)) val avgByKey = step1.mapValues(i => i._1/i._2) Essentially, what this is doing is passing a

Re: Spark partitioning question

2015-04-28 Thread Silvio Fiorito
So the other issue could due to the fact that using mapPartitions after the partitionBy, you essentially lose the partitioning of the keys since Spark assumes the keys were altered in the map phase. So really the partitionBy gets lost after the mapPartitions, that’s why you need to do it again.

Re: Scalability of group by

2015-04-28 Thread Richard Marscher
Hi, I can offer a few ideas to investigate in regards to your issue here. I've run into resource issues doing shuffle operations with a much smaller dataset than 2B. The data is going to be saved to disk by the BlockManager as part of the shuffle and then redistributed across the cluster as releva

Single stream with series of transformations

2015-04-28 Thread jc.francisco
Hi, I'm following the pattern of filtering data by a certain criteria, and then saving the results to a different table. The code below illustrates the idea. The simple integration test I wrote suggests it works, simply asserting filtered data should be in their respective tables after being filter

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Todd Nist
Can you simply apply the https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.util.StatCounter to this? You should be able to do something like this: val stats = RDD.map(x => x._2).stats() -Todd On Tue, Apr 28, 2015 at 10:00 AM, subscripti...@prismalytics.io < subscripti...

Re: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-28 Thread ๏̯͡๏
Worked now. On Mon, Apr 27, 2015 at 10:20 PM, Sean Owen wrote: > Works fine for me. Make sure you're not downloading the HTML > redirector page and thinking it's the archive. > > On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wrote: > > I downloaded 1.3.1 hadoop 2.4 prebuilt package (tar)

Re: Serialization error

2015-04-28 Thread ๏̯͡๏
arguments are values of it. The name of the argument is important and all you need to do is specify those when your creating SparkConf object. Glad it worked. On Tue, Apr 28, 2015 at 5:20 PM, madhvi wrote: > Thankyou Deepak.It worked. > Madhvi > On Tuesday 28 April 2015 01:39 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
ah, just noticed that you are using an external package, you can add that like this conf = (SparkConf().set("spark.jars", jar_path)) or if it is a python package: sc.addPyFile() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone

Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
Hi Mark, That does not look like an python path issue, spark-assembly jar should have those packaged, and should make it available for the workers. Have you built the jar yourself? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalon

Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread subscripti...@prismalytics.io
Hello Friends: I generated a Pair RDD with K/V pairs, like so: >>> >>> rdd1.take(10) # Show a small sample. [(u'2013-10-09', 7.60117302052786), (u'2013-10-10', 9.322709163346612), (u'2013-10-10', 28.264462809917358), (u'2013-10-07', 9.664429530201343), (u'2013-10-07', 12.461538461538463),

Re: Spark partitioning question

2015-04-28 Thread Marius Danciu
Thank you Silvio, I am aware of groubBy limitations and this is subject for replacement. I did try repartitionAndSortWithinPartitions but then I end up with maybe too much shuffling one from groupByKey and the other from repartition. My expectation was that since N records are partitioned to the

Re: Spark partitioning question

2015-04-28 Thread Silvio Fiorito
Hi Marius, What’s the expected output? I would recommend avoiding the groupByKey if possible since it’s going to force all records for each key to go to an executor which may overload it. Also if you need to sort and repartition, try using repartitionAndSortWithinPartitions to do it in one sho

Re: 1.3.1: Persisting RDD in parquet - "Conflicting partition column names"

2015-04-28 Thread ayan guha
Can you show your code please? On 28 Apr 2015 13:20, "sranga" wrote: > Hi > > I am getting the following error when persisting an RDD in parquet format > to > an S3 location. This is code that was working in the 1.2 version. The > version that it is failing to work is 1.3.1. > Any help is appreci

Re: How to add jars to standalone pyspark program

2015-04-28 Thread ayan guha
Its a windows thing. Please escape front slash in string. Basically it is not able to find the file On 28 Apr 2015 22:09, "Fabian Böhnlein" wrote: > Can you specifiy 'running via PyCharm'. how are you executing the script, > with spark-submit? > > In PySpark I guess you used --jars databricks-csv

Re: submitting to multiple masters

2015-04-28 Thread James King
Indeed, many thanks Michal for the help. On Tue, Apr 28, 2015 at 2:20 PM, michal.klo...@gmail.com < michal.klo...@gmail.com> wrote: > According to the docs it should go like this: > spark://host1:port1,host2:port2 > > > > https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-

Re: submitting to multiple masters

2015-04-28 Thread michal.klo...@gmail.com
According to the docs it should go like this: spark://host1:port1,host2:port2 https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper Thanks M > On Apr 28, 2015, at 8:13 AM, James King wrote: > > I have multiple masters running and I'm trying to submit an ap

submitting to multiple masters

2015-04-28 Thread James King
I have multiple masters running and I'm trying to submit an application using spark-1.3.0-bin-hadoop2.4/bin/spark-submit with this config (i.e. a comma separated list of master urls) --master spark://master01:7077,spark://master02:7077 But getting this exception Exceptio

Spark partitioning question

2015-04-28 Thread Marius Danciu
Hello all, I have the following Spark (pseudo)code: rdd = mapPartitionsWithIndex(...) .mapPartitionsToPair(...) .groupByKey() .sortByKey(comparator) .partitionBy(myPartitioner) .mapPartitionsWithIndex(...) .mapPartitionsToPair( *f* ) The input data

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Dean Wampler
It's probably not your code. What's the full command line you use to submit the job? Are you sure the job on the cluster has access to the network interface? Can you test the receiver by itself without Spark? For example, does this line work as expected: List nifs = Pcaps.findAllDevs(); dean D

Re: How to add jars to standalone pyspark program

2015-04-28 Thread Fabian Böhnlein
Can you specifiy 'running via PyCharm'. how are you executing the script, with spark-submit? In PySpark I guess you used --jars databricks-csv.jar. With spark-submit you might need the additional --driver-class-path databricks-csv.jar. Both parameters cannot be set via the SparkConf object.

Re: Serialization error

2015-04-28 Thread madhvi
Thankyou Deepak.It worked. Madhvi On Tuesday 28 April 2015 01:39 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: val conf = new SparkConf() .setAppName(detail) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryoserializer.buffer.mb", arguments.get("buffersize

Re: Understanding Spark's caching

2015-04-28 Thread ayan guha
Hi I replied you in SO. If option A had a action call then it should suffice too. On 28 Apr 2015 05:30, "Eran Medan" wrote: > Hi Everyone! > > I'm trying to understand how Spark's cache work. > > Here is my naive understanding, please let me know if I'm missing > something: > > val rdd1 = sc.text

Re: spark-defaults.conf

2015-04-28 Thread James King
So no takers regarding why spark-defaults.conf is not being picked up. Here is another one: If Zookeeper is configured in Spark why do we need to start a slave like this: spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh 1 spark://somemaster:7077 i.e. why do we need to specify the master url explic

Re: StandardScaler failing with OOM errors in PySpark

2015-04-28 Thread Rok Roskar
That's exactly what I'm saying -- I specify the memory options using spark options, but this is not reflected in how the JVM is created. No matter which memory settings I specify, the JVM for the driver is always made with 512Mb of memory. So I'm not sure if this is a feature or a bug? rok On Mon

Re: Understanding Spark's caching

2015-04-28 Thread Akhil Das
Option B would be fine, as in the SO itself the answer says, Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. Also note, In Option A, you are not specifying any a

Re: Serialization error

2015-04-28 Thread madhvi
On Tuesday 28 April 2015 01:39 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: val conf = new SparkConf() .setAppName(detail) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryoserializer.buffer.mb", arguments.get("buffersize").get) .set("spark.kryose

Re: java.lang.UnsupportedOperationException: empty collection

2015-04-28 Thread Robineast
I've tried running your code through spark-shell on both 1.3.0 (pre-built for Hadoop 2.4 and above) and a recently built snapshot of master. Both work fine. Running on OS X yosemite. What's your configuration? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: How to debug Spark on Yarn?

2015-04-28 Thread Steve Loughran
On 27 Apr 2015, at 07:51, ÐΞ€ρ@Ҝ (๏̯͡๏) mailto:deepuj...@gmail.com>> wrote: Spark 1.3 1. View stderr/stdout from executor from Web UI: when the job is running i figured out the executor that am suppose to see, and those two links show 4 special characters on browser. 2. Tail on Yarn logs:

[Spark SQL] Problems creating a table in specified schema/database

2015-04-28 Thread James Aley
Hey all, I'm trying to create tables from existing Parquet data in different schemata. The following isn't working for me: CREATE DATABASE foo; CREATE TABLE foo.bar USING com.databricks.spark.avro OPTIONS (path '...'); -- Error: org.apache.spark.sql.AnalysisException: cannot recognize input nea

Spark Sql: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-04-28 Thread LinQili
Hi all. I was launching a spark sql job on my own machine, not on the spark cluster machines, and failed. The excpetion info is: 15/04/28 16:28:04 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.RuntimeException: Unable to insta

Re: Serialization error

2015-04-28 Thread ๏̯͡๏
val conf = new SparkConf() .setAppName(detail) .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryoserializer.buffer.mb", arguments.get("buffersize" ).get) .set("spark.kryoserializer.buffer.max.mb", arguments.get( "maxbuffersize").g

How to add jars to standalone pyspark program

2015-04-28 Thread mj
Hi, I'm trying to figure out how to use a third party jar inside a python program which I'm running via PyCharm in order to debug it. I am normally able to run spark code in python such as this: spark_conf = SparkConf().setMaster('local').setAppName('test') sc = SparkContext(conf=spark_co

Re: New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-28 Thread ayan guha
Alias function not in python yet. I suggest to write SQL if your data suits it On 28 Apr 2015 14:42, "Don Drake" wrote: > https://issues.apache.org/jira/browse/SPARK-7182 > > Can anyone suggest a workaround for the above issue? > > Thanks. > > -Don > > -- > Donald Drake > Drake Consulting > http: