Spark Lost executor && shuffle.FetchFailedException

2015-09-21 Thread biyan900116
Hi All: When I write the data to the hive dynamic partition table, many errors and warnings as following happen... Is the reason that shuffle output is so large ? = 15/09/21 14:53:09 ERROR cluster.YarnClusterScheduler: Lost executor 402 on dn03.datanode.com: remote Rpc client disassoc

Hbase Spark streaming issue.

2015-09-21 Thread Siva
Hi, I m seeing some strange error while inserting data from spark streaming to hbase. I can able to write the data from spark (without streaming) to hbase successfully, but when i use the same code to write dstream I m seeing the below error. I tried setting the below parameters, still didnt hel

Deploying spark-streaming application on production

2015-09-21 Thread Jeetendra Gangele
Hi All, I have an spark streaming application with batch (10 ms) which is reading the MQTT channel and dumping the data from MQTT to HDFS. So suppose if I have to deploy new application jar(with changes in spark streaming application) what is the best way to deploy, currently I am doing as below

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-21 Thread Petr Novak
> > It might be connected with my problems with gracefulShutdown in Spark > 1.5.0 2.11 > https://mail.google.com/mail/#search/petr/14fb6bd5166f9395 > > Maybe Ctrl+C corrupts checkpoints and breaks gracefulShutdown? > The provided link is obviously wrong. I haven't found it Spark mailing lists archi

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Petr Novak
Internally Spark is using json4s and jackson parser v3.2.10, AFAIK. So if you are using Scala they should be available without adding dependencies. There is v3.2.11 already available but adding to my app was causing NoSuchMethod exception so I would have to shade it. I'm simply staying on v3.2.10 f

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Petr Novak
Surprisingly I had the same issue when including json4s dependency at the same version v3.2.10. I had to remove json4s deps from my code. I'm using Scala 2.11, there might be some issue with mixing 2.10/2.11 and it could be just my environment. I haven't investigated much as depending on Spark prov

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak
I think you would have to persist events somehow if you don't want to miss them. I don't see any other option there. Either in MQTT if it is supported there or routing them through Kafka. There is WriteAheadLog in Spark but you would have decouple stream MQTT reading and processing into 2 separate

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak
I should read my posts at least once to avoid so many typos. Hopefully you are brave enough to read through. Petr On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak wrote: > I think you would have to persist events somehow if you don't want to miss > them. I don't see any other option there. Either i

passing SparkContext as parameter

2015-09-21 Thread Priya Ch
Hello All, How can i pass sparkContext as a parameter to a method in an object. Because passing sparkContext is giving me TaskNotSerializable Exception. How can i achieve this ? Thanks, Padma Ch

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak
In short there is no direct support for it in Spark AFAIK. You will either manage it in MQTT or have to add another layer of indirection - either in-memory based (observable streams, in-mem db) or disk based (Kafka, hdfs files, db) which will keep you unprocessed events. Now realizing, there is su

Re: passing SparkContext as parameter

2015-09-21 Thread Petr Novak
add @transient? On Mon, Sep 21, 2015 at 11:36 AM, Petr Novak wrote: > add @transient? > > On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch > wrote: > >> Hello All, >> >> How can i pass sparkContext as a parameter to a method in an object. >> Because passing sparkContext is giving me TaskNotSerial

mongo-hadoop with Spark is slow for me, and adding nodes doesn't seem to make any noticeable difference

2015-09-21 Thread cscarioni
Hi,I appreciate any help or pointers in the right direction My current test scenario is the following. I want to process a MongoDB collection, anonymising some fields on it and store it in another Collection. The size of the collection is around 900 GB with 2.5 million documents Following is th

Re: Spark + Druid

2015-09-21 Thread Petr Novak
Great work. On Fri, Sep 18, 2015 at 6:51 PM, Harish Butani wrote: > Hi, > > I have just posted a Blog on this: > https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani > > regards, > Harish Butani. > > On Tue, Sep 1, 2015 at 11:46 PM, Paolo Platter > wr

spark with internal ip

2015-09-21 Thread ZhuGe
Hi there:We recently add one NIC to each node of the cluster(stand alone) for larger bandwidth, and we modify the /etc/hosts file, so the hostname points to the new NIC's ip address(internal).What we want to achieve is that, communication between nodes would go through the new NIC. It seems th

Re: Spark does not yet support its JDBC component for Scala 2.11.

2015-09-21 Thread Petr Novak
Nice, thanks. So the note in build instruction for 2.11 is obsolete? Or there are still some limitations? http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Fri, Sep 11, 2015 at 2:19 PM, Petr Novak wrote: > Nice, thanks. > > So the note in build instruction for 2

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-21 Thread Petr Novak
We have tried on another cluster installation with the same effect. Petr On Mon, Sep 21, 2015 at 10:45 AM, Petr Novak wrote: > It might be connected with my problems with gracefulShutdown in Spark >> 1.5.0 2.11 >> https://mail.google.com/mail/#search/petr/14fb6bd5166f9395 >> >> Maybe Ctrl+C cor

Re: passing SparkContext as parameter

2015-09-21 Thread Priya Ch
can i use this sparkContext on executors ?? In my application, i have scenario of reading from db for certain records in rdd. Hence I need sparkContext to read from DB (cassandra in our case), If sparkContext couldn't be sent to executors , what is the workaround for this ?? On Mon, Sep 21, 2

How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Dear , I have took lots of days to think into this issue, however, without any success...I shall appreciate your all kind help. There is an RDD rdd1, I would like get a new RDD rdd2, each row in rdd2[ i ] = rdd1[ i ] - rdd[i - 1] .What kinds of API or function would I use... Thanks very much!J

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Romi Kuntsman
RDD is a set of data rows (in your case numbers), there is no meaning for the order of the items. What exactly are you trying to accomplish? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21, 2015 at 2:29 PM, Zhiliang Zhu wrote: > Dear , > > I have took lots of days to

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
sparkConext is available on the driver, not on executors. To read from Cassandra, you can use something like this: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21, 2015 at 2:27 PM, Priya

Re: passing SparkContext as parameter

2015-09-21 Thread Ted Yu
You can use broadcast variable for passing connection information. Cheers > On Sep 21, 2015, at 4:27 AM, Priya Ch wrote: > > can i use this sparkContext on executors ?? > In my application, i have scenario of reading from db for certain records in > rdd. Hence I need sparkContext to read from

Re: DataGenerator for streaming application

2015-09-21 Thread Hemant Bhanawat
Why are you using rawSocketStream to read the data? I believe rawSocketStream waits for a big chunk of data before it can start processing it. I think what you are writing is a String and you should use socketTextStream which reads the data on a per line basis. On Sun, Sep 20, 2015 at 9:56 AM, Sa

Re: passing SparkContext as parameter

2015-09-21 Thread Priya Ch
Yes, but i need to read from cassandra db within a spark transformation..something like.. dstream.forachRDD{ rdd=> rdd.foreach { message => sc.cassandraTable() . . . } } Since rdd.foreach gets executed on workers, how can i make sparkContext available on workers ???

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
foreach is something that runs on the driver, not the workers. if you want to perform some function on each record from cassandra, you need to do cassandraRdd.map(func), which will run distributed on the spark workers *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21, 20

Re: Spark does not yet support its JDBC component for Scala 2.11.

2015-09-21 Thread Ted Yu
I think the document should be updated to reflect the integration of SPARK-8013 Cheers On Mon, Sep 21, 2015 at 3:48 AM, Petr Novak wrote: > Nice, thanks. > > So the note in build instruction for 2.11 is obsolete? Or there are still > some limit

spark + parquet + schema name and metadata

2015-09-21 Thread Borisa Zivkovic
Hi, I am trying to figure out how to write parquet metadata when persisting DataFrames to parquet using Spark (1.4.1) I could not find a way to change schema name (which seems to be hardcoded to root) and also how to add data to key/value metadata in parquet footer. org.apache.parquet.hadoop.met

Re: passing SparkContext as parameter

2015-09-21 Thread Cody Koeninger
That isn't accurate, I think you're confused about foreach. Look at http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Mon, Sep 21, 2015 at 7:36 AM, Romi Kuntsman wrote: > foreach is something that runs on the driver, not the workers.

Why are executors on slave never used?

2015-09-21 Thread Joshua Fox
I have a default AWS EMR cluster (1 master, 1 slave) with Spark. When I run a Spark process, it works fine -- but only on the master, as if it were standalone. The web-UI and logging code shows only 1 executor, the localhost. How can I diagnose this? (I create *SparkConf, *in Python, with *setMa

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Hi Romi, Thanks very much for your kind help comment~~ In fact there is some valid backgroud of the application, it is about R data analysis #fund_nav_daily is a M X N (or M X 1) matrix or data.frame, col is each daily fund return, row is the daily date#fund_return_daily needs to count the e

Count for select not matching count for group by

2015-09-21 Thread Michael Kelly
Hi, I'm seeing some strange behaviour with spark 1.5, I have a dataframe that I have built from loading and joining some hive tables stored in s3. The dataframe is cached in memory, using df.cache. What I'm seeing is that the counts I get when I do a group by on a column are different from what

sqlContext.read.avro broadcasting files from the driver

2015-09-21 Thread Daniel Haviv
Hi, I'm loading a 1000 files using the spark-avro package: val df = sqlContext.read.avro(*"/incoming/"*) When I'm performing an action on this df it seems like for each file a broadcast is being created and is sent to the workers (instead of the workers reading their data-local files): scala> df.

AWS_CREDENTIAL_FILE

2015-09-21 Thread Michel Lemay
Hi, It looks like spark does read AWS credentials from environment variable AWS_CREDENTIAL_FILE like awscli does. Mike

Python Packages in Spark w/Mesos

2015-09-21 Thread John Omernik
Hey all - Curious at the best way to include python packages in my Spark installation. (Such as NLTK). Basically I am running on Mesos, and would like to find a way to include the package in the binary distribution in that I don't want to install packages on all nodes. We should be able to includ

Re: Count for select not matching count for group by

2015-09-21 Thread Richard Hillegas
For what it's worth, I get the expected result that "filter" behaves like "group by" when I run the same experiment against a DataFrame which was loaded from a relational store: import org.apache.spark.sql._ import org.apache.spark.sql.types._ val df = sqlContext.read.format("jdbc").options( Ma

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-21 Thread Ellen Kraffmiller
Thank you for the link! I was using http://apache-spark-user-list.1001560.n3.nabble.com/, and I didn't see replies there. Regarding your code example, I'm doing the same thing and successfully creating the rdd, but the problem is that when I call a clustering algorithm like amap::hcluster(), I get

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Sujit Pal
Hi Zhiliang, Would something like this work? val rdd2 = rdd1.sliding(2).map(v => v(1) - v(0)) -sujit On Mon, Sep 21, 2015 at 7:58 AM, Zhiliang Zhu wrote: > Hi Romi, > > Thanks very much for your kind help comment~~ > > In fact there is some valid backgroud of the application, it is about R >

Re: Why are executors on slave never used?

2015-09-21 Thread Hemant Bhanawat
When you specify master as local[2], it starts the spark components in a single jvm. You need to specify the master correctly. I have a default AWS EMR cluster (1 master, 1 slave) with Spark. When I run a Spark process, it works fine -- but only on the master, as if it were standalone. The web-UI

Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first(). For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. I wa

Exception initializing JavaSparkContext

2015-09-21 Thread ekraffmiller
Hi, I’m trying to run a simple test program to access Spark though Java. I’m using JDK 1.8, and Spark 1.5. I’m getting an Exception from the JavaSparkContext constructor. My initialization code matches all the sample code I’ve found online, so not sure what I’m doing wrong. Here is my code: Sp

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Hi Sujit, I must appreciate your kind help very much~ It seems to be OK, however, do you know the corresponding spark Java API achievement...Is there any java API as scala sliding, and it seemed that I do not find spark scala's doc about sliding ... Thank you very much~Zhiliang On Monday,

Re: Python Packages in Spark w/Mesos

2015-09-21 Thread Tim Chen
Hi John, Sorry haven't get time to respond to your questions over the weekend. If you're running client mode, to use the Docker/Mesos integration minimally you just need to set the image configuration 'spark.mesos.executor.docker.image' as stated in the documentation, which Spark will use this im

Re: Docker/Mesos with Spark

2015-09-21 Thread Tim Chen
Hi John, There is no other blog post yet, I'm thinking to do a series of posts but so far haven't get time to do that yet. Running Spark in docker containers makes distributing spark versions easy, it's simple to upgrade and automatically caches on the slaves so the same image just runs right awa

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Hi Sujit, Thanks very much for your kind help.I have found the sliding doc in both scala and java spark, it is from mlib RDDFunctions, though in the doc there is always not enough example. Best Regards,Zhiliang On Monday, September 21, 2015 11:48 PM, Sujit Pal wrote: Hi Zhiliang

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry, Looks like it is a Python-specific issue. Can you create a JIRA? Thanks, Yin On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > Hi Spark Developers, > > I just ran some very simple operations on a dataset. I was surprise by the > execution plan of take(1), head() or first(). > > Fo

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
btw, does 1.4 has the same problem? On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > Hi Jerry, > > Looks like it is a Python-specific issue. Can you create a JIRA? > > Thanks, > > Yin > > On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > >> Hi Spark Developers, >> >> I just ran some very s

Re: spark + parquet + schema name and metadata

2015-09-21 Thread Cheng Lian
Currently Spark SQL doesn't support customizing schema name and metadata. May I know why these two matters in your use case? Some Parquet data models, like parquet-avro, do support it, while some others don't (e.g. parquet-hive). Cheng On 9/21/15 7:13 AM, Borisa Zivkovic wrote: Hi, I am try

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Sujit Pal
Hi Zhiliang, Haven't used the Java API but found this Javadoc page, may be helpful to you. https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/mllib/rdd/RDDFunctions.html I think the equivalent Java code snippet might go something like this: RDDFunctions.fromRDD(rdd1, ClassTag$.apply(

how to get RDD from two different RDDs with cross column

2015-09-21 Thread Zhiliang Zhu
Dear Romi, Priya, Sujt and Shivaram and all, I have took lots of days to think into this issue, however, without  any enough good solution...I shall appreciate your all kind help. There is an RDD rdd1, and another RDD rdd2, (rdd2 can be PairRDD, or DataFrame with two columns as ).StringDate colum

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Seems 1.4 has the same issue. On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > >> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create a JIRA? >> >> Thanks, >> >> Yin >> >> On Mon,

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Yin, You are right! I just tried the scala version with the above lines, it works as expected. I'm not sure if it happens also in 1.4 for pyspark but I thought the pyspark code just calls the scala code via py4j. I didn't expect that this bug is pyspark specific. That surprises me actually a bi

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens also in 1.4 for pyspark bu

Fwd: Issue with high no of skipped task

2015-09-21 Thread Saurav Sinha
-- Forwarded message -- From: "Saurav Sinha" Date: 21-Sep-2015 11:48 am Subject: Issue with high no of skipped task To: Cc: Hi Users, I am new Spark I have written flow.When we deployed our code it is completing jobs in 4-5 min. But now it is taking 20+ min in completing with a

Re: Exception initializing JavaSparkContext

2015-09-21 Thread Marcelo Vanzin
What Spark package are you using? In particular, which hadoop version? On Mon, Sep 21, 2015 at 9:14 AM, ekraffmiller wrote: > Hi, > I’m trying to run a simple test program to access Spark though Java. I’m > using JDK 1.8, and Spark 1.5. I’m getting an Exception from the > JavaSparkContext const

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Hi Romi, I must show my sincere appreciation towards your kind & helpful help. One more question, currently I am using spark to deal with financial data analysis, so lots of operations on R data.frame/matrix and stat/regressionare always called.However, SparkR currently is not that strong, most o

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Looks like the problem is df.rdd does not work very well with limit. In scala, df.limit(1).rdd will also trigger the issue you observed. I will add this in the jira. On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote: > I just noticed you found 1.4 has the same issue. I added that as well in > th

Re: DataGenerator for streaming application

2015-09-21 Thread Saiph Kappa
Thanks a lot. Now it's working fine. I wasn't aware of "socketTextStream", not sure if it was documented in the spark programming guide. On Mon, Sep 21, 2015 at 12:46 PM, Hemant Bhanawat wrote: > Why are you using rawSocketStream to read the data? I believe > rawSocketStream waits for a big chu

JDBCRdd issue

2015-09-21 Thread Saurabh Malviya (samalviy)
Hi, While using reference with in JDBCRdd , It is throwing serializable exception. Does JDBCRdd does not except reference from other part of code.? confMap= ConfFactory.getConf(ParquetStreaming) val jdbcRDD = new JdbcRDD(sc, () => { Class.forName("org.

Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Ashish Soni
Hi All , Just wanted to find out if there is an benefits to installing kafka brokers and spark nodes on the same machine ? is it possible that spark can pull data from kafka if it is local to the node i.e. the broker or partition is on the same machine. Thanks, Ashish

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-21 Thread Alan Braithwaite
That could be the behavior but spark.mesos.executor.home being unset still raises an exception inside the dispatcher preventing a docker from even being started. I can see if other properties are inherited from the default environment when that's set, if you'd like. I think the main problem is ju

Re: Exception initializing JavaSparkContext

2015-09-21 Thread Ellen Kraffmiller
I am including the Spark core dependency in my maven pom.xml: org.apache.spark spark-core_2.10 1.5.0 This is bringing these hadoop versions: hadoop-annotations-2.2.0 hadoop-auth-2.2.0 hadoop-client-2.2.0 hadoop-common-2.2.0 hadoop-core-0.20.204.0 hadoop-hdfs-2.2.0 followed by mapreduce and yarn

Serialization Error with PartialFunction / immutable sets

2015-09-21 Thread Chaney Courtney
Hi, I’m receiving a task not serializable exception using Spark GraphX (Scala 2.11.6 / JDK 1.8 / Spark 1.5) My vertex data is of type (VertexId, immutable Set), My edge data is of type PartialFunction[ISet[E], ISet[E]] where each ED has a precomputed function. My vertex program: val v

Re: Deploying spark-streaming application on production

2015-09-21 Thread Adrian Tanase
I'm wondering, isn't this the canonical use case for WAL + reliable receiver? As far as I know you can tune Mqtt server to wait for ack on messages (qos level 2?). With some support from the client libray you could achieve exactly once semantics on the read side, if you ack message only after wr

Re: Exception initializing JavaSparkContext

2015-09-21 Thread Ted Yu
bq. hadoop-core-0.20.204.0 How come the above got into play - it was from hadoop-1 On Mon, Sep 21, 2015 at 11:34 AM, Ellen Kraffmiller < ellen.kraffmil...@gmail.com> wrote: > I am including the Spark core dependency in my maven pom.xml: > > > org.apache.spark > spark-core_2.10 > 1.5.0 > > > Th

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Cody Koeninger
The direct stream already uses the kafka leader for a given partition as the preferred location. I don't run kafka on the same nodes as spark, and I don't know anyone who does, so that situation isn't particularly well tested. On Mon, Sep 21, 2015 at 1:15 PM, Ashish Soni wrote: > Hi All , > > J

Re: Using Spark for portfolio manager app

2015-09-21 Thread Adrian Tanase
1. reading from kafka has exactly once guarantees - we are using it in production today (with the direct receiver) * ​you will probably have 2 topics, loading both into spark and joining / unioning as needed is not an issue * tons of optimizations you can do there, assuming every

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Adrian Tanase
We do - using Spark streaming, Kafka, HDFS all collocated on the same nodes. Works great so far. Spark picks up the location information and reads data from the partitions hosted by the local broker, showing up as NODE_LOCAL in the UI. You also need to look at the locality options in the confi

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Adrian Tanase
I've been using spray-json for general JSON ser/deser in scala (spark app), mostly for config files and data exchange. Haven't used it in conjunction with jobs that process large JSON data sources, so can't speak for those use cases. -adrian _

Re: Exception initializing JavaSparkContext

2015-09-21 Thread Ellen Kraffmiller
I found the problem - the pom.xml I was using also contained and old dependency to a mahout library, which was including the old hadoop-core. Removing that fixed the problem. Thank you! On Mon, Sep 21, 2015 at 2:54 PM, Ted Yu wrote: > bq. hadoop-core-0.20.204.0 > > How come the above got into pl

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
Cody, that's a great reference! As shown there - the best way to connect to an external database from the workers is to create a connection pool on (each) worker. The driver mass pass, via broadcast, the connection string, but not the connect object itself and not the spark context. On Mon, Sep 21

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Romi Kuntsman
Hi, If I understand correctly: rdd1 contains keys (of type StringDate) rdd2 contains keys and values and rdd3 contains all the keys, and the values from rdd2? I think you should make rdd1 and rdd2 PairRDD, and then use outer join. Does that make sense? On Mon, Sep 21, 2015 at 8:37 PM Zhiliang Zhu

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is dirt

HiveQL Compatibility (0.12.0, 0.13.0???)

2015-09-21 Thread Dominic Ricard
Hi, here's a statement from the Spark 1.5.0 Spark SQL and DataFrame Guide : *Compatibility with Apache Hive* Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Curre

Slow Performance with Apache Spark Gradient Boosted Tree training runs

2015-09-21 Thread vkutsenko
I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of th

Re: Why are executors on slave never used?

2015-09-21 Thread Andrew Or
Hi Joshua, What cluster manager are you using, standalone or YARN? (Note that standalone here does not mean local mode). If standalone, you need to do `setMaster("spark://[CLUSTER_URL]:7077")`, where CLUSTER_URL is the machine that started the standalone Master. If YARN, you need to do `setMaster

Re: Spark data type guesser UDAF

2015-09-21 Thread Ruslan Dautkhanov
Does it deserve to be a JIRA in Spark / Spark MLLib? How do you guys normally determine data types? Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical. Another u

Mesos Tasks only run on one node

2015-09-21 Thread John Omernik
I have a happy healthy Mesos cluster (0.24) running in my lab. I've compiled spark-1.5.0 and it seems to be working fine, except for one small issue, my tasks all seem to run on one node. (I have 6 in the cluster). Basically, I have directory of compressed text files. Compressed, these 25 files

Spark Streaming distributed job

2015-09-21 Thread nibiau
Hello, Please could you explain me what is exactly distributed when I launch a spark streaming job over YARN cluster ? My code is something like : JavaDStream customReceiverStream = ssc.receiverStream(streamConfig.getJmsReceiver()); JavaDStream incoming_msg = customReceiverStream.map(

Iterator-based streaming, how is it efficient ?

2015-09-21 Thread Samuel Hailu
Hi, In Spark's in-memory logic, without cache, elements are accessed in an iterator-based streaming style [ http://www.slideshare.net/liancheng/dtcc-14-spark-runtime-internals?next_slideshow=1 ] I have two questions: 1. if elements are read one line at at time from HDFS (disk) and then tr

Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Balaji Vijayan
Howdy, I'm a relative novice at Spark/Scala and I'm puzzled by some behavior that I'm seeing in 2 of my local Spark/Scala environments (Scala for Jupyter and Scala IDE) but not the 3rd (Spark Shell). The following code throws the following stack trace error in the former 2 environments but execute

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-21 Thread tridib
Did you get any solution to this? I am getting same issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: HiveQL Compatibility (0.12.0, 0.13.0???)

2015-09-21 Thread Michael Armbrust
In general we welcome pull requests for these kind of updates. In this case its already been fixed in master and branch-1.5 and will be updated when we release 1.5.1 (hopefully soon). On Mon, Sep 21, 2015 at 1:21 PM, Dominic Ricard < dominic.ric...@tritondigital.com> wrote: > Hi, >here's a s

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Ted Yu
Which release are you using ? >From the line number in ClosureCleaner, it seems you're using 1.4.x Cheers On Mon, Sep 21, 2015 at 4:07 PM, Balaji Vijayan wrote: > Howdy, > > I'm a relative novice at Spark/Scala and I'm puzzled by some behavior that > I'm seeing in 2 of my local Spark/Scala env

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Zhiliang Zhu
Hi Romi, Yes, you understand it correctly.And rdd1 keys are cross with rdd2 keys, that is, there are lots of same keys between rdd1 and rdd2, and there are some keys inrdd1 but not in rdd2, there are also some keys in rdd2 but not in rdd1.Then rdd3 keys would be same with rdd1 keys, rdd3 will no

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
I noticed that some executors have issue with scratch space. I see the following in yarn app container stderr around the time when yarn killed the executor because it uses too much memory. -- App container stderr -- 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache rdd_6_346 in

spark.mesos.coarse impacts memory performance on mesos

2015-09-21 Thread Utkarsh Sengar
I am running Spark 1.4.1 on mesos. The spark job does a "cartesian" of 4 RDDs (aRdd, bRdd, cRdd, dRdd) of size 100, 100, 7 and 1 respectively. Lets call it prouctRDD. Creation of "aRdd" needs data pull from multiple data sources, merging it and creating a tuple of JavaRdd, finally aRDD looks some

Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
I am processing a single file and want to remove duplicate rows by some key by always choosing the first row in the file for that key. The best solution I could come up with is to zip each row with the partition index and local index, like this: rdd.mapPartitionsWithIndex { case (partitionIndex,

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Igor Berman
Try to broadcasr header On Sep 22, 2015 08:07, "Balaji Vijayan" wrote: > Howdy, > > I'm a relative novice at Spark/Scala and I'm puzzled by some behavior that > I'm seeing in 2 of my local Spark/Scala environments (Scala for Jupyter and > Scala IDE) but not the 3rd (Spark Shell). The following co

Re: word count (group by users) in spark

2015-09-21 Thread Zhang, Jingyu
Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs. Cheers, Jingyu On

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Saisai Shao
I think you need to increase the memory size of executor through command arguments "--executor-memory", or configuration "spark.executor.memory". Also yarn.scheduler.maximum-allocation-mb in Yarn side if necessary. Thanks Saisai On Mon, Sep 21, 2015 at 5:13 PM, Alexander Pivovarov wrote: > I

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Sandy Ryza
The warning your seeing in Spark is no issue. The scratch space lives inside the heap, so it'll never result in YARN killing the container by itself. The issue is that Spark is using some off-heap space on top of that. You'll need to bump the spark.yarn.executor.memoryOverhead property to give t

How does one use s3 for checkpointing?

2015-09-21 Thread Amit Ramesh
A lot of places in the documentation mention using s3 for checkpointing, however I haven't found any examples or concrete evidence of anyone having done this. 1. Is this a safe/reliable option given the read-after-write consistency for PUTS in s3? 2. Is s3 access broken for hadoop 2.6 (SP

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Alexis Gillain
As Igor said header must be available on each partition so the solution is broadcasting it. About the difference between repl and scala IDE, it may come from the sparkContext setup as REPL define one by default. 2015-09-22 8:41 GMT+08:00 Igor Berman : > Try to broadcasr header > On Sep 22, 2015

Re: How does one use s3 for checkpointing?

2015-09-21 Thread Jerry Lam
Hi Amit, Have you looked at Amazon EMR? Most people using EMR use s3 for persistency (both as input and output of spark jobs). Best Regards, Jerry Sent from my iPhone > On 21 Sep, 2015, at 9:24 pm, Amit Ramesh wrote: > > > A lot of places in the documentation mention using s3 for checkpoi

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
I repartitioned input RDD from 4,800 to 24,000 partitions After that the stage (24000 tasks) was done in 22 min on 100 boxes Shuffle read/write: 905 GB / 710 GB Task Metrics (Dur/GC/Read/Write) Min: 7s/1s/38MB/30MB Med: 22s/9s/38MB/30MB Max:1.8min/1.6min/38MB/30MB On Mon, Sep 21, 2015 at 5:55 PM,

Re: How does one use s3 for checkpointing?

2015-09-21 Thread Utkarsh Sengar
We are using "spark-1.4.1-bin-hadoop2.4" on mesos (not EMR) with s3 to read and write data and haven't noticed any inconsistencies with it, so 1 (mostly) and 2 definitely should not be a problem. Regarding 3, are you setting the file system impl in spark config? sparkContext.hadoopConfiguration().

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
I think foldByKey is much more what you want, as it has more a notion of building up some result per key by encountering values serially. You would take the first and ignore the rest. Note that "first" depends on your RDD having an ordering to begin with, or else you rely on however it happens to b

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
Hmm, I don't think that's what I want. There's no "zero value" in my use case. On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen wrote: > I think foldByKey is much more what you want, as it has more a notion > of building up some result per key by encountering values serially. > You would take the firs

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
The zero value here is None. Combining None with any row should yield Some(row). After that, combining is a no-op for other rows. On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver wrote: > Hmm, I don't think that's what I want. There's no "zero value" in my use > case. > > On Mon, Sep 21, 2015 at 8:

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
Hmm, ok, but I'm not seeing why foldByKey is more appropriate than reduceByKey? Specifically, is foldByKey guaranteed to walk the RDD in order, but reduceByKey is not? On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen wrote: > The zero value here is None. Combining None with any row should yield > Some

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
Yes, that's right, though "in order" depends on the RDD having an ordering, but so does the zip-based solution. Actually, I'm going to walk that back a bit, since I don't see a guarantee that foldByKey behaves like foldLeft. The implementation underneath, in combineByKey, appears that it will act

  1   2   >