spark vs flink low memory available

2015-08-11 Thread parö
hi community, i have build a spark and flink k-means application. my test case is a clustering on 1 million points on 3node cluster. in memory bottlenecks begins flink to outsource to disk and work slowly but works. however spark lose executers if the memory is full and starts again (infinety loo

Re: ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver

2015-08-11 Thread Sadaf Khan
okay. Then do you have any idea how to avoid this error? Thanks On Tue, Aug 11, 2015 at 12:08 AM, Tathagata Das wrote: > I think this may be expected. When the streaming context is stopped > without the SparkContext, then the receivers are stopped by the driver. The > receiver sends back the me

Re: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Yu
HBase will not have query engine. It will provide better support to query engines. Cheers > On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc wrote: > > Ted, > > I’m in China now, and seem to experience difficulty to access Apache Jira. > Anyways, it appears to me that HBASE-14181 attempts to

Re:spark vs flink low memory available

2015-08-11 Thread jun
your detail of log file? At 2015-08-10 22:02:16, "Pa Rö" wrote: hi community, i have build a spark and flink k-means application. my test case is a clustering on 1 million points on 3node cluster. inmemorybottlenecksbegins flink tooutsourceto disk andworkslowly butworks. however spark lose

Re: spark vs flink low memory available

2015-08-11 Thread Ted Yu
Pa: Can you try 1.5.0 SNAPSHOT ? See SPARK-7075 Project Tungsten (Spark 1.5 Phase 1) Cheers On Tue, Aug 11, 2015 at 12:49 AM, jun wrote: > your detail of log file? > > At 2015-08-10 22:02:16, "Pa Rö" wrote: > > hi community, > > i have build a spark and flink k-means application. > my test ca

Re: Differents of loading data

2015-08-11 Thread Akhil Das
Load data to where? To spark? If you are referring to spark, then there are some differences the way the connector is implemented. When you use spark, the most important thing that you get is the parallelism (depending on the number of partitions). If you compare it with a native java driver then y

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-08-11 Thread Rick Moritz
Consider the spark.max.cores configuration option -- it should do what you require. On Tue, Aug 11, 2015 at 8:26 AM, Haripriya Ayyalasomayajula < aharipriy...@gmail.com> wrote: > Hello all, > > As a quick follow up for this, I have been using Spark on Yarn till now > and am currently exploring Me

Re: spark vs flink low memory available

2015-08-11 Thread Pa Rö
my first post is here and a log too: http://mail-archives.us.apache.org/mod_mbox/spark-user/201508.mbox/%3ccah2_pykqhfr4tbvpbt2tdhgm+zrkcbzfnk7uedkjpdhe472...@mail.gmail.com%3E i use cloudera live, i think i can not use spark 1.5. i will try to run it again and post the current logfile here. 201

Re: SparkR -Graphx Connected components

2015-08-11 Thread Robineast
To be part of a strongly connected component every vertex must be reachable from every other vertex. Vertex 6 is not reachable from the other components of scc 0. Same goes for 7. So both 6 and 7 form their own strongly connected components. 6 and 7 are part of the connected components of 0 and 3 r

Fwd: How to minimize shuffling on Spark dataframe Join?

2015-08-11 Thread Abdullah Anwar
I have two dataframes like this student_rdf = (studentid, name, ...) student_result_rdf = (studentid, gpa, ...) we need to join this two dataframes. we are now doing like this, student_rdf.join(student_result_rdf, student_result_rdf["studentid"] == student_rdf["studentid"]) So it is simple.

Re: Spark with GCS Connector - Rate limit error

2015-08-11 Thread Akhil Das
There's a daily quota and a minutely quota, you could be hitting those. You can ask google to increase the quota for the particular service. Now, to reduce the limit from the spark side, you can actually to a re-partition to a smaller number before doing the save. Another way to use the local file

Re: Spark Streaming dealing with broken files without dying

2015-08-11 Thread Akhil Das
You can do something like this: val fStream = ssc.textFileStream("/sigmoid/data/") .map(x => { try{ //Move all the transformations within a try..catch }catch{ case e: Exception => { logError("Whoops!! "); null } } }) Thanks Best Regards On Mon, Aug 10, 2015 at 7:44 PM, Mario Pastorelli < mari

Python3 Spark execution problems

2015-08-11 Thread Javier Domingo Cansino
Hi, I have been trying to use spark for the processing I need to do in some logs, and I have found several difficulties during the process. Most of them I could overcome them, but I am really stuck in the last one. I would really like to know how spark is supposed to be deployed. For now, I have

Re: Java Streaming Context - File Stream use

2015-08-11 Thread Akhil Das
Like this: (Including the filter function) JavaPairInputDStream inputStream = ssc.fileStream( testDir.toString(), LongWritable.class, Text.class, TextInputFormat.class, new Function() { @Override public Boolean call(Path v1) throws Exception {

Re: Inquery about contributing codes

2015-08-11 Thread Akhil Das
You can create a new Issue and send a pull request for the same i think. + dev list Thanks Best Regards On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon wrote: > Dear Sir / Madam, > > I have a plan to contribute some codes about passing filters to a > datasource as physical planning. > > In more

Python3 Spark execution problems

2015-08-11 Thread Javier Domingo Cansino
Hi, I have been trying to use spark for the processing I need to do in some logs, and I have found several difficulties during the process. Most of them I could overcome them, but I am really stuck in the last one. I would really like to know how spark is supposed to be deployed. For now, I have

AW: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-08-11 Thread rene.pfitzner
Hi – I'd like to follow up on this, as I am running into very similar issues (with a much bigger data set, though – 10^5 nodes, 10^7 edges). So let me repost the question: Any ideas on how to estimate graphx memory requirements? Cheers! Von: Roman Sokolov [mailto:ole...@gmail.com] Gesendet: S

dse spark-submit multiple jars issue

2015-08-11 Thread satish chandra j
*HI,* Please let me know if i am missing anything in the command below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar ///home/missingmerch/spark-cassandra-connector-java_2.1

Re: dse spark-submit multiple jars issue

2015-08-11 Thread Javier Domingo Cansino
I have no real idea (not java user), but have you tried with the --jars option? http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management AFAIK, you are currently submitting the jar names as arguments to the called Class instead of the jars themselves [image

How to specify column type when saving DataFrame as parquet file?

2015-08-11 Thread Jyun-Fan Tsai
Hi all, I'm using Spark 1.4.1. I create a DataFrame from json file. There is a column C that all values are null in the json file. I found that the datatype of column C in the created DataFrame is string. However, I would like to specify the column as Long when saving it as parquet file. What

Do you have any other method to get cpu elapsed time of an spark application

2015-08-11 Thread JoneZhang
Is there more information about spark evenlog, for example Why did not appear the SparkListenerExecutorRemoved event in evenlog while i use dynamic executor? I want to calculate cpu elapsed time of an application base on evenlog. By the way, Do you have any other method to get cpu elapsed time

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Jerry Lam
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU and the other executor with 6. So I don't think you can easily configure it without some tweaking at the source code. Sent from my iPad On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula wrote: > Hi Tim, > > Spark

mllib on (key, Iterable[Vector])

2015-08-11 Thread Fabian Böhnlein
Hi everyone, I am trying to use mllib.clustering.GaussianMixture, but am blocked by the fact that the API only accepts RDD[Vector]. Broadly speaking I need to run the clustering on an RDD[(key, Iterable[Vector]), e.g. (fabricated): val WebsiteUserAgeRDD : RDD[url, userAgeVector] val ageClusterB

Re: dse spark-submit multiple jars issue

2015-08-11 Thread satish chandra j
HI , I have used --jars option as well, please find the command below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld *--jars* ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar ///home/missingmerch/spark-cassandra-connector-java_2.1

Re: dse spark-submit multiple jars issue

2015-08-11 Thread Javier Domingo Cansino
use --verbose, it might give you some insights on what0s happening, [image: Fon] Javier Domingo CansinoResearch & Development Engineer+34 946545847Skype: javier.domingo.fonAll information in this email is confidential On Tue, A

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Haripriya Ayyalasomayajula
Spark evolved as an example framework for Mesos - thats how I know it. It is surprising to see that the options provided by mesos in this case are less. Tweaking the source code, haven't done it yet but I would love to see what options could be there! On Tue, Aug 11, 2015 at 5:42 AM, Jerry Lam wr

PySpark order-only window function issue

2015-08-11 Thread Maciej Szymkiewicz
Hello everyone, I am trying to use PySpark API with window functions without specifying partition clause. I mean something equivalent to this SELECT v, row_number() OVER (ORDER BY v) AS rn FROM df in SQL. I am not sure if I am doing something wrong or it is a bug but results are far from what I

Re: dse spark-submit multiple jars issue

2015-08-11 Thread satish chandra j
HI, Please find the log details below: dse spark-submit --verbose --master local --class HelloWorld etl-0.0.1-SNAPSHOT.jar --jars file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar file:/home/missingmerch/dse.jar file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar Using properties file:

Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
Hi, After taking a look at the code, I found out the problem: As spark will use broadcastNestedLoopJoin to treat nonequality condition. And one of my dataframe(df1) is created from an existing RDD(logicalRDD), so it uses defaultSizeInBytes * length to estimate the size. The other dataframe(df2) th

Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-11 Thread Akhil Das
Hi My Spark job (running in local[*] with spark 1.4.1) reads data from a thrift server(Created an RDD, it will compute the partitions in getPartitions() call and in computes hasNext will return records from these partitions), count(), foreach() is working fine it returns the correct number of reco

RE: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-11 Thread Benjamin Ross
Jerry, I was able to use window functions without the hive thrift server. HiveContext does not imply that you need the hive thrift server running. Here’s what I used to test this out: var conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") val sc = new SparkC

RE: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-11 Thread Benjamin Ross
I forgot to mention, my setup was: - Spark 1.4.1 running in standalone mode - Datastax spark cassandra connector 1.4.0-M1 - Cassandra DB - Scala version 2.10.4 From: Benjamin Ross Sent: Tuesday, August 11, 2015 10:16 AM To: Jerry; Michael Armbrust Cc: user

Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
Hi all, I don't have any hadoop fs installed on my environment, but I would like to store dataframes in parquet files. I am failing to do so, if possible, anyone have any pointers? Thank you, Saif

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Dean Wampler
It should work fine. I have an example script here: https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala (Spark 1.4.X) What does "I am failing to do so" mean? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I am launching my spark-shell spark-1.4.1-bin-hadoop2.6/bin/spark-shell 15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala> val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF scala> data.write.parquet("/var/ data/Saif/pq")

Unsupported major.minor version 51.0

2015-08-11 Thread Yakubovich, Alexey
I found some discussions online, but it all cpome to advice to use JDF 1.7 (or 1.8). Well, I use JDK 1.7 on OS X Yosemite . Both java –verion // java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) and ech

Re: Unsupported major.minor version 51.0

2015-08-11 Thread Ted Yu
What does the following command say ? mvn -version Maybe you are using an old maven ? Cheers On Tue, Aug 11, 2015 at 7:55 AM, Yakubovich, Alexey < alexey.yakubov...@searshc.com> wrote: > I found some discussions online, but it all cpome to advice to use JDF 1.7 > (or 1.8). > Well, I use JDK 1.7

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
Sorry, I provided bad information. This example worked fine with reduced parallelism. It seems my problem have to do with something specific with the real data frame at reading point. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Tuesday, August 11, 2015

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From: Ellafi, Saif A. Sent: Tuesday, August 11, 2015 12:01 PM To: Ellafi, Saif A.; deanwamp...@gmail.com Cc: user@spark.apache.org Subject: RE: Parquet without hadoop: Possible? Sorry, I

Re: Spark Cassandra Connector issue

2015-08-11 Thread satish chandra j
HI, Can we apply *saveToCassandra method to a JdbcRDD * Code: *Code:* *import* *org.apache*.spark.SparkContext *import* *org.apache*.spark.SparkContext._ *import* *org.apache*.spark.SparkConf *import* *org.apache*.spark.rdd.JdbcRDD *import* *com.datastax*.spark.connector._ *import* com.dat

Spark DataFrames uses too many partition

2015-08-11 Thread Al M
I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading lots of very small files and joining them together. Every file is loaded by Spark with just one partition. Each time I join two small files the partition count increases to 2

Unsupported major.minor version 51.0

2015-08-11 Thread alexeyy3
I found some discussions online, but it all cpome to advice to use JDF 1.7 (or 1.8). Well, I use JDK 1.7 on OS X Yosemite . Both java –verion // java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) and echo $

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Jerry Lam
Just out of curiosity, what is the advantage of using parquet without hadoop? Sent from my iPhone > On 11 Aug, 2015, at 11:12 am, wrote: > > I confirm that it works, > > I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 > > Saif > > From: Ellafi, Saif A. > S

Re: stopping spark stream app

2015-08-11 Thread Shushant Arora
Is stopping in the streaming context in onBatchCompleted event of StreamingListener does not kill the app? I have below code in streaming listener public void onBatchCompleted(StreamingListenerBatchCompleted arg0) { //check stop condition System.out.println("stopping gracefully"); jssc.stop(false

Re: avoid duplicate due to executor failure in spark stream

2015-08-11 Thread Shushant Arora
What if processing is neither idempotent nor its in transaction ,say I am posting events to some external server after processing. Is it possible to get accumulator of failed task in retry task? Is there any way to detect whether this task is retried task or original task ? I was trying to achie

Re: Spark DataFrames uses too many partition

2015-08-11 Thread Silvio Fiorito
You need to configure the spark.sql.shuffle.partitions parameter to a different value. It defaults to 200. On 8/11/15, 11:31 AM, "Al M" wrote: >I am using DataFrames with Spark 1.4.1. I really like DataFrames but the >partitioning makes no sense to me. > >I am loading lots of very small fil

unsubscribe

2015-08-11 Thread Michel Robert
Michel Robert Almaden Research Center EDA - IBM Systems and Technology Group Phone: (408) 927-2117 T/L 8-457-2117 E-mail: m...@us.ibm.com

Re: Accessing S3 files with s3n://

2015-08-11 Thread Steve Loughran
On 10 Aug 2015, at 20:17, Akshat Aranya mailto:aara...@gmail.com>> wrote: Hi Jerry, Akhil, Thanks your your help. With s3n, the entire file is downloaded even while just creating the RDD with sqlContext.read.parquet(). It seems like even just opening and closing the InputStream causes the en

Application failed error

2015-08-11 Thread Anubhav Agarwal
I am running Spark 1.3 on CDH 5.4 stack. I am getting the following error when I spark-submit my application:- 15/08/11 16:03:49 INFO Remoting: Starting remoting 15/08/11 16:03:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkdri...@cdh54-22a4101a-14d7-4f06-b3f8-079c6f

Re: unsubscribe

2015-08-11 Thread Ted Yu
See first section of http://spark.apache.org/community.html On Tue, Aug 11, 2015 at 9:47 AM, Michel Robert wrote: > Michel Robert > Almaden Research Center > EDA - IBM Systems and Technology Group > Phone: (408) 927-2117 T/L 8-457-2117 > E-mail: m...@us.ibm.com >

Re: Spark job workflow engine recommendations

2015-08-11 Thread Hien Luu
We are in the middle of figuring that out. At the high level, we want to combine the best parts of existing workflow solutions. On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone wrote: > Hien, > Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin > going to use for workflow sch

Re: Spark job workflow engine recommendations

2015-08-11 Thread Ruslan Dautkhanov
We use Talend, but not for Spark workflows. Although it does have Spark componenets. https://www.talend.com/download/talend-open-studio It is free (commercial support available), easy to design and deploy workflows. Talend for BigData 6.0 was released as month ago. Is anybody using Talend for Spa

Does print/event logging affect performance?

2015-08-11 Thread Saif.A.Ellafi
Hi all, silly question. Does logging info messages, both print or to file, or event logging, cause any impact to general performance of spark? Saif

Re: Does print/event logging affect performance?

2015-08-11 Thread Ted Yu
What level of logging are you looking at ? At INFO level, there shouldn't be noticeable difference. On Tue, Aug 11, 2015 at 12:24 PM, wrote: > Hi all, > > silly question. Does logging info messages, both print or to file, or > event logging, cause any impact to general performance of spark? > >

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-11 Thread Philip Weaver
Do you think it might be faster to put all the files in one directory but still partitioned the same way? I don't actually need to filter on the values of the partition keys, but I need to rely on there be no overlap in the value of the keys between any two parquet files. On Fri, Aug 7, 2015 at 8:

Re: Unsupported major.minor version 51.0

2015-08-11 Thread Ritesh Kumar Singh
Can you please mention the output for the following : java -version javac -version

Spark Job Hangs on our production cluster

2015-08-11 Thread java8964
Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 data/task nodes, which runs with BigInsight V3.0.0.2, corresponding with Hadoop 2.2.0 with MR1. Since IBM BigInsight doesn't come with Spark, so we build Spark 1.2.2 with Hadoop 2.2.0 + Hive 0.12 by ourselves, and dep

Re: Spark Job Hangs on our production cluster

2015-08-11 Thread Igor Berman
how do u want to process 1T of data when you set your executor memory to be 2g? look at spark ui, metrics of tasks...if any look at spark logs on executor machine under work dir(unless you configured log4j) I think your executors are thrashing or spilling to disk. check memory metrics/swapping O

ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
I am seeing following error. I think it's not able to find some other associated classes as I see "$2" in the exception, but not sure what I am missing. 15/08/11 16:00:15 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 50, ip-10-241-251-141.us-west-2.compute.internal): java.lang.Cl

Re: mllib on (key, Iterable[Vector])

2015-08-11 Thread Feynman Liang
You could try flatMapping i.e. if you have data : RDD[(key, Iterable[Vector])] then data.flatMap(_._2) : RDD[Vector], which can be GMMed. If you want to first partition by url, I would first create multiple RDDs using `filter`, then running GMM on each of the filtered rdds. On Tue, Aug 11, 2015

RE: Spark Job Hangs on our production cluster

2015-08-11 Thread java8964
The executor's memory is reset by "--executor-memory 24G" for spark-shell. The one from the spark-env.sh is just for default setting. I can confirm from the Spark UI the executor heap is set as 24G. Thanks Yong From: igor.ber...@gmail.com Date: Tue, 11 Aug 2015 23:31:59 +0300 Subject: Re: Spark Jo

Boosting spark.yarn.executor.memoryOverhead

2015-08-11 Thread Eric Bless
Previously I was getting a failure which included the message Container killed by YARN for exceeding memory limits. 2.1 GB of 2 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. So I attempted the following - spark-submit --jars examples.jar latest_msmtdt_by

Sporadic "Input validation failed" error when executing LogisticRegressionWithLBFGS.train

2015-08-11 Thread Francis Lau
Has anyone see this issue? I am calling the LogisticRegressionWithLBFGS.train API and about 7 out of 10 times, I get an ""Input validation failed" error". The exact same code and dataset works sometimes but fails at other times. It is odd. I can't seem to find any info on this. Below is the pyspar

Re: ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
I see the following line in the log "15/08/11 17:59:12 ERROR spark.SparkContext: Jar not found at file:/home/ec2-user/./spark-streaming-test-0.0.1-SNAPSHOT-jar-with-dependencies.jar", however I do see that this file exists on all the node in that path. Not sure what's happening here. Please note I

Re: Spark job workflow engine recommendations

2015-08-11 Thread Vikram Kone
Hi LarsThanks for the brain dump. All the points you made about target audience, degree of high availability and time based scheduling instead of event based scheduling are all valid and make sense.In our case, most of your Devs are .net based and so xml or web based scheduling is preferred over

grouping by a partitioned key

2015-08-11 Thread Philip Weaver
If I have an RDD that happens to already be partitioned by a key, how efficient can I expect a groupBy operation to be? I would expect that Spark shouldn't have to move data around between nodes, and simply will have a small amount of work just checking the partitions to discover that it doesn't ne

adding a custom Scala RDD for use in PySpark

2015-08-11 Thread Eric Walker
Hi, I'm new to Scala, Spark and PySpark and have a question about what approach to take in the problem I'm trying to solve. I have noticed that working with HBase tables read in using `newAPIHadoopRDD` can be quite slow with large data sets when one is interested in only a small subset of the key

Re: grouping by a partitioned key

2015-08-11 Thread Eugene Morozov
Philip, If all data per key are inside just one partition, then Spark will figure that out. Can you guarantee that’s the case? What is it you try to achieve? There might be another way for it, when you might be 100% sure what’s happening. You can print debugString or explain (for DataFrame) to

Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Yu
Yan: Where can I find performance numbers for Astro (it's close to middle of August) ? Cheers On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc wrote: > Finally I can take a look at HBASE-14181 now. Unfortunately there is no > design doc mentioned. Superficially it is very similar to Astro with a >

Re: Sporadic "Input validation failed" error when executing LogisticRegressionWithLBFGS.train

2015-08-11 Thread ai he
Hi Francis, >From my observation when using spark sql, dataframe.limit(n) does not necessarily return the same result each time when running Apps. To be more precise, in one App, the result should be same for the same n, however, changing n might not result in the same prefix(the result for n = 1

Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc.

Re: ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
After changing the '--deploy_mode client' the program seems to work however it looks like there is a bug in spark when using --deploy_mode as 'yarn'. Should I open a bug? On Tue, Aug 11, 2015 at 3:02 PM, Mohit Anchlia wrote: > I see the following line in the log "15/08/11 17:59:12 ERROR > spark

Re: Partitioning in spark streaming

2015-08-11 Thread ayan guha
partitioning - by itself - is a property of RDD. so essentially it is no different in case of streaming where each batch is one RDD. You can use partitionBy on RDD and pass on your custom partitioner function to it. One thing you should consider is how balanced are your partitions ie your partitio

Re: grouping by a partitioned key

2015-08-11 Thread Philip Weaver
Thanks. In my particular case, I am calculating a distinct count on a key that is unique to each partition, so I want to calculate the distinct count within each partition, and then sum those. This approach will avoid moving the sets of that key around between nodes, which would be very expensive.

Spark 1.4.0 Docker Slave GPU Access

2015-08-11 Thread Nastooh Avessta (navesta)
Hi Trying to access GPU from a Spark 1.4.0 Docker slave, without much luck. In my Spark program, I make a system call to a script, which performs various calculations using GPU. I am able to run this script as standalone, or via Mesos Marathon; however, calling the script through Spark fails due

Re: Boosting spark.yarn.executor.memoryOverhead

2015-08-11 Thread Sandy Ryza
Hi Eric, This is likely because you are putting the parameter after the primary resource (latest_msmtdt_by_gridid_and_source.py), which makes it a parameter to your application instead of a parameter to Spark/ -Sandy On Wed, Aug 12, 2015 at 4:40 AM, Eric Bless wrote: > Previously I was getting

Re: Job is Failing automatically

2015-08-11 Thread Jeff Zhang
> 15/08/11 12:59:34 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, sdldalplhdw02.suddenlink.cequel3.com): java.lang.NullPointerException at com.suddenlink.pnm.process.HBaseStoreHelper.flush(HBaseStoreHelper.java:313) It's your app error. NPE from HBaseStoreHelper

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
I am also trying to understand how are files named when writing to hadoop? for eg: how does "saveAs" method ensures that each executor is generating unique files? On Tue, Aug 11, 2015 at 4:21 PM, ayan guha wrote: > partitioning - by itself - is a property of RDD. so essentially it is no > differ

What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread swetha
Hi, What is the optimal approach to do Secondary sort in Spark? I have to first Sort by an Id in the key and further sort it by timeStamp which is present in the value. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-optimal-appr

RE: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-11 Thread Cheng, Hao
Definitely worth to try. And you can sort the record before writing out, and then you will get the parquet files without overlapping keys. Let us know if that helps. Hao From: Philip Weaver [mailto:philip.wea...@gmail.com] Sent: Wednesday, August 12, 2015 4:05 AM To: Cheng Lian Cc: user Subject:

Re: What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread Kevin Jung
You should create key as tuple type. In your case, RDD[((id, timeStamp) , value)] is the proper way to do. Kevin --- Original Message --- Sender : swetha Date : 2015-08-12 09:37 (GMT+09:00) Title : What is the optimal approach to do Secondary Sort in Spark? Hi, What is the optimal appr

RE: Spark DataFrames uses too many partition

2015-08-11 Thread Cheng, Hao
That's a good question, we don't support reading small files in a single partition yet, but it's definitely an issue we need to optimize, do you mind to create a jira issue for this? Hopefully we can merge that in 1.6 release. 200 is the default partition number for parallel tasks after the data

Re: Spark Job Hangs on our production cluster

2015-08-11 Thread Jeff Zhang
Logs would be helpful to diagnose. Could you attach the logs ? On Wed, Aug 12, 2015 at 5:19 AM, java8964 wrote: > The executor's memory is reset by "--executor-memory 24G" for spark-shell. > > The one from the spark-env.sh is just for default setting. > > I can confirm from the Spark UI the ex

pregel graphx job not finishing

2015-08-11 Thread dizzy5112
Hi im currently using a pregel message passing function for my graph in spark and graphx. The problem i have is that the code runs perfectly on spark 1.0 and finishes in a couple of minutes but as we have upgraded now im trying to run the same code on 1.3 but it doesnt finish (left it overnight and

Exception in spark

2015-08-11 Thread Ravisankar Mani
Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object” when using some where condition queries. I am using 1.4.0 version spark. Is this exception resolved in latest spark? Regards, Ravi

Re: Exception in spark

2015-08-11 Thread Josh Rosen
Can you share a query or stack trace? More information would make this question easier to answer. On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani wrote: > Hi all, > > We got an exception like > “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call > to dataType on unresolv

Error when running SparkPi in Intellij

2015-08-11 Thread canan chen
I import the spark project into intellij, and try to run SparkPi in intellij, but failed with compilation error: Error:scalac: while compiling: /Users/werere/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala during phase: jvm library version: ver

Re: Partitioning in spark streaming

2015-08-11 Thread Hemant Bhanawat
Posting a comment from my previous mail post: When data is received from a stream source, receiver creates blocks of data. A new block of data is generated every blockInterval milliseconds. N blocks of data are created during the batchInterval where N = batchInterval/blockInterval. A RDD is creat

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
Thanks for the info. When data is written in hdfs how does spark keeps the filenames written by multiple executors unique On Tue, Aug 11, 2015 at 9:35 PM, Hemant Bhanawat wrote: > Posting a comment from my previous mail post: > > When data is received from a stream source, receiver creates block

Not seeing Log messages

2015-08-11 Thread Spark Enthusiast
I wrote a small python program : def parseLogs(self): """ Read and parse log file """ self._logger.debug("Parselogs() start") self.parsed_logs = (self._sc .textFile(self._logFile) .map(self._parseApacheLogLine) .cac

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-11 Thread Hemant Bhanawat
Is the source of your dataframe partitioned on key? As per your mail, it looks like it is not. If that is the case, for partitioning the data, you will have to shuffle the data anyway. Another part of your question is - how to co-group data from two dataframes based on a key? I think for RDD's co

Re: Spark job workflow engine recommendations

2015-08-11 Thread Nick Pentreath
I also tend to agree that Azkaban is somehqat easier to get set up. Though I haven't used the new UI for Oozie that is part of CDH, so perhaps that is another good option. It's a pity Azkaban is a little rough in terms of documenting its API, and the scalability is an issue. However it would

Re: Not seeing Log messages

2015-08-11 Thread Spark Enthusiast
Forgot to mention. Here is how I run the program :  ./bin/spark-submit --conf "spark.app.master"="local[1]" ~/workspace/spark-python/ApacheLogWebServerAnalysis.py On Wednesday, 12 August 2015 10:28 AM, Spark Enthusiast wrote: I wrote a small python program : def parseLogs(self):

Re: Exception in spark

2015-08-11 Thread Ravisankar Mani
Hi Rosan, Thanks for your response. Kindly refer the following query and stack trace. I have checked same query in hive, It works perfectly. In case i have removed "in" in where class, it works in spark SELECT If(ISNOTNULL(SUM(`Sheet1`.`Runs`)),SUM(`Sheet1`.`Runs`),0) AS `Sheet1_Runs` ,`Sheet1`.

Re: Exception in spark

2015-08-11 Thread Ravisankar Mani
Hi Josh Please ignore the last mail stack trace. Kindly refer the exception details. {"org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'Sheet1.Teams"} Regards, Ravi On Wed, Aug 12, 2015 at 1:34 AM, Ravisankar Mani wrote: > Hi

Re: grouping by a partitioned key

2015-08-11 Thread Hemant Bhanawat
As far as I know, Spark SQL cannot process data on a per-partition-basis. DataFrame.foreachPartition is the way. I haven't tried it, but, following looks like a not-so-sophisticated way of making spark sql partition aware. http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-d