Spark Streaming failing on YARN Cluster

2015-08-13 Thread Ramkumar V
Hi, I have a cluster of 1 master and 2 slaves. I'm running a spark streaming in master and I want to utilize all nodes in my cluster. i had specified some parameters like driver memory and executor memory in my code. when i give --deploy-mode cluster --master yarn-cluster in my spark-submit, it gi

UnknownHostNameException looking up host name with > 64 characters

2015-08-13 Thread Jeff Jones
I've got a Spark application running on a host with > 64 character FQDN. When running with Spark master "local[*]" I get the following error. Note, the host name should be ip-10-248-0-177.us-west-2.compute.internaldna.corp.adaptivebiotech.com but the last 6 characters are missing. The same ap

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-13 Thread Akhil Das
I figured that out, And these are my findings: -> It just enters in an infinite loop when there's a duplicate partition id. -> It enters in an infinite loop when the partition id starts from 1 rather than 0 Something like this piece of code can reproduce it: (in getPartitions()) val total_part

Re: Error writing to cassandra table using spark application

2015-08-13 Thread Akhil Das
Can you look in the worker logs and see what going wrong? Thanks Best Regards On Wed, Aug 12, 2015 at 9:53 PM, Nupur Kumar (BLOOMBERG/ 731 LEX) < nkumar...@bloomberg.net> wrote: > Hello, > > I am doing this for the first time so feel free to let me know/forward > this to where it needs to be if

Re: Spark 1.3 + Parquet: "Skipping data using statistics"

2015-08-13 Thread Cheng Lian
On 8/13/15 6:11 AM, YaoPau wrote: I've seen this function referenced in a couple places, first this forum post and this talk by Michael Armbrust during the 42nd minute.

Re: ClassNotFound spark streaming

2015-08-13 Thread Akhil Das
You need to add that jar in the classpath. While submitting the job, you can use --jars, --driver-classpath etc configurations to add the jar. Apart from that if you are running the job as a standalone application, then you can use the sc.addJar option to add the jar (which will ship this jar into

serialization issue

2015-08-13 Thread 周千昊
Hi, I am using spark 1.4 when an issue occurs to me. I am trying to use the aggregate function: JavaRdd rdd = some rdd; HashMap zeroValue = new HashMap(); // add initial key-value pair for zeroValue rdd.aggregate(zeroValue, new Function2,

Re: dse spark-submit multiple jars issue

2015-08-13 Thread Javier Domingo Cansino
Please notice that 'jars: null' I don't know why you put ///. but I would propose you just put normal absolute paths. dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/postgresql-9.4-1201.jdbc41.jar /home/missingmerch/dse.jar /home/missingmerch/s

Spark Streaming: Change Kafka topics on runtime

2015-08-13 Thread Nisrina Luthfiyati
Hi all, I want to write a Spark Streaming program that listens to Kafka for a list of topics. The list of topics that I want to consume is stored in a DB and might change dynamically. I plan to periodically refresh this list of topics in the Spark Streaming app. My question is is it possible to a

Re: make-distribution.sh failing at spark/R/lib/sparkr.zip

2015-08-13 Thread MEETHU MATHEW
Hi, It worked after removing that line. Thank you for the response and fix .  Thanks & Regards, Meethu M On Thursday, 13 August 2015 4:12 AM, Burak Yavuz wrote: For the record:https://github.com/apache/spark/pull/8147 https://issues.apache.org/jira/browse/SPARK-9916 On Wed, Aug 12,

Re: Does Spark optimization might miss to run transformation?

2015-08-13 Thread Michael Armbrust
-dev If you want to guarantee the side effects happen you should use foreach or foreachPartitions. A `take`, for example, might only evaluate a subset of the partitions until it find enough results. On Wed, Aug 12, 2015 at 7:06 AM, Eugene Morozov wrote: > Hi! > > I’d like to complete action (s

Re: serialization issue

2015-08-13 Thread Anish Haldiya
While submitting the job, you can use --jars, --driver-classpath etc configurations to add the jar. Apart from that if you are running the job as a standalone application, then you can use the sc.addJar option to add the jar (which will ship this jar into all the executors) Regards, Anish On 8/

Re: Spark Streaming failing on YARN Cluster

2015-08-13 Thread Ramkumar V
Yes. this file is available in this path in the same machine where i'm running the spark. later i moved spark-1.4.1 folder to all other machines in my cluster but still i'm facing the same issue. *Thanks*, On Thu, Aug 13, 2015 at 1:17 PM, Akhil Das wro

Re: Controlling number of executors on Mesos vs YARN

2015-08-13 Thread Ajay Singal
Hi Tim, An option like spark.mesos.executor.max to cap the number of executors per node/application would be very useful. However, having an option like spark.mesos.executor.num to specify desirable number of executors per node would provide even/much better control. Thanks, Ajay On Wed, Aug 12

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Hemant Bhanawat
A chain of map and flatmap does not cause any serialization-deserialization. On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann wrote: > Hello everyone, > > I am wondering what the effect of serialization is within a stage. > > My understanding of Spark as an execution engine is that the data flow

Re: Unit Testing

2015-08-13 Thread jay vyas
yes there certainly is, so long as eclipse has the right plugins and so on to run scala programs. You're really asking two questions: (1) Can I use a modern IDE to develop spark apps and (2) can we easily unit test spark streaming apps. the answer is yes to both... Regarding your IDE: I like t

Re: using Spark or pig group by efficient in my use case?

2015-08-13 Thread Eugene Morozov
I’d say spark will be faster in this case, because it avoids storing intermediate data to disk after map and before reduce tasks. It’ll be faster even if you use Combiner (I’d assume Pig is able to figure that out). Hard to say how much faster as it’ll depend on disks available (ssd vs sshd vs

Re: Controlling number of executors on Mesos vs YARN

2015-08-13 Thread Ajay Singal
Tim, The ability to specify fine-grain configuration could be useful for many reasons. Let's take an example of a node with 32 cores. All of first, as per my understanding, having 5 executors each with 6 cores will almost always perform better than having a single executor with 30 cores . Also,

Eviction of RDD persisted on disk

2015-08-13 Thread Eugene Morozov
Hello! I know there is LRU eviction policy for the in memory cached partitions, but unfortunately I cannot find anything regarding evicition policy from rdd presisted on disk. Is there any? Does spark assume that it has infinite disk available and never evict anything? Even then, if it’s true,

About Databricks's spark-sql-perf

2015-08-13 Thread Todd
Hi, I got a question about the spark-sql-perf project by Databricks at https://github.com/databricks/spark-sql-perf/ The Tables.scala (https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/bigdata/Tables.scala) and BigData (https://github.com/dat

回复:Spark DataFrames uses too many partition

2015-08-13 Thread prosp4300
Hi, I want to know how you coalesce the partition to one to improve the performance Thanks 在2015年08月11日 23:31,Al M 写道: I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading lots of very small files and joining them together.

Create column in nested structure?

2015-08-13 Thread Ewan Leith
Has anyone used withColumn (or another method) to add a column to an existing nested dataframe? If I call: df.withColumn("nested.newcolumn", df("oldcolumn")) then it just creates the new column with a "." In it's name, not under the "nested" structure. Thanks, Ewan

Re: DataFrame column structure change

2015-08-13 Thread Eugene Morozov
I have a pretty complex nested structure with several levels. So in order to create it I use SQLContext.createDataFrame method and provide specific Rows with specific StrucTypes, both of which I build myself. To build a Row I iterate over my values and literally build a Row. List row = n

Re: saveToCassandra not working in Spark Job but works in Spark Shell

2015-08-13 Thread satish chandra j
HI, Please let me know if I am missing anything in the below mail, to get the issue fixed Regards, Satish Chandra On Wed, Aug 12, 2015 at 6:59 PM, satish chandra j wrote: > HI, > > The below mentioned code is working very well fine in Spark Shell but when > the same is placed in Spark Applicati

spark.streaming.maxRatePerPartition parameter: what are the benefits?

2015-08-13 Thread allonsy
Hello everyone, in the new Kafka Direct API, what are the benefits of setting a value for *spark.streaming.maxRatePerPartition*? In my case, I have 2 seconds batches consuming ~15k tuples from a topic split into 48 partitions (4 workers, 16 total cores). Is there any particular value I should be

Re: Spark Streaming: Change Kafka topics on runtime

2015-08-13 Thread Cody Koeninger
The current kafka stream implementation assumes the set of topics doesn't change during operation. You could either take a crack at writing a subclass that does what you need; stop/start; or if your batch duration isn't too small, you could run it as a series of RDDs (using the existing KafkaUtils

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Mark Heimann
Thanks a lot guys, that's exactly what I hoped for :-). Cheers, Mark 2015-08-13 6:35 GMT+02:00 Hemant Bhanawat : > A chain of map and flatmap does not cause any > serialization-deserialization. > > > > On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann > wrote: > >> Hello everyone, >> >> I am wonder

RE: Create column in nested structure?

2015-08-13 Thread Ewan Leith
Never mind me, I've found an email to this list from Raghavendra Pandey which got me what I needed val nestedCol = struct(df("nested2.column1"), df("nested2.column2"), df("flatcolumn")) val df2 = df.select(df("nested1"), nestedCol as "nested2") Thanks, Ewan From: Ewan Leith Sent: 13 August 201

Re: Question regarding join with multiple columns with pyspark

2015-08-13 Thread Dan LaBar
The DataFrame issue has been fixed in Spark 1.5. Refer to SPARK-7990 and Stackoverflow: Spark specify multiple column conditions for dataframe join . On Tue, Apr 28, 2015 at 12:55 PM, Ali Bajwa wrote:

Re: Spark - Standalone Vs YARN Vs Mesos

2015-08-13 Thread ๏̯͡๏
I am looking to decide what is best for my production grade spark application(s). YARN = 1. YARN supports security. When Spark is run over YARN the communication between processes can use secure authentication through Kerberos. 2. Spark standalone cluster can only run Spark jobs and

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Zoltán Zvara
Serialization only occurs intra-stage, when you are using Python, and as far as I know, only in the first stage, when reading the data and passing it to the Python interpreter the first time. Multiple operations are just chains of simple *map *and *flatMap *operators at task level on simple Scala

Spark RuntimeException hadoop output format

2015-08-13 Thread Mohit Anchlia
I have this call trying to save to hdfs 2.6 wordCounts.saveAsNewAPIHadoopFiles("prefix", "txt"); but I am getting the following: java.lang.RuntimeException: class scala.runtime.Nothing$ not org.apache.hadoop.mapreduce.OutputFormat

Spark 1.3.0: ExecutorLostFailure depending on input file size

2015-08-13 Thread Wyss Michael (wysm)
Hi I've been at this problem for a few days now and wasn't able to solve it. I'm hoping that I'm missing something that you don't! I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a wo

Streaming on Exponential Data

2015-08-13 Thread UMESH CHAUDHARY
Hi, I was working with non-reliable receiver version of Spark-Kafka streaming i.e. KafkaUtils,createStream... where for testing purpose I was getting data at constant rate from kafka and it was acting as expected. But when there was exponential data in Kafka, my program started crashing saying "Can

Re: Unit Testing

2015-08-13 Thread Burak Yavuz
I would recommend this spark package for your unit testing needs ( http://spark-packages.org/package/holdenk/spark-testing-base). Best, Burak On Thu, Aug 13, 2015 at 5:51 AM, jay vyas wrote: > yes there certainly is, so long as eclipse has the right plugins and so on > to run scala programs. Y

Write to cassandra...each individual statement

2015-08-13 Thread Priya Ch
Hi All, I have a question in writing rdd to cassandra. Instead of writing entire rdd to cassandra, i want to write individual statement into cassandra beacuse there is a need to perform to ETL on each message ( which requires checking with the DB). How could i insert statements individually? Usin

Re: About Databricks's spark-sql-perf

2015-08-13 Thread Yin Huai
Hi Todd, We have not got a chance to update it. We will update it after 1.5 release. Thanks, Yin On Thu, Aug 13, 2015 at 6:49 AM, Todd wrote: > Hi, > I got a question about the spark-sql-perf project by Databricks at > https://github.com/databricks/spark-sql-perf/ > > > The Tables.scala ( >

Re: Write to cassandra...each individual statement

2015-08-13 Thread Philip Weaver
All you'd need to do is *transform* the rdd before writing it, e.g. using the .map function. On Thu, Aug 13, 2015 at 11:30 AM, Priya Ch wrote: > Hi All, > > I have a question in writing rdd to cassandra. Instead of writing entire > rdd to cassandra, i want to write individual statement into ca

Fwd: - Spark 1.4.1 - run-example SparkPi - Failure ...

2015-08-13 Thread Naga Vij
Hello, Any idea on why this is happening? Thanks Naga -- Forwarded message -- From: Naga Vij Date: Wed, Aug 12, 2015 at 5:47 PM Subject: - Spark 1.4.1 - run-example SparkPi - Failure ... To: user@spark.apache.org Hi, I am evaluating Spark 1.4.1 Any idea on why run-example Sp

Re: Write to cassandra...each individual statement

2015-08-13 Thread Priya Ch
Hi Philip, I have the following requirement - I read the streams of data from various partitions of kafka topic. And then I union the dstreams and apply hash partitioner so messages of same key would go into single partition of an rdd, which is ofcourse handled by a single thread. This way we try

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-13 Thread Imran Rashid
oh I see, you are defining your own RDD & Partition types, and you had a bug where partition.index did not line up with the partitions slot in rdd.getPartitions. Is that correct? On Thu, Aug 13, 2015 at 2:40 AM, Akhil Das wrote: > I figured that out, And these are my findings: > > -> It just en

Re: About Databricks's spark-sql-perf

2015-08-13 Thread Michael Armbrust
Hey sorry, I've been doing a bunch of refactoring on this project. Most of the data generation was a huge hack (it was done before we supported partitioning natively) and used some private APIs that don't exist anymore. As a result, while doing the regression tests for 1.5 I deleted a bunch of br

MatrixFactorizationModel.save got StackOverflowError

2015-08-13 Thread Benyi Wang
I'm using spark-1.4.1 and compile it against CDH5.3.2. When I use ALS.trainImplicit to build a model, I got this error when rank=40 and iterations=30. It worked for (rank=10, iteration=10) and (rank=20, iteration=20). What was wrong with (rank=40, iterations=30)? 15/08/13 01:16:40 INFO sched

Re: Write to cassandra...each individual statement

2015-08-13 Thread Philip Weaver
So you need some state between messages in a partition. You can use mapPartitions or foreachPartition, which allow you to write code to process an entire partition. On Thu, Aug 13, 2015 at 11:48 AM, Priya Ch wrote: > Hi Philip, > > I have the following requirement - > I read the streams of data

RDD.join vs spark SQL join

2015-08-13 Thread Xiao JIANG
Hi,May I know the performance difference the rdd.join function and spark SQL join operation. If I want to join several big Rdds, how should I decide which one I should use? What are the factors to consider here? Thanks!

New Spark User - GBM iterations and Spark benchmarks

2015-08-13 Thread Sereday, Scott
I am exploring utilizing Spark as my dataset is becoming more and more difficult to manage and analyze. I'd appreciate if anyone could provide feedback on the following questions for me: * I am especially interested in training large datasets using machine learning algorithms. o Doe

Re: Spark - Standalone Vs YARN Vs Mesos

2015-08-13 Thread ๏̯͡๏
What are ideas around Spark cluster for streaming purposes ? What is better standalone / Mesos / YARN ? Please share cluster details and size of data and type of processing. (multiple processing points) (architecture or similar) I see folks using YARN cluster for streaming purposes. Regards, Dee

Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Matt Forbes
I am training a boosted trees model on a couple million input samples (with around 300 features) and am noticing that the input size of each stage is increasing each iteration. For each new tree, the first step seems to be building the decision tree metadata, which does a .count() on the input data

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Sean Owen
Not that I have any answer at this point, but I was discussing this exact same problem with Johannes today. An input size of ~20K records was growing each iteration by ~15M records. I could not see why on a first look. @jkbradley I know it's not much info but does that ring any bells? I think Joha

Retrieving offsets from previous spark streaming checkpoint

2015-08-13 Thread Stephen Durfey
When deploying a spark streaming application I want to be able to retrieve the lastest kafka offsets that were processed by the pipeline, and create my kafka direct streams from those offsets. Because the checkpoint directory isn't guaranteed to be compatible between job deployments, I don't want t

Re:Re: About Databricks's spark-sql-perf

2015-08-13 Thread Todd
Thanks Michael for the answer. I will watch the project and hope the update will be coming soon, :-) At 2015-08-14 02:13:32, "Michael Armbrust" wrote: Hey sorry, I've been doing a bunch of refactoring on this project. Most of the data generation was a huge hack (it was done before we suppo

Re: spark.streaming.maxRatePerPartition parameter: what are the benefits?

2015-08-13 Thread Cody Koeninger
It's just to limit the maximum number of records a given executor needs to deal with in a given batch. Typical usage would be if you're starting a stream from the beginning of a kafka log, or after a long downtime, and don't want ALL of the messages in the first batch. On Thu, Aug 13, 2015 at 8

Re: Retrieving offsets from previous spark streaming checkpoint

2015-08-13 Thread Cody Koeninger
Access the offsets using HasOffsetRanges, save them in your datastore, provide them as the fromOffsets argument when starting the stream. See https://github.com/koeninger/kafka-exactly-once On Thu, Aug 13, 2015 at 3:53 PM, Stephen Durfey wrote: > When deploying a spark streaming application I w

Custom serialization and checkpointing

2015-08-13 Thread Tech Meme
Hi Guys, We need to do some state checkpointing (an rdd thats updated using updateStateByKey). We would like finer control over the serialization. Also, this would allow us to do schema evolution in the deserialization code when we need to modify the structure of the classes associated with the

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Matt Forbes
Is this an artifact of a recent change? Does this not show up in any of the tests or benchmarks? On Thu, Aug 13, 2015 at 2:33 PM, Sean Owen wrote: > Not that I have any answer at this point, but I was discussing this > exact same problem with Johannes today. An input size of ~20K records > was g

graphx class not found error

2015-08-13 Thread dizzy5112
the code below works perfectly on both cluster and local modes but when i try to create a graph in cluster mode (it works in local mode) I get the following error: any help appreciated -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/graphx-class-not

Re: graphx class not found error

2015-08-13 Thread dizzy5112
Oh forgot to note using the Scala REPL for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/graphx-class-not-found-error-tp24253p24254.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: graphx class not found error

2015-08-13 Thread Ted Yu
The code and error didn't go through. Mind sending again ? Which Spark release are you using ? On Thu, Aug 13, 2015 at 6:17 PM, dizzy5112 wrote: > the code below works perfectly on both cluster and local modes > > > > but when i try to create a graph in cluster mode (it works in local mode) >

Re: ERROR Executor java.lang.NoClassDefFoundError

2015-08-13 Thread nsalian
If --jars doesn't work, try --conf "spark.executor.extraClassPath=" -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Executor-java-lang-NoClassDefFoundError-tp24244p24256.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Reduce number of partitions before saving to file. coalesce or repartition?

2015-08-13 Thread Alexander Pivovarov
Hi Everyone Which one should work faster (coalesce or repartition) if I need to reduce number of partitions from 5000 to 3 before saving RDD asTextFile Total data size is about 400MB on disk in text format Thank you

Always two tasks slower than others, and then job fails

2015-08-13 Thread randylu
It is strange that there are always two tasks slower than others, and the corresponding partitions's data are larger, no matter how many partitions? Executor ID Address Task Time Shuffle Read Size / Records 1 slave129.vsvs.com:56691 16 s1 99.5 MB / 1886

spark streaming map use external variable occur a problem

2015-08-13 Thread kale
$]{$F6UA~MQPMN)RW@JHZ[C.png Description: Binary data 1HY$7$6A)YC$0_0B0J90JK4.png Description: Binary data TZAOKWT5NFI{G6DAN9]L)2O.png Description: Binary data - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For a

Materials for deep insight into Spark SQL

2015-08-13 Thread Todd
Hi, I would ask whether there are slides, blogs or videos on the topic about how spark sql is implemented, the process or the whole picture when spark sql executes the code, Thanks!.

matrix inverse and multiplication

2015-08-13 Thread go canal
Hello,I am new to Spark. I am looking for a matrix inverse and multiplication solution. I did a quick search and found a couple of solutions but my requirements are:- large matrix (up to 2 millions x 2 m)- need to support complex double data type- preferably in Java  There is one post  http://da

Re: Materials for deep insight into Spark SQL

2015-08-13 Thread Ted Yu
You can look under Developer Track: https://spark-summit.org/2015/#day-1 http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1 (slightly old) Catalyst design: https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit FYI On Thu, Aug 13

worker and executor memory

2015-08-13 Thread James Pirz
Hi, I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), a

how do you convert directstream into data frames

2015-08-13 Thread Mohit Durgapal
Hi All, After creating a direct stream like below: val events = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet) I would like to convert the above stream into data frames, so that I could run hive queries over it. Could anyone p

Driver staggering task launch times

2015-08-13 Thread Ara Vartanian
I’m observing an unusual situation where my step duration increases as I add further executors to my cluster. My algorithm is fully data parallelizable into a map phase, followed by a reduce step at the end that amounts to matrix addition. So I’ve kicked a cluster of, say, 100 executors with 4 c

Re: how do you convert directstream into data frames

2015-08-13 Thread Mohit Durgapal
Any idea anyone? On Fri, Aug 14, 2015 at 10:11 AM, Mohit Durgapal wrote: > Hi All, > > After creating a direct stream like below: > > val events = KafkaUtils.createDirectStream[String, String, > StringDecoder, StringDecoder]( > ssc, kafkaParams, topicsSet) > > > I would like to convert

Re: Spark RuntimeException hadoop output format

2015-08-13 Thread Mohit Anchlia
I was able to get this working by using an alternative method however I only see 0 bytes files in hadoop. I've verified that the output does exist in the logs however it's missing from hdfs. On Thu, Aug 13, 2015 at 10:49 AM, Mohit Anchlia wrote: > I have this call trying to save to hdfs 2.6 > >

Re: Driver staggering task launch times

2015-08-13 Thread Philip Weaver
Are you running on mesos, yarn or standalone? If you're on mesos, are you using coarse grain or fine grained mode? On Thu, Aug 13, 2015 at 10:13 PM, Ara Vartanian wrote: > I’m observing an unusual situation where my step duration increases as I > add further executors to my cluster. My algorithm

Re: Streaming on Exponential Data

2015-08-13 Thread Hemant Bhanawat
What does exponential data means? Does this mean that the amount of the data that is being received from the stream in a batchinterval is increasing exponentially as the time progresses? Does your process have enough memory to handle the data for a batch interval? You may want to share Spark task

Re: Driver staggering task launch times

2015-08-13 Thread Ara Vartanian
I’m running on Yarn. > On Aug 13, 2015, at 10:58 PM, Philip Weaver wrote: > > Are you running on mesos, yarn or standalone? If you're on mesos, are you > using coarse grain or fine grained mode? > > On Thu, Aug 13, 2015 at 10:13 PM, Ara Vartanian > wrote: > I’m obs

Re: Driver staggering task launch times

2015-08-13 Thread Philip Weaver
Ah, nevermind, I don't know anything about scheduling tasks in YARN. On Thu, Aug 13, 2015 at 11:03 PM, Ara Vartanian wrote: > I’m running on Yarn. > > On Aug 13, 2015, at 10:58 PM, Philip Weaver > wrote: > > Are you running on mesos, yarn or standalone? If you're on mesos, are you > using coars

Exception in spark

2015-08-13 Thread Ravisankar Mani
Hi all, I got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object” when using some where condition queries. I am using 1.4.0 version spark. But its perfectly working in hive . Please refer the following query. I have appli

Re: Reduce number of partitions before saving to file. coalesce or repartition?

2015-08-13 Thread Anish Haldiya
Hi, If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle. However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in th

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-13 Thread Akhil Das
Yep, and it works fine for operations which does not involve any shuffle (like foreach,, count etc) and those which involves shuffle operations ends up in an infinite loop. Spark should somehow indicate this instead of going in an infinite loop. Thanks Best Regards On Thu, Aug 13, 2015 at 11:37 P