from:"Petr Novak"

Standalone executor memory is fixed while executor cores are load balanced between workers

2016-08-18 Thread Petr Novak

Hello, when I set spark.executor.cores e.g. to 8 cores and spark.executor.memory to 8GB. It can allocate more executors with less cores for my app but each executors gets 8GB RAM. It is a problem because I can allocate more memory across cluster than expected, the worst case is 8x 1core executors,

Re: question about Reynold's talk: " The Future of Real Time"

2016-04-22 Thread Petr Novak

Hi, I understand it just as that they will provide some lower latency interface and probably using jdbc so that 3rd party BI tools can integrate and query streams like they would be static datasets. If BI will repeat the query it will be updated. I don't know if BI tools are already heading towards

Re: Impala can't read partitioned Parquet files saved from DF.partitionBy

2016-04-21 Thread Petr Novak

I have to ask my colleague if there is any specific error but I think it just doesn't see files. Petr On Thu, Apr 21, 2016 at 11:54 AM, Petr Novak wrote: > Hello, > Impala (v2.1.0, Spark 1.6.0) can't read partitioned Parquet files saved > from DF.partitionBy (using Python).

Impala can't read partitioned Parquet files saved from DF.partitionBy

2016-04-21 Thread Petr Novak

Hello, Impala (v2.1.0, Spark 1.6.0) can't read partitioned Parquet files saved from DF.partitionBy (using Python). Is there any known reason, some config? Or it should generally work hence it is likely to be something wrong solely on our side? Many thanks, Petr

Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Petr Novak

Hi all, I believe that it used to be in documentation that Standalone mode is not for production. I'm either wrong or it was already removed. Having a small cluster between 5-10 nodes is Standalone recommended for production? I would like to go with Mesos but the question is if there is real add-o

Re: Apache Arrow + Spark examples?

2016-02-24 Thread Petr Novak

How Arrows collide with Tungsten and its binary in-memory format. It will still has to convert between them. I assume they use similar concepts/layout hence it is likely the conversion can be quite efficient. Or is there a change that the current Tungsten in memory format would be replaced by Arrow

Spark runs only on Mesos v0.21?

2016-02-12 Thread Petr Novak

Hi all, based on documenation: "Spark 1.6.0 is designed for use with Mesos 0.21.0 and does not require any special patches of Mesos." We are considering Mesos for our use but this concerns me a lot. Mesos is currently on v0.27 which we need for its Volumes feature. But Spark locks us to 0.21 only

Re: Kafka & Spark Streaming

2015-09-25 Thread Petr Novak

You can have offsetRanges on workers f.e. object Something { var offsetRanges = Array[OffsetRange]() def create[F : ClassTag](stream: InputDStream[Array[Byte]]) (implicit codec: Codec[F]: DStream[F] = { stream transform { rdd => offsetRanges = rdd.asInstanceOf[HasOffsetRanges]

Re: --class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak

I'm sorry. Both approaches actually work. It was something else wrong with my cluster. Petr On Fri, Sep 25, 2015 at 4:53 PM, Petr Novak wrote: > Either setting it programatically doesn't work: > sparkConf.setIfMissing("class", "...Main") > > In my curr

Re: --class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak

Either setting it programatically doesn't work: sparkConf.setIfMissing("class", "...Main") In my current setting moving main to another package requires to propagate change to deploy scripts. Doesn't matter I will find some other way. Petr On Fri, Sep 25, 2015

--class has to be always specified in spark-submit either it is defined in jar manifest?

2015-09-25 Thread Petr Novak

Ortherwise it seems it tries to load from a checkpoint which I have deleted and cannot be found. Or it should work and I have wrong something else. Documentation doesn't mention option with jar manifest, so I assume it doesn't work this way. Many thanks, Petr

Re: Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-25 Thread Petr Novak

, 2015 at 12:14 PM, Petr Novak wrote: > Many thanks Cody, it explains quite a bit. > > I had couple of problems with checkpointing and graceful shutdown moving > from working code in Spark 1.3.0 to 1.5.0. Having InterruptedExceptions, > KafkaDirectStream couldn't initialize,

Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-23 Thread Petr Novak

Hi, I have 2 streams and checkpointing with code based on documentation. One stream is transforming data from Kafka and saves them to Parquet file. The other stream uses the same stream and does updateStateByKey to compute some aggregations. There is no gracefulShutdown. Both use about this code t

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread Petr Novak

If you need to understand what is the magic Product then google up Algebraic Data Types and learn it together with what is Sum type. One option is http://www.stephanboyer.com/post/18/algebraic-data-types Enjoy, Petr On Wed, Sep 23, 2015 at 9:07 AM, Petr Novak wrote: > I'm unsu

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread Petr Novak

I'm unsure if it completely equivalent to a case class and if it has some limitations compared to case class or if it needs some more methods implemented. Petr On Wed, Sep 23, 2015 at 9:04 AM, Petr Novak wrote: > You can implement your own case class supporting more then 22 fields.

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-23 Thread Petr Novak

You can implement your own case class supporting more then 22 fields. It is something like: class MyRecord(val val1: String, val val2: String, ... more then 22, in this case f.e. 26) extends Product with Serializable { def canEqual(that: Any): Boolean = that.isInstanceOf[MyRecord] def prod

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak

AM, Petr Novak wrote: > Ahh the problem probably is async ingestion to Spark receiver buffers, > hence WAL is required I would say. > > Petr > > On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak wrote: > >> If MQTT can be configured with long enough timeout for ACK and can bu

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak

Ahh the problem probably is async ingestion to Spark receiver buffers, hence WAL is required I would say. Petr On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak wrote: > If MQTT can be configured with long enough timeout for ACK and can buffer > enough events while waiting for Spark Job r

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak

restart Spark job. On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak wrote: > Ahh the problem probably is async ingestion to Spark receiver buffers, > hence WAL is required I would say. > > On Tue, Sep 22, 2015 at 10:52 AM, Petr Novak wrote: > >> If MQTT can be configured with lo

Re: passing SparkContext as parameter

2015-09-22 Thread Petr Novak

And probably the original source code https://gist.github.com/koen-dejonghe/39c10357607c698c0b04 On Tue, Sep 22, 2015 at 10:37 AM, Petr Novak wrote: > To complete design pattern: > > http://stackoverflow.com/questions/30450763/spark-streaming-and-connection-pool-implementation > &

Re: passing SparkContext as parameter

2015-09-22 Thread Petr Novak

u wrote: >>>> >>>>> You can use broadcast variable for passing connection information. >>>>> >>>>> Cheers >>>>> >>>>> On Sep 21, 2015, at 4:27 AM, Priya Ch >>>>> wrote: >>>>> >

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-21 Thread Petr Novak

We have tried on another cluster installation with the same effect. Petr On Mon, Sep 21, 2015 at 10:45 AM, Petr Novak wrote: > It might be connected with my problems with gracefulShutdown in Spark >> 1.5.0 2.11 >> https://mail.google.com/mail/#search/petr/14fb6bd5166f9395 >

Re: Spark does not yet support its JDBC component for Scala 2.11.

2015-09-21 Thread Petr Novak

Nice, thanks. So the note in build instruction for 2.11 is obsolete? Or there are still some limitations? http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Fri, Sep 11, 2015 at 2:19 PM, Petr Novak wrote: > Nice, thanks. > > So the note in build instru

Re: Spark + Druid

2015-09-21 Thread Petr Novak

Great work. On Fri, Sep 18, 2015 at 6:51 PM, Harish Butani wrote: > Hi, > > I have just posted a Blog on this: > https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani > > regards, > Harish Butani. > > On Tue, Sep 1, 2015 at 11:46 PM, Paolo Platter > wr

Re: passing SparkContext as parameter

2015-09-21 Thread Petr Novak

add @transient? On Mon, Sep 21, 2015 at 11:36 AM, Petr Novak wrote: > add @transient? > > On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch > wrote: > >> Hello All, >> >> How can i pass sparkContext as a parameter to a method in an object. >> Be

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak

on, Sep 21, 2015 at 11:26 AM, Petr Novak wrote: > I should read my posts at least once to avoid so many typos. Hopefully you > are brave enough to read through. > > Petr > > On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak wrote: > >> I think you would have to persist events

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak

I should read my posts at least once to avoid so many typos. Hopefully you are brave enough to read through. Petr On Mon, Sep 21, 2015 at 11:23 AM, Petr Novak wrote: > I think you would have to persist events somehow if you don't want to miss > them. I don't see any other opti

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak

I think you would have to persist events somehow if you don't want to miss them. I don't see any other option there. Either in MQTT if it is supported there or routing them through Kafka. There is WriteAheadLog in Spark but you would have decouple stream MQTT reading and processing into 2 separate

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Petr Novak

g on Spark provided version is fine for us for now. Regards, Petr On Mon, Sep 21, 2015 at 11:06 AM, Petr Novak wrote: > Internally Spark is using json4s and jackson parser v3.2.10, AFAIK. So if > you are using Scala they should be available without adding dependencies. > There is v3.2.11 al

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Petr Novak

Internally Spark is using json4s and jackson parser v3.2.10, AFAIK. So if you are using Scala they should be available without adding dependencies. There is v3.2.11 already available but adding to my app was causing NoSuchMethod exception so I would have to shade it. I'm simply staying on v3.2.10 f

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-21 Thread Petr Novak

n? I would just need a confirmation from community that checkpointing and graceful shutdown is actually working with KafkaDirectStream on 1.5.0 so that I can look for a problem on my side. Many thanks, Petr On Sun, Sep 20, 2015 at 12:58 PM, Petr Novak wrote: > Hi Michal, > yes, it is there

Re: Kafka createDirectStream issue

2015-09-20 Thread Petr Novak

val topics="first" shouldn't it be val topics = Set("first") ? On Sun, Sep 20, 2015 at 1:01 PM, Petr Novak wrote: > val topics="first" > > shouldn't it be val topics = Set("first") ? > > On Sat, Sep 19, 2015 at 10:07 PM,

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-20 Thread Petr Novak

shutdown hook Thanks, Petr On Sat, Sep 19, 2015 at 4:01 AM, Michal Čizmazia wrote: > Hi Petr, after Ctrl+C can you see the following message in the logs? > > Invoking stop(stopGracefully=false) > > Details: > https://github.com/apache/spark/pull/6307 > > > On 18 Septembe

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak

It might be connected with my problems with gracefulShutdown in Spark 1.5.0 2.11 https://mail.google.com/mail/#search/petr/14fb6bd5166f9395 Maybe Ctrl+C corrupts checkpoints and breaks gracefulShutdown? Petr On Fri, Sep 18, 2015 at 4:10 PM, Petr Novak wrote: > ...to ensure it is not someth

Re: Spark Streaming stop gracefully doesn't return to command line after upgrade to 1.4.0 and beyond

2015-09-18 Thread Petr Novak

.spark.SparkContext) [2015-09-11 22:33:05,899] INFO Shutdown hook called (org.apache.spark.util.ShutdownHookManager) [2015-09-11 22:33:05,899] INFO Deleting directory /dfs/spark/tmp/spark-b466fc2e-9ab8-4783-87c2-485bac5c3cd6 (org.apache.spark.util.ShutdownHookManager) Thanks, Petr On Mon, Sep

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak

...to ensure it is not something wrong on my cluster. On Fri, Sep 18, 2015 at 4:09 PM, Petr Novak wrote: > I have tried it on Spark 1.3.0 2.10 and it works. The same code doesn't on > Spark 1.5.0 2.11. It would be nice if anybody could try on another > installation to ensure i

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak

I have tried it on Spark 1.3.0 2.10 and it works. The same code doesn't on Spark 1.5.0 2.11. It would be nice if anybody could try on another installation to ensure it is something wrong on my cluster. Many thanks, Petr On Fri, Sep 18, 2015 at 4:07 PM, Petr Novak wrote: > This one is g

Re: KafkaDirectStream can't be recovered from checkpoint

2015-09-18 Thread Petr Novak

This one is generated, I suppose, after Ctrl+C 15/09/18 14:38:25 INFO Worker: Asked to kill executor app-20150918143823-0001/0 15/09/18 14:38:25 INFO Worker: Asked to kill executor app-20150918143823-0001/0 15/09/18 14:38:25 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handle

KafkaDirectStream can't be recovered from checkpoint

2015-09-17 Thread Petr Novak

Hi all, it throws FileBasedWriteAheadLogReader: Error reading next item, EOF reached java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.streaming.util.FileBaseWriteAheadLogReader.hasNext(FileBasedWriteAheadLogReader.scala:47) WAL is not enable

Spark does not yet support its JDBC component for Scala 2.11.

2015-09-11 Thread Petr Novak

Does it still apply for 1.5.0? What actual limitation does it mean when I switch to 2.11? No JDBC Thriftserver? No JDBC DataSource? No JdbcRDD (which is already obsolete I believe)? Some more? What library is the blocker to upgrade JDBC component to 2.11? Is there any estimate when it could be a

Spark Streaming stop gracefully doesn't return to command line after upgrade to 1.4.0 and beyond

2015-09-10 Thread Petr Novak

Hello, my Spark streaming v1.3.0 code uses sys.ShutdownHookThread { ssc.stop(stopSparkContext = true, stopGracefully = true) } to use Ctrl+C in command line to stop it. It returned back to command line after it finished batch but it doesn't with v1.4.0-v.1.5.0. Was the behaviour or required cod

Spark-shell throws Hive error when SQLContext.parquetFile, v1.3

2015-09-10 Thread Petr Novak

Hello, sqlContext.parquetFile(dir) throws exception " Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient" The strange thing is that on the second attempt to open the file it is successful: try { sqlContext.parquetFile(dir) } catch { case e: Exception => sqlCont

Re: DataFrame Parquet Writer doesn't keep schema

2015-08-26 Thread Petr Novak

The same as https://mail.google.com/mail/#label/Spark%2Fuser/14f64c75c15f5ccd Please follow the discussion there. On Tue, Aug 25, 2015 at 5:02 PM, Petr Novak wrote: > Hi all, > when I read parquet files with "required" fields aka nullable=false they > are read correctl

DataFrame Parquet Writer doesn't keep schema

2015-08-25 Thread Petr Novak

Hi all, when I read parquet files with "required" fields aka nullable=false they are read correctly. Then I save them (df.write.parquet) and read again all my fields are saved and read as optional, aka nullable=true. Which means I suddenly have files with incompatible schemas. This happens on 1.3.0

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Petr Novak

ogger.info(...) Iterator.empty } foreach { (_: Nothing) => } } } Many thanks for any advice, I'm sure its a noob question. Petr On Mon, Aug 17, 2015 at 1:12 PM, Petr Novak wrote: > Or can I generally create new RDD from transformation and enrich its > partitio

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-17 Thread Petr Novak

Or can I generally create new RDD from transformation and enrich its partitions with some metadata so that I would copy OffsetRanges in my new RDD in DStream? On Mon, Aug 17, 2015 at 1:08 PM, Petr Novak wrote: > Hi all, > I need to transform KafkaRDD into a new stream of deserialize

Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-17 Thread Petr Novak

Hi all, I need to transform KafkaRDD into a new stream of deserialized case classes. I want to use the new stream to save it to file and to perform additional transformations on it. To save it I want to use offsets in filenames, hence I need OffsetRanges in transformed RDD. But KafkaRDD is private

spark-streaming-kafka_2.11 not available yet?

2015-05-26 Thread Petr Novak

Hello, I would like to switch from Scala 2.10 to 2.11 for Spark app development. It seems that the only thing blocking me is a missing spark-streaming-kafka_2.11 maven package. Any plan to add it or am I just blind? Many thanks, Vladimir

Re: Does Spark Driver works with HDFS in HA mode

2014-09-29 Thread Petr Novak

Thank you. HADOOP_CONF_DIR has been missing. On Wed, Sep 24, 2014 at 4:48 PM, Matt Narrell wrote: > Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark > machines > > mn > > On Sep 24, 2014, at 5:35 AM, Petr Novak wrote: > > Hello, > if our Hadoop

Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Petr Novak

Hello, if our Hadoop cluster is configured with HA and "fs.defaultFS" points to a namespace instead of a namenode hostname - hdfs:/// - then our Spark job fails with exception. Is there anything to configure or it is not implemented? Exception in thread "main" org.apache.spark.SparkException: Job

50 matches

Mail list logo