spark.streaming.receiver.maxRate

2017-09-13 Thread Margus Roo
Hi Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 2.11.8 and Java 1.8.0_60 I have Nifi flow produces more records than Spark stream can work in batch time. To avoid spark queue overflow I wanted to try spark streaming backpressure (did not work for my) so back to the more

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think it is one of the conceptual difference in Spark compare to other languages, there is no indexing in plain RDDs, This was the thread with Ankit: Yes. So order preservation can not be guaranteed in the case of failure. Also not sure if partitions are ordered. Can you get the same sequence of

Re: compile error: No classtag available while calling RDD.zip()

2017-09-13 Thread bluejoe
Thanks for your reply! Actually, It is Ok when I use RDD.zip() like this: 1 def zipDatasets(m:Dataset[String], n:Dataset[Int])={ 2 m.sparkSession.createDataset(m.rdd.zip(n.rdd)); 3 } But in my project, the type of Dataset is designated by the caller, so I introduce X,Y: 1 def zipDatasets[X

Re-sharded kinesis stream starts generating warnings after kinesis shard numbers were doubled

2017-09-13 Thread Mikhailau, Alex
Has anyone seen the following warnings in the log after a kinesis stream has been re-sharded? com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask WARN Cannot get the shard for this ProcessTask, so duplicate KPL user records in the event of resharding will not be dropped during d

how sequence of chained jars in spark.(driver/executor).extraClassPath matters

2017-09-13 Thread Richard Xin
so let's say I have chained path in spark.driver.extraClassPath/spark.executor.extraClassPath such as /path1/*:/path2/*, and I have different versions of the same jar under those 2 directories, how spark pick the version of jar to use, from /path1/*? Thanks.

Should I use Dstream or Structured Stream to transfer data from source to sink and then back from sink to source?

2017-09-13 Thread kant kodali
Hi All, I am trying to read data from kafka, insert into Mongo and read from mongo and insert back into Kafka. I went with structured stream approach first however I believe I am making some naiver error because my map operations are not getting invoked. The pseudo code looks like this DataSet r

Re: compile error: No classtag available while calling RDD.zip()

2017-09-13 Thread Anastasios Zouzias
Hi there, If it is OK with you to work with DataFrames, you can do https://gist.github.com/zouzias/44723de11222535223fe59b4b0bc228c import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructField,StructType,IntegerType, LongType} val df = sc.parallelize(Seq( (1.0, 2.0), (0.0, -1.0

Re: RDD order preservation through transformations

2017-09-13 Thread lucas.g...@gmail.com
I'm wondering why you need order preserved, we've had situations where keeping the source as an artificial field in the dataset was important and I had to run contortions to inject that (In this case the datasource had no unique key). Is this similar? On 13 September 2017 at 10:46, Suzen, Mehmet

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
But what happens if one of the partitions fail, how fault tolarence recover elements in other partitions. On 13 Sep 2017 18:39, "Ankit Maloo" wrote: > AFAIK, the order of a rdd is maintained across a partition for Map > operations. There is no way a map operation can change sequence across a >

Re: RDD order preservation through transformations

2017-09-13 Thread Ankit Maloo
AFAIK, the order of a rdd is maintained across a partition for Map operations. There is no way a map operation can change sequence across a partition as partition is local and computation happens one record at a time. On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" wrote: I think the order has no meani

Re: Chaining Spark Streaming Jobs

2017-09-13 Thread Sunita Arvind
Thanks for your suggestion Vincent. Do not have much experience with akka as such. I will explore this option. On Tue, Sep 12, 2017 at 11:01 PM, vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > What about chaining with akka or akka stream and the fair scheduler ? > > Le 13 sept. 2017

Re: RDD order preservation through transformations

2017-09-13 Thread Suzen, Mehmet
I think the order has no meaning in RDDs see this post, specially zip methods: https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RDD order preservation through transformations

2017-09-13 Thread johan.grande.ext
Hi, I'm a beginner using Spark with Scala and I'm having trouble understanding ordering in RDDs. I understand that RDDs are ordered (as they can be sorted) but that some transformations don't preserve order. How can I know which transformations preserve order and which don't? Regarding map, fo

Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde To: user@spark.ap

Minimum cost flow problem solving in Spark

2017-09-13 Thread Swapnil Shinde
Hello Has anyone used Spark to solve minimum cost flow problems in Spark? I am quite new to combinatorial optimization algorithms so any help or suggestions, libraries are very appreciated. Thanks Swapnil

HiveThriftserver does not seem to respect partitions

2017-09-13 Thread Yana Kadiyska
Hi folks, I have created a table in the following manner: CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition ( list of columns ) COMMENT 'User Infomation' PARTITIONED BY (account_id String, product String, group_id String, year String, month String, day String) STORED AS

compile error: No classtag available while calling RDD.zip()

2017-09-13 Thread 沈志宏
Hello,Since Dataset has no zip(..) methods, so I wrote following code to zip two datasets:  1 def zipDatasets[X: Encoder, Y: Encoder](spark: SparkSession, m: Dataset[X], n: Dataset[Y]) = { 2 val rdd = m.rdd.zip(n.rdd); 3 import spark.implicits._ 4

[Structured Streaming] Multiple sources best practice/recommendation

2017-09-13 Thread JG Perrin
Hi, I have different files being dumped on S3, I want to ingest them and join them. What does sound better to you? Have one " directory" for all or one per file format? If I have one directory for all, can you get some metadata about the file, like its name? If multiple directory, how can I h

[Spark Dataframe] How can I write a correct filter so the Hive table partitions are pruned correctly

2017-09-13 Thread Patrick Duin
Hi Spark users, I've got an issue where I wrote a filter on a Hive table using dataframes and despite setting: spark.sql.hive.metastorePartitionPruning=true no partitions are being pruned. In short: Doing this: table.filter("partition=x or partition=y") will result in Spark fetching all partitio