Hi
Using Spark 2.1.1.2.6-1.0-129 (from Hortonworks distro) and Scala 2.11.8
and Java 1.8.0_60
I have Nifi flow produces more records than Spark stream can work in
batch time. To avoid spark queue overflow I wanted to try spark
streaming backpressure (did not work for my) so back to the more
I think it is one of the conceptual difference in Spark compare to
other languages, there is no indexing in plain RDDs, This was the
thread with Ankit:
Yes. So order preservation can not be guaranteed in the case of
failure. Also not sure if partitions are ordered. Can you get the same
sequence of
Thanks for your reply!
Actually, It is Ok when I use RDD.zip() like this:
1 def zipDatasets(m:Dataset[String], n:Dataset[Int])={
2 m.sparkSession.createDataset(m.rdd.zip(n.rdd));
3 }
But in my project, the type of Dataset is designated by the caller, so I
introduce X,Y:
1 def zipDatasets[X
Has anyone seen the following warnings in the log after a kinesis stream has
been re-sharded?
com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask
WARN Cannot get the shard for this ProcessTask, so duplicate KPL user records
in the event of resharding will not be dropped during d
so let's say I have chained path in
spark.driver.extraClassPath/spark.executor.extraClassPath such as
/path1/*:/path2/*, and I have different versions of the same jar under those 2
directories, how spark pick the version of jar to use, from /path1/*?
Thanks.
Hi All,
I am trying to read data from kafka, insert into Mongo and read from mongo
and insert back into Kafka. I went with structured stream approach first
however I believe I am making some naiver error because my map operations
are not getting invoked.
The pseudo code looks like this
DataSet r
Hi there,
If it is OK with you to work with DataFrames, you can do
https://gist.github.com/zouzias/44723de11222535223fe59b4b0bc228c
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType,
LongType}
val df = sc.parallelize(Seq(
(1.0, 2.0), (0.0, -1.0
I'm wondering why you need order preserved, we've had situations where
keeping the source as an artificial field in the dataset was important and
I had to run contortions to inject that (In this case the datasource had no
unique key).
Is this similar?
On 13 September 2017 at 10:46, Suzen, Mehmet
But what happens if one of the partitions fail, how fault tolarence recover
elements in other partitions.
On 13 Sep 2017 18:39, "Ankit Maloo" wrote:
> AFAIK, the order of a rdd is maintained across a partition for Map
> operations. There is no way a map operation can change sequence across a
>
AFAIK, the order of a rdd is maintained across a partition for Map
operations. There is no way a map operation can change sequence across a
partition as partition is local and computation happens one record at a
time.
On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" wrote:
I think the order has no meani
Thanks for your suggestion Vincent. Do not have much experience with akka
as such. I will explore this option.
On Tue, Sep 12, 2017 at 11:01 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:
> What about chaining with akka or akka stream and the fair scheduler ?
>
> Le 13 sept. 2017
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Hi,
I'm a beginner using Spark with Scala and I'm having trouble understanding
ordering in RDDs. I understand that RDDs are ordered (as they can be sorted)
but that some transformations don't preserve order.
How can I know which transformations preserve order and which don't? Regarding
map, fo
You might be interested in "Maximum Flow implementation on Spark GraphX" done
by a Colorado School of Mines grad student a couple of years ago.
http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx
From: Swapnil Shinde
To: user@spark.ap
Hello
Has anyone used Spark to solve minimum cost flow problems in Spark? I
am quite new to combinatorial optimization algorithms so any help or
suggestions, libraries are very appreciated.
Thanks
Swapnil
Hi folks, I have created a table in the following manner:
CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition (
list of columns
)
COMMENT 'User Infomation'
PARTITIONED BY (account_id String,
product String,
group_id String,
year String,
month String,
day String)
STORED AS
Hello,Since Dataset has no zip(..) methods, so I wrote following code to zip
two datasets:
1 def zipDatasets[X: Encoder, Y: Encoder](spark: SparkSession, m:
Dataset[X], n: Dataset[Y]) = {
2 val rdd = m.rdd.zip(n.rdd);
3 import spark.implicits._
4
Hi,
I have different files being dumped on S3, I want to ingest them and join them.
What does sound better to you? Have one " directory" for all or one per file
format?
If I have one directory for all, can you get some metadata about the file, like
its name?
If multiple directory, how can I h
Hi Spark users,
I've got an issue where I wrote a filter on a Hive table using dataframes
and despite setting:
spark.sql.hive.metastorePartitionPruning=true no partitions are being
pruned.
In short:
Doing this: table.filter("partition=x or partition=y") will result in Spark
fetching all partitio
19 matches
Mail list logo