Hi, I have spark job that produces duplicates when one or tasks from
repartition stage fails.
Here is simplified code.
sparkContext.setCheckpointDir("hdfs://path-to-checkpoint-dir")
*val *inputRDDs: List[RDD[String]] = *List*.*empty *// an RDD per input dir
*val *updatedRDDs = inputRDDs.map{ in
Shubham,
DataSourceV2 passes Spark's internal representation to your source and
expects Spark's internal representation back from the source. That's why
you consume and produce InternalRow: "internal" indicates that Spark
doesn't need to convert the values.
Spark's internal representation for a d
To be more explicit, the easiest thing to do in the short term is use
your own instance of KafkaConsumer to get the offsets for the
timestamps you're interested in, using offsetsForTimes, and use those
for the start / end offsets. See
https://kafka.apache.org/10/javadoc/?org/apache/kafka/clients/c
That article is pretty old, If you click through the link to the jira
mentioned in it, https://issues.apache.org/jira/browse/SPARK-18580 ,
it's been resolved.
On Wed, Jan 2, 2019 at 12:42 AM JF Chen wrote:
>
> yes, 10 is a very low value for testing initial rate.
> And from this article
> https:
Hi All,
I am using custom DataSourceV2 implementation (*Spark version 2.3.2*)
Here is how I am trying to pass in *date type *from spark shell.
scala> val df =
> sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype",
> col("datetype").cast("date"))
> scala> df.write.format("com