Hello,
I'm using Spark to deal with my project these days, however i
noticed that when load data
stored in Hadoop hdfs, it seems that there is a huge difference in JVM memory
size between using DataFrame
and using RDD format.Below lists my shell script when using spark-shell,
my original
Hi all,
I get a Dataset[Row] through the following code:
val df: Dataset[Row] =
spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
After that I want to collect it to the driver:
val df_rows: Array[Row] = df.collect()
The Spark web ui shows that all tasks have run succe
It’s not intermittent, seems to happen everytime spark fails when it starts
up from last checkpoint and complains the offset is old. I checked the
offset and it is indeed true the offset expired from kafka side. My version
of spark is 2.4.4 using kafka 0.10
On Sun, Apr 19, 2020 at 3:38 PM Jungtaek
Deleting the latest .compact file would lose the ability for exactly-once
and lead Spark fail to read from the output directory. If you're reading
the output directory from non-Spark then metadata on output directory
doesn't matter, but there's no exactly-once (exactly-once is achieved
leveraging t
Did you provide more records to topic "after" you started the query? That's
the only one I can imagine based on such information.
On Fri, Apr 17, 2020 at 9:13 AM Ruijing Li wrote:
> Hi all,
>
> Apologies if this has been asked before, but I could not find the answer
> to this question. We have a
You may want to check "where" the job is stuck via taking thread dump - it
could be in kafka consumer, in Spark codebase, etc. Without the information
it's hard to say.
On Thu, Apr 16, 2020 at 4:22 PM Ruijing Li wrote:
> Thanks Jungtaek, that makes sense.
>
> I tried Burak’s solution of just tur
That sounds odd. Is it intermittent, or always reproducible if you starts
with same checkpoint? What's the version of Spark?
On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li wrote:
> Hi all,
>
> I have a question on how structured streaming does checkpointing. I’m
> noticing that spark is not reading
Many thanks Ayan.
I tried that as well as follows:
val broadcastValue = "123456789" // I assume this will be sent as a
constant for the batch
val df = spark.read.
format("com.databricks.spark.xml").
option("rootTag", "hierarchy").
option("rowTag",