Hi
@Marco, the multiple rows written are not dupes as current timestamp field
is different in each of them.
@Ayan I checked and found that my whole code is rerun twice. Although there
seems to be no error, is it configurable to re-run by cluster manager?
On Tue, Oct 17, 2017 at 6:45 PM, ayan g
It should not be parallel exec as the logging code is called in driver.
Have you checked if your driver is reran by cluster manager due to any
failure or error situation>
On Tue, Oct 17, 2017 at 11:52 PM, Marco Mistroni
wrote:
> Hi
> Uh if the problem is really with parallel exec u can try to c
Hi
Uh if the problem is really with parallel exec u can try to call
repartition(1) before u save
Alternatively try to store data in a csv file and see if u have same
behaviour, to exclude dynamodb issues
Also ..are the multiple rows being written dupes (they have all same
fields/values)?
Hth
On
This is the code -
hdfs_path=
if(hdfs_path.contains(".avro")){
data_df =
spark.read.format("com.databricks.spark.avro").load(hdfs_path)
}else if(hdfs_path.contains(".tsv")){
data_df =
spark.read.option("delimiter","\t").option("header","true").csv(hdfs_path)
}else if(hdfs_path.c
Can you share your code?
On Tue, 17 Oct 2017 at 10:22 pm, Harsh Choudhary
wrote:
> Hi
>
> I'm running a Spark job in which I am appending new data into Parquet
> file. At last, I make a log entry in my Dynamodb table stating the number
> of records appended, time etc. Instead of one single entry
Hi
I'm running a Spark job in which I am appending new data into Parquet file.
At last, I make a log entry in my Dynamodb table stating the number of
records appended, time etc. Instead of one single entry in the database,
multiple entries are being made to it. Is it because of parallel execution