Hi Shiao-An Yuan
I also found this correctness problem in my production environment.
My spark version is 2.3.1。 I thought it was because Spark-23243 before .
But you said You also have this problem in your environment
, and your version is 2.4.4 which had solved spark-23243. So Maybe this problem
Hi,
I am using Spark 2.4.4 standalone mode.
On Mon, Jan 18, 2021 at 4:26 AM Sean Owen wrote:
> Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using?
>
> On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan
> wrote:
>
>> Hi folks,
>>
>> I finally found the root cause of this issue
Hi Shiao-An,
With regard to your set-up below and I quote:
"The input/output files are parquet on GCS. The Spark version is 2.4.4
with standalone deployment. Workers running on GCP preemptible instances
and they being preempted very frequently."
Am I correct that you have foregone deploying Data
Hi,
I may be wrong, but this looks like a massively complicated solution for
what could have been a simple SQL.
It always seems okay to be to first reduce the complexity and then solve
it, rather than solve a problem which should not even exist in the first
instance.
Regards,
Gourav
On Sun, Jan
Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using?
On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan
wrote:
> Hi folks,
>
> I finally found the root cause of this issue.
> It can be easily reproduced by the following code.
> We ran it on a standalone mode 4 cores * 4 instanc
Hi folks,
I finally found the root cause of this issue.
It can be easily reproduced by the following code.
We ran it on a standalone mode 4 cores * 4 instances (total 16 cores)
environment.
```
import org.apache.spark.TaskContext
import scala.sys.process._
import org.apache.spark.sql.functions._
I don't think this addresses my comment at all. Please try correctly
implementing equals and hashCode for your key class first.
On Tue, Dec 29, 2020 at 8:31 PM Shiao-An Yuan
wrote:
> Hi Sean,
>
> Sorry, I didn't describe it clearly. The column "pkey" is like a "Primary
> Key" and I do "reduce by
Hi Sean,
Sorry, I didn't describe it clearly. The column "pkey" is like a "Primary
Key" and I do "reduce by key" on this column, so the "amount of rows"
should always equal to the "cardinality of pkey".
When I said data get duplicated & lost, I mean duplicated "pkey" exists in
the output file (aft
Total guess here, but your key is a case class. It does define hashCode and
equals for you, but, you have an array as one of the members. Array
equality is by reference, so, two arrays of the same elements are not
equal. You may have to define hashCode and equals manually to make them
correct.
On
Hi folks,
We recently identified a data correctness issue in our pipeline.
The data processing flow is as follows:
1. read the current snapshot (provide empty if it doesn't exist yet)
2. read unprocessed new data
3. union them and do a `reduceByKey` operation
4. output a new version of the snapsh
10 matches
Mail list logo