Re: Delta with intelligent upsett

2019-10-31 Thread Roland Johann
If the dataset contains a column like changed_at/created_at you can use this as watermark and filter out rows that have changed_at/created_at before the watermark. Best Regards Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 17

Re: Delta with intelligent upsett

2019-10-31 Thread Gourav Sengupta
should not a where clause on the partition field help with that? I am obviously missing something in the question. Regards, Gourav On Thu, Oct 31, 2019 at 9:15 PM ayan guha wrote: > > Hi > > we have a scenario where we have a large table ie 5-6B records. The table > is repository of data from

[Spark SQL]: Dataframe group by potential bug (Scala)

2019-10-31 Thread ludwiggj
This is using Spark Scala 2.4.4. I'm getting some very strange behaviour after reading in a dataframe from a json file, using sparkSession.read in permissive mode. I've included the error column when reading in the data, as I want to log details of any errors in the input json file. My suspicion i

Fwd: Delta with intelligent upsett

2019-10-31 Thread ayan guha
Hi we have a scenario where we have a large table ie 5-6B records. The table is repository of data from past N years. It is possible that some updates take place on the data and thus er are using Delta table. As part of the business process we know updates can happen only within M years of past

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Nicolas Paris
have you deactivated the spark.ui ? I have read several thread explaining the ui can lead to OOM because it stores 1000 dags by default On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote: > Dear List, > > I've observed some sort of memory leak when using pyspark to run ~100 > jobs in loca

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Paul Wais
Well, dumb question: Given the workflow outlined above, should Local Mode keep running? Or is the leak a known issue? I just wanted to check because I can't recall seeing this issue with a non-local master, though it's possible there were task failures that hid the issue. If this issue looks ne

[Spark Streaming] Apply multiple ML pipelines(Models) to the same stream

2019-10-31 Thread Spico Florin
Hello! I have an use case where I have to apply multiple already trained models (e.g. M1, M2, ..Mn) on the same spark stream ( fetched from kafka). The models were trained usining the isolation forest algorithm from here: https://github.com/titicaca/spark-iforest I have found something similar