Re: Missing / Duplicate Data when Spark retries

2020-09-10 Thread Ruijing Li
our code or data, but hard to > > say without knowing more. The lineage is fine and deterministic, but > > your data or operations might not be. > > > > On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li wrote: > > > > > > Hi all, > > > > > >

Missing / Duplicate Data when Spark retries

2020-09-09 Thread Ruijing Li
this would happen, I don't have indeterministic data though. Anyone have encountered something similar or an inkling? Thanks! -- Cheers, Ruijing Li

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-05-06 Thread Ruijing Li
d, whee XYZ is some integer value > representing the task ID that was launched on that executor. In case you're > running > this is local mode that thread would be located in the same Java thread > dump that you have already collected. > > > On Tue, Apr 21, 2020 at 9:51 PM

Re: Good idea to do multi-threading in spark job?

2020-05-06 Thread Ruijing Li
xecuting your code in one > JVM, and whatever synchronization that implies. > > On Sun, May 3, 2020 at 11:32 AM Ruijing Li wrote: > > > > Hi all, > > > > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we > use semaphores / parallel collections wit

Good idea to do multi-threading in spark job?

2020-05-03 Thread Ruijing Li
about any deadlocks and if it could mess with the fixes for issues such as this https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961 We do run with multiple cores. Thanks! -- Cheers, Ruijing Li

Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-22 Thread Ruijing Li
For some reason, after restarting the app and trying again, latest now works as expected. Not sure why it didn’t work before. On Tue, Apr 21, 2020 at 1:46 PM Ruijing Li wrote: > Yes, we did. But for some reason latest does not show them. The count is > always 0. > > On Sun, Apr 19,

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
aries in the dump then why not share the > thread dump? (I mean, the output of jstack) > > stack trace would be more helpful to find which thing acquired lock and > which other things are waiting for acquiring lock, if we suspect deadlock. > > On Wed, Apr 22, 2020 at 2:38 AM Rui

Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-21 Thread Ruijing Li
; > On Fri, Apr 17, 2020 at 9:13 AM Ruijing Li wrote: > >> Hi all, >> >> Apologies if this has been asked before, but I could not find the answer >> to this question. We have a structured streaming job, but for some reason, >> if we use startingOffsets = l

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
waiting. Thanks On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li wrote: > Strangely enough I found an old issue that is the exact same issue as mine > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343 > > However I’m using spark 2.4.4 so the issue should have been s

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
After refreshing a couple of times, I notice the lock is being swapped between these 3. The other 2 will be blocked by whoever gets this lock, in a cycle of 160 has lock -> 161 -> 159 -> 160 On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li wrote: > In thread dump, I do see this >

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Ruijing Li
gt; so maybe doing manually would be the only option. Not sure Spark UI will > > provide the same, haven't used at all.) > > > > It will tell which thread is being blocked (even it's shown as running) > and > > which point to look at. > > > > On Th

Re: Understanding spark structured streaming checkpointing system

2020-04-19 Thread Ruijing Li
Jungtaek Lim wrote: > That sounds odd. Is it intermittent, or always reproducible if you starts > with same checkpoint? What's the version of Spark? > > On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li wrote: > >> Hi all, >> >> I have a question on how structured streami

Using startingOffsets latest - no data from structured streaming kafka query

2020-04-16 Thread Ruijing Li
“ Fetcher [Consumer] Resetting offset for partition to offset” over and over again.. However with startingOffsets=earliest, we don’t get this issue. I’m wondering then how we can use startingOffsets=latest as I wish to start from the latest offset available. -- Cheers, Ruijing Li

Understanding spark structured streaming checkpointing system

2020-04-16 Thread Ruijing Li
restarting it, I see it instead reads from offset file 9 which contains {1:1000} Can someone explain why spark doesn’t take the max offset? Thanks. -- Cheers, Ruijing Li

Re: Spark structured streaming - Fallback to earliest offset

2020-04-16 Thread Ruijing Li
gh most probably you'll need to do > former), but if you can't make sure and if you understand the risk then yes > you can turn off the option and take the risk. > > > On Wed, Apr 15, 2020 at 9:24 AM Ruijing Li wrote: > >> I see, I wasn’t sure if that would wor

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-16 Thread Ruijing Li
x27; ' END "INFO" >> FROM >> v$process p >> ,v$session a >> ,v$sess_io b >> WHERE >> a.paddr = p.addr >> AND p.background IS NULL >> --AND a.sid NOT IN (select sid from v$mystat where rownum = 1) >> AND a.

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
eam? > > On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: > >> Hi all, >> >> I have a spark structured streaming app that is consuming from a kafka >> topic with retention set up. Sometimes I face an issue where my query has >> not finished processing a mess

Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Ruijing Li
cannot set that. How do I do this for structured streaming? Thanks! -- Cheers, Ruijing Li

Spark hangs while reading from jdbc - does nothing

2020-04-10 Thread Ruijing Li
sometimes it stops at 29 completed stages and doesn’t start the last stage. The spark job is idling and there is no pending or active task. What could be the problem? Thanks. -- Cheers, Ruijing Li

Re: Can you view thread dumps on spark UI if job finished

2020-04-08 Thread Ruijing Li
ormation on how to use this tool in the spark > documentation https://spark.apache.org/docs/latest/monitoring.html > > > > > > On Wed, 8 Apr 2020, 23:47 Ruijing Li, wrote: > >> Hi all, >> >> As stated in title, currently when I view the spark UI of a comp

Can you view thread dumps on spark UI if job finished

2020-04-08 Thread Ruijing Li
Hi all, As stated in title, currently when I view the spark UI of a completed spark job, I see there are thread dump links in the executor tab, but clicking on them does nothing. Is it possible to see the thread dumps somehow even if the job finishes? On spark 2.4.5. Thanks. -- Cheers, Ruijing

Re: can we all help use our expertise to create an IT solution for Covid-19

2020-03-26 Thread Ruijing Li
t;>>>>> On March 26, 2020 3:41 PM Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Do you think we can create a global solution in the cloud using >>>>>>>>> volunteers like us and third party employees. What I have in mind is >>>>>>>>> to >>>>>>>>> create a comprehensive real time solution to get data from various >>>>>>>>> countries, universities pushed into a fast database through Kafka and >>>>>>>>> Spark >>>>>>>>> and used downstream for greater analytics. I am sure likes of Goggle >>>>>>>>> etc. >>>>>>>>> will provide free storage and likely many vendors will grab the >>>>>>>>> opportunity. >>>>>>>>> >>>>>>>>> We can then donate this to WHO or others and we can make it very >>>>>>>>> modular though microservices etc. >>>>>>>>> >>>>>>>>> I hope this does not sound futuristic. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> LinkedIn * >>>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>>> which may >>>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>>> damages >>>>>>>>> arising from such loss, damage or destruction. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> .. spend time to analyse .. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Abraham R. Wilcox >>>>>>>>> Sales Director (African Region) >>>>>>>>> 8x8 Hosted VoIP - Communications & Collaboration Solutions >>>>>>>>> 7257 NW 4TH BLVD SUITE 305 >>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail&source=g> >>>>>>>>> GAINESVILLE, FL 32607 >>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail&source=g> >>>>>>>>> >>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail&source=g> >>>>>>>>> >>>>>>>>> US >>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail&source=g> >>>>>>>>> Direct: +1 510 646 1484 >>>>>>>>> US Voice: +1 641 715 3900 ext. 755489# >>>>>>>>> US Fax: +1 855 661 4166 >>>>>>>>> Alt. email: awilco...@gmail.com >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Chenguang He >>>>>>> >>>>>> -- Cheers, Ruijing Li

ForEachBatch collecting batch to driver

2020-03-10 Thread Ruijing Li
into the driver? -- Cheers, Ruijing Li

Re: Schema store for Parquet

2020-03-09 Thread Ruijing Li
Thanks Magnus, I’ll explore Atlas and see what I can find. On Wed, Mar 4, 2020 at 11:10 AM Magnus Nilsson wrote: > Apache Atlas is the apache data catalog. Maybe want to look into that. It > depends on what your use case is. > > On Wed, Mar 4, 2020 at 8:01 PM Ruijing Li wrote:

Re: Schema store for Parquet

2020-03-04 Thread Ruijing Li
10:35, Magnus Nilsson wrote: > >> Google hive metastore. >> >> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: >> >>> Hi all, >>> >>> Has anyone explored efforts to have a centralized storage of schemas of >>> different parquet files

Schema store for Parquet

2020-03-04 Thread Ruijing Li
Hi all, Has anyone explored efforts to have a centralized storage of schemas of different parquet files? I know there is schema management for Avro, but couldn’t find solutions for parquet schema management. Thanks! -- Cheers, Ruijing Li

Re: Integration testing Framework Spark SQL Scala

2020-02-25 Thread Ruijing Li
Just wanted to follow up on this. If anyone has any advice, I’d be interested in learning more! On Thu, Feb 20, 2020 at 6:09 PM Ruijing Li wrote: > Hi all, > > I’m interested in hearing the community’s thoughts on best practices to do > integration testing for spark sql jobs. We

Integration testing Framework Spark SQL Scala

2020-02-20 Thread Ruijing Li
sparksession locally or testing with spark-shell. Ideally, we’d like some sort of docker container emulating hdfs and spark cluster mode, that you can run locally. Any test framework, tips, or examples people can share? Thanks! -- Cheers, Ruijing Li

Re: Better way to debug serializable issues

2020-02-20 Thread Ruijing Li
erialization.extendedDebugInfo=true > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Tue, Feb 18, 2020 at 1:02 PM Ruijing Li wrote: > >> Hi all, >> >> When working with spark jobs, I sometimes have to tackle with >> serializa

Better way to debug serializable issues

2020-02-18 Thread Ruijing Li
generic classes or the class Spark is running itself). Thanks! -- Cheers, Ruijing Li

Re: Best way to read batch from Kafka and Offsets

2020-02-15 Thread Ruijing Li
for your help! On Wed, Feb 5, 2020 at 7:07 PM Ruijing Li wrote: > Looks like I’m wrong, since I tried that exact snippet and it worked > > So to be clear, in the part where I do batchDF.write.parquet, that is not > the exact code I’m using. > > I’m using a custom write function

Spark 2.4.4 has bigger memory impact than 2.3?

2020-02-15 Thread Ruijing Li
memory than previous versions of spark? I’d be interested to know if anyone else has this issue. We are on scala 2.11.12 on java 8 -- Cheers, Ruijing Li

Re: Best way to read batch from Kafka and Offsets

2020-02-05 Thread Ruijing Li
function isn’t working correctly Is batchDF a static dataframe though? Thanks On Wed, Feb 5, 2020 at 6:13 PM Ruijing Li wrote: > Hi all, > > I tried with forEachBatch but got an error. Is this expected? > > Code is > > df.writeStream.trigger(Trigger.Once).forEachBatc

Re: Best way to read batch from Kafka and Offsets

2020-02-05 Thread Ruijing Li
cy. What if your job fails as >> you're committing the offsets in the end, but the data was already stored? >> Will your getOffsets method return the same offsets? >> >> I'd rather not solve problems that other people have solved for me, but >> ultimately the d

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Ruijing Li
cessing the data) >> >> Currently to make it work in batch mode, you need to maintain the state >> information of the offsets externally. >> >> >> Thanks >> Anil >> >> -Sent from my mobile >> http://anilkulkarni.com/ >> >> On Mon, F

Re: Best way to read batch from Kafka and Offsets

2020-02-03 Thread Ruijing Li
a duplicate message with two offsets. > > The alternative is you can reprocess the offsets back from where you > thought the message was last seen. > > Kind regards > Chris > > On Mon, 3 Feb 2020, 7:39 pm Ruijing Li, wrote: > >> Hi all, >> >> My use case i

Best way to read batch from Kafka and Offsets

2020-02-03 Thread Ruijing Li
without missing data? Any help would be appreciated. -- Cheers, Ruijing Li

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Ruijing Li
k.size. One solution is to reduce the spark.executor.cores in > such a job (note the approx heap calculation noted in the ticket). Other > solution is increased executor heap. Or use off-heap configuration with > Spark 2.4 which will remove the pressure for reads but not for writes. > > regards

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Ruijing Li
artition may > reduce the number of connections? You may have to look at what the > executors do when they reach out to the remote cluster. > > On Sun, 22 Dec 2019, 8:07 am Ruijing Li, wrote: > >> I managed to make the failing stage work by increasing memoryOverhead to >> s

Out of memory HDFS Multiple Cluster Write

2019-12-21 Thread Ruijing Li
ing stage of multiple cluster write) to prevent spark’s small files problem. We reduce from 4000 partitions to 20. On Sat, Dec 21, 2019 at 11:28 AM Ruijing Li wrote: > Not for the stage that fails, all it does is read and write - the number > of tasks is # of cores * # of executor instance

Re: Out of memory HDFS Multiple Cluster Write

2019-12-21 Thread Ruijing Li
ndicates > a shuffle? I don't expect a shuffle if it is a straight write. What's the > input partition size? > > On Sat, 21 Dec 2019, 10:24 am Ruijing Li, wrote: > >> Could you explain why shuffle partitions might be a good starting point? >> >> Some more

Re: Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
art. > > Is there a difference in the number of partitions when the parquet is read > to spark.sql.shuffle.partitions? Is it much higher than > spark.sql.shuffle.partitions? > > On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, wrote: > >> Hi all, >> >> I have encounte

Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
looking at. -- Cheers, Ruijing Li -- Cheers, Ruijing Li