our code or data, but hard to
>
> say without knowing more. The lineage is fine and deterministic, but
>
> your data or operations might not be.
>
>
>
> On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li wrote:
>
> >
>
> > Hi all,
>
> >
>
> >
this would
happen, I don't have indeterministic data though. Anyone have encountered
something similar or an inkling?
Thanks!
--
Cheers,
Ruijing Li
d, whee XYZ is some integer value
> representing the task ID that was launched on that executor. In case you're
> running
> this is local mode that thread would be located in the same Java thread
> dump that you have already collected.
>
>
> On Tue, Apr 21, 2020 at 9:51 PM
xecuting your code in one
> JVM, and whatever synchronization that implies.
>
> On Sun, May 3, 2020 at 11:32 AM Ruijing Li wrote:
> >
> > Hi all,
> >
> > We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we
> use semaphores / parallel collections wit
about any deadlocks and if it could mess with the fixes for issues such as
this
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961
We do run with multiple cores.
Thanks!
--
Cheers,
Ruijing Li
For some reason, after restarting the app and trying again, latest now
works as expected. Not sure why it didn’t work before.
On Tue, Apr 21, 2020 at 1:46 PM Ruijing Li wrote:
> Yes, we did. But for some reason latest does not show them. The count is
> always 0.
>
> On Sun, Apr 19,
aries in the dump then why not share the
> thread dump? (I mean, the output of jstack)
>
> stack trace would be more helpful to find which thing acquired lock and
> which other things are waiting for acquiring lock, if we suspect deadlock.
>
> On Wed, Apr 22, 2020 at 2:38 AM Rui
;
> On Fri, Apr 17, 2020 at 9:13 AM Ruijing Li wrote:
>
>> Hi all,
>>
>> Apologies if this has been asked before, but I could not find the answer
>> to this question. We have a structured streaming job, but for some reason,
>> if we use startingOffsets = l
waiting.
Thanks
On Tue, Apr 21, 2020 at 9:58 AM Ruijing Li wrote:
> Strangely enough I found an old issue that is the exact same issue as mine
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-18343
>
> However I’m using spark 2.4.4 so the issue should have been s
After refreshing a couple of times, I notice the lock is being swapped
between these 3. The other 2 will be blocked by whoever gets this lock, in
a cycle of 160 has lock -> 161 -> 159 -> 160
On Tue, Apr 21, 2020 at 10:33 AM Ruijing Li wrote:
> In thread dump, I do see this
>
gt; so maybe doing manually would be the only option. Not sure Spark UI will
> > provide the same, haven't used at all.)
> >
> > It will tell which thread is being blocked (even it's shown as running)
> and
> > which point to look at.
> >
> > On Th
Jungtaek Lim
wrote:
> That sounds odd. Is it intermittent, or always reproducible if you starts
> with same checkpoint? What's the version of Spark?
>
> On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li wrote:
>
>> Hi all,
>>
>> I have a question on how structured streami
“ Fetcher [Consumer] Resetting
offset for partition to offset” over and over again..
However with startingOffsets=earliest, we don’t get this issue. I’m
wondering then how we can use startingOffsets=latest as I wish to start
from the latest offset available.
--
Cheers,
Ruijing Li
restarting it, I see it instead
reads from offset file 9 which contains {1:1000}
Can someone explain why spark doesn’t take the max offset?
Thanks.
--
Cheers,
Ruijing Li
gh most probably you'll need to do
> former), but if you can't make sure and if you understand the risk then yes
> you can turn off the option and take the risk.
>
>
> On Wed, Apr 15, 2020 at 9:24 AM Ruijing Li wrote:
>
>> I see, I wasn’t sure if that would wor
x27; ' END "INFO"
>> FROM
>> v$process p
>> ,v$session a
>> ,v$sess_io b
>> WHERE
>> a.paddr = p.addr
>> AND p.background IS NULL
>> --AND a.sid NOT IN (select sid from v$mystat where rownum = 1)
>> AND a.
eam?
>
> On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote:
>
>> Hi all,
>>
>> I have a spark structured streaming app that is consuming from a kafka
>> topic with retention set up. Sometimes I face an issue where my query has
>> not finished processing a mess
cannot set that. How do I do this for
structured streaming?
Thanks!
--
Cheers,
Ruijing Li
sometimes it stops at 29 completed stages and doesn’t start the last stage.
The spark job is idling and there is no pending or active task. What could
be the problem? Thanks.
--
Cheers,
Ruijing Li
ormation on how to use this tool in the spark
> documentation https://spark.apache.org/docs/latest/monitoring.html
>
>
>
>
>
> On Wed, 8 Apr 2020, 23:47 Ruijing Li, wrote:
>
>> Hi all,
>>
>> As stated in title, currently when I view the spark UI of a comp
Hi all,
As stated in title, currently when I view the spark UI of a completed spark
job, I see there are thread dump links in the executor tab, but clicking on
them does nothing. Is it possible to see the thread dumps somehow even if
the job finishes? On spark 2.4.5.
Thanks.
--
Cheers,
Ruijing
t;>>>>> On March 26, 2020 3:41 PM Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> Do you think we can create a global solution in the cloud using
>>>>>>>>> volunteers like us and third party employees. What I have in mind is
>>>>>>>>> to
>>>>>>>>> create a comprehensive real time solution to get data from various
>>>>>>>>> countries, universities pushed into a fast database through Kafka and
>>>>>>>>> Spark
>>>>>>>>> and used downstream for greater analytics. I am sure likes of Goggle
>>>>>>>>> etc.
>>>>>>>>> will provide free storage and likely many vendors will grab the
>>>>>>>>> opportunity.
>>>>>>>>>
>>>>>>>>> We can then donate this to WHO or others and we can make it very
>>>>>>>>> modular though microservices etc.
>>>>>>>>>
>>>>>>>>> I hope this does not sound futuristic.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn *
>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> .. spend time to analyse ..
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Abraham R. Wilcox
>>>>>>>>> Sales Director (African Region)
>>>>>>>>> 8x8 Hosted VoIP - Communications & Collaboration Solutions
>>>>>>>>> 7257 NW 4TH BLVD SUITE 305
>>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail&source=g>
>>>>>>>>> GAINESVILLE, FL 32607
>>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail&source=g>
>>>>>>>>>
>>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail&source=g>
>>>>>>>>>
>>>>>>>>> US
>>>>>>>>> <https://www.google.com/maps/search/7257+NW+4TH+BLVD+SUITE+305+GAINESVILLE,+FL+32607+%0D%0AUS?entry=gmail&source=g>
>>>>>>>>> Direct: +1 510 646 1484
>>>>>>>>> US Voice: +1 641 715 3900 ext. 755489#
>>>>>>>>> US Fax: +1 855 661 4166
>>>>>>>>> Alt. email: awilco...@gmail.com
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Chenguang He
>>>>>>>
>>>>>> --
Cheers,
Ruijing Li
into the driver?
--
Cheers,
Ruijing Li
Thanks Magnus,
I’ll explore Atlas and see what I can find.
On Wed, Mar 4, 2020 at 11:10 AM Magnus Nilsson wrote:
> Apache Atlas is the apache data catalog. Maybe want to look into that. It
> depends on what your use case is.
>
> On Wed, Mar 4, 2020 at 8:01 PM Ruijing Li wrote:
10:35, Magnus Nilsson wrote:
>
>> Google hive metastore.
>>
>> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote:
>>
>>> Hi all,
>>>
>>> Has anyone explored efforts to have a centralized storage of schemas of
>>> different parquet files
Hi all,
Has anyone explored efforts to have a centralized storage of schemas of
different parquet files? I know there is schema management for Avro, but
couldn’t find solutions for parquet schema management. Thanks!
--
Cheers,
Ruijing Li
Just wanted to follow up on this. If anyone has any advice, I’d be
interested in learning more!
On Thu, Feb 20, 2020 at 6:09 PM Ruijing Li wrote:
> Hi all,
>
> I’m interested in hearing the community’s thoughts on best practices to do
> integration testing for spark sql jobs. We
sparksession locally or testing with spark-shell. Ideally, we’d like some
sort of docker container emulating hdfs and spark cluster mode, that you
can run locally.
Any test framework, tips, or examples people can share? Thanks!
--
Cheers,
Ruijing Li
erialization.extendedDebugInfo=true
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Tue, Feb 18, 2020 at 1:02 PM Ruijing Li wrote:
>
>> Hi all,
>>
>> When working with spark jobs, I sometimes have to tackle with
>> serializa
generic classes or the class Spark is running itself).
Thanks!
--
Cheers,
Ruijing Li
for your help!
On Wed, Feb 5, 2020 at 7:07 PM Ruijing Li wrote:
> Looks like I’m wrong, since I tried that exact snippet and it worked
>
> So to be clear, in the part where I do batchDF.write.parquet, that is not
> the exact code I’m using.
>
> I’m using a custom write function
memory
than previous versions of spark? I’d be interested to know if anyone else
has this issue. We are on scala 2.11.12 on java 8
--
Cheers,
Ruijing Li
function isn’t
working correctly
Is batchDF a static dataframe though?
Thanks
On Wed, Feb 5, 2020 at 6:13 PM Ruijing Li wrote:
> Hi all,
>
> I tried with forEachBatch but got an error. Is this expected?
>
> Code is
>
> df.writeStream.trigger(Trigger.Once).forEachBatc
cy. What if your job fails as
>> you're committing the offsets in the end, but the data was already stored?
>> Will your getOffsets method return the same offsets?
>>
>> I'd rather not solve problems that other people have solved for me, but
>> ultimately the d
cessing the data)
>>
>> Currently to make it work in batch mode, you need to maintain the state
>> information of the offsets externally.
>>
>>
>> Thanks
>> Anil
>>
>> -Sent from my mobile
>> http://anilkulkarni.com/
>>
>> On Mon, F
a duplicate message with two offsets.
>
> The alternative is you can reprocess the offsets back from where you
> thought the message was last seen.
>
> Kind regards
> Chris
>
> On Mon, 3 Feb 2020, 7:39 pm Ruijing Li, wrote:
>
>> Hi all,
>>
>> My use case i
without missing data? Any help would be
appreciated.
--
Cheers,
Ruijing Li
k.size. One solution is to reduce the spark.executor.cores in
> such a job (note the approx heap calculation noted in the ticket). Other
> solution is increased executor heap. Or use off-heap configuration with
> Spark 2.4 which will remove the pressure for reads but not for writes.
>
> regards
artition may
> reduce the number of connections? You may have to look at what the
> executors do when they reach out to the remote cluster.
>
> On Sun, 22 Dec 2019, 8:07 am Ruijing Li, wrote:
>
>> I managed to make the failing stage work by increasing memoryOverhead to
>> s
ing stage of multiple cluster write) to prevent spark’s small files
problem. We reduce from 4000 partitions to 20.
On Sat, Dec 21, 2019 at 11:28 AM Ruijing Li wrote:
> Not for the stage that fails, all it does is read and write - the number
> of tasks is # of cores * # of executor instance
ndicates
> a shuffle? I don't expect a shuffle if it is a straight write. What's the
> input partition size?
>
> On Sat, 21 Dec 2019, 10:24 am Ruijing Li, wrote:
>
>> Could you explain why shuffle partitions might be a good starting point?
>>
>> Some more
art.
>
> Is there a difference in the number of partitions when the parquet is read
> to spark.sql.shuffle.partitions? Is it much higher than
> spark.sql.shuffle.partitions?
>
> On Fri, 20 Dec 2019, 7:34 pm Ruijing Li, wrote:
>
>> Hi all,
>>
>> I have encounte
looking at.
--
Cheers,
Ruijing Li
--
Cheers,
Ruijing Li
43 matches
Mail list logo