Hi Dongjoon,
Thanks for the proposal! I like the idea. Maybe we can extend it to
component too and to some jira labels such as correctness which may be
worth to highlight in PRs too. My only concern is that in many cases JIRAs
are created not very carefully so they may be incorrect at the moment of
Seems like a good idea. Can we test this with a component first?
On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun
wrote:
> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been long
Yea, I think we can automate this process via, for instance,
https://github.com/apache/spark/blob/master/dev/github_jira_sync.py
+1 for such sort of automatic categorizing and matching metadata between
JIRA and github
Adding Josh and Sean as well.
On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun, wrote
Hi, All.
Since we use both Apache JIRA and GitHub actively for Apache Spark
contributions, we have lots of JIRAs and PRs consequently. One specific
thing I've been longing to see is `Jira Issue Type` in GitHub.
How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`? There
are two ma
Nicholas , thank you for your explanation.
I am also interested in the example that Rishi is asking for. I am sure
mapPartitions may work , but as Vladimir suggests it may not be the best
option in terms of performance.
@Vladimir Prus , are you aware of any example about writing a "custom
phy
Hi,
I am running a Spark Dataframe function of NTILE over a huge data - it
spills lot of data while sorting and eventually it fails.
The data size is roughly 80 Million record with size of 4G (not sure
whether its serialized or deserialized) - I am calculating NTILE(10) for
all these records orde
Dear Apache Enthusiast,
(You’re receiving this message because you’re subscribed to one or more
Apache Software Foundation project user mailing lists.)
We’re thrilled to announce the schedule for our upcoming conference,
ApacheCon North America 2019, in Las Vegas, Nevada. See it now at
https
Hey Jean-Michel,
Looks like its specific for YARN. As I mentioned, I am running on
standalone cluster.
Thanks
On Tue 11 Jun 2019 at 10:50, Lourier, Jean-Michel (FIX1) <
jean-michel.lour...@porsche.de> wrote:
> Hi Patrick,
>
> I guess the easiest way is to use log aggregation:
> https://spark.ap
Nice finding!
Given you already pointed out previous issue which fixed similar issue, it
would be also easy for you to craft the patch and verify whether the fix
resolves your issue. Looking forward to see your patch.
Thanks,
Jungtaek Lim (HeartSaVioR)
On Wed, Jun 12, 2019 at 8:23 PM Gerard Maas
Ooops - linked the wrong JIRA ticket: (that other one is related)
https://issues.apache.org/jira/browse/SPARK-28025
On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote:
> Hi!
> I would like to socialize this issue we are currently facing:
> The Structured Streaming default CheckpointFileManager
Hi!
I would like to socialize this issue we are currently facing:
The Structured Streaming default CheckpointFileManager leaks .crc files by
leaving them behind after users of this class (like
HDFSBackedStateStoreProvider) apply their cleanup methods.
This results in an unbounded creation of tiny
FYI, I am already using QueryExecutionListener which satisfies the
requirements.
But that only works for dataframe APIs. If someone does
df.rdd().someAction(), QueryExecutionListener is never invoked. I want
something like QueryExecutionListener works in case of
df.rdd().someAction() too.
I explor
Hi,
If your data frame is partitioned by column A, and you want deduplication
by columns A, B and C, then a faster way might be to sort each partition by
A, B and C and then do a linear scan - it is often faster than group by all
columns - which require a shuffle. Sadly, there's no standard way to
Hi all,
As we know that parquet is stored in columnar format and filtering on the
column will require that column only instead of the complete record.
So when we are creating Dataset[Class] and doing group by on the column vs
same on steps DataFrame is performing differently. Operations on Datase
Hi All,
Is there any way to receive some event that a DataSourceReader is finished?
I want to do some clean up after all the DataReaders are finished reading
and hence need some kind of cleanUp() mechanism at DataSourceReader(Driver)
level.
How to achieve this?
For instance, in DataSourceWriter
Hi, Sonu.
You can send email to user-unsubscr...@spark.apache.org with subject
"(send this email to unsubscribe)" to unsubscribe from this mailling
list[1].
Regards.
[1] https://spark.apache.org/community.html
2019-05-27 2:01 GMT+07.00, Sonu Jyotshna :
>
>
--
--
Salam Hangat,
Pengelola B2B
Hi,
My employer(IBM) is interested in hiring people in hyderabad if they are
committers in any of the Apache Projects and are interested Spark and
ecosystem.
Thanks,
Prashant.
17 matches
Mail list logo