date:20190612

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Marco Gaido

Hi Dongjoon, Thanks for the proposal! I like the idea. Maybe we can extend it to component too and to some jira labels such as correctness which may be worth to highlight in PRs too. My only concern is that in many cases JIRAs are created not very carefully so they may be incorrect at the moment of

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Reynold Xin

Seems like a good idea. Can we test this with a component first? On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun wrote: > Hi, All. > > Since we use both Apache JIRA and GitHub actively for Apache Spark > contributions, we have lots of JIRAs and PRs consequently. One specific > thing I've been long

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Hyukjin Kwon

Yea, I think we can automate this process via, for instance, https://github.com/apache/spark/blob/master/dev/github_jira_sync.py +1 for such sort of automatic categorizing and matching metadata between JIRA and github Adding Josh and Sean as well. On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun, wrote

Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Dongjoon Hyun

Hi, All. Since we use both Apache JIRA and GitHub actively for Apache Spark contributions, we have lots of JIRAs and PRs consequently. One specific thing I've been longing to see is `Jira Issue Type` in GitHub. How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`? There are two ma

Re: High level explanation of dropDuplicates

2019-06-12 Thread Yeikel

Nicholas , thank you for your explanation. I am also interested in the example that Rishi is asking for. I am sure mapPartitions may work , but as Vladimir suggests it may not be the best option in terms of performance. @Vladimir Prus , are you aware of any example about writing a "custom phy

Spark Dataframe NTILE function

2019-06-12 Thread Subash Prabakar

Hi, I am running a Spark Dataframe function of NTILE over a huge data - it spills lot of data while sorting and eventually it fails. The data size is roughly 80 Million record with size of 4G (not sure whether its serialized or deserialized) - I am calculating NTILE(10) for all these records orde

ApacheCon North America 2019 Schedule Now Live!

2019-06-12 Thread Rich Bowen

Dear Apache Enthusiast, (You’re receiving this message because you’re subscribed to one or more Apache Software Foundation project user mailing lists.) We’re thrilled to announce the schedule for our upcoming conference, ApacheCon North America 2019, in Las Vegas, Nevada. See it now at https

Re: Getting driver logs in Standalone Cluster

2019-06-12 Thread Tomasz Krol

Hey Jean-Michel, Looks like its specific for YARN. As I mentioned, I am running on standalone cluster. Thanks On Tue 11 Jun 2019 at 10:50, Lourier, Jean-Michel (FIX1) < jean-michel.lour...@porsche.de> wrote: > Hi Patrick, > > I guess the easiest way is to use log aggregation: > https://spark.ap

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Jungtaek Lim

Nice finding! Given you already pointed out previous issue which fixed similar issue, it would be also easy for you to craft the patch and verify whether the fix resolves your issue. Looking forward to see your patch. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Jun 12, 2019 at 8:23 PM Gerard Maas

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas

Ooops - linked the wrong JIRA ticket: (that other one is related) https://issues.apache.org/jira/browse/SPARK-28025 On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote: > Hi! > I would like to socialize this issue we are currently facing: > The Structured Streaming default CheckpointFileManager

[StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas

Hi! I would like to socialize this issue we are currently facing: The Structured Streaming default CheckpointFileManager leaks .crc files by leaving them behind after users of this class (like HDFSBackedStateStoreProvider) apply their cleanup methods. This results in an unbounded creation of tiny

Re: Clean up method for DataSourceReader

2019-06-12 Thread Shubham Chaurasia

FYI, I am already using QueryExecutionListener which satisfies the requirements. But that only works for dataframe APIs. If someone does df.rdd().someAction(), QueryExecutionListener is never invoked. I want something like QueryExecutionListener works in case of df.rdd().someAction() too. I explor

Re: High level explanation of dropDuplicates

2019-06-12 Thread Vladimir Prus

Hi, If your data frame is partitioned by column A, and you want deduplication by columns A, B and C, then a faster way might be to sort each partition by A, B and C and then do a linear scan - it is often faster than group by all columns - which require a shuffle. Sadly, there's no standard way to

Performance difference between Dataframe and Dataset especially on parquet data.

2019-06-12 Thread Shivam Sharma

Hi all, As we know that parquet is stored in columnar format and filtering on the column will require that column only instead of the complete record. So when we are creating Dataset[Class] and doing group by on the column vs same on steps DataFrame is performing differently. Operations on Datase

Clean up method for DataSourceReader

2019-06-12 Thread Shubham Chaurasia

Hi All, Is there any way to receive some event that a DataSourceReader is finished? I want to do some clean up after all the DataReaders are finished reading and hence need some kind of cleanUp() mechanism at DataSourceReader(Driver) level. How to achieve this? For instance, in DataSourceWriter

Re: unsubscribe

2019-06-12 Thread B2B Web ID

Hi, Sonu. You can send email to user-unsubscr...@spark.apache.org with subject "(send this email to unsubscribe)" to unsubscribe from this mailling list[1]. Regards. [1] https://spark.apache.org/community.html 2019-05-27 2:01 GMT+07.00, Sonu Jyotshna : > > -- -- Salam Hangat, Pengelola B2B

Employment opportunities.

2019-06-12 Thread Prashant Sharma

Hi, My employer(IBM) is interested in hiring people in hyderabad if they are committers in any of the Apache Projects and are interested Spark and ecosystem. Thanks, Prashant.

Re: Exposing JIRA issue types at GitHub PRs

Re: Exposing JIRA issue types at GitHub PRs

Re: Exposing JIRA issue types at GitHub PRs

Exposing JIRA issue types at GitHub PRs

Re: High level explanation of dropDuplicates

Spark Dataframe NTILE function

ApacheCon North America 2019 Schedule Now Live!

Re: Getting driver logs in Standalone Cluster

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

[StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

Re: Clean up method for DataSourceReader

Re: High level explanation of dropDuplicates

Performance difference between Dataframe and Dataset especially on parquet data.

Clean up method for DataSourceReader

Re: unsubscribe

Employment opportunities.

17 matches

Site Navigation

Mail list logo

Footer information