Write to same hdfs dir from multiple spark jobs

2020-07-29 Thread Deepak Sharma
Hi
Is there any design pattern around writing to the same hdfs directory from
multiple spark jobs?

-- 
Thanks
Deepak
www.bigdatabig.com


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-29 Thread Jason Moore
Hi all,

Discussion around 3.0.1 seems to have trickled away.  What was blocking the 
release process kicking off?  I can see some unresolved bugs raised against 
3.0.0, but conversely there were quite a few critical correctness fixes waiting 
to be released.

Cheers,
Jason.

From: Takeshi Yamamuro 
Date: Wednesday, 15 July 2020 at 9:00 am
To: Shivaram Venkataraman 
Cc: "dev@spark.apache.org" 
Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release

> Just wanted to check if there are any blockers that we are still waiting for 
> to start the new release process.
I don't see any on-going blocker in my area.
Thanks for the notification.

Bests,
Tkaeshi

On Wed, Jul 15, 2020 at 4:03 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, Yi.

Could you explain why you think that is a blocker? For the given example from 
the JIRA description,


spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt))

Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t")

checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil)

Apache Spark 3.0.0 seems to work like the following.

scala> spark.version
res0: String = 3.0.0

scala> spark.udf.register("key", udf((m: Map[String, String]) => 
m.keys.head.toInt))
res1: org.apache.spark.sql.expressions.UserDefinedFunction = 
SparkUserDefinedFunction($Lambda$1958/948653928@5d6bed7b,IntegerType,List(Some(class[value[0]:
 map])),None,false,true)

scala> Seq(Map("1" -> "one", "2" -> 
"two")).toDF("a").createOrReplaceTempView("t")

scala> sql("SELECT key(a) AS k FROM t GROUP BY key(a)").collect
res3: Array[org.apache.spark.sql.Row] = Array([1])

Could you provide a reproducible example?

Bests,
Dongjoon.


On Tue, Jul 14, 2020 at 10:04 AM Yi Wu 
mailto:yi...@databricks.com>> wrote:
This probably be a blocker: https://issues.apache.org/jira/browse/SPARK-32307

On Tue, Jul 14, 2020 at 11:13 PM Sean Owen 
mailto:sro...@gmail.com>> wrote:
https://issues.apache.org/jira/browse/SPARK-32234 ?

On Tue, Jul 14, 2020 at 9:57 AM Shivaram Venkataraman
mailto:shiva...@eecs.berkeley.edu>> wrote:
>
> Hi all
>
> Just wanted to check if there are any blockers that we are still waiting for 
> to start the new release process.
>
> Thanks
> Shivaram
>


--
---
Takeshi Yamamuro


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-29 Thread Mridul Muralidharan
I agree, that would be a new feature; and unless compelling reason (like
security concerns) would not qualify.

Regards,
Mridul

On Wed, Jul 15, 2020 at 11:46 AM Wenchen Fan  wrote:

> Supporting Python 3.8.0 sounds like a new feature, and doesn't qualify a
> backport. But I'm open to other opinions.
>
> On Wed, Jul 15, 2020 at 11:24 PM Ismaël Mejía  wrote:
>
>> Any chance that SPARK-29536 PySpark does not work with Python 3.8.0
>> can be backported to 2.4.7 ?
>> This was not done for Spark 2.4.6 because it was too late on the vote
>> process but it makes perfect sense to have this in 2.4.7.
>>
>> On Wed, Jul 15, 2020 at 9:07 AM Wenchen Fan  wrote:
>> >
>> > Yea I think 2.4.7 is good to go. Let's start!
>> >
>> > On Wed, Jul 15, 2020 at 1:50 PM Prashant Sharma 
>> wrote:
>> >>
>> >> Hi Folks,
>> >>
>> >> So, I am back, and searched the JIRAS with target version as "2.4.7"
>> and Resolved, found only 2 jiras. So, are we good to go, with just a couple
>> of jiras fixed ? Shall I proceed with making a RC?
>> >>
>> >> Thanks,
>> >> Prashant
>> >>
>> >> On Thu, Jul 2, 2020 at 5:23 PM Prashant Sharma 
>> wrote:
>> >>>
>> >>> Thank you, Holden.
>> >>>
>> >>> Folks, My health has gone down a bit. So, I will start working on
>> this in a few days. If this needs to be published sooner, then maybe
>> someone else has to help out.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Jul 2, 2020 at 10:11 AM Holden Karau 
>> wrote:
>> 
>>  I’m happy to have Prashant do 2.4.7 :)
>> 
>>  On Wed, Jul 1, 2020 at 9:40 PM Xiao Li 
>> wrote:
>> >
>> > +1 on releasing both 3.0.1 and 2.4.7
>> >
>> > Great! Three committers volunteer to be a release manager. Ruifeng,
>> Prashant and Holden. Holden just helped release Spark 2.4.6. This time,
>> maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7
>> respectively.
>> >
>> > Xiao
>> >
>> > On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-32148 was reported
>> yesterday, and if the report is valid it looks to be a blocker. I'll try to
>> take a look sooner.
>> >>
>> >> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>> >>>
>> >>> Thanks Holden -- it would be great to also get 2.4.7 started
>> >>>
>> >>> Thanks
>> >>> Shivaram
>> >>>
>> >>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau <
>> hol...@pigscanfly.ca> wrote:
>> >>> >
>> >>> > I can take care of 2.4.7 unless someone else wants to do it.
>> >>> >
>> >>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore <
>> jason.mo...@quantium.com.au> wrote:
>> >>> >>
>> >>> >> Hi all,
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Could I get some input on the severity of this one that I
>> found yesterday?  If that’s a correctness issue, should it block this
>> patch?  Let me know under the ticket if there’s more info that I can
>> provide to help.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> https://issues.apache.org/jira/browse/SPARK-32136
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Thanks,
>> >>> >>
>> >>> >> Jason.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> From: Jungtaek Lim 
>> >>> >> Date: Wednesday, 1 July 2020 at 10:20 am
>> >>> >> To: Shivaram Venkataraman 
>> >>> >> Cc: Prashant Sharma , 郑瑞峰 <
>> ruife...@foxmail.com>, Gengliang Wang ,
>> gurwls223 , Dongjoon Hyun ,
>> Jules Damji , Holden Karau ,
>> Reynold Xin , Yuanjian Li ,
>> "dev@spark.apache.org" , Takeshi Yamamuro <
>> linguin@gmail.com>
>> >>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> SPARK-32130 [1] looks to be a performance regression
>> introduced in Spark 3.0.0, which is ideal to look into before releasing
>> another bugfix version.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>> >>> >>
>> >>> >> Hi all
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> I just wanted to ping this thread to see if all the
>> outstanding blockers for 3.0.1 have been fixed. If so, it would be great if
>> we can get the release going. The CRAN team sent us a note that the version
>> SparkR available on CRAN for the current R version (4.0.2) is broken and
>> hence we need to update the package soon --  it will be great to do it with
>> 3.0.1.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Thanks
>> >>> >>
>> >>> >> Shivaram
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma <
>> scrapco...@gmail.com> wrote:
>> >>> >>
>> >>> >> +1 for 3.0.1 release.
>> >>> >>
>> >>> >> I too can he

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-29 Thread Jungtaek Lim
bump, is there any interest on this topic?

On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim 
wrote:

> (Just to add rationalization, you can refer the original mail thread on
> dev@ list to see efforts on addressing problems in file stream source /
> sink -
> https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
> )
>
> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim 
> wrote:
>
>> Hi devs,
>>
>> As I have been going through the various issues on metadata log growing,
>> it's not only the issue of sink, but also the issue of source.
>> Unlike sink metadata log which entries should be available to the
>> readers, the source metadata log is only for the streaming query starting
>> from the checkpoint, hence in theory it should only memorize about
>> minimal entries which prevent processing multiple times on the same file.
>>
>> This is not applied to the file stream source, and I think it's because
>> of the existence of the "latestFirst" option which I haven't seen from any
>> sources. The option works as reading files in "backward" order, which means
>> Spark can read the oldest file and latest file together in a micro-batch,
>> which ends up having to memorize all files previously read. The option can
>> be changed during query restart, so even if the query is started with
>> "latestFirst" being false, it's not safe to apply the logic of minimizing
>> entries to memorize, as the option can be changed to true and then we'll
>> read files again.
>>
>> I'm seeing two approaches here:
>>
>> 1) apply "retention" - unlike "maxFileAge", the option would apply to
>> latestFirst as well. That said, if the retention is set to 7 days, the
>> files older than 7 days would never be read in any way. With this approach
>> we can at least get rid of entries which are older than retention. The
>> issue is how to play nicely with existing "maxFileAge", as it also plays
>> similar with the retention, though it's being ignored when latestFirst is
>> turned on. (Change the semantic of "maxFileAge" vs leave it to "soft
>> retention" and introduce another option.)
>>
>> (This approach is being proposed under SPARK-17604, and PR is available -
>> https://github.com/apache/spark/pull/28422)
>>
>> 2) replace "latestFirst" option with alternatives, which no longer read
>> in "backward" order - this doesn't say we have to read all files to move
>> forward. As we do with Kafka, start offset can be provided, ideally as a
>> timestamp, which Spark will read from such timestamp and forward order.
>> This doesn't cover all use cases of "latestFirst", but "latestFirst"
>> doesn't seem to be natural with the concept of SS (think about watermark),
>> I'd prefer to support alternatives instead of struggling with "latestFirst".
>>
>> Would like to hear your opinions.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>