Write to same hdfs dir from multiple spark jobs
Hi Is there any design pattern around writing to the same hdfs directory from multiple spark jobs? -- Thanks Deepak www.bigdatabig.com
Re: [DISCUSS] Apache Spark 3.0.1 Release
Hi all, Discussion around 3.0.1 seems to have trickled away. What was blocking the release process kicking off? I can see some unresolved bugs raised against 3.0.0, but conversely there were quite a few critical correctness fixes waiting to be released. Cheers, Jason. From: Takeshi Yamamuro Date: Wednesday, 15 July 2020 at 9:00 am To: Shivaram Venkataraman Cc: "dev@spark.apache.org" Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release > Just wanted to check if there are any blockers that we are still waiting for > to start the new release process. I don't see any on-going blocker in my area. Thanks for the notification. Bests, Tkaeshi On Wed, Jul 15, 2020 at 4:03 AM Dongjoon Hyun mailto:dongjoon.h...@gmail.com>> wrote: Hi, Yi. Could you explain why you think that is a blocker? For the given example from the JIRA description, spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt)) Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t") checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil) Apache Spark 3.0.0 seems to work like the following. scala> spark.version res0: String = 3.0.0 scala> spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt)) res1: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$1958/948653928@5d6bed7b,IntegerType,List(Some(class[value[0]: map])),None,false,true) scala> Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t") scala> sql("SELECT key(a) AS k FROM t GROUP BY key(a)").collect res3: Array[org.apache.spark.sql.Row] = Array([1]) Could you provide a reproducible example? Bests, Dongjoon. On Tue, Jul 14, 2020 at 10:04 AM Yi Wu mailto:yi...@databricks.com>> wrote: This probably be a blocker: https://issues.apache.org/jira/browse/SPARK-32307 On Tue, Jul 14, 2020 at 11:13 PM Sean Owen mailto:sro...@gmail.com>> wrote: https://issues.apache.org/jira/browse/SPARK-32234 ? On Tue, Jul 14, 2020 at 9:57 AM Shivaram Venkataraman mailto:shiva...@eecs.berkeley.edu>> wrote: > > Hi all > > Just wanted to check if there are any blockers that we are still waiting for > to start the new release process. > > Thanks > Shivaram > -- --- Takeshi Yamamuro
Re: [DISCUSS] Apache Spark 3.0.1 Release
I agree, that would be a new feature; and unless compelling reason (like security concerns) would not qualify. Regards, Mridul On Wed, Jul 15, 2020 at 11:46 AM Wenchen Fan wrote: > Supporting Python 3.8.0 sounds like a new feature, and doesn't qualify a > backport. But I'm open to other opinions. > > On Wed, Jul 15, 2020 at 11:24 PM Ismaël Mejía wrote: > >> Any chance that SPARK-29536 PySpark does not work with Python 3.8.0 >> can be backported to 2.4.7 ? >> This was not done for Spark 2.4.6 because it was too late on the vote >> process but it makes perfect sense to have this in 2.4.7. >> >> On Wed, Jul 15, 2020 at 9:07 AM Wenchen Fan wrote: >> > >> > Yea I think 2.4.7 is good to go. Let's start! >> > >> > On Wed, Jul 15, 2020 at 1:50 PM Prashant Sharma >> wrote: >> >> >> >> Hi Folks, >> >> >> >> So, I am back, and searched the JIRAS with target version as "2.4.7" >> and Resolved, found only 2 jiras. So, are we good to go, with just a couple >> of jiras fixed ? Shall I proceed with making a RC? >> >> >> >> Thanks, >> >> Prashant >> >> >> >> On Thu, Jul 2, 2020 at 5:23 PM Prashant Sharma >> wrote: >> >>> >> >>> Thank you, Holden. >> >>> >> >>> Folks, My health has gone down a bit. So, I will start working on >> this in a few days. If this needs to be published sooner, then maybe >> someone else has to help out. >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Thu, Jul 2, 2020 at 10:11 AM Holden Karau >> wrote: >> >> I’m happy to have Prashant do 2.4.7 :) >> >> On Wed, Jul 1, 2020 at 9:40 PM Xiao Li >> wrote: >> > >> > +1 on releasing both 3.0.1 and 2.4.7 >> > >> > Great! Three committers volunteer to be a release manager. Ruifeng, >> Prashant and Holden. Holden just helped release Spark 2.4.6. This time, >> maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7 >> respectively. >> > >> > Xiao >> > >> > On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >> >> >> https://issues.apache.org/jira/browse/SPARK-32148 was reported >> yesterday, and if the report is valid it looks to be a blocker. I'll try to >> take a look sooner. >> >> >> >> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >>> >> >>> Thanks Holden -- it would be great to also get 2.4.7 started >> >>> >> >>> Thanks >> >>> Shivaram >> >>> >> >>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau < >> hol...@pigscanfly.ca> wrote: >> >>> > >> >>> > I can take care of 2.4.7 unless someone else wants to do it. >> >>> > >> >>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore < >> jason.mo...@quantium.com.au> wrote: >> >>> >> >> >>> >> Hi all, >> >>> >> >> >>> >> >> >>> >> >> >>> >> Could I get some input on the severity of this one that I >> found yesterday? If that’s a correctness issue, should it block this >> patch? Let me know under the ticket if there’s more info that I can >> provide to help. >> >>> >> >> >>> >> >> >>> >> >> >>> >> https://issues.apache.org/jira/browse/SPARK-32136 >> >>> >> >> >>> >> >> >>> >> >> >>> >> Thanks, >> >>> >> >> >>> >> Jason. >> >>> >> >> >>> >> >> >>> >> >> >>> >> From: Jungtaek Lim >> >>> >> Date: Wednesday, 1 July 2020 at 10:20 am >> >>> >> To: Shivaram Venkataraman >> >>> >> Cc: Prashant Sharma , 郑瑞峰 < >> ruife...@foxmail.com>, Gengliang Wang , >> gurwls223 , Dongjoon Hyun , >> Jules Damji , Holden Karau , >> Reynold Xin , Yuanjian Li , >> "dev@spark.apache.org" , Takeshi Yamamuro < >> linguin@gmail.com> >> >>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release >> >>> >> >> >>> >> >> >>> >> >> >>> >> SPARK-32130 [1] looks to be a performance regression >> introduced in Spark 3.0.0, which is ideal to look into before releasing >> another bugfix version. >> >>> >> >> >>> >> >> >>> >> >> >>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130 >> >>> >> >> >>> >> >> >>> >> >> >>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >>> >> >> >>> >> Hi all >> >>> >> >> >>> >> >> >>> >> >> >>> >> I just wanted to ping this thread to see if all the >> outstanding blockers for 3.0.1 have been fixed. If so, it would be great if >> we can get the release going. The CRAN team sent us a note that the version >> SparkR available on CRAN for the current R version (4.0.2) is broken and >> hence we need to update the package soon -- it will be great to do it with >> 3.0.1. >> >>> >> >> >>> >> >> >>> >> >> >>> >> Thanks >> >>> >> >> >>> >> Shivaram >> >>> >> >> >>> >> >> >>> >> >> >>> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma < >> scrapco...@gmail.com> wrote: >> >>> >> >> >>> >> +1 for 3.0.1 release. >> >>> >> >> >>> >> I too can he
Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source
bump, is there any interest on this topic? On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim wrote: > (Just to add rationalization, you can refer the original mail thread on > dev@ list to see efforts on addressing problems in file stream source / > sink - > https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E > ) > > On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim > wrote: > >> Hi devs, >> >> As I have been going through the various issues on metadata log growing, >> it's not only the issue of sink, but also the issue of source. >> Unlike sink metadata log which entries should be available to the >> readers, the source metadata log is only for the streaming query starting >> from the checkpoint, hence in theory it should only memorize about >> minimal entries which prevent processing multiple times on the same file. >> >> This is not applied to the file stream source, and I think it's because >> of the existence of the "latestFirst" option which I haven't seen from any >> sources. The option works as reading files in "backward" order, which means >> Spark can read the oldest file and latest file together in a micro-batch, >> which ends up having to memorize all files previously read. The option can >> be changed during query restart, so even if the query is started with >> "latestFirst" being false, it's not safe to apply the logic of minimizing >> entries to memorize, as the option can be changed to true and then we'll >> read files again. >> >> I'm seeing two approaches here: >> >> 1) apply "retention" - unlike "maxFileAge", the option would apply to >> latestFirst as well. That said, if the retention is set to 7 days, the >> files older than 7 days would never be read in any way. With this approach >> we can at least get rid of entries which are older than retention. The >> issue is how to play nicely with existing "maxFileAge", as it also plays >> similar with the retention, though it's being ignored when latestFirst is >> turned on. (Change the semantic of "maxFileAge" vs leave it to "soft >> retention" and introduce another option.) >> >> (This approach is being proposed under SPARK-17604, and PR is available - >> https://github.com/apache/spark/pull/28422) >> >> 2) replace "latestFirst" option with alternatives, which no longer read >> in "backward" order - this doesn't say we have to read all files to move >> forward. As we do with Kafka, start offset can be provided, ideally as a >> timestamp, which Spark will read from such timestamp and forward order. >> This doesn't cover all use cases of "latestFirst", but "latestFirst" >> doesn't seem to be natural with the concept of SS (think about watermark), >> I'd prefer to support alternatives instead of struggling with "latestFirst". >> >> Would like to hear your opinions. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >