Spark 2.2.1 Dataframes multiple joins bug?

2020-03-23 Thread Dipl.-Inf. Rico Bergmann
Hi all! Is it possible that Spark creates under certain circumstances duplicate rows when doing multiple joins? What I did: buse.count res0: Long = 20554365 buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count res1: Long = 20554365 buse.alias("buse").join(bdef.alia

Re: Issue with UDF Int Conversion - Str to Int

2020-03-23 Thread Jacob Lynn
You are overflowing the integer type, which goes up to a max value of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to `LongType()` and it works as expected. On Mon, Mar 23, 2020 at 6:15 AM ayan guha wrote: > Hi > > I am trying to implement simple hashing/checksum logic. The key lo

Re: Issue with UDF Int Conversion - Str to Int

2020-03-23 Thread ayan guha
Thanks a lot. Will try. On Mon, Mar 23, 2020 at 8:16 PM Jacob Lynn wrote: > You are overflowing the integer type, which goes up to a max value > of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to > `LongType()` and it works as expected. > > On Mon, Mar 23, 2020 at 6:15 AM ayan guh

Re: Issue with UDF Int Conversion - Str to Int

2020-03-23 Thread Enrico Minack
Ayan, no need for UDFs, the SQL API provides all you need (sha1, substring, conv): https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).show() +--+ |  sha2long| +

Re: Problem with Kafka group.id

2020-03-23 Thread Sethupathi T
I had exact same issue, the temp fix what I did was, took open source code from github, modified the group.id mandatory logic and built customized library. Thanks, On Tue, Mar 17, 2020 at 7:34 AM Sjoerd van Leent < sjoerd.van.le...@alliander.com> wrote: > Dear reader, > > > > I must force the gr

Re: Issue with UDF Int Conversion - Str to Int

2020-03-23 Thread ayan guha
AwesomeDid not know about conv function so thanks for that On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack wrote: > Ayan, > > no need for UDFs, the SQL API provides all you need (sha1, substring, conv > ): > https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html > > >>> df.select(conv

Re: Issue with UDF Int Conversion - Str to Int

2020-03-23 Thread Vipul Rajan
Hi Ayan, You don't have to bother with conversion at all. All functions that should work on number columns would still work as long as all values in the column are numbers: scala> df2.printSchema root |-- id: string (nullable = false) |-- id2: string (nullable = false) scala> df2.show +---+---

Integration about submitting and monitoring spark tasks

2020-03-23 Thread jianl miao
I now need to integrate spark into our own platform built with spring to reflect the ability of task submission and task monitoring. Spark tasks run on yarn and are in cluster mode. And our current service may submit tasks to different yarn clusters. According to the current method provided by spar