Hi all!
Is it possible that Spark creates under certain circumstances duplicate
rows when doing multiple joins?
What I did:
buse.count
res0: Long = 20554365
buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count
res1: Long = 20554365
buse.alias("buse").join(bdef.alia
You are overflowing the integer type, which goes up to a max value
of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to
`LongType()` and it works as expected.
On Mon, Mar 23, 2020 at 6:15 AM ayan guha wrote:
> Hi
>
> I am trying to implement simple hashing/checksum logic. The key lo
Thanks a lot. Will try.
On Mon, Mar 23, 2020 at 8:16 PM Jacob Lynn wrote:
> You are overflowing the integer type, which goes up to a max value
> of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to
> `LongType()` and it works as expected.
>
> On Mon, Mar 23, 2020 at 6:15 AM ayan guh
Ayan,
no need for UDFs, the SQL API provides all you need (sha1, substring, conv):
https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html
>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
10).cast("long").alias("sha2long")).show()
+--+
| sha2long|
+
I had exact same issue, the temp fix what I did was, took open source code
from github, modified the group.id mandatory logic and built customized
library.
Thanks,
On Tue, Mar 17, 2020 at 7:34 AM Sjoerd van Leent <
sjoerd.van.le...@alliander.com> wrote:
> Dear reader,
>
>
>
> I must force the gr
AwesomeDid not know about conv function so thanks for that
On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack
wrote:
> Ayan,
>
> no need for UDFs, the SQL API provides all you need (sha1, substring, conv
> ):
> https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html
>
> >>> df.select(conv
Hi Ayan,
You don't have to bother with conversion at all. All functions that should
work on number columns would still work as long as all values in the column
are numbers:
scala> df2.printSchema
root
|-- id: string (nullable = false)
|-- id2: string (nullable = false)
scala> df2.show
+---+---
I now need to integrate spark into our own platform built with spring to
reflect the ability of task submission and task monitoring. Spark tasks run
on yarn and are in cluster mode. And our current service may submit tasks
to different yarn clusters.
According to the current method provided by spar