This is expected. You are not accessing the DataSet Dict when calling UDF 
countPositiveSimilarity. The dict dataframe as it existed when udf was created 
is encoded into udf. If you change  dict later on the changes will not get 
automatically picked up in UDF countPositiveSimilarity. 

Sent from Mail for Windows 10

From: knowsnothing
Sent: Sunday, November 5, 2017 8:12 AM
To: dev@spark.apache.org
Subject: Accessing DataFrame inside UserDefinedFunction.

Hi Everyone,

I've been told, that accessing DataFrame object from UserDefinedFunction is
not possible, but turns out that the code shown below works fine (taken from
StackOverflow). Why is it so? Is it a bug, or is it expected?

Thanks in advance.

case class Target(wordListOne: Seq[String], WordListTwo: Seq[String])
val targetData = Seq(Target(Seq("Spark", "Wrong", "Something"), Seq("Java",
"Grape", "Banana")),
                     Target(Seq("Java", "Scala"), Seq("Scala", "Banana")),
                     Target(Seq(""), Seq("Grape", "Banana")),
                     Target(Seq(""), Seq("")))
val targets = spark.createDataset(targetData)

case class WordSimilarity(first: String, second: String, similarity: Double)
val similarityData = Seq(WordSimilarity("Spark", "Java", 0.8), 
                     WordSimilarity("Scala", "Spark", 0.9), 
                     WordSimilarity("Java", "Scala", 0.9),
                     WordSimilarity("Apple", "Grape", 0.66),
                     WordSimilarity("Scala", "Apple", -0.1),
                     WordSimilarity("Gine", "Spark", 0.1)) 
val dict = spark.createDataset(similarityData)

val countPositiveSimilarity = udf[Long, Seq[String], Seq[String]]((a, b) => 
    dict.filter(
        (($"first".isin(a: _*) && $"second".isin(b: _*)) ||
($"first".isin(b: _*) && $"second".isin(a: _*))) && $"similarity" > 0.7
    ).count
)

val countDF = targets.withColumn("positive_count",
countPositiveSimilarity($"wordListOne", $"wordListTwo"))




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Reply via email to