Hi,

I can easily do this in shell but wanted to see what I can do in Spark.

I am trying to create a simple table (10 rows, 2 columns) for now and then
register it as tempTable and store in Hive, if it is feasible.

First column col1 is monolithically incrementing integer and the second
column a string of 10 random chars

Use a UDF to create random char on length (charlength)


import scala.util.Random
def random_string(chars: String, charlength: Int) : String = {
  val newKey = (1 to charlength).map(
    x =>
    {
      val index = Random.nextInt(chars.length)
      chars(index)
    }
   ).mkString("")
   return newKey
}
spark.udf.register("random_string", random_string(_:String, _:Int))
//create class
case class columns (col1: Int, col2: String)

val chars = ('a' to 'z') ++ ('A' to 'Z')
var text = "Array("
val comma = ","
val terminator = "))"
for (i  <- 1 to 10) {
var random_char = random_string(chars.mkString(""), 10)
if (i < 10) {text = text + """(""" + i.toString +
""",""""+random_char+"""")"""+comma}
   else {text = text + """(""" + i.toString +
""",""""+random_char+"""")))"""}
}
println(text)
val df =sc.parallelize(text)

val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
p._2.toString)).toDF

When I run it I get this

Loading dynamic_ARRAY_generator.scala...
import scala.util.Random
random_string: (chars: String, charlength: Int)String
res0: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function2>,StringType,Some(List(StringType,
IntegerType)))
defined class columns
chars: scala.collection.immutable.IndexedSeq[Char] = Vector(a, b, c, d, e,
f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D,
E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z)
text: String = Array(
comma: String = ,
terminator: String = ))
Array((1,"yyzbPpXEoX"),(2,"bEnzvFCdXm"),(3,"dKXZbgaGTr"),(4,"hIHGkiWjcy"),(5,"HBnJmYlefk"),(6,"MKqfwWCmah"),(7,"CrKYmsbXKI"),(8,"iySnzSKtuH"),(9,"BbCRKqtkml"),(10,"nYdxrDneUm")))

*df: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[0] at
parallelize at <console>:27*<console>:29: error: value _1 is not a member
of Char
       val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
p._2.toString)).toDF
                                                       ^
<console>:29: error: value _2 is not a member of Char
       val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
p._2.toString)).toDF

^
<console>:29: error: value toDF is not a member of
org.apache.spark.rdd.RDD[U]
       val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
p._2.toString)).toDF

^

This works

val df =sc.parallelize(text)

But this fails

val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
p._2.toString)).toDF

I gather it sees it at RDD[Char]!


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to