Ok I sorted out basic problem.

I can create a text string dynamically with 2 columns and numerate rows

scala> println(text)
(1,"VDNiqDKChu"),(2,"LApMjYGYkC"),(3,"HuVCyfizzD"),(4,"kUSzHWquGA"),(5,"OlJGGQQlUh"),(6,"POljdWgAIN"),(7,"wsRqqGZaqy"),(8,"HOgdjAFUln"),(9,"jYwvafOjDo"),(10,"QlvZGMBimd")

If I cut and paste it on the terminal, this works OK

scala> val df = sc.parallelize(Array(
(1,"VDNiqDKChu"),(2,"LApMjYGYkC"),(3,"HuVCyfizzD"),(4,"kUSzHWquGA"),(5,"OlJGGQQlUh"),(6,"POljdWgAIN"),(7,"wsRqqGZaqy"),(8,"HOgdjAFUln"),(9,"jYwvafOjDo"),(10,"QlvZGMBimd
") ))

scala> df.count
res50: Long = 10

So I see 10 entries in the array

If I do the following passing text String to the Array

scala> val df = sc.parallelize(( Array(text)   ))
df: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[113] at
parallelize at <console>:29

It only interprets as one String!

scala> df.count
res52: Long = 1

Is there anyway I can force it to see it NOT as a String and interpret it?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 23 August 2016 at 12:51, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> I can easily do this in shell but wanted to see what I can do in Spark.
>
> I am trying to create a simple table (10 rows, 2 columns) for now and then
> register it as tempTable and store in Hive, if it is feasible.
>
> First column col1 is monolithically incrementing integer and the second
> column a string of 10 random chars
>
> Use a UDF to create random char on length (charlength)
>
>
> import scala.util.Random
> def random_string(chars: String, charlength: Int) : String = {
>   val newKey = (1 to charlength).map(
>     x =>
>     {
>       val index = Random.nextInt(chars.length)
>       chars(index)
>     }
>    ).mkString("")
>    return newKey
> }
> spark.udf.register("random_string", random_string(_:String, _:Int))
> //create class
> case class columns (col1: Int, col2: String)
>
> val chars = ('a' to 'z') ++ ('A' to 'Z')
> var text = "Array("
> val comma = ","
> val terminator = "))"
> for (i  <- 1 to 10) {
> var random_char = random_string(chars.mkString(""), 10)
> if (i < 10) {text = text + """(""" + i.toString +
> """,""""+random_char+"""")"""+comma}
>    else {text = text + """(""" + i.toString +
> """,""""+random_char+"""")))"""}
> }
> println(text)
> val df =sc.parallelize(text)
>
> val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
> p._2.toString)).toDF
>
> When I run it I get this
>
> Loading dynamic_ARRAY_generator.scala...
> import scala.util.Random
> random_string: (chars: String, charlength: Int)String
> res0: org.apache.spark.sql.expressions.UserDefinedFunction =
> UserDefinedFunction(<function2>,StringType,Some(List(StringType,
> IntegerType)))
> defined class columns
> chars: scala.collection.immutable.IndexedSeq[Char] = Vector(a, b, c, d,
> e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C,
> D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z)
> text: String = Array(
> comma: String = ,
> terminator: String = ))
> Array((1,"yyzbPpXEoX"),(2,"bEnzvFCdXm"),(3,"dKXZbgaGTr"),
> (4,"hIHGkiWjcy"),(5,"HBnJmYlefk"),(6,"MKqfwWCmah"),(7,"CrKYmsbXKI"),(8,"
> iySnzSKtuH"),(9,"BbCRKqtkml"),(10,"nYdxrDneUm")))
>
> *df: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[0] at
> parallelize at <console>:27*<console>:29: error: value _1 is not a member
> of Char
>        val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
> p._2.toString)).toDF
>                                                        ^
> <console>:29: error: value _2 is not a member of Char
>        val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
> p._2.toString)).toDF
>
> ^
> <console>:29: error: value toDF is not a member of
> org.apache.spark.rdd.RDD[U]
>        val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
> p._2.toString)).toDF
>
> ^
>
> This works
>
> val df =sc.parallelize(text)
>
> But this fails
>
> val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt,
> p._2.toString)).toDF
>
> I gather it sees it at RDD[Char]!
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Reply via email to