Ok I sorted out basic problem. I can create a text string dynamically with 2 columns and numerate rows
scala> println(text) (1,"VDNiqDKChu"),(2,"LApMjYGYkC"),(3,"HuVCyfizzD"),(4,"kUSzHWquGA"),(5,"OlJGGQQlUh"),(6,"POljdWgAIN"),(7,"wsRqqGZaqy"),(8,"HOgdjAFUln"),(9,"jYwvafOjDo"),(10,"QlvZGMBimd") If I cut and paste it on the terminal, this works OK scala> val df = sc.parallelize(Array( (1,"VDNiqDKChu"),(2,"LApMjYGYkC"),(3,"HuVCyfizzD"),(4,"kUSzHWquGA"),(5,"OlJGGQQlUh"),(6,"POljdWgAIN"),(7,"wsRqqGZaqy"),(8,"HOgdjAFUln"),(9,"jYwvafOjDo"),(10,"QlvZGMBimd ") )) scala> df.count res50: Long = 10 So I see 10 entries in the array If I do the following passing text String to the Array scala> val df = sc.parallelize(( Array(text) )) df: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[113] at parallelize at <console>:29 It only interprets as one String! scala> df.count res52: Long = 1 Is there anyway I can force it to see it NOT as a String and interpret it? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 23 August 2016 at 12:51, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > I can easily do this in shell but wanted to see what I can do in Spark. > > I am trying to create a simple table (10 rows, 2 columns) for now and then > register it as tempTable and store in Hive, if it is feasible. > > First column col1 is monolithically incrementing integer and the second > column a string of 10 random chars > > Use a UDF to create random char on length (charlength) > > > import scala.util.Random > def random_string(chars: String, charlength: Int) : String = { > val newKey = (1 to charlength).map( > x => > { > val index = Random.nextInt(chars.length) > chars(index) > } > ).mkString("") > return newKey > } > spark.udf.register("random_string", random_string(_:String, _:Int)) > //create class > case class columns (col1: Int, col2: String) > > val chars = ('a' to 'z') ++ ('A' to 'Z') > var text = "Array(" > val comma = "," > val terminator = "))" > for (i <- 1 to 10) { > var random_char = random_string(chars.mkString(""), 10) > if (i < 10) {text = text + """(""" + i.toString + > """,""""+random_char+"""")"""+comma} > else {text = text + """(""" + i.toString + > """,""""+random_char+"""")))"""} > } > println(text) > val df =sc.parallelize(text) > > val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt, > p._2.toString)).toDF > > When I run it I get this > > Loading dynamic_ARRAY_generator.scala... > import scala.util.Random > random_string: (chars: String, charlength: Int)String > res0: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(<function2>,StringType,Some(List(StringType, > IntegerType))) > defined class columns > chars: scala.collection.immutable.IndexedSeq[Char] = Vector(a, b, c, d, > e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, > D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z) > text: String = Array( > comma: String = , > terminator: String = )) > Array((1,"yyzbPpXEoX"),(2,"bEnzvFCdXm"),(3,"dKXZbgaGTr"), > (4,"hIHGkiWjcy"),(5,"HBnJmYlefk"),(6,"MKqfwWCmah"),(7,"CrKYmsbXKI"),(8," > iySnzSKtuH"),(9,"BbCRKqtkml"),(10,"nYdxrDneUm"))) > > *df: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[0] at > parallelize at <console>:27*<console>:29: error: value _1 is not a member > of Char > val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt, > p._2.toString)).toDF > ^ > <console>:29: error: value _2 is not a member of Char > val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt, > p._2.toString)).toDF > > ^ > <console>:29: error: value toDF is not a member of > org.apache.spark.rdd.RDD[U] > val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt, > p._2.toString)).toDF > > ^ > > This works > > val df =sc.parallelize(text) > > But this fails > > val df =sc.parallelize(text).map(p => columns(p._1.toString.toInt, > p._2.toString)).toDF > > I gather it sees it at RDD[Char]! > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > >