Sorry if this email is a duplicate, I realised that I was not registered with 
the mailing list  ...


I am having a problem with a spark sql script which is running on a spark 1.2 

CentOS CDH 5.3 mini 5 node cluster. The script processes some image csv data 

each record/line of which has 28x28 integer elements ending with an integer 
label

value which describes the record.


I have checked the data file rows to ensure that they contain the correct 
number of 

elements. 



[hadoop@hc2r1m2 h2o_spark_1_2]$ cat chk_csv.bash

#!/bin/bash


MYFILE=$1


echo ""

echo "check $MYFILE"

echo ""


cat $MYFILE | while read line

do

  echo -n "Line records = "

  echo $line | tr "," "\n" | wc -l

done



[hadoop@hc2r1m2 h2o_spark_1_2]$ ./chk_csv.bash mnist_train_1x.csv


check mnist_train_1x.csv


Line records = 785

Line records = 785

Line records = 785

Line records = 785

Line records = 785

Line records = 785

Line records = 785

Line records = 785

Line records = 785

Line records = 785



The basic script looks like this .. 



  // create  a spark conf and context


  val sparkMaster = "spark://hc2nn.semtech-solutions.co.nz:7077"

  val appName = "Spark ex3"

  val conf = new SparkConf()


  conf.setMaster(sparkMaster)

  conf.setAppName(appName)


  val sparkCxt = new SparkContext(conf)



  // Initialize SQL context


  import org.apache.spark.sql._

  implicit val sqlContext = new SQLContext(sparkCxt)


  // prep the hdfs based data



  val server    = "hdfs://hc2nn.semtech-solutions.co.nz:8020"

  val path      = "/data/spark/h2o/"

  val train_csv =  server + path + "mnist_train_1x.csv"



  val schemaString = getSchema() // string representing schema 28 x 28 int plus 
label int


  val schema = StructType( schemaString.split(" ")

         .map(fieldName => StructField(fieldName, IntegerType, false)))


  val rawTrainData = sparkCxt.textFile(train_csv)


  val trainRDD  = rawTrainData.map( rawRow => Row( rawRow.split(",") 
.map(_.toInt) ) )


  val trainSchemaRDD = sqlContext.applySchema(trainRDD, schema)


  trainSchemaRDD.registerTempTable("trainingTable")


  val resSchemaRddTrain = sqlContext.sql("""SELECT P1,Label FROM 
trainingTable""".stripMargin)


  println( " >>>>> sqlResult rows = " + resSchemaRddTrain.count() )


  val resArray = resSchemaRddTrain.collect()


  // collect results in  java.lang.ArrayIndexOutOfBoundsException




No matter what I try I get an ArrayIndexOutOfBoundsException when I try to 
examine the result

of the select even though I know that there are 10 result rows. 


I dont think its a data issue as both the schema and all data rows have 28x28+1 
= 785 elements. 

Advice appreciated



                                          

Reply via email to