Spark MLlib Vector only supports data of double type, it's reasonable to throw exception when you creating a Vector with element of unicode type.
2016-05-24 7:27 GMT-07:00 flyinggip <myflying...@hotmail.com>: > Hi there, > > I notice that there might be a bug in pyspark.mllib.linalg.Vectors when > dealing with a vector with a single element. > > Firstly, the 'dense' method says it can also take numpy.array. However the > code uses 'if len(elements) == 1' and when a numpy.array has only one > element its length is undefined and currently if calling dense() on a numpy > array with one element the program crashes. Probably instead of using len() > in the above if, size should be used. > > Secondly, after I managed to create a dense-Vectors object with only one > element from unicode, it seems that its behaviour is unpredictable. For > example, > > Vectors.dense(unicode("0.1")) > > will report an error. > > dense_vec = Vectors.dense(unicode("0.1")) > > will NOT report any error until you run > > dense_vec > > to check its value. And the following will be able to create a successful > DataFrame: > > mylist = [(0, Vectors.dense(unicode("0.1")))] > myrdd = sc.parallelize(mylist) > mydf = sqlContext.createDataFrame(myrdd, ["X", "Y"]) > > However if the above unicode value is read from a text file (e.g., a csv > file with 2 columns) then the DataFrame column corresponding to "Y" will be > EMPTY: > > raw_data = sc.textFile(filename) > split_data = raw_data.map(lambda line: line.split(',')) > parsed_data = split_data.map(lambda line: (int(line[0]), > Vectors.dense(line[1]))) > mydf = sqlContext.createDataFrame(parsed_data, ["X", "Y"]) > > It would be great if someone could share some ideas. Thanks a lot. > > f. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Possible-bug-involving-Vectors-with-a-single-element-tp27013.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >