Thanks Davies, sure, I can share the code/data in pm - best fahad On Mon, Oct 19, 2015 at 10:52 AM, Davies Liu <dav...@databricks.com> wrote: > Could you simplify the code a little bit so we can reproduce the failure? > (may also have some sample dataset if it depends on them) > > On Sun, Oct 18, 2015 at 10:42 PM, fahad shah <sfaha...@gmail.com> wrote: >> Hi >> >> I am trying to do pair rdd's, group by the key assign id based on key. >> I am using Pyspark with spark 1.3, for some reason, I am getting this >> error that I am unable to figure out - any help much appreciated. >> >> Things I tried (but to no effect), >> >> 1. make sure I am not doing any conversions on the strings >> 2. make sure that the fields used in the key are all there and not >> empty string (or else I toss the row out) >> >> My code is along following lines (split is using stringio to parse >> csv, header removes the header row and parse_train is putting the 54 >> fields into named tuple after whitespace/quote removal): >> >> #Error for string argument is thrown on the BB.take(1) where the >> groupbykey is evaluated >> >> A = sc.textFile("train.csv").filter(lambda x:not >> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is >> None) >> >> A.count() >> >> B = A.map(lambda k: >> ((k.srch_destination_id,k.srch_length_of_stay,k.srch_booking_window,k.srch_adults_count, >> k.srch_children_count,k.srch_room_count), >> (k[0:54]))) >> BB = B.groupByKey() >> BB.take(1) >> >> >> best fahad >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >>
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org