Thanks Praveen. With regard to key/value pair. My kafka takes the following rows as input
cat ${IN_FILE} | ${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list rhes564:9092 --topic newtopic That ${IN_FILE} is the source of prices (1000 as follows ID TIMESTAMP PRICE 40, 20160426-080924, 67.55738301621814598514 So tuples would be like below? (1,"ID") (2, "TIMESTAMP") (3, "PRICE") For values val words1 = lines.map(_.split(',').view(1)) val words2 = lines.map(_.split(',').view(2)) val words3 = lines.map(_.split(',').view(3)) So word1 will return value of ID, word2 will return value of TIMESTAMP and word3 will return value of PRICE? I assume this is an array that can be handled as elements of an array as well? Regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 26 April 2016 at 11:11, Praveen Devarao <praveen...@in.ibm.com> wrote: > Hi Mich, > > >> > val lines = dstream.map(_._2) > > This maps the record into components? Is that the correct > understanding of it > << > > Not sure what you refer to when said record into components. The > above function is basically giving you the tuple (key/value pair) that you > would have inserted into Kafka. say my Kafka producer puts data as > > 1=>"abc" > 2 => "def" > > Then the above map would give you tuples as below > > (1,"abc") > (2,"abc") > > >> > The following splits the line into comma separated fields. > > val words = lines.map(_.split(',').view(2)) > << > Right, basically the value portion of your kafka data is being > handled here > > >> > val words = lines.map(_.split(',').view(2)) > > I am interested in column three So view(2) returns the > value. > > I have also seen other ways like > > val words = lines.map(_.split(',').map(line => (line(0), > (line(1),line(2) ... > << > > The split operation is returning back an array of String [a > immutable *StringLike *collection]....calling the view method is creating > a *IndexedSeqView *on the iterable while as in the second way you are > iterating through it accessing the elements directly via the index position > [line(0), line(1) ]. You would have to decide what is best for your use > case based on evaluations should be lazy or immediate [see references > below]. > > References: > *http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike* > <http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike>*, > * > *http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView* > <http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView> > > > Thanking You > > --------------------------------------------------------------------------------- > Praveen Devarao > Spark Technology Centre > IBM India Software Labs > > --------------------------------------------------------------------------------- > "Courage doesn't always roar. Sometimes courage is the quiet voice at the > end of the day saying I will try again" > > > > From: Mich Talebzadeh <mich.talebza...@gmail.com> > To: "user @spark" <user@spark.apache.org> > Date: 26/04/2016 12:58 pm > Subject: Splitting spark dstream into separate fields > ------------------------------ > > > > Hi, > > Is there any optimum way of splitting a dstream into components? > > I am doing Spark streaming and this the dstream I get > > val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topics) > > Now that dstream consists of 10,00 price lines per second like below > > ID, TIMESTAMP, PRICE > 31,20160426-080924,93.53608929178084896656 > > The columns are separated by commas/ > > Now couple of questions: > > val lines = dstream.map(_._2) > > This maps the record into components? Is that the correct understanding of > it > > The following splits the line into comma separated fields. > > val words = lines.map(_.split(',').view(2)) > > I am interested in column three So view(2) returns the value. > > I have also seen other ways like > > val words = lines.map(_.split(',').map(line => (line(0), (line(1),line(2) > ... > > line(0), line(1) refer to the position of the fields? > > Which one is the adopted one or the correct one? > > Thanks > > > Dr Mich Talebzadeh > > LinkedIn > *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw* > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > *http://talebzadehmich.wordpress.com* > <http://talebzadehmich.wordpress.com/> > > > >