Re: Splitting spark dstream into separate fields

Mich Talebzadeh Tue, 26 Apr 2016 03:34:05 -0700

Thanks Praveen.

With regard to key/value pair. My kafka takes the following rows as input


cat ${IN_FILE} | ${KAFKA_HOME}/bin/kafka-console-producer.sh --broker-list
rhes564:9092 --topic newtopic

That ${IN_FILE} is the source of prices (1000  as follows

ID       TIMESTAMP                           PRICE
40,     20160426-080924,                  67.55738301621814598514

So tuples would be like below?

 (1,"ID")
(2, "TIMESTAMP")
(3, "PRICE")

For values
val words1 = lines.map(_.split(',').view(1))
val words2 = lines.map(_.split(',').view(2))
val words3 = lines.map(_.split(',').view(3))

So word1 will return value of ID, word2 will return value of TIMESTAMP and
word3 will return value of PRICE?

I assume this is an array that can be handled as elements of an array as
well?

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 26 April 2016 at 11:11, Praveen Devarao <praveen...@in.ibm.com> wrote:

> Hi Mich,
>
>         >>
>                 val lines = dstream.map(_._2)
>
>                 This maps the record into components? Is that the correct
> understanding of it
>         <<
>
>         Not sure what you refer to when said record into components. The
> above function is basically giving you the tuple (key/value pair) that you
> would have inserted into Kafka. say my Kafka producer puts data as
>
>         1=>"abc"
>         2 => "def"
>
>         Then the above map would give you tuples as below
>
>         (1,"abc")
>         (2,"abc")
>
>         >>
>                 The following splits the line into comma separated fields.
>
>                 val words = lines.map(_.split(',').view(2))
>         <<
>         Right, basically the value portion of your kafka data is being
> handled here
>
>         >>
>                 val words = lines.map(_.split(',').view(2))
>
>                 I am interested in column three So view(2) returns the
> value.
>
>                 I have also seen other ways like
>
>                 val words = lines.map(_.split(',').map(line => (line(0),
> (line(1),line(2) ...
>         <<
>
>         The split operation is returning back an array of String [a
> immutable *StringLike *collection]....calling the view method is creating
> a *IndexedSeqView *on the iterable while as in the second way you are
> iterating through it accessing the elements directly via the index position
> [line(0), line(1) ]. You would have to decide what is best for your use
> case based on evaluations should be lazy or immediate [see references
> below].
>
>         References:
> *http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike*
> <http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike>*,
> *
> *http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView*
> <http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView>
>
>
> Thanking You
>
> ---------------------------------------------------------------------------------
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> ---------------------------------------------------------------------------------
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:        Mich Talebzadeh <mich.talebza...@gmail.com>
> To:        "user @spark" <user@spark.apache.org>
> Date:        26/04/2016 12:58 pm
> Subject:        Splitting spark dstream into separate fields
> ------------------------------
>
>
>
> Hi,
>
> Is there any optimum way of splitting a dstream into components?
>
> I am doing Spark streaming and this the dstream I get
>
> val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, topics)
>
> Now that dstream consists of 10,00 price lines per second like below
>
> ID, TIMESTAMP, PRICE
> 31,20160426-080924,93.53608929178084896656
>
> The columns are separated by commas/
>
> Now couple of questions:
>
> val lines = dstream.map(_._2)
>
> This maps the record into components? Is that the correct understanding of
> it
>
> The following splits the line into comma separated fields.
>
> val words = lines.map(_.split(',').view(2))
>
> I am interested in column three So view(2) returns the value.
>
> I have also seen other ways like
>
> val words = lines.map(_.split(',').map(line => (line(0), (line(1),line(2)
> ...
>
> line(0), line(1) refer to the position of the fields?
>
> Which one is the adopted one or the correct one?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>
> *http://talebzadehmich.wordpress.com*
> <http://talebzadehmich.wordpress.com/>
>
>
>
>

Re: Splitting spark dstream into separate fields

Reply via email to