I don't think you can avoid examining each element of the RDD, if
that's what you mean. Your approach is basically the best you can do
in general. You're not making a second RDD here, and even if you did
this in two steps, the second RDD is really more of a bookkeeping that
a second huge data structure.
You can simplify your example a bit, although I doubt it's noticeably faster:
bigRdd.flatMap { i =>
val h = md5(i)
if (h(0) == 'A') {
Some(h)
} else {
None
}
}
This is also fine, simpler still, and if it's slower, not by much:
bigRdd.map(md5).filter(_(0) == 'A')
On Thu, Dec 18, 2014 at 10:18 PM, bethesda <[email protected]> wrote:
> We have a very large RDD and I need to create a new RDD whose values are
> derived from each record of the original RDD, and we only retain the few new
> records that meet a criteria. I want to avoid creating a second large RDD
> and then filtering it since I believe this could tax system resources
> unnecessarily (tell me if that assumption is wrong.)
>
> So for example, /and this is just an example/, say we have an RDD with 1 to
> 1,000,000 and we iterate through each value, and compute it's md5 hash, and
> we only keep the results that start with 'A'.
>
> What we've tried and seems to work but which seemed a bit ugly, and perhaps
> not efficient, was the following in pseudocode. * Is this the best way to do
> this?*
>
> Thanks
>
> bigRdd.flatMap( { i =>
> val h = md5(i)
> if (h.substring(1,1) == 'A') {
> Array(h)
> } else {
> Array[String]()
> }
> })
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-smaller-derivative-RDD-from-an-RDD-tp20769.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]