…my 0.1 cent ☺
As a Spark and SO user, I would not find a separate SE a good thing.
*Part of the SO beauty is that you can filter easily and track different topics
from one dashboard.
*Being part of SO also gets good exposure as it raises awareness of Spark
across a wider audience.
*High rep
My two cents (As a user/consumer)…
I have been following & using Spark in financial services before version 1 and
before it migrated questions from Google Groups to apache mailing lists (which
was a shame ☹ ).
SO:
There has been some momentum lately on SO, but as questions were not
“monitored/
I was aiming to show the operations with pseudo-code, but I apparently failed,
so Java it is ☺
Assume the following 3 datasets on HDFS.
1. RDD1: User (1 Million rows – 2GB ) Columns: uid, locationId, (extra
stuff)
2. RDD2: Actions (1 Billion rows – 500GB) Columns: uid_1, uid_2 (extr
One example pattern we have it doing joins or filters based on two datasets.
E.g.
1 Filter –multiple- RddB for a given set extracted from RddA (keyword
here is multiple times)
a. RddA -> keyBy -> distinct -> collect() to Set A;
b. RddB -> Filter using Set A;
2 “Join
Hi,
It is a common pattern to process an RDD, collect (typically a subset) to the
driver and then broadcast back.
Adding an RDD method that can do that using the torrent broadcast mechanics
would be much more efficient. In addition, it would not require the Driver to
also utilize its Heap hold
It would be nice to have a code-configurable fall-back plan for such cases. Any
generalized solution can cause problems elsewhere.
Simply replicating hot cached blocks would be complicated to maintain and could
cause OOME. In the case I described on the JIRA, the hot partition will be
changing
It would be nice to have a code-configurable fall-back plan for such cases. Any
generalized solution can cause problems elsewhere.
Simply replicating hot cached blocks would be complicated to maintain and could
cause OOME. In the case I described on the JIRA, the hot partition will be
changing