RE: Handling questions in the mailing lists

2016-11-24 Thread Ioannis.Deligiannis
…my 0.1 cent ☺ As a Spark and SO user, I would not find a separate SE a good thing. *Part of the SO beauty is that you can filter easily and track different topics from one dashboard. *Being part of SO also gets good exposure as it raises awareness of Spark across a wider audience. *High rep

RE: Handling questions in the mailing lists

2016-11-07 Thread Ioannis.Deligiannis
My two cents (As a user/consumer)… I have been following & using Spark in financial services before version 1 and before it migrated questions from Google Groups to apache mailing lists (which was a shame ☹ ). SO: There has been some momentum lately on SO, but as questions were not “monitored/

RE: RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis
I was aiming to show the operations with pseudo-code, but I apparently failed, so Java it is ☺ Assume the following 3 datasets on HDFS. 1. RDD1: User (1 Million rows – 2GB ) Columns: uid, locationId, (extra stuff) 2. RDD2: Actions (1 Billion rows – 500GB) Columns: uid_1, uid_2 (extr

RE: RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis
One example pattern we have it doing joins or filters based on two datasets. E.g. 1 Filter –multiple- RddB for a given set extracted from RddA (keyword here is multiple times) a. RddA -> keyBy -> distinct -> collect() to Set A; b. RddB -> Filter using Set A; 2 “Join

RDD.broadcast

2016-04-28 Thread Ioannis.Deligiannis
Hi, It is a common pattern to process an RDD, collect (typically a subset) to the driver and then broadcast back. Adding an RDD method that can do that using the torrent broadcast mechanics would be much more efficient. In addition, it would not require the Driver to also utilize its Heap hold

RE: Spark Scheduler creating Straggler Node

2016-03-09 Thread Ioannis.Deligiannis
It would be nice to have a code-configurable fall-back plan for such cases. Any generalized solution can cause problems elsewhere. Simply replicating hot cached blocks would be complicated to maintain and could cause OOME. In the case I described on the JIRA, the hot partition will be changing

RE: Spark Scheduler creating Straggler Node

2016-03-09 Thread Ioannis.Deligiannis
It would be nice to have a code-configurable fall-back plan for such cases. Any generalized solution can cause problems elsewhere. Simply replicating hot cached blocks would be complicated to maintain and could cause OOME. In the case I described on the JIRA, the hot partition will be changing