imho, you'll need to implement custom rdd with your locality settings(i.e. custom implementation of discovering where each partition is located) + setting for spark.locality.wait
On 24 August 2016 at 03:48, Mohit Jaggi <mohitja...@gmail.com> wrote: > It is a bit hacky but possible. A lot depends on what kind of queries etc > you want to run. You could write a data source that reads your data and > keeps it partitioned the way you want, then use mapPartitions() to execute > your code⦠> > > Mohit Jaggi > Founder, > Data Orchard LLC > www.dataorchardllc.com > > > > > On Aug 22, 2016, at 7:59 AM, Larry White <ljw1...@gmail.com> wrote: > > Hi, > > I have a bit of an unusual use-case and would *greatly* *appreciate* some > feedback as to whether it is a good fit for spark. > > I have a network of compute/data servers configured as a tree as shown > below > > - controller > - server 1 > - server 2 > - server 3 > - etc. > > There are ~20 servers, but the number is increasing to ~100. > > Each server contains a different dataset, all in the same format. Each is > hosted by a different organization, and the data on every individual server > is unique to that organization. > > Data *cannot* be replicated across servers using RDDs or any other means, > for privacy/ownership reasons. > > Data *cannot* be retrieved to the controller, except in aggregate form, > as the result of a query, for example. > > Because of this, there are currently no operations that treats the data as > if it were a single data set: We could run a classifier on each site > individually, but cannot for legal reasons, pull all the data into a single > *physical* dataframe to run the classifier on all of it together. > > The servers are located across a wide geographic region (1,000s of miles) > > We would like to send jobs from the controller to be executed in parallel > on all the servers, and retrieve the results to the controller. The jobs > would consist of SQL-Heavy Java code for 'production' queries, and python > or R code for ad-hoc queries and predictive modeling. > > Spark seems to have the capability to meet many of the individual > requirements, but is it a reasonable platform overall for building this > application? > > Thank you very much for your assistance. > > Larry > > > >