imho, you'll need to implement custom rdd with your locality settings(i.e.
custom implementation of discovering where each partition is located) +
setting for spark.locality.wait

On 24 August 2016 at 03:48, Mohit Jaggi <mohitja...@gmail.com> wrote:

> It is a bit hacky but possible. A lot depends on what kind of queries etc
> you want to run. You could write a data source that reads your data and
> keeps it partitioned the way you want, then use mapPartitions() to execute
> your code…
>
>
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
>
>
>
>
> On Aug 22, 2016, at 7:59 AM, Larry White <ljw1...@gmail.com> wrote:
>
> Hi,
>
> I have a bit of an unusual use-case and would *greatly* *appreciate* some
> feedback as to whether it is a good fit for spark.
>
> I have a network of compute/data servers configured as a tree as shown
> below
>
>    - controller
>    - server 1
>       - server 2
>       - server 3
>       - etc.
>
> There are ~20 servers, but the number is increasing to ~100.
>
> Each server contains a different dataset, all in the same format. Each is
> hosted by a different organization, and the data on every individual server
> is unique to that organization.
>
> Data *cannot* be replicated across servers using RDDs or any other means,
> for privacy/ownership reasons.
>
> Data *cannot* be retrieved to the controller, except in aggregate form,
> as the result of a query, for example.
>
> Because of this, there are currently no operations that treats the data as
> if it were a single data set: We could run a classifier on each site
> individually, but cannot for legal reasons, pull all the data into a single
> *physical* dataframe to run the classifier on all of it together.
>
> The servers are located across a wide geographic region (1,000s of miles)
>
> We would like to send jobs from the controller to be executed in parallel
> on all the servers, and retrieve the results to the controller. The jobs
> would consist of SQL-Heavy Java code for 'production' queries, and python
> or R code for ad-hoc queries and predictive modeling.
>
> Spark seems to have the capability to meet many of the individual
> requirements, but is it a reasonable platform overall for building this
> application?
>
> Thank you very much for your assistance.
>
> Larry
>
>
>
>

Reply via email to