Hi all,

Does anyone ever try using Arrow Dataset API in a distributed system? E.g. 
create scan tasks in machine 1, then send and execute these tasks from machine 
2, 3, 4.

So far I think a possible workaround is to:

1. Create Dataset on machine 1;
2. Call Scan(), collect all scan tasks from scan task iterator;
3. Say we have 5 tasks with number 1, 2, 3, 4, 5 here, and we decide to run 
task 1, 2 on machine 2, run task 3, 4, 5 on machine 3;
4. Send the target task numbers to machine 2, 3 respectively;
5. Create Dataset with the same configuration on machine 2 and 3, and Call 
Scan() to create 5 tasks for each machine;
6. On machine 2, run task 1, 2, skip 3, 4, 5
7. On machine 3, skip task 1, 2, run 3, 4, 5

This should work correctly only if we assume that the method `Dataset::Scan()` 
returns exactly the same task iterator on different machines. And not sure if 
unnecessary overheads will be brought during the process, alter all we'll run 
the scan method N times for N machines.

A better solution I could think about is to make scan tasks serializable so we 
could distribute them directly to machines. Currently they don't seem to be 
designed in such way since we allow contextual stuffs to be used to create the 
tasks, e.g. the opening readers in ParquetScanTask[1]. At the same time a 
built-in ser/de mechanism will be needed. Anyway a bunch of work has to be done.

So far I am not sure which way is more reasonable or there is a better one than 
both. Any thoughts please let me know.

Best,
Hongze

[1] 
https://github.com/apache/arrow/blob/c09a82a388e79ddf4377f44ecfe515604f147270/cpp/src/arrow/dataset/file_parquet.cc#L52-L74

Reply via email to