Hi Jun, 

I was wondering if there was something out there already. GPFS appears to the 
OS as local filesystem, so if there was a consumer that dumped to local 
filesystem, we'd be gold. 

Thanks,
--Ken

On May 16, 2014, at 7:04 PM, Jun Rao <jun...@gmail.com> wrote:

> You probably would have to write a consumer app to dump data in binary form
> to GPFS or NFS, since the HDFS api is very special.
> 
> Thanks,
> 
> Jun
> 
> 
> On Fri, May 16, 2014 at 8:17 AM, Carlile, Ken 
> <carli...@janelia.hhmi.org>wrote:
> 
>> Hi all,
>> 
>> Sorry for the possible repost--hadn't seen this in the list after 18 hours
>> and figured I'd try again....
>> 
>> We are experimenting as using Kafka as a midpoint between microscopes and
>> a Spark cluster for data analysis. Our microscopes almost universally use
>> Windows machines for acquisition (as do most scientific instruments), and
>> our compute cluster (which runs Spark among many other things) runs Linux.
>> We use Isilon for file storage primarily, although we also have a GPFS
>> cluster for HPC.
>> 
>> We have a working http post system going into Kafka from the Windows
>> acquisition machine, which is performing more reliably and faster than an
>> SMB connection to the Isilon or GPFS clusters. Unfortunately, the Spark
>> streaming consumer is much slower than reading from disk (Isilon or GPFS)
>> on the Spark cluster.
>> 
>> My proposal would be to not only improve the Spark streaming, but also to
>> have a consumer (or multiple consumers!) that writes to disk, either over
>> NFS or "locally" via a GPFS client.
>> 
>> As I am a systems engineer, I'm not equipped to write this, so I'm
>> wondering if anyone has done this sort of thing with Kafka before. I know
>> there are HDFS consumers out there, and our Isilons can do HDFS, but the
>> implementation on the Isilon is very limited at this time, and the ability
>> to write to local filesystem or NFS would give us much more flexibility.
>> 
>> Ideally, I would like to be able to use Kafka as a high speed transfer
>> point between acquisition instruments (usually running Windows) and several
>> kinds of storage, so that we could write virtually simultaneously to
>> archive storage for the raw data and to HPC scratch for data analysis,
>> thereby limiting the penalty incurred from data movement between storage
>> tiers.
>> 
>> Thanks for any input you have,
>> 
>> --Ken

Reply via email to