Hi all, 

We are experimenting as using Kafka as a midpoint between microscopes and a 
Spark cluster for data analysis. Our microscopes almost universally use Windows 
machines for acquisition (as do most scientific instruments), and our compute 
cluster (which runs Spark among many other things) runs Linux. We use Isilon 
for file storage primarily, although we also have a GPFS cluster for HPC. 

We have a working http post system going into Kafka from the Windows 
acquisition machine, which is performing more reliably and faster than an SMB 
connection to the Isilon or GPFS clusters. Unfortunately, the Spark streaming 
consumer is much slower than reading from disk (Isilon or GPFS) on the Spark 
cluster. 

My proposal would be to not only improve the Spark streaming, but also to have 
a consumer (or multiple consumers!) that writes to disk, either over NFS or 
"locally" via a GPFS client. 

As I am a systems engineer, I'm not equipped to write this, so I'm wondering if 
anyone has done this sort of thing with Kafka before. I know there are HDFS 
consumers out there, and our Isilons can do HDFS, but the implementation on the 
Isilon is very limited at this time, and the ability to write to local 
filesystem or NFS would give us much more flexibility. 

Ideally, I would like to be able to use Kafka as a high speed transfer point 
between acquisition instruments (usually running Windows) and several kinds of 
storage, so that we could write virtually simultaneously to archive storage for 
the raw data and to HPC scratch for data analysis, thereby limiting the penalty 
incurred from data movement between storage tiers. 

Thanks for any input you have,

--Ken

Reply via email to