This article <http://www.virdata.com/tuning-spark/> gives you a pretty good start on the Spark streaming side. And this article <https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines> is for the kafka, it has nice explanation how message size and partitions effects the throughput. And this article <https://www.sigmoid.com/creating-sigview-a-real-time-analytics-dashboard/> has a use-case.
Thanks Best Regards On Tue, May 12, 2015 at 8:25 PM, dgoldenberg <dgoldenberg...@gmail.com> wrote: > Hi, > > I'm looking at a data ingestion implementation which streams data out of > Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to > process the data in each partition. Have folks looked at ways of speeding > up this type of ingestion? > > Let's say the main part of the ingest process is fetching documents from > somewhere and performing text extraction on them. Is this type of > processing > best done by expressing the pipelining with Spark RDD transformations or by > just kicking off a multi-threaded pipeline? > > Or, is using a multi-threaded pipeliner per partition is a decent strategy > and the performance comes from running in a clustered mode? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >