This article <http://www.virdata.com/tuning-spark/> gives you a pretty good
start on the Spark streaming side. And this article
<https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines>
is for the kafka, it has nice explanation how message size and partitions
effects the throughput. And this article
<https://www.sigmoid.com/creating-sigview-a-real-time-analytics-dashboard/>
has a use-case.

Thanks
Best Regards

On Tue, May 12, 2015 at 8:25 PM, dgoldenberg <dgoldenberg...@gmail.com>
wrote:

> Hi,
>
> I'm looking at a data ingestion implementation which streams data out of
> Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to
> process the data in each partition.  Have folks looked at ways of speeding
> up this type of ingestion?
>
> Let's say the main part of the ingest process is fetching documents from
> somewhere and performing text extraction on them. Is this type of
> processing
> best done by expressing the pipelining with Spark RDD transformations or by
> just kicking off a multi-threaded pipeline?
>
> Or, is using a multi-threaded pipeliner per partition is a decent strategy
> and the performance comes from running in a clustered mode?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-speed-up-data-ingestion-with-Spark-tp22859.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to