Custom positioning/partitioning Dataframes

2016-06-03 Thread Nilesh Chakraborty
Hi, I have a domain-specific schema (RDF data with vertical partitioning, ie. one table per property) and I want to instruct SparkSQL to keep semantically closer property tables closer together, that is, group dataframes together into different nodes (or at least encourage it somehow) so that tabl

Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
Hello! Spark Streaming supports HDFS as input source, and also Akka actor receivers, or TCP socket receivers. For my use case I think it's probably more convenient to read the data directly from Actors, because I already need to set up a multi-node Akka cluster (on the same nodes that Spark runs

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
Hey Michael, Thanks for the great reply! That clears things up a lot. The idea about Apache Kafka sounds very interesting; I'll look into it. The multiple consumers and fault tolerance sound awesome. That's probably what I need. Cheers, Nilesh -- View this message in context: http://apache-sp

Accumulable with huge accumulated value?

2014-06-14 Thread Nilesh Chakraborty
Hey all! I have got an iterative problem. I'm trying to find something similar to Hadoop's MultipleOutputs [1] in Spark 1.0. I need to build up a couple of large dense vectors (may contain billions of elements - 2 billion doubles => at least 16GB) by adding partial vector chunks to it. This can be

Alternative to checkpointing and materialization for truncating lineage in high iteration jobs

2014-06-28 Thread Nilesh Chakraborty
Hello, In a thread about "java.lang.StackOverflowError when calling count()" [1] I saw Tathagata Das share an interesting approach for truncating RDD lineage - this helps prevent StackOverflowErrors in high iteration jobs while avoiding the disk-writing performance penalty. Here's an excerpt from