Re: Spark random forest - string data

2015-01-16 Thread Andy Twigg
Hi Asaf, featurestream [1] is an internal project I'm playing with that includes support for some of this, in particular: * 1-pass random forest construction * schema inference * native support for text fields Would this be of interest? It's not open source, but if there's sufficient demand I can

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
s a simpler (and perhaps more > efficient) approach. > > Keith > > On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg wrote: >> >> Could you modify your function so that it streams through the files record >> by record and outputs them to hdfs, then read them all in as RDDs a

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
y format. The api allows reading out a > single record at a time, but I'm not sure how to get those records into > spark (without reading everything into memory from a single file at once). > > > > On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg wrote: > >> file

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
> > file => tranform file into a bunch of records What does this function do exactly? Does it load the file locally? Spark supports RDDs exceeding global RAM (cf the terasort example), but if your example just loads each file locally, then this may cause problems. Instead, you should load each fi

Re: hdfs streaming context

2014-12-01 Thread Andy Twigg
Have you tried just passing a path to ssc.textFileStream() ? It monitors the path for new files by looking at mtime/atime ; all new/touched files in the time window appear as an rdd in the dstream. On 1 December 2014 at 14:41, Benjamin Cuthbert wrote: > All, > > Is it possible to stream on HDFS d

Re: How to broadcast a textFile?

2014-11-17 Thread Andy Twigg
Broadcast copies arbitrary objects, so you could read it into an object such an array of lines then broadcast that. Andy On Monday, 17 November 2014, YaoPau wrote: > I have a 1 million row file that I'd like to read from my edge node, and > then > send a copy of it to each Hadoop machine's memo

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Andy Twigg
wrote: > >> If the tree is too big build it on graphxbut it will need thorough >> analysis so that the partitions are well balanced... >> >> On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg > > wrote: >> >>> Hi Boromir, >>> >>> Assum

Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Andy Twigg
Hi Boromir, Assuming the tree fits in memory, and what you want to do is parallelize the computation, the 'obvious' way is the following: * broadcast the tree T to each worker (ok since it fits in memory) * construct an RDD for the deepest level - each element in the RDD is (parent,data_at_node)