Hi Asaf,
featurestream [1] is an internal project I'm playing with that includes
support for some of this, in particular:
* 1-pass random forest construction
* schema inference
* native support for text fields
Would this be of interest? It's not open source, but if there's sufficient
demand I can
s a simpler (and perhaps more
> efficient) approach.
>
> Keith
>
> On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg wrote:
>>
>> Could you modify your function so that it streams through the files record
>> by record and outputs them to hdfs, then read them all in as RDDs a
y format. The api allows reading out a
> single record at a time, but I'm not sure how to get those records into
> spark (without reading everything into memory from a single file at once).
>
>
>
> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg wrote:
>
>> file
>
> file => tranform file into a bunch of records
What does this function do exactly? Does it load the file locally?
Spark supports RDDs exceeding global RAM (cf the terasort example), but if
your example just loads each file locally, then this may cause problems.
Instead, you should load each fi
Have you tried just passing a path to ssc.textFileStream() ? It
monitors the path for new files by looking at mtime/atime ; all
new/touched files in the time window appear as an rdd in the dstream.
On 1 December 2014 at 14:41, Benjamin Cuthbert wrote:
> All,
>
> Is it possible to stream on HDFS d
Broadcast copies arbitrary objects, so you could read it into an object
such an array of lines then broadcast that.
Andy
On Monday, 17 November 2014, YaoPau wrote:
> I have a 1 million row file that I'd like to read from my edge node, and
> then
> send a copy of it to each Hadoop machine's memo
wrote:
>
>> If the tree is too big build it on graphxbut it will need thorough
>> analysis so that the partitions are well balanced...
>>
>> On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg > > wrote:
>>
>>> Hi Boromir,
>>>
>>> Assum
Hi Boromir,
Assuming the tree fits in memory, and what you want to do is parallelize
the computation, the 'obvious' way is the following:
* broadcast the tree T to each worker (ok since it fits in memory)
* construct an RDD for the deepest level - each element in the RDD is
(parent,data_at_node)