from:"Daniel Imberman"

Reading an RDD from a checkpoint.

2016-03-14 Thread Daniel Imberman

So I'm attempting to pre-compute my data such that I can pull an RDD from a checkpoint. However, I'm finding that upon running the same job twice the system is simply recreating the RDD from scratch. Here is the code I'm implementing to create the checkpoint: def checkpointTeam(checkpointDir:St

Forcing data from disk to memory

2016-03-23 Thread Daniel Imberman

Hi all, So I have a question about persistence. Let's say I have an RDD that's persisted MEMORY_AND_DISK, and I know that I now have enough memory space cleared up that I can force the data on disk into memory. Is it possible to tell spark to re-evaluate the open RDD memory and move that informati

Re: Forcing data from disk to memory

2016-03-24 Thread Daniel Imberman

it? Thank you On Thu, Mar 24, 2016 at 1:21 AM Takeshi Yamamuro wrote: > just re-sent, > > > -- Forwarded message -- > From: Takeshi Yamamuro > Date: Thu, Mar 24, 2016 at 5:19 PM > Subject: Re: Forcing data from disk to memory > To: Daniel Imberman > &

Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Daniel Imberman

Hi All, I've developed a spark module in scala that I would like to add a python port for. I want to be able to allow users to create a pyspark RDD and send it to my system. I've been looking into the pyspark source code as well as py4J and was wondering if there has been anything like this implem

Re: Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Daniel Imberman

k! :) > > On Wed, Jun 22, 2016 at 7:07 PM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi All, >> >> I've developed a spark module in scala that I would like to add a python >> port for. I want to be able to allow users to create a pysp

Arrays in Datasets (1.6.1)

2016-06-27 Thread Daniel Imberman

Hi all, So I've been attempting to reformat a project I'm working on to use the Dataset API and have been having some issues with encoding errors. From what I've read, I think that I should be able to store Arrays of primitive values in a dataset. However, the following class gives me encoding err

Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Daniel Imberman

There's no real way of doing nested for-loops with RDD's because the whole idea is that you could have so much data in the RDD that it would be really ugly to store it all in one worker. There are, however, ways to handle what you're asking about. I would personally use something like CoGroup or

Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman

Hi, I’m looking for a way to compare subsets of an RDD intelligently. Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about do

Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman

Hi, I’m looking for a way to compare subsets of an RDD intelligently. Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about do

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman

Could you please post the associated code and output? On Mon, Jan 4, 2016 at 3:55 PM Arun Luthra wrote: > I tried groupByKey and noticed that it did not group all values into the > same group. > > In my test dataset (a Pair rdd) I have 16 records, where there are only 4 > distinct keys, so I exp

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman

Could you try simplifying the key and seeing if that makes any difference? Make it just a string or an int so we can count out any issues in object equality. On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra wrote: > Spark 1.5.0 > > data: > > p1,lo1,8,0,4,0,5,20150901|5,1,1.0 > p1,lo2,8,0,4,0,5

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman

Jan 4, 2016 at 5:05 PM Arun Luthra wrote: > If I simplify the key to String column with values lo1, lo2, lo3, lo4, it > works correctly. > > On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman > wrote: > >> Could you try simplifying the key and seeing if that makes any >> d

[no subject]

2016-01-11 Thread Daniel Imberman

Hi all, I'm looking for a way to efficiently partition an RDD, but allow the same data to exists on multiple partitions. Lets say I have a key-value RDD with keys {1,2,3,4} I want to be able to to repartition the RDD so that so the partitions look like p1 = {1,2} p2 = {2,3} p3 = {3,4} Localit

Re: partitioning RDD

2016-01-11 Thread Daniel Imberman

gt; > On Mon, Jan 11, 2016 at 10:51 AM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi all, >> >> I'm looking for a way to efficiently partition an RDD, but allow the same >> data to exists on multiple partitions. >> >> >> L

Re: Read Accumulator value while running

2016-01-13 Thread Daniel Imberman

Hi Kira, I'm having some trouble understanding your question. Could you please give a code example? >From what I think you're asking there are two issues with what you're looking to do. (Please keep in mind I could be totally wrong on both of these assumptions, but this is what I've been lead t

Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman

I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD in a mapPartition. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large. My hope is to do a MapPartitions within the R

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman

Hi Darin, You should read this article. TextFile is very inefficient in S3. http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 Cheers On Wed, Jan 13, 2016 at 11:43 AM Darin McBeath wrote: > I'm looking for some suggestions based on other's experiences. > > I currently

Re: Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman

very fast. > > Cheers > > On Wed, Jan 13, 2016 at 11:29 AM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> I'm looking for a way to send structures to pre-determined partitions so >> that >> they can be used by another RDD in a mapPartition. >

Re: Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Daniel Imberman

h > what I'm doing already. Was just thinking there might be a better way. > > Darin. > ------ > *From:* Daniel Imberman > *To:* Darin McBeath ; User > *Sent:* Wednesday, January 13, 2016 2:48 PM > *Subject:* Re: Best practice for retrievin

Re: Sending large objects to specific RDDs

2016-01-14 Thread Daniel Imberman

ially AWS but you have to do your own virtualization scripts). Do you have any other thoughts on how I could go about dealing with this purely using spark and HDFS? Thank you On Wed, Jan 13, 2016 at 11:49 AM Daniel Imberman wrote: > Thank you Ted! That sounds like it would probably be

Re: Sending large objects to specific RDDs

2016-01-16 Thread Daniel Imberman

on this :-) > > On Thu, Jan 14, 2016 at 3:51 PM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi Ted, >> >> So unfortunately after looking into the cluster manager that I will be >> using for my testing (I'm using a super-computer called

Re: Sending large objects to specific RDDs

2016-01-16 Thread Daniel Imberman

On Sat, Jan 16, 2016 at 9:38 AM Koert Kuipers wrote: > Just doing a join is not an option? If you carefully manage your > partitioning then this can be pretty efficient (meaning no extra shuffle, > basically map-side join) > On Jan 13, 2016 2:30 PM, "Daniel Imberman" > wr

Re: Sending large objects to specific RDDs

2016-01-17 Thread Daniel Imberman

gt; On Sat, Jan 16, 2016 at 9:39 AM, Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi Ted, >> >> I think I might have figured something out!(Though I haven't tested it at >> scale yet) >> >> My current thought is that I can do a groupB

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman

Hi Richard, If I understand the question correctly it sounds like you could probably do this using mapValues (I'm assuming that you want two pieces of information out of all rows, the states as individual items, and the number of states in the row) val separatedInputStrings = input:RDD[(Int, Str

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman

edit: Mistake in the second code example val numColumns = separatedInputStrings.filter{ case(id, (stateList, numStates)) => numStates}.reduce(math.max) On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman wrote: > Hi Richard, > > If I understand the question correctly it sounds li

Re: Split columns in RDD

2016-01-19 Thread Daniel Imberman

edit 2: filter should be map val numColumns = separatedInputStrings.map{ case(id, (stateList, numStates)) => numStates}.reduce(math.max) On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman wrote: > edit: Mistake in the second code example > > val numColumns = separatedInputStrings.fil

Terminating Spark Steps in AWS

2016-01-26 Thread Daniel Imberman

Hi all, I want to set up a series of spark steps on an EMR spark cluster, and terminate the current step if it's taking too long. However, when I ssh into the master node and run hadoop jobs -list, the master node seems to believe that there is no jobs running. I don't want to terminate the cluste

Re: Terminating Spark Steps in AWS

2016-01-26 Thread Daniel Imberman

jobs. For Spark jobs, which run on YARN, > you instead want "yarn application -list". > > Hope this helps, > Jonathan (from the EMR team) > > On Tue, Jan 26, 2016 at 10:05 AM Daniel Imberman < > daniel.imber...@gmail.com> wrote: > >> Hi all, >> &

Running multiple foreach loops

2016-02-17 Thread Daniel Imberman

Hi all, So I'm currently figuring out how to accumulate three separate accumulators: val a:Accumulator val b:Accumulator val c:Accumulator I have an r:RDD[thing] and the code currently reads r.foreach{ thing => a += thing b += thing c += thing } Idea

Re: Running multiple foreach loops

2016-02-17 Thread Daniel Imberman

Thank you Ted! On Wed, Feb 17, 2016 at 2:12 PM Ted Yu wrote: > If the Accumulators are updated at the same time, calling foreach() once > seems to have better performance. > > > On Feb 17, 2016, at 4:30 PM, Daniel Imberman > wrote: > > > > Hi all, > > >

Performing multiple aggregations over the same data

2016-02-23 Thread Daniel Imberman

Hi guys, So I'm running into a speed issue where I have a dataset that needs to be aggregated multiple times. Initially my team had set up three accumulators and were running a single foreach loop over the data. Something along the lines of val accum1:Accumulable[a] val accum2: Accumulable[b] va

Attempting to aggregate multiple values

2016-02-26 Thread Daniel Imberman

Hi all, So over the past few days I've been attempting to create a function that takes an RDD[U], and creates three MMaps. I've been attempting to aggregate these values but I'm running into a major issue. when I initially tried to use separate aggregators for each map, I noticed a significant sl

Re: Broadcasting Non Serializable Objects

2016-10-18 Thread Daniel Imberman

Hi Pedro, Can you please post your code? Daniel On Tue, Oct 18, 2016 at 12:27 PM pedroT wrote: > Hi guys. > > I know this is a well known topic, but reading about (a lot) I'm not sure > about the answer.. > > I need to broadcast a complex estructure with a lot of objects as fields, > including

Reading an RDD from a checkpoint.

Forcing data from disk to memory

Re: Forcing data from disk to memory

Creating a python port for a Scala Spark Projeect

Re: Creating a python port for a Scala Spark Projeect

Arrays in Datasets (1.6.1)

Re: How to do nested for-each loops across RDDs ?

Comparing Subsets of an RDD

Comparing Subsets of an RDD

Re: groupByKey does not work?

Re: groupByKey does not work?

Re: groupByKey does not work?

[no subject]

Re: partitioning RDD

Re: Read Accumulator value while running

Sending large objects to specific RDDs

Re: Best practice for retrieving over 1 million files from S3

Re: Sending large objects to specific RDDs

Re: Best practice for retrieving over 1 million files from S3

Re: Sending large objects to specific RDDs

Re: Sending large objects to specific RDDs

Re: Sending large objects to specific RDDs

Re: Sending large objects to specific RDDs

Re: Split columns in RDD

Re: Split columns in RDD

Re: Split columns in RDD

Terminating Spark Steps in AWS

Re: Terminating Spark Steps in AWS

Running multiple foreach loops

Re: Running multiple foreach loops

Performing multiple aggregations over the same data

Attempting to aggregate multiple values

Re: Broadcasting Non Serializable Objects

33 matches

Site Navigation

Mail list logo

Footer information