Making Unpersist Lazy

2015-07-01 Thread Jem Tucker
Hi, The current behavior of rdd.unpersist() appears to not be lazily executed and therefore must be placed after an action. Is there any way to emulate lazy execution of this function so it is added to the task queue? Thanks, Jem

Re: import errors with Eclipse Scala

2015-07-01 Thread Jem Tucker
in eclipse you can just add the spark assembly jar to the build path, right click the project > build path > configure build path > library > add external jars On Wed, Jul 1, 2015 at 7:15 PM Stefan Panayotov wrote: > Hi Ted, > > How can I import the relevant Spark projects into Eclipse? > Do I n

Re: Making Unpersist Lazy

2015-07-02 Thread Jem Tucker
be used the persist/unpersist does not work effectively Thanks Jem On Thu, 2 Jul 2015 at 08:18, Akhil Das wrote: > rdd's which are no longer required will be removed from memory by spark > itself (which you can consider as lazy?). > > Thanks > Best Regards > > On Wed, J

Accessing the console from spark

2015-07-03 Thread Jem Tucker
Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem

Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
ards > > On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker wrote: > >> Hi, >> >> We have an application that requires a username/password to be entered >> from the command line. To screen a password in java you need to use >> System.console.readPassword however when r

Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
console <- null pointer exception here val pass = console.readPassword("password: ") thanks, Jem On Fri, Jul 3, 2015 at 11:04 AM Akhil Das wrote: > Can you paste the code? Something is missing > > Thanks > Best Regards > > On Fri, Jul 3, 2015 at 3:14 PM, Jem

Spark Parallelism

2015-07-13 Thread Jem Tucker
Hi All, We have recently begun performance testing our Spark application and have found that changing the default parallelism has a much larger effect on the performance than expected, meaning there seems to be an illusive sweet spot that depends on the input size. Does anyone have any idea of a

Re: creating a distributed index

2015-07-15 Thread Jem Tucker
With regards to Indexed structures in Spark are there any alternatives to IndexedRDD for more generic keys including Strings? Thanks Jem On Wed, Jul 15, 2015 at 7:41 AM Burak Yavuz wrote: > Hi Swetha, > > IndexedRDD is available as a package on Spark Packages >

Re: creating a distributed index

2015-07-15 Thread Jem Tucker
//github.com/amplab/spark-indexedrdd/blob/master/src/main/scala/edu/berkeley/cs/amplab/spark/indexedrdd/KeySerializer.scala>, > including Strings. It's not released yet, but you can use it from the > master branch if you're interested. > > Ankur <http://www.ankurdave.com/>

Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
Hello, I have been using IndexedRDD as a large lookup (1 billion records) to join with small tables (1 million rows). The performance of indexedrdd is great until it has to be persisted on disk. Are there any alternatives to IndexedRDD or any changes to how I use it to improve performance with big

Re: Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
t will take some > time in any case. > > Regards, > Vetle > > > On Thu, Jul 16, 2015 at 10:02 AM Jem Tucker wrote: > >> Hello, >> >> I have been using IndexedRDD as a large lookup (1 billion records) to >> join with small tables (1 million rows). Th

Re: Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
ably have to install it separately. >> >> On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker wrote: >> >>> Hi Vetle, >>> >>> IndexedRDD is persisted in the same way RDDs are as far as I am aware. >>> Are you aware if Cassandra can be built into my applicati

Unread block data error

2015-07-17 Thread Jem Tucker
Hi, I have been running a batch of data through my application for the last couple of days and this morning discovered it had fallen over with the following error. java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStre

Re: java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Jem Tucker
Hi Yifan, I think this is a result of Kryo trying to seriallize something too large. Have you tried to increase your partitioning? Cheers, Jem On Fri, Oct 23, 2015 at 11:24 AM Yifan LI wrote: > Hi, > > I have a big sorted RDD sRdd(~962million elements), and need to scan its > elements in orde

Spark on YARN

2015-08-07 Thread Jem Tucker
Hi, I am running spark on YARN on the CDH5.3.2 stack. I have created a new user to own and run a testing environment, however when using this user applications I submit to yarn never begin to run, even if they are the exact same application that is successful with another user? Has anyone seen an

Re: Spark on YARN

2015-08-08 Thread Jem Tucker
Ryza wrote: > Hi Jem, > > Do they fail with any particular exception? Does YARN just never end up > giving them resources? Does an application master start? If so, what are > in its logs? If not, anything suspicious in the YARN ResourceManager logs? > > -Sandy > > On

Re: Spark on YARN

2015-08-08 Thread Jem Tucker
the RM web UI, do you see any available resources to spawn > the application master container? > > > On Sat, Aug 8, 2015 at 4:37 AM, Jem Tucker wrote: > >> Hi Sandy, >> >> The application doesn't fail, it gets accepted by yarn but the >> application mast

Re: Spark on YARN

2015-08-10 Thread Jem Tucker
n since another user's max > vcore limit is not reached. > > On Sat, Aug 8, 2015 at 10:07 PM, Jem Tucker wrote: > >> Hi dustin, >> >> Yes there are enough resources available, the same application run with a >> different user works fine so I think it is

Re: Relation between threads and executor core

2015-08-26 Thread Jem Tucker
Hi Samya, When submitting an application with spark-submit the cores per executor can be set with --executor-cores, meaning you can run that many tasks per executor concurrently. The page below has some more details on submitting applications: https://spark.apache.org/docs/latest/submitting-appli

Re: Relation between threads and executor core

2015-08-26 Thread Jem Tucker
ontrol? * > > > > Regards, > > Sam > > > > *From:* Jem Tucker [mailto:jem.tuc...@gmail.com] > *Sent:* Wednesday, August 26, 2015 2:26 PM > *To:* Samya MAITI ; user@spark.apache.org > *Subject:* Re: Relation between threads and executor core > > > > Hi S

RDD from partitions

2015-08-28 Thread Jem Tucker
Hi, I am trying to create an RDD from a selected number of its parents partitions. My current approach is to create my own SelectedPartitionRDD and implement compute and numPartitions myself, problem is the compute method is marked as @developerApi, and hence unsuitable for me to be using in my ap

Re: RDD from partitions

2015-08-28 Thread Jem Tucker
.contains(TaskContext.get().partitionId)) { > false > } else{ > iter.hasNext > } > } > > override def next():Int = iter.next() > } > > } > }.collect().foreach(println) > > > > > On Fri, Aug 28, 2015 at 12:33 PM, Jem Tucker

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
Hi, You just need to extend Partitioner and override the numPartitions and getPartition methods, see below class MyPartitioner extends partitioner { def numPartitions: Int = // Return the number of partitions def getPartition(key Any): Int = // Return the partition for a given key } On Tue,

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
e > range partitioner. > > On Tue, Sep 1, 2015 at 3:22 PM, Jem Tucker wrote: > >> Hi, >> >> You just need to extend Partitioner and override the numPartitions and >> getPartition methods, see below >> >> class MyPartitioner extends partitioner { >>

Re: Custom Partitioner

2015-09-01 Thread Jem Tucker
t; > I think range partitioner is not available in pyspark, so if we want > create one. how should we create that. my question is that. > > On Tue, Sep 1, 2015 at 3:57 PM, Jem Tucker wrote: > >> Ah sorry I miss read your question. In pyspark it looks like you just >> ne

Re: Custom Partitioner

2015-09-02 Thread Jem Tucker
:42 PM, Davies Liu wrote: > >> You can take the sortByKey as example: >> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L642 >> >> On Tue, Sep 1, 2015 at 3:48 AM, Jem Tucker wrote: >> > something like... >> > >> > clas

IndexedRDD

2015-01-13 Thread Jem Tucker
Hi, I have been playing around with the indexedRDD ( https://issues.apache.org/jira/browse/SPARK-2365, https://github.com/amplab/spark-indexedrdd) and have been very impressed with its performance. Some performance testing has revealed worse than expected scaling of the join performance*, and I wa

Re: IndexedRDD

2015-01-13 Thread Jem Tucker
Andrew Ash wrote: > >> Hi Jem, >> >> Linear time in scaling on the big table doesn't seem that surprising to >> me. What were you expecting? >> >> I assume you're doing normalRDD.join(indexedRDD). If you were to replace >> the indexedRDD with

FileInputDStream missing files

2015-01-14 Thread Jem Tucker
Hi all, A small number of the files being moved into my landing directory are not being "seen" by my fileStream reciever. After looking at the code it seems that, in the case of long batches (> 1minute), if files are created before a batch finishes, but only become visible after that batch finishe