Re: writing to local files on a worker

2018-11-15 Thread Steve Lewis
ws IOException{ LineNumberReader rdr = new LineNumberReader(new FileReader(f)); StringBuilder sb = new StringBuilder(); String line = rdr.readLine(); while(line != null) { sb.append(line); sb.append("\n"); line = rdr.readLine();

Re: writing to local files on a worker

2018-11-12 Thread Steve Lewis
are within a map step. > > Generally you should not call external applications from Spark. > > > Am 11.11.2018 um 23:13 schrieb Steve Lewis : > > > > I have a problem where a critical step needs to be performed by a third > party c++ application. I can send or install

writing to local files on a worker

2018-11-11 Thread Steve Lewis
I have a problem where a critical step needs to be performed by a third party c++ application. I can send or install this program on the worker nodes. I can construct a function holding all the data this program needs to process. The problem is that the program is designed to read and write from

No space left on device

2018-08-20 Thread Steve Lewis
We are trying to run a job that has previously run on Spark 1.3 on a different cluster. The job was converted to 2.3 spark and this is a new cluster. The job dies after completing about a half dozen stages with java.io.IOException: No space left on device It appears that the nodes are us

How do i get a spark instance to use my log4j properties

2016-04-12 Thread Steve Lewis
Ok I am stymied. I have tried everything I can think of to get spark to use my own version of log4j.properties In the launcher code - I launch a local instance from a Java application I say -Dlog4j.configuration=conf/log4j.properties where conf/log4j.properties is user.dir - no luck Spark alwa

I need some help making datasets with known columns from a JavaBean

2016-03-01 Thread Steve Lewis
I asked a similar question a day or so ago but this is a much more concrete example showing the difficulty I am running into I am trying to use DataSets. I have an object which I want to encode with its fields as columns. The object is a well behaved Java Bean. However one field is an object (or

DataSet Evidence

2016-02-29 Thread Steve Lewis
I have a relatively complex Java object that I would like to use in a dataset if I say Encoder evidence = Encoders.kryo(MyType.class); JavaRDD rddMyType= generateRDD(); // some code Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(), evidence); I get one column - the whole object

Re: Datasets and columns

2016-01-25 Thread Steve Lewis
Type.printSchema() > > On Mon, Jan 25, 2016 at 1:16 PM, Steve Lewis > wrote: > >> assume I have the following code >> >> SparkConf sparkConf = new SparkConf(); >> >> JavaSparkContext sqlCtx= new JavaSparkContext(sparkConf); >> >> JavaRDD rddMyTy

Datasets and columns

2016-01-25 Thread Steve Lewis
assume I have the following code SparkConf sparkConf = new SparkConf(); JavaSparkContext sqlCtx= new JavaSparkContext(sparkConf); JavaRDD rddMyType= generateRDD(); // some code Encoder evidence = Encoders.kryo(MyType.class); Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(), evidence

Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Steve Lewis
ambda functions if you want though: > > ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) > => > ... > } > > > On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis > wrote: > >> We have been working a large search problem which we have

I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Steve Lewis
We have been working a large search problem which we have been solving in the following ways. We have two sets of objects, say children and schools. The object is to find the closest school to each child. There is a distance measure but it is relatively expensive and would be very costly to apply

Strange Set of errors

2015-12-14 Thread Steve Lewis
I am running on a spark 1.5.1 cluster managed by Mesos - I have an application that handled a chemistry problem which can be increased by increasing the number of atoms - increasing the number of Spark stages. I do a repartition at each stage - Stage 9 is the last stage. At each stage the size and

Has the format of a spark jar file changes in 1.5

2015-12-12 Thread Steve Lewis
I have been using my own code to build the jar file I use for spark submit. In 1.4 I could simply add all class and resource files I find in the class path to the jar and add all jars in the classpath into a directory called lib in the jar file. In 1.5 I see that resources and classes in jars in t

Looking for a few Spark Benchmarks

2015-07-19 Thread Steve Lewis
I was in a discussion with someone who works for a cloud provider which offers Spark/Hadoop services. We got into a discussion of performance and the bewildering array of machine types and the problem of selecting a cluster with 20 "Large" instances VS 10 "Jumbo" instances or the trade offs betwee

Re: How can I force operations to complete and spool to disk

2015-05-07 Thread Steve Lewis
ake a look in the algorithm once more. "Tasks typically > preform both operations several hundred thousand times." why it can not be > done distributed way? > > On Thu, May 7, 2015 at 3:16 PM, Steve Lewis wrote: > >> I am performing a job where I perform a number of st

How can I force operations to complete and spool to disk

2015-05-06 Thread Steve Lewis
I am performing a job where I perform a number of steps in succession. One step is a map on a JavaRDD which generates objects taking up significant memory. The this is followed by a join and an aggregateByKey. The problem is that the system is running getting OutOfMemoryErrors - Most tasks work but

Re: Can a map function return null

2015-04-19 Thread Steve Lewis
t from Samsung Mobile > > > Original message > From: Olivier Girardot > Date:2015/04/18 22:04 (GMT+00:00) > To: Steve Lewis ,user@spark.apache.org > Subject: Re: Can a map function return null > > You can return an RDD with null values inside, and afterward

Can a map function return null

2015-04-18 Thread Steve Lewis
I find a number of cases where I have an JavaRDD and I wish to transform the data and depending on a test return 0 or one item (don't suggest a filter - the real case is more complex). So I currently do something like the following - perform a flatmap returning a list with 0 or 1 entry depending on

Fwd: Numbering RDD members Sequentially

2015-03-11 Thread Steve Lewis
-- Forwarded message -- From: Steve Lewis Date: Wed, Mar 11, 2015 at 9:13 AM Subject: Re: Numbering RDD members Sequentially To: "Daniel, Ronald (ELS-SDG)" perfect - exactly what I was looking for, not quite sure why it is called zipWithIndex since zipping is not i

Numbering RDD members Sequentially

2015-03-10 Thread Steve Lewis
I have Hadoop Input Format which reads records and produces JavaPairRDD locatedData where _1() is a formatted version of the file location - like "12690",, "24386 ."27523 ... _2() is data to be processed For historical reasons I want to convert _1() into in integer representing the

Distributing Computation across slaves

2015-01-15 Thread Steve Lewis
I have a job involving two sets of data indexed with the same type of key. I have an expensive operation that I want to run on pairs sharing the same key. The following code works BUT all of the work is being done on 3 of 16 processors - How do I go about diagnosing and fixing the behavior. A

Is there a way to read a parquet database without generating an RDD

2015-01-06 Thread Steve Lewis
I have an application where a function needs access to the results of a select from a parquet database. Creating a JavaSQLContext and from it a JavaSchemaRDD as shown below works but the parallelism is not needed - a simple JDBC call would work - Are there alternative non-parallel ways to achieve

Developing with Pycharm

2015-01-05 Thread Steve Lewis
I am trying to use PyCharm for Spark development on a windows 8.1 Machine - I have installed py4j, added Spark pythin as a content root and have Cygwin in my path Also Using intelliJ works for Spark Java code. When I run a simple word count below I get errors in launching a Spark local cluster - an

Is there a way (in Java) to turn Java Iterable into a JavaRDD?

2014-12-19 Thread Steve Lewis
I notice new methods such as JavaSparkContext makeRDD (with few useful examples) - It takes a Seq but while there are ways to turn a list into a Seq I see nothing that uses an Iterable

Who is using Spark and related technologies for bioinformatics applications?

2014-12-17 Thread Steve Lewis
I am aware of the ADAM project in Berkeley and I am working on Proteomic searches - anyone else working in this space

Re: how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
rs if it has to. > > On Fri, Dec 12, 2014 at 2:39 PM, Steve Lewis > wrote: >> >> The objective is to let the Spark application generate a file in a format >> which can be consumed by other programs - as I said I am willing to give up >> parallelism at this stage

Re: how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
e final output file. > > > - SF > > On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis > wrote: >> >> >> I have an RDD which is potentially too large to store in memory with >> collect. I want a single task to write the contents as a file to hdfs. Time >> is not a

how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
I have an RDD which is potentially too large to store in memory with collect. I want a single task to write the contents as a file to hdfs. Time is not a large issue but memory is. I say the following converting my RDD (scans) to a local Iterator. This works but hasNext shows up as a separate task

In Java how can I create an RDD with a large number of elements

2014-12-08 Thread Steve Lewis
assume I don't care about values which may be created in a later map - in scala I can say val rdd = sc.parallelize(1 to 10, numSlices = 1000) but in Java JavaSparkContext can only paralellize a List - limited to Integer,MAX_VALUE elements and required to exist in memory - the best I can do

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis
will be generated at a time > per executor thread. > > On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis wrote: > >> I have a function which generates a Java object and I want to explore >> failures which only happen when processing large numbers of these object. >> the r

How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis
I have a function which generates a Java object and I want to explore failures which only happen when processing large numbers of these object. the real code is reading a many gigabyte file but in the test code I can generate similar objects programmatically. I could create a small list, paralleli

Problems creating and reading a large test file

2014-12-05 Thread Steve Lewis
I am trying to look at problems reading a data file over 4G. In my testing I am trying to create such a file. My plan is to create a fasta file (a simple format used in biology) looking like >1 TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTG >2 G

I am having problems reading files in the 4GB range

2014-12-05 Thread Steve Lewis
I am using a custom hadoop input format which works well on smaller files but fails with a file at about 4GB size - the format is generating about 800 splits and all variables in my code are longs - Any suggestions? Is anyone reading files of this size? Exception in thread "main" org.apache.spark

How can a function get a TaskContext

2014-12-04 Thread Steve Lewis
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/TaskContext.java has a Java implementation if TaskContext wit a very useful method /** * Return the currently active TaskContext. This can be called inside of * user functions to access contextual information about run

Re: Any ideas why a few tasks would stall

2014-12-04 Thread Steve Lewis
n contains and try to achieve a roughly even >> distribution for best performance. In particular, if the RDDs are PairRDDs, >> partitions are assigned based on the hash of the key, so an even >> distribution of values among keys is required for even split of data across >> pa

Failed to read chunk exception

2014-12-04 Thread Steve Lewis
I am running a large job using 4000 partitions - after running for four hours on a 16 node cluster it fails with the following message. The errors are in spark code and seem address unreliability at the level of the disk - Anyone seen this and know what is going on and how to fix it. Exception in

How can a function running on a slave access the Executor

2014-12-03 Thread Steve Lewis
I have been working on balancing work across a number of partitions and find it would be useful to access information about the current execution environment much of which (like Executor ID) are available if there was a way to get the current executor or the Hadoop TaskAttempt context - does any o

Re: Any ideas why a few tasks would stall

2014-12-02 Thread Steve Lewis
ou sample the executor several times in a short time period, you can > identify 'hot spots' or expensive sections in the user code. > > On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis wrote: > >> I am working on a problem which will eventually involve many millions of >>

Any ideas why a few tasks would stall

2014-12-02 Thread Steve Lewis
I am working on a problem which will eventually involve many millions of function calls. A have a small sample with several thousand calls working but when I try to scale up the amount of data things stall. I use 120 partitions and 116 finish in very little time. The remaining 4 seem to do all the

How can a function access Executor ID, Function ID and other parameters

2014-11-30 Thread Steve Lewis
I am running on a 15 node cluster and am trying to set partitioning to balance the work across all nodes. I am using an Accumulator to track work by Mac Address but would prefer to use data known to the Spark environment - Executor ID, and Function ID show up in the Spark UI and Task ID and Attem

How can a function access Executor ID, Function ID and other parameters known to the Spark Environment

2014-11-26 Thread Steve Lewis
I am running on a 15 node cluster and am trying to set partitioning to balance the work across all nodes. I am using an Accumulator to track work by Mac Address but would prefer to use data known to the Spark environment - Executor ID, and Function ID show up in the Spark UI and Task ID and Attem

Re: Why is this operation so expensive

2014-11-25 Thread Steve Lewis
e not in the first half of the > PairRDD. An alternative is to make the .equals() and .hashcode() of > KeyType delegate to the .getId() method you use in the anonymous function. > > Cheers, > Andrew > > On Tue, Nov 25, 2014 at 10:06 AM, Steve Lewis > wrote: > >> I have an

Why is this operation so expensive

2014-11-25 Thread Steve Lewis
I have an JavaPairRDD> originalPairs. There are on the order of 100 million elements I call a function to rearrange the tuples JavaPairRDD> newPairs = originalPairs.values().mapToPair(new PairFunction, String, Tuple2> { @Override public Tuple2> doCall(final Tuple2 t) {

How do I get the executor ID from running Java code

2014-11-17 Thread Steve Lewis
The spark UI lists a number of Executor IDS on the cluster. I would like to access both executor ID and Task/Attempt IDs from the code inside a function running on a slave machine. Currently my motivation is to examine parallelism and locality but in Hadoop this aids in allowing code to write non

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Steve Lewis
a stage). If you want to force it to have more > partitions, you can call RDD.repartition(numPartitions). Note that this > will introduce a shuffle you wouldn't otherwise have. > > Also make sure your job is allocated more than one core in your cluster > (you can see this on the

How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Steve Lewis
I have instrumented word count to track how many machines the code runs on. I use an accumulator to maintain a Set or MacAddresses. I find that everything is done on a single machine. This is probably optimal for word count but not the larger problems I am working on. How to a force processing to

Re: How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
st/api/scala/index.html#org.apache.spark.AccumulatorParam > > JavaSparkContext has helper methods for int and double but not long. You > can just make your own little implementation of AccumulatorParam > right? ... which would be nice to add to JavaSparkContext. > > On Wed, Nov 12, 2014 a

How (in Java) do I create an Accumulator of type Long

2014-11-12 Thread Steve Lewis
JavaSparkContext currentContext = ...; Accumulator accumulator = currentContext.accumulator(0, "MyAccumulator"); will create an Accumulator of Integers. For many large Data problems Integer is too small and Long is a better type. I see a call like the following AccumulatorPar

How can my java code executing on a slave find the task id?

2014-11-12 Thread Steve Lewis
I am trying to determine how effective partitioning is at parallelizing my tasks. So far I suspect it that all work is done in one task. My plan is to create a number of accumulators - one for each task and have functions increment the accumulator for the appropriate task (or slave) the values cou

Is there a way to clone a JavaRDD without persisting it

2014-11-11 Thread Steve Lewis
In my problem I have a number of intermediate JavaRDDs and would like to be able to look at their sizes without destroying the RDD for sibsequent processing. persist will do this but these are big and perisist seems expensive and I am unsure of which StorageLevel is needed, Is there a way to clone

How do I kill av job submitted with spark-submit

2014-11-02 Thread Steve Lewis
I see the job in the web interface but don't know how to kill it

Re: A Spark Design Problem

2014-11-01 Thread Steve Lewis
vaPairRDD > > If you partition both RDDs by the bin id, I think you should be able to > get what you want. > > Best Regards, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > >> >> On Fr

A Spark Design Problem

2014-10-31 Thread Steve Lewis
The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a scoring function between keys and locks and a key that fits a lock will have a high score. There may be many keys fitting one lock and a key m

Questions about serialization and SparkConf

2014-10-29 Thread Steve Lewis
Assume in my executor I say SparkConf sparkConf = new SparkConf(); sparkConf.set("spark.kryo.registrator", "com.lordjoe.distributed.hydra.HydraKryoSerializer"); sparkConf.set("mysparc.data", "Some user Data"); sparkConf.setAppName("Some App"); Now 1) Are there defa

Re: com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994

2014-10-28 Thread Steve Lewis
objects you're trying to serialize and see if > those work. > > > -Original Message- > *From: *Steve Lewis [lordjoe2...@gmail.com] > *Sent: *Tuesday, October 28, 2014 10:46 PM Eastern Standard Time > *To: *user@spark.apache.org > *Subject: *com.esotericsoftware.k

com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994

2014-10-28 Thread Steve Lewis
A cluster I am running on keeps getting KryoException. Unlike the Java serializer the Kryo Exception gives no clue as to what class is giving the error The application runs properly locally but no the cluster and I have my own custom KryoRegistrator and register sereral dozen classes - essentially

Re: How do you write a JavaRDD into a single file

2014-10-21 Thread Steve Lewis
Collect will store the entire output in a List in memory. This solution is acceptable for "Little Data" problems although if the entire problem fits in the memory of a single machine there is less motivation to use Spark. Most problems which benefit from Spark are large enough that even the data a

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
; RDD and JavaRDD already expose a method to iterate over the data, > called toLocalIterator. It does not require that the RDD fit entirely > in memory. > > On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis > wrote: > > At the end of a set of computation I have a JavaRDD . I want

How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
At the end of a set of computation I have a JavaRDD . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable. the sa

How to I get at a SparkContext or better a JavaSparkContext from the middle of a function

2014-10-14 Thread Steve Lewis
I am running a couple of functions on an RDD which require access to data on the file system known to the context. If I create a class with a context a a member variable I get a serialization error, So I am running my function on some slave and I want to read in data from a Path defined by a stri

Re: Broadcast Torrent fail - then the job dies

2014-10-08 Thread Steve Lewis
t by setting spark.broadcast.factory > to org.apache.spark.broadcast.HttpBroadcastFactory in spark conf. > > Thanks, > Liquan > > On Wed, Oct 8, 2014 at 11:21 AM, Steve Lewis > wrote: > >> I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2 >> - I repeatedly

Broadcast Torrent fail - then the job dies

2014-10-08 Thread Steve Lewis
I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2 - I repeatedly see the following in my logs. I believe this happens in combineByKey 14/10/08 09:36:30 INFO executor.Executor: Running task 3.0 in stage 0.0 (TID 3) 14/10/08 09:36:30 INFO broadcast.TorrentBroadcast: Started

anyone else seeing something like https://issues.apache.org/jira/browse/SPARK-3637

2014-10-07 Thread Steve Lewis
java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurren

Stupid Spark question

2014-10-07 Thread Steve Lewis
I am porting a Hadoop job to Spark - One issue is that the workers need to read files from hdfs reading a different file based on the key or in some cases reading an object that is expensive to serialize. This is easy if the worker has access to the JavaSparkContext (I am working in Java) but thi

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-06 Thread Steve Lewis
Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this an input split has only a length (could be ignores if the format treats as non splittable) and a String for a location. If the location is a URL into wikipedia the whole thing should work. Hadoop InputFormat

What can be done if a FlatMapFunctions generated more data that can be held in memory

2014-10-01 Thread Steve Lewis
I number of the problems I want to work with generate datasets which are too large to hold in memory. This becomes an issue when building a FlatMapFunction and also when the data used in combineByKey cannot be held in memory. The following is a simple, if a little silly, example of a FlatMapF

A sample for generating big data - and some design questions

2014-09-30 Thread Steve Lewis
This sample below is essentially word count modified to be big data by turning lines into groups of upper case letters and then generating all case variants - it is modeled after some real problems in biology The issue is I know how to do this in Hadoop but in Spark the use of a List in my flatmap

Has anyone seen java.nio.ByteBuffer.wrap(ByteBuffer.java:392)

2014-09-24 Thread Steve Lewis
java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurren

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Hmmm - I have only tested in local mode but I got an java.io.NotSerializableException: org.apache.hadoop.io.Text at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528) at java.io.ObjectOutputStream.w

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Do your custom Writable classes implement Serializable - I think that is the only real issue - my code uses vanilla Text

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
the SparkContext variable and path denote the path to file >>> of CustomizedInputFormat, we use >>> >>> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path, >>> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V]) >>> >>> to create an RDD of (K,V)

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Steve Lewis
[V]) > > to create an RDD of (K,V) with CustomizedInputFormat. > > Hope this helps, > Liquan > > On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis > wrote: > >> When I experimented with using an InputFormat I had used in Hadoop for a >> long time in Hadoop I found &g

Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Steve Lewis
When I experimented with using an InputFormat I had used in Hadoop for a long time in Hadoop I found 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat 2) initialize needs to be called in the constructor 3) The

Re: Is there any way (in Java) to make a JavaRDD from an iterable

2014-09-22 Thread Steve Lewis
Sep 22, 2014 at 9:22 AM, Steve Lewis > wrote: > >>The only way I find is to turn it into a list - in effect holding >> everything in memory (see code below). Surely Spark has a better way. >> >> Also what about unterminated iterables like a Fibonacci series - (u

Is there any way (in Java) to make a JavaRDD from an iterable

2014-09-22 Thread Steve Lewis
The only way I find is to turn it into a list - in effect holding everything in memory (see code below). Surely Spark has a better way. Also what about unterminated iterables like a Fibonacci series - (useful only if limited in some other way ) /** * make an RDD from an iterable * @

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Steve Lewis
mbineByKey[Int](zero, reduce, >> merge) >> reducedUnsorted.sortByKey() >> >> On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis >> wrote: >> >>> I am struggling to reproduce the functionality of a Hadoop reducer on >>> Spark (in Java) >>> &

Reproducing the function of a Hadoop Reducer

2014-09-19 Thread Steve Lewis
I am struggling to reproduce the functionality of a Hadoop reducer on Spark (in Java) in Hadoop I have a function public void doReduce(K key, Iterator values) in Hadoop there is also a consumer (context write) which can be seen as consume(key,value) In my code 1) knowing the key is important to

storage.DiskBlockManager: Exception while deleting local spark dir

2014-09-17 Thread Steve Lewis
14/09/17 14:21:21 ERROR storage.DiskBlockManager: Exception while deleting local spark dir: C:\Users\Steve\AppData\Local\Temp\spark-local-20140917142115-6a81 java.io.IOException: Failed to delete: C:\Users\Steve\AppData\Local\Temp\spark-local-20140917142115-6a81\03\shuffle_0_0_10 I run a job with

Looking for a good sample of Using Spark to do things Hadoop can do

2014-09-12 Thread Steve Lewis
Assume I have a large book with many Chapters and many lines of text. Assume I have a function that tells me the similarity of two lines of text. The objective is to find the most similar line in the same chapter within 200 lines of the line found. The real problem involves biology and is beyond t

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Steve Lewis
In modern projects there are a bazillion dependencies - when I use Hadoop I just put them in a lib directory in the jar - If I have a project that depends on 50 jars I need a way to deliver them to Spark - maybe wordcount can be written without dependencies but real projects need to deliver depende

Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-08 Thread Steve Lewis
In a Hadoop jar there is a directory called lib and all non-provided third party jars go there and are included in the class path of the code. Do jars for Spark have the same structure - another way to ask the question is if I have code to execute Spark and a jar build for Hadoop can I simply use

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Steve Lewis
ey-value pairs without worrying about this. > > Matei > > On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com) > wrote: > > When programming in Hadoop it is possible to guarantee > 1) All keys sent to a specific partition will be handled by the same > mac

I am looking for a Java sample of a Partitioner

2014-09-02 Thread Steve Lewis
Assume say JavaWord count I call the equivalent of a Mapper JavaPairRDD ones = words.mapToPair(,,, Now right here I want to guarantee that each word starting with a particular letter is processed in a specific partition - (Don't tell me this is a dumb idea - I know that but in a Hadoop code a cus

I am looking for a Java sample of a Partitioner

2014-09-02 Thread Steve Lewis
Assume say JavaWord count I call the equivalent of a Mapper JavaPairRDD ones = words.mapToPair(,,, Now right here I want to guarantee that each word starting with a particular letter is processed in a specific partition - (Don't tell me this is a dumb idea - I know that but in a Hadoop code a cus

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Steve Lewis
could have data on disk across keys), and this is still true for > groupByKey, cogroup and join. Those restrictions will hopefully go away in > a later release. But sortByKey + mapPartitions lets you just iterate > through the key-value pairs without worrying about this. > > Mat

Mapping Hadoop Reduce to Spark

2014-08-30 Thread Steve Lewis
When programming in Hadoop it is possible to guarantee 1) All keys sent to a specific partition will be handled by the same machine (thread) 2) All keys received by a specific machine (thread) will be received in sorted order 3) These conditions will hold even if the values associated with a specif

What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Steve Lewis
In many cases when I work with Map Reduce my mapper or my reducer might take a single value and map it to multiple keys - The reducer might also take a single key and emit multiple values I don't think that functions like flatMap and reduceByKey will work or are there tricks I am not aware of

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Steve Lewis
e an action, like > count(), on the words RDD. > > On Mon, Aug 25, 2014 at 6:32 PM, Steve Lewis > wrote: > > I was able to get JavaWordCount running with a local instance under > > IntelliJ. > > > > In order to do so I needed to use maven to package my code and >

How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-25 Thread Steve Lewis
I was able to get JavaWordCount running with a local instance under IntelliJ. In order to do so I needed to use maven to package my code and call String[] jars = { "/SparkExamples/target/word-count-examples_2.10-1.0.0.jar" }; sparkConf.setJars(jars); After that the sample ran properly and

How to start master and workers on Windows

2014-08-22 Thread Steve Lewis
Thank you for your advice - I really need this help and promise to post a blog entry once it works I ran >bin\spark-class.cmd org.apache.spark.deploy.master.Master and this ran successfully and I got a web page at http://localhost:8080 this says Spark Master at spark://192.168.1.4:7077 *My machi

I am struggling to run Spark Examples on my local machine

2014-08-21 Thread Steve Lewis
I download the binaries for spark-1.0.2-hadoop1 and unpack it on my Widows 8 box. I can execute spark-shell.com and get a command window which does the proper things I open a browser to http:/localhost:4040 and a window comes up describing the spark-master Then using IntelliJ I create a project wi

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-20 Thread Steve Lewis
> http://ml-nlp-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html > on how to build from source and run examples in spark shell. > > > Regards, > Manu > > > On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis > wrote: > >> I want to look at porting

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-18 Thread Steve Lewis
p-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html > on how to build from source and run examples in spark shell. > > > Regards, > Manu > > > On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis > wrote: > >> I want to look at porting a Hadoop problem to Spark

Does anyone have a stand alone spark instance running on Windows

2014-08-16 Thread Steve Lewis
I want to look at porting a Hadoop problem to Spark - eventually I want to run on a Hadoop 2.0 cluster but while I am learning and porting I want to run small problems in my windows box. I installed scala and sbt. I download Spark and in the spark directory can say mvn -Phadoop-0.23 -Dhadoop.versio