ws IOException{
LineNumberReader rdr = new LineNumberReader(new FileReader(f));
StringBuilder sb = new StringBuilder();
String line = rdr.readLine();
while(line != null) {
sb.append(line);
sb.append("\n");
line = rdr.readLine();
are within a map step.
>
> Generally you should not call external applications from Spark.
>
> > Am 11.11.2018 um 23:13 schrieb Steve Lewis :
> >
> > I have a problem where a critical step needs to be performed by a third
> party c++ application. I can send or install
I have a problem where a critical step needs to be performed by a third
party c++ application. I can send or install this program on the worker
nodes. I can construct a function holding all the data this program needs
to process. The problem is that the program is designed to read and write
from
We are trying to run a job that has previously run on Spark 1.3 on a
different cluster. The job was converted to 2.3 spark and this is a
new cluster.
The job dies after completing about a half dozen stages with
java.io.IOException: No space left on device
It appears that the nodes are us
Ok I am stymied. I have tried everything I can think of to get spark to use
my own version of
log4j.properties
In the launcher code - I launch a local instance from a Java application
I say -Dlog4j.configuration=conf/log4j.properties
where conf/log4j.properties is user.dir - no luck
Spark alwa
I asked a similar question a day or so ago but this is a much more concrete
example showing the difficulty I am running into
I am trying to use DataSets. I have an object which I want to encode with
its fields as columns. The object is a well behaved Java Bean.
However one field is an object (or
I have a relatively complex Java object that I would like to use in a
dataset
if I say
Encoder evidence = Encoders.kryo(MyType.class);
JavaRDD rddMyType= generateRDD(); // some code
Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(),
evidence);
I get one column - the whole object
Type.printSchema()
>
> On Mon, Jan 25, 2016 at 1:16 PM, Steve Lewis
> wrote:
>
>> assume I have the following code
>>
>> SparkConf sparkConf = new SparkConf();
>>
>> JavaSparkContext sqlCtx= new JavaSparkContext(sparkConf);
>>
>> JavaRDD rddMyTy
assume I have the following code
SparkConf sparkConf = new SparkConf();
JavaSparkContext sqlCtx= new JavaSparkContext(sparkConf);
JavaRDD rddMyType= generateRDD(); // some code
Encoder evidence = Encoders.kryo(MyType.class);
Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(), evidence
ambda functions if you want though:
>
> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2)
> =>
> ...
> }
>
>
> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis
> wrote:
>
>> We have been working a large search problem which we have
We have been working a large search problem which we have been solving in
the following ways.
We have two sets of objects, say children and schools. The object is to
find the closest school to each child. There is a distance measure but it
is relatively expensive and would be very costly to apply
I am running on a spark 1.5.1 cluster managed by Mesos - I have an
application that handled a chemistry problem which can be increased by
increasing the number of atoms - increasing the number of Spark stages. I
do a repartition at each stage - Stage 9 is the last stage. At each stage
the size and
I have been using my own code to build the jar file I use for spark
submit. In 1.4 I could simply add all class and resource files I find in
the class path to the jar and add all jars in the classpath into a
directory called lib in the jar file.
In 1.5 I see that resources and classes in jars in t
I was in a discussion with someone who works for a cloud provider which
offers Spark/Hadoop services. We got into a discussion of performance and
the bewildering array of machine types and the problem of selecting a
cluster with 20 "Large" instances VS 10 "Jumbo" instances or the trade offs
betwee
ake a look in the algorithm once more. "Tasks typically
> preform both operations several hundred thousand times." why it can not be
> done distributed way?
>
> On Thu, May 7, 2015 at 3:16 PM, Steve Lewis wrote:
>
>> I am performing a job where I perform a number of st
I am performing a job where I perform a number of steps in succession.
One step is a map on a JavaRDD which generates objects taking up
significant memory.
The this is followed by a join and an aggregateByKey.
The problem is that the system is running getting OutOfMemoryErrors -
Most tasks work but
t from Samsung Mobile
>
>
> Original message
> From: Olivier Girardot
> Date:2015/04/18 22:04 (GMT+00:00)
> To: Steve Lewis ,user@spark.apache.org
> Subject: Re: Can a map function return null
>
> You can return an RDD with null values inside, and afterward
I find a number of cases where I have an JavaRDD and I wish to transform
the data and depending on a test return 0 or one item (don't suggest a
filter - the real case is more complex). So I currently do something like
the following - perform a flatmap returning a list with 0 or 1 entry
depending on
-- Forwarded message --
From: Steve Lewis
Date: Wed, Mar 11, 2015 at 9:13 AM
Subject: Re: Numbering RDD members Sequentially
To: "Daniel, Ronald (ELS-SDG)"
perfect - exactly what I was looking for, not quite sure why it is
called zipWithIndex
since zipping is not i
I have Hadoop Input Format which reads records and produces
JavaPairRDD locatedData where
_1() is a formatted version of the file location - like
"12690",, "24386 ."27523 ...
_2() is data to be processed
For historical reasons I want to convert _1() into in integer representing
the
I have a job involving two sets of data indexed with the same type of key.
I have an expensive operation that I want to run on pairs sharing the same
key. The following code works BUT all of the work is being done on 3 of 16
processors -
How do I go about diagnosing and fixing the behavior. A
I have an application where a function needs access to the results of a
select from a parquet database. Creating a JavaSQLContext and from it
a JavaSchemaRDD
as shown below works but the parallelism is not needed - a simple JDBC call
would work -
Are there alternative non-parallel ways to achieve
I am trying to use PyCharm for Spark development on a windows 8.1 Machine -
I have installed py4j, added Spark pythin as a content root and have Cygwin
in my path
Also Using intelliJ works for Spark Java code.
When I run a simple word count below I get errors in launching a Spark
local cluster - an
I notice new methods such as JavaSparkContext makeRDD (with few useful
examples) - It takes a Seq but while there are ways to turn a list into a
Seq I see nothing that uses an Iterable
I am aware of the ADAM project in Berkeley and I am working on Proteomic
searches -
anyone else working in this space
rs if it has to.
>
> On Fri, Dec 12, 2014 at 2:39 PM, Steve Lewis
> wrote:
>>
>> The objective is to let the Spark application generate a file in a format
>> which can be consumed by other programs - as I said I am willing to give up
>> parallelism at this stage
e final output file.
>
>
> - SF
>
> On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis
> wrote:
>>
>>
>> I have an RDD which is potentially too large to store in memory with
>> collect. I want a single task to write the contents as a file to hdfs. Time
>> is not a
I have an RDD which is potentially too large to store in memory with
collect. I want a single task to write the contents as a file to hdfs. Time
is not a large issue but memory is.
I say the following converting my RDD (scans) to a local Iterator. This
works but hasNext shows up as a separate task
assume I don't care about values which may be created in a later map - in
scala I can say
val rdd = sc.parallelize(1 to 10, numSlices = 1000)
but in Java JavaSparkContext can only paralellize a List - limited to
Integer,MAX_VALUE elements and required to exist in memory - the best I can
do
will be generated at a time
> per executor thread.
>
> On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis wrote:
>
>> I have a function which generates a Java object and I want to explore
>> failures which only happen when processing large numbers of these object.
>> the r
I have a function which generates a Java object and I want to explore
failures which only happen when processing large numbers of these object.
the real code is reading a many gigabyte file but in the test code I can
generate similar objects programmatically. I could create a small list,
paralleli
I am trying to look at problems reading a data file over 4G. In my testing
I am trying to create such a file.
My plan is to create a fasta file (a simple format used in biology)
looking like
>1
TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTG
>2
G
I am using a custom hadoop input format which works well on smaller files
but fails with a file at about 4GB size - the format is generating about
800 splits and all variables in my code are longs -
Any suggestions? Is anyone reading files of this size?
Exception in thread "main" org.apache.spark
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/TaskContext.java
has a Java implementation if TaskContext wit a very useful method
/** * Return the currently active TaskContext. This can be called inside of
* user functions to access contextual information about run
n contains and try to achieve a roughly even
>> distribution for best performance. In particular, if the RDDs are PairRDDs,
>> partitions are assigned based on the hash of the key, so an even
>> distribution of values among keys is required for even split of data across
>> pa
I am running a large job using 4000 partitions - after running for four
hours on a 16 node cluster it fails with the following message.
The errors are in spark code and seem address unreliability at the level of
the disk -
Anyone seen this and know what is going on and how to fix it.
Exception in
I have been working on balancing work across a number of partitions and
find it would be useful to access information about the current execution
environment much of which (like Executor ID) are available if there was a
way to get the current executor or the Hadoop TaskAttempt context -
does any o
ou sample the executor several times in a short time period, you can
> identify 'hot spots' or expensive sections in the user code.
>
> On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis wrote:
>
>> I am working on a problem which will eventually involve many millions of
>>
I am working on a problem which will eventually involve many millions of
function calls. A have a small sample with several thousand calls working
but when I try to scale up the amount of data things stall. I use 120
partitions and 116 finish in very little time. The remaining 4 seem to do
all the
I am running on a 15 node cluster and am trying to set partitioning to
balance the work across all nodes. I am using an Accumulator to track work
by Mac Address but would prefer to use data known to the Spark environment
- Executor ID, and Function ID show up in the Spark UI and Task ID and
Attem
I am running on a 15 node cluster and am trying to set partitioning to
balance the work across all nodes. I am using an Accumulator to track work
by Mac Address but would prefer to use data known to the Spark environment
- Executor ID, and Function ID show up in the Spark UI and Task ID and
Attem
e not in the first half of the
> PairRDD. An alternative is to make the .equals() and .hashcode() of
> KeyType delegate to the .getId() method you use in the anonymous function.
>
> Cheers,
> Andrew
>
> On Tue, Nov 25, 2014 at 10:06 AM, Steve Lewis
> wrote:
>
>> I have an
I have an JavaPairRDD> originalPairs. There are
on the order of 100 million elements
I call a function to rearrange the tuples
JavaPairRDD> newPairs =
originalPairs.values().mapToPair(new PairFunction,
String, Tuple2> {
@Override
public Tuple2> doCall(final
Tuple2 t) {
The spark UI lists a number of Executor IDS on the cluster. I would like
to access both executor ID and Task/Attempt IDs from the code inside a
function running on a slave machine.
Currently my motivation is to examine parallelism and locality but in
Hadoop this aids in allowing code to write non
a stage). If you want to force it to have more
> partitions, you can call RDD.repartition(numPartitions). Note that this
> will introduce a shuffle you wouldn't otherwise have.
>
> Also make sure your job is allocated more than one core in your cluster
> (you can see this on the
I have instrumented word count to track how many machines the code runs
on. I use an accumulator to maintain a Set or MacAddresses. I find that
everything is done on a single machine. This is probably optimal for word
count but not the larger problems I am working on.
How to a force processing to
st/api/scala/index.html#org.apache.spark.AccumulatorParam
>
> JavaSparkContext has helper methods for int and double but not long. You
> can just make your own little implementation of AccumulatorParam
> right? ... which would be nice to add to JavaSparkContext.
>
> On Wed, Nov 12, 2014 a
JavaSparkContext currentContext = ...;
Accumulator accumulator = currentContext.accumulator(0,
"MyAccumulator");
will create an Accumulator of Integers. For many large Data problems
Integer is too small and Long is a better type.
I see a call like the following
AccumulatorPar
I am trying to determine how effective partitioning is at parallelizing my
tasks. So far I suspect it that all work is done in one task. My plan is to
create a number of accumulators - one for each task and have functions
increment the accumulator for the appropriate task (or slave) the values
cou
In my problem I have a number of intermediate JavaRDDs and would like to
be able to look at their sizes without destroying the RDD for sibsequent
processing. persist will do this but these are big and perisist seems
expensive and I am unsure of which StorageLevel is needed, Is there a way
to clone
I see the job in the web interface but don't know how to kill it
vaPairRDD
>
> If you partition both RDDs by the bin id, I think you should be able to
> get what you want.
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>>
>> On Fr
The original problem is in biology but the following captures the CS
issues, Assume I have a large number of locks and a large number of keys.
There is a scoring function between keys and locks and a key that fits a
lock will have a high score. There may be many keys fitting one lock and a
key m
Assume in my executor I say
SparkConf sparkConf = new SparkConf();
sparkConf.set("spark.kryo.registrator",
"com.lordjoe.distributed.hydra.HydraKryoSerializer");
sparkConf.set("mysparc.data", "Some user Data");
sparkConf.setAppName("Some App");
Now
1) Are there defa
objects you're trying to serialize and see if
> those work.
>
>
> -Original Message-
> *From: *Steve Lewis [lordjoe2...@gmail.com]
> *Sent: *Tuesday, October 28, 2014 10:46 PM Eastern Standard Time
> *To: *user@spark.apache.org
> *Subject: *com.esotericsoftware.k
A cluster I am running on keeps getting KryoException. Unlike the Java
serializer the Kryo Exception gives no clue as to what class is giving the
error
The application runs properly locally but no the cluster and I have my own
custom KryoRegistrator and register sereral dozen classes - essentially
Collect will store the entire output in a List in memory. This solution is
acceptable for "Little Data" problems although if the entire problem fits
in the memory of a single machine there is less motivation to use Spark.
Most problems which benefit from Spark are large enough that even the data
a
; RDD and JavaRDD already expose a method to iterate over the data,
> called toLocalIterator. It does not require that the RDD fit entirely
> in memory.
>
> On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis
> wrote:
> > At the end of a set of computation I have a JavaRDD . I want
At the end of a set of computation I have a JavaRDD . I want a
single file where each string is printed in order. The data is small enough
that it is acceptable to handle the printout on a single processor. It may
be large enough that using collect to generate a list might be unacceptable.
the sa
I am running a couple of functions on an RDD which require access to data
on the file system known to the context. If I create a class with a context
a a member variable I get a serialization error,
So I am running my function on some slave and I want to read in data from a
Path defined by a stri
t by setting spark.broadcast.factory
> to org.apache.spark.broadcast.HttpBroadcastFactory in spark conf.
>
> Thanks,
> Liquan
>
> On Wed, Oct 8, 2014 at 11:21 AM, Steve Lewis
> wrote:
>
>> I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2
>> - I repeatedly
I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2 -
I repeatedly see
the following in my logs.
I believe this happens in combineByKey
14/10/08 09:36:30 INFO executor.Executor: Running task 3.0 in stage 0.0
(TID 3)
14/10/08 09:36:30 INFO broadcast.TorrentBroadcast: Started
java.lang.NullPointerException
at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
java.util.concurren
I am porting a Hadoop job to Spark - One issue is that the workers need to
read files from hdfs reading a different file based on the key or in some
cases reading an object that is expensive to serialize.
This is easy if the worker has access to the JavaSparkContext (I am
working in Java) but thi
Try a Hadoop Custom InputFormat - I can give you some samples -
While I have not tried this an input split has only a length (could be
ignores if the format treats as non splittable) and a String for a location.
If the location is a URL into wikipedia the whole thing should work.
Hadoop InputFormat
I number of the problems I want to work with generate datasets which are
too large to hold in memory. This becomes an issue when building a
FlatMapFunction and also when the data used in combineByKey cannot be held
in memory.
The following is a simple, if a little silly, example of a
FlatMapF
This sample below is essentially word count modified to be big data by
turning lines into groups of
upper case letters and then generating all case variants - it is modeled
after some real problems in biology
The issue is I know how to do this in Hadoop but in Spark the use of a List
in my flatmap
java.lang.NullPointerException
at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
java.util.concurren
Hmmm - I have only tested in local mode but I got an
java.io.NotSerializableException: org.apache.hadoop.io.Text
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.w
Do your custom Writable classes implement Serializable - I think that is
the only real issue - my code uses vanilla Text
the SparkContext variable and path denote the path to file
>>> of CustomizedInputFormat, we use
>>>
>>> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
>>> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])
>>>
>>> to create an RDD of (K,V)
[V])
>
> to create an RDD of (K,V) with CustomizedInputFormat.
>
> Hope this helps,
> Liquan
>
> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis
> wrote:
>
>> When I experimented with using an InputFormat I had used in Hadoop for a
>> long time in Hadoop I found
&g
When I experimented with using an InputFormat I had used in Hadoop for a
long time in Hadoop I found
1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated
class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
2) initialize needs to be called in the constructor
3) The
Sep 22, 2014 at 9:22 AM, Steve Lewis
> wrote:
>
>>The only way I find is to turn it into a list - in effect holding
>> everything in memory (see code below). Surely Spark has a better way.
>>
>> Also what about unterminated iterables like a Fibonacci series - (u
The only way I find is to turn it into a list - in effect holding
everything in memory (see code below). Surely Spark has a better way.
Also what about unterminated iterables like a Fibonacci series - (useful
only if limited in some other way )
/**
* make an RDD from an iterable
* @
mbineByKey[Int](zero, reduce,
>> merge)
>> reducedUnsorted.sortByKey()
>>
>> On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis
>> wrote:
>>
>>> I am struggling to reproduce the functionality of a Hadoop reducer on
>>> Spark (in Java)
>>>
&
I am struggling to reproduce the functionality of a Hadoop reducer on
Spark (in Java)
in Hadoop I have a function
public void doReduce(K key, Iterator values)
in Hadoop there is also a consumer (context write) which can be seen as
consume(key,value)
In my code
1) knowing the key is important to
14/09/17 14:21:21 ERROR storage.DiskBlockManager: Exception while deleting
local spark dir:
C:\Users\Steve\AppData\Local\Temp\spark-local-20140917142115-6a81
java.io.IOException: Failed to delete:
C:\Users\Steve\AppData\Local\Temp\spark-local-20140917142115-6a81\03\shuffle_0_0_10
I run a job with
Assume I have a large book with many Chapters and many lines of text.
Assume I have a function that tells me the similarity of two lines of
text. The objective is to find the most similar line in the same chapter
within 200 lines of the line found.
The real problem involves biology and is beyond t
In modern projects there are a bazillion dependencies - when I use Hadoop I
just put them in a lib directory in the jar - If I have a project that
depends on 50 jars I need a way to deliver them to Spark - maybe wordcount
can be written without dependencies but real projects need to deliver
depende
In a Hadoop jar there is a directory called lib and all non-provided third
party jars go there and are included in the class path of the code. Do jars
for Spark have the same structure - another way to ask the question is if I
have code to execute Spark and a jar build for Hadoop can I simply use
ey-value pairs without worrying about this.
>
> Matei
>
> On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com)
> wrote:
>
> When programming in Hadoop it is possible to guarantee
> 1) All keys sent to a specific partition will be handled by the same
> mac
Assume say JavaWord count
I call the equivalent of a Mapper
JavaPairRDD ones = words.mapToPair(,,,
Now right here I want to guarantee that each word starting with a
particular letter is processed in a specific partition - (Don't tell me
this is a dumb idea - I know that but in a Hadoop code a cus
Assume say JavaWord count
I call the equivalent of a Mapper
JavaPairRDD ones = words.mapToPair(,,,
Now right here I want to guarantee that each word starting with a
particular letter is processed in a specific partition - (Don't tell me
this is a dumb idea - I know that but in a Hadoop code a cus
could have data on disk across keys), and this is still true for
> groupByKey, cogroup and join. Those restrictions will hopefully go away in
> a later release. But sortByKey + mapPartitions lets you just iterate
> through the key-value pairs without worrying about this.
>
> Mat
When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same
machine (thread)
2) All keys received by a specific machine (thread) will be received in
sorted order
3) These conditions will hold even if the values associated with a specif
In many cases when I work with Map Reduce my mapper or my reducer might
take a single value and map it to multiple keys -
The reducer might also take a single key and emit multiple values
I don't think that functions like flatMap and reduceByKey will work or are
there tricks I am not aware of
e an action, like
> count(), on the words RDD.
>
> On Mon, Aug 25, 2014 at 6:32 PM, Steve Lewis
> wrote:
> > I was able to get JavaWordCount running with a local instance under
> > IntelliJ.
> >
> > In order to do so I needed to use maven to package my code and
>
I was able to get JavaWordCount running with a local instance under
IntelliJ.
In order to do so I needed to use maven to package my code and
call
String[] jars = {
"/SparkExamples/target/word-count-examples_2.10-1.0.0.jar" };
sparkConf.setJars(jars);
After that the sample ran properly and
Thank you for your advice - I really need this help and promise to post a
blog entry once it works
I ran
>bin\spark-class.cmd org.apache.spark.deploy.master.Master
and this ran successfully and I got a web page at http://localhost:8080
this says
Spark Master at spark://192.168.1.4:7077
*My machi
I download the binaries for spark-1.0.2-hadoop1 and unpack it on my Widows
8 box.
I can execute spark-shell.com and get a command window which does the
proper things
I open a browser to http:/localhost:4040 and a window comes up describing
the spark-master
Then using IntelliJ I create a project wi
> http://ml-nlp-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html
> on how to build from source and run examples in spark shell.
>
>
> Regards,
> Manu
>
>
> On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis
> wrote:
>
>> I want to look at porting
p-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html
> on how to build from source and run examples in spark shell.
>
>
> Regards,
> Manu
>
>
> On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis
> wrote:
>
>> I want to look at porting a Hadoop problem to Spark
I want to look at porting a Hadoop problem to Spark - eventually I want to
run on a Hadoop 2.0 cluster but while I am learning and porting I want to
run small problems in my windows box.
I installed scala and sbt.
I download Spark and in the spark directory can say
mvn -Phadoop-0.23 -Dhadoop.versio
94 matches
Mail list logo