Hi, Spark friends
This is Yifan. I am a software developer from Workday. I am not very familiar
with Spark and I have a question about the Tag in TreeNode. We have a use case
where we will add some information to Tag and we hope the tag will be persisted
in Spark. But I noticed that the tag is
unsubscribe
smaller in size.
>
> HTH,
> Deng
>
> On Thu, Oct 29, 2015 at 8:40 PM, Yifan LI wrote:
>
>> I have a guess that before scanning that RDD, I sorted it and set
>> partitioning, so the result is not balanced:
>>
>> sortBy[S](f: Function
>> <http:
try to repartition it to see if it helps.
Best,
Yifan LI
> On 29 Oct 2015, at 12:52, Yifan LI wrote:
>
> Hey,
>
> I was just trying to scan a large RDD sortedRdd, ~1billion elements, using
> toLocalIterator api, but an exception returned as it was almost finished:
>
&
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Do you have any idea?
I have set partitioning quite big, like 4
Best,
Yifan LI
Thanks for your advice, Jem. :)
I will increase the partitioning and see if it helps.
Best,
Yifan LI
> On 23 Oct 2015, at 12:48, Jem Tucker wrote:
>
> Hi Yifan,
>
> I think this is a result of Kryo trying to seriallize something too large.
> Have you trie
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Best,
Yifan LI
Shiwei, yes, you might be right. Thanks. :)
Best,
Yifan LI
> On 12 Oct 2015, at 12:55, 郭士伟 wrote:
>
> I think this is not a problem Spark can solve effectively, cause RDD in
> immutable. Every time you want to change an RDD, you create a new one, and
> sort again. Mayb
that “sort again”? it is too
costly… :(
Anyway thank you again!
Best,
Yifan LI
> On 12 Oct 2015, at 12:19, Adrian Tanase wrote:
>
> I think you’re looking for the flatMap (or flatMapValues) operator – you can
> do something like
>
> sortedRdd.flatMapValues( v =
uot;))
…
#2:
60 is matched! 60/2 = 30, the collection right now should be as:
(3, (53.5, “ccc”))
(4, (48, “ddd”))
(2, (30, “bbb”)) <— inserted back here
(5, (29, “eee"))
…
Best,
Yifan LI
Hi,
I just encountered the same problem, when I run a PageRank program which has
lots of stages(iterations)…
The master was lost after my program done.
And, the issue still remains even I increased driver memory.
Have any idea? e.g. how to increase the master memory?
Thanks.
Best,
Yifan
it might because there was a large shuffling…
Is there anyone has idea to fix it? Thanks in advance!
Best,
Yifan LI
,
Yifan LI
scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Best,
Yifan LI
Yes, you are right. For now I have to say the workload/executor is distributed
evenly…so, like you said, it is difficult to improve the situation.
However, have you any idea of how to make a *skew* data/executor distribution?
Best,
Yifan LI
> On 06 May 2015, at 15:13, Saisai Shao wr
Thanks, Shao. :-)
I am wondering if the spark will rebalance the storage overhead in
runtime…since still there is some available space on other nodes.
Best,
Yifan LI
> On 06 May 2015, at 14:57, Saisai Shao wrote:
>
> I think you could configure multiple disks through spark.
computation…, so maybe sometime the request
on that node was too bigger than available space.
But, is there any way to avoid this kind of error? I am sure that the overall
disk space of all nodes is enough for my application.
Thanks in advance!
Best,
Yifan LI
Thanks, Olivier and Franz. :)
Best,
Yifan LI
> On 02 May 2015, at 23:23, Olivier Girardot wrote:
>
> I guess :
>
> val srdd_s1 = srdd.filter(_.startsWith("s1_")).sortBy(_)
> val srdd_s2 = srdd.filter(_.startsWith("s2_")).sortBy(_)
> val srdd_s3
…
…
Have any idea? Thanks in advance! :)
Best,
Yifan LI
repartitioning will be inevitable??
Thanks in advance!
Best,
Yifan LI
Hi Kannan,
I am not sure I have understood what your question is exactly, but maybe the
reduceByKey or reduceByKeyLocally functionality is better to your need.
Best,
Yifan LI
> On 17 Feb 2015, at 17:37, Vijayasarathy Kannan wrote:
>
> Hi,
>
> I am working on a Spark a
Thanks, Kelvin :)
The error seems to disappear after I decreased both
spark.storage.memoryFraction and spark.shuffle.memoryFraction to 0.2
And, some increase on driver memory.
Best,
Yifan LI
> On 10 Feb 2015, at 18:58, Kelvin Chu <2dot7kel...@gmail.com> wrote:
>
> Since
Yes, I have read it, and am trying to find some way to do that… Thanks :)
Best,
Yifan LI
> On 10 Feb 2015, at 12:06, Akhil Das wrote:
>
> Did you have a chance to look at this doc
> http://spark.apache.org/docs/1.2.0/tuning.html
> <http://spark.apache.org/docs
during pregel supersteps. so, it seems to suffer from
high GC?
Best,
Yifan LI
> On 10 Feb 2015, at 10:26, Akhil Das wrote:
>
> You could try increasing the driver memory. Also, can you be more specific
> about the data volume?
>
> Thanks
> Best Regards
>
> On
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:128)
at
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
Best,
Yifan LI
Hi Ankur,
Thanks very much for your help, but I am using v1.2, so it is SORT…
Let me know if you have any other advice, :)
Best,
Yifan LI
> On 05 Feb 2015, at 17:56, Ankur Srivastava wrote:
>
> Li, I cannot tell you the reason for this exception but have seen these kind
&g
Anyone has idea on where I can find the detailed log of that lost executor(why
it was lost)?
Thanks in advance!
> On 05 Feb 2015, at 16:14, Yifan LI wrote:
>
> Hi,
>
> I am running a heavy memory/cpu overhead graphx application, I think the
> memory is suffi
? Where I can get more details for this issue?
Best,
Yifan LI
executor 13
15/02/02 11:48:49 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 13
Anyone has points on this?
Best,
Yifan LI
> On 02 Feb 2015, at 11:47, Yifan LI wrote:
>
> Thanks, Sonal.
>
> But it seems to be an error happened when “cleaning broadc
Thanks, Sonal.
But it seems to be an error happened when “cleaning broadcast”?
BTW, what is the timeout of “[30 seconds]”? can I increase it?
Best,
Yifan LI
> On 02 Feb 2015, at 11:12, Sonal Goyal wrote:
>
> That may be the cause of your issue. Take a look at the tuning gui
Yes, I think so, esp. for a pregel application… have any suggestion?
Best,
Yifan LI
> On 30 Jan 2015, at 22:25, Sonal Goyal wrote:
>
> Is your code hitting frequent garbage collection?
>
> Best Regards,
> Sonal
> Founder, Nube Technologies <http://www.nu
=> (9 + 2) /
> 11]15/01/29 23:57:30 ERROR TaskSchedulerImpl: Lost executor 8 on
> small10-tap1.common.lip6.fr <http://small10-tap1.common.lip6.fr/>: remote
> Akka client disassociated
> 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 8
> 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 8
>
> Best,
> Yifan LI
>
>
>
>
>
ll10-tap1.common.lip6.fr: remote Akka client disassociated
15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 8
15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 8
Best,
Yifan LI
Hi,
I just saw an Edge RDD is "300% Fraction Cached” in Storage WebUI, what does
that mean? I can understand if the value was under 100%…
Thanks.
Best,
Yifan LI
they communicate
between partitions(and machines)?
Anyone has some points on this, or communication between RDDs?
Thanks, :)
Best,
Yifan LI
regel api, one superset is enough
- by using spark basic operations(groupByKey, leftJoin, etc) on vertices RDD
and its intermediate results.
w.r.t the communication among machines, and the high cost of
groupByKey/leftJoin, I guess that 1st option is better?
what’s your idea?
Best,
Yifan LI
Thanks, Paolo and Mark. :)
> On 04 Dec 2014, at 11:58, Paolo Platter wrote:
>
> Hi,
>
> rdd.flatMap( e => e._2.map( i => ( i, e._1)))
>
> Should work, but I didn't test it so maybe I'm missing something.
>
> Paolo
>
> Inviata dal
Hi,
I have a RDD like below:
(1, (10, 20))
(2, (30, 40, 10))
(3, (30))
…
Is there any way to map it to this:
(10,1)
(20,1)
(30,2)
(40,2)
(10,2)
(30,3)
…
generally, for each element, it might be mapped to multiple.
Thanks in advance!
Best,
Yifan LI
(SparkSubmit.scala)
anyone has some points on this?
Best,
Yifan LI
you chose, wrt the vertices replication factor
- the distribution of partitions on cluster
...
Best,
Yifan LI
LIP6, UPMC, Paris
> On 17 Nov 2014, at 11:59, Hlib Mykhailenko wrote:
>
> Hello,
>
> I use Spark Standalone Cluster and I want to measure somehow internode
> comm
I am not sure if it can work on Spark 1.0, but give it a try.
or, Maybe you can try:
1) to construct the edges and vertices RDDs respectively with desired storage
level.
2) then, to obtain a graph by using Graph(verticesRDD, edgesRDD).
Best,
Yifan LI
On 28 Oct 2014, at 12:10, Arpit Kumar
Hi Arpit,
To try this:
val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions =
numPartitions, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK,
vertexStorageLevel = StorageLevel.MEMORY_AND_DISK)
Best,
Yifan LI
On 28 Oct 2014, at 11:17, Arpit Kumar wrote:
> Any h
using basic spark table operations(join, etc), for
instance, in [1])
[1] http://event.cwi.nl/grades2014/03-salihoglu.pdf
Best,
Yifan LI
tor((edge.dstId, edge.srcAttr._2 * edge.attr))
} else {
Iterator.empty
}
}
so, in this case, there is a message, even is none, is still sent? or not?
Best,
Yifan
On 16 Sep 2014, at 11:48, Ankur Dave wrote:
> At 2014-09-16 10:55:37 +0200, Yifan LI wrote:
>> - from
already been
introduced in graphx pregel api? )
Best,
Yifan LI
On 15 Sep 2014, at 23:07, Ankur Dave wrote:
> At 2014-09-15 16:25:04 +0200, Yifan LI wrote:
>> I am wondering if the vertex active/inactive(corresponding the change of its
>> value between two supersteps) feature i
]) =
Iterator((edge.dstId, hmCal(edge.srcAttr)))
or, I should do that by a customised measure function, e.g. by keeping its
change in vertex attribute after each iteration.
I noticed that there is an optional parameter “skipStale" in mrTriplets
operator.
Best,
Yifan LI
:00 Ankur Dave :
> At 2014-09-03 17:58:09 +0200, Yifan LI wrote:
> > val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions =
> numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK)
> >
> > Error: java.lang.Unsupporte
)
Error: java.lang.UnsupportedOperationException: Cannot change storage level
of an RDD after it was already assigned a level
Is there anyone could give me help?
Best,
Yifan
2014-08-18 23:52 GMT+02:00 Ankur Dave :
> On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI wrote:
>
> I am te
Hi,
I am testing our application(similar to "personalised page rank" using Pregel,
and note that each vertex property will need pretty much more space to store
after new iteration), it works correctly on small graph.(we have one single
machine, 8 cores, 16G memory)
But when we ran it on larger
ally, only hubs can receive messages and compute results in
the LAST iteration? since we don't need the final result of non-hub vertices.
Best,
Yifan LI
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For addi
Try this:
./bin/run-example graphx.LiveJournalPageRank <…>
On Aug 2, 2014, at 5:55 PM, Deep Pradhan wrote:
> Hi,
> I am running Spark in a single node cluster. I am able to run the codes in
> Spark like SparkPageRank.scala, SparkKMeans.scala by the following command,
> bin/run-examples org.ap
Hi Bin,
Maybe you could get the vertex, for instance, which id is 80, by using:
graph.vertices.filter{case(id, _) => id==80}.collect
but I am not sure this is the exactly efficient way.(it will scan the whole
table? if it can not get benefit from index of VertexRDD table)
@Ankur, is there any
t;
"Task Time" is approximated to "Duration" multiplied with "Total Tasks"?
2) what are the exact meanings of "Shuffle Read/Shuffle Write"?
Best,
Yifan LI
Thanks, Abel.
Best,
Yifan LI
On Jul 21, 2014, at 4:16 PM, Abel Coronado Iruegas
wrote:
> Hi Yifan
>
> This works for me:
>
> export SPARK_JAVA_OPTS="-Xms10g -Xmx40g -XX:MaxPermSize=10g"
> export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar
> expor
omewhere(spark-1.0.1)?
Best,
Yifan LI
communication cost of my application upon the chosen
partition strategy?
On Jul 18, 2014, at 9:18 PM, Ankur Dave wrote:
> Sorry, I didn't read your vertex replication example carefully, so my
> previous answer is wrong. Here's the correct one:
>
> On Fri, Jul 18, 2014 at 9:
core is stricter to access only its own partition in memory?
how do they communicate when the required data(edges?) in another partition?
On Jul 15, 2014, at 9:30 PM, Ankur Dave wrote:
> On Jul 15, 2014, at 12:06 PM, Yifan LI wrote:
> Btw, is there any possibility to customise the partition
partitions being scheduled?
Best,
Yifan
On Jul 15, 2014, at 12:06 PM, Yifan LI wrote:
> Dear Ankur,
>
> Thanks so much!
>
> Btw, is there any possibility to customise the partition strategy as we
> expect?
>
>
> Best,
> Yifan
> On Jul 11, 2014, at 10:20 PM,
aphX questions to the
> Spark user list if possible. I'll probably still be the one answering them,
> but that way others can benefit as well.
>
> Ankur
>
>
> On Fri, Jul 11, 2014 at 3:05 AM, Yifan LI wrote:
> Hi Ankur,
>
> I am doing graph computation using Gr
Hi,
I am doing graph computation using GraphX, but it seems to be an error on graph
partition strategy specification.
as in "GraphX programming guide":
The Graph.partitionBy operator allows users to choose the graph partitioning
strategy, but due to SPARK-1931, this method is broken in Spark 1.0
Hi,
I am planning to do graph(social network) computation on a cluster(hadoop has
been installed), but it seems there are a "Pre-built" package for hadoop which
I am NOT sure if the graphX has been included in.
or, should I install other released version(obviously the graphX has been
included)
61 matches
Mail list logo