Question about Spark Tag in TreeNode

2025-07-07 Thread Yifan Li
Hi, Spark friends This is Yifan. I am a software developer from Workday. I am not very familiar with Spark and I have a question about the Tag in TreeNode. We have a use case where we will add some information to Tag and we hope the tag will be persisted in Spark. But I noticed that the tag is

unsubscribe

2023-08-11 Thread Yifan LI
unsubscribe

Re: [Spark] java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-30 Thread Yifan LI
smaller in size. > > HTH, > Deng > > On Thu, Oct 29, 2015 at 8:40 PM, Yifan LI wrote: > >> I have a guess that before scanning that RDD, I sorted it and set >> partitioning, so the result is not balanced: >> >> sortBy[S](f: Function >> <http:

Re: [Spark] java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-29 Thread Yifan LI
try to repartition it to see if it helps. Best, Yifan LI > On 29 Oct 2015, at 12:52, Yifan LI wrote: > > Hey, > > I was just trying to scan a large RDD sortedRdd, ~1billion elements, using > toLocalIterator api, but an exception returned as it was almost finished: > &

[Spark] java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-29 Thread Yifan LI
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Do you have any idea? I have set partitioning quite big, like 4 Best, Yifan LI

Re: java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Yifan LI
Thanks for your advice, Jem. :) I will increase the partitioning and see if it helps. Best, Yifan LI > On 23 Oct 2015, at 12:48, Jem Tucker wrote: > > Hi Yifan, > > I think this is a result of Kryo trying to seriallize something too large. > Have you trie

java.lang.NegativeArraySizeException? as iterating a big RDD

2015-10-23 Thread Yifan LI
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Best, Yifan LI

Re: "dynamically" sort a large collection?

2015-10-12 Thread Yifan LI
Shiwei, yes, you might be right. Thanks. :) Best, Yifan LI > On 12 Oct 2015, at 12:55, 郭士伟 wrote: > > I think this is not a problem Spark can solve effectively, cause RDD in > immutable. Every time you want to change an RDD, you create a new one, and > sort again. Mayb

Re: "dynamically" sort a large collection?

2015-10-12 Thread Yifan LI
that “sort again”? it is too costly… :( Anyway thank you again! Best, Yifan LI > On 12 Oct 2015, at 12:19, Adrian Tanase wrote: > > I think you’re looking for the flatMap (or flatMapValues) operator – you can > do something like > > sortedRdd.flatMapValues( v =

"dynamically" sort a large collection?

2015-10-12 Thread Yifan LI
uot;)) … #2: 60 is matched! 60/2 = 30, the collection right now should be as: (3, (53.5, “ccc”)) (4, (48, “ddd”)) (2, (30, “bbb”)) <— inserted back here (5, (29, “eee")) … Best, Yifan LI

Re: Master dies after program finishes normally

2015-06-26 Thread Yifan LI
Hi, I just encountered the same problem, when I run a PageRank program which has lots of stages(iterations)… The master was lost after my program done. And, the issue still remains even I increased driver memory. Have any idea? e.g. how to increase the master memory? Thanks. Best, Yifan

large shuffling => executor lost?

2015-06-04 Thread Yifan LI
it might because there was a large shuffling… Is there anyone has idea to fix it? Thanks in advance! Best, Yifan LI

applications are still in progress?

2015-05-13 Thread Yifan LI
, Yifan LI

com.esotericsoftware.kryo.KryoException: java.io.IOException: Stream is corrupted

2015-05-13 Thread Yifan LI
scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Best, Yifan LI

Re: No space left on device??

2015-05-06 Thread Yifan LI
Yes, you are right. For now I have to say the workload/executor is distributed evenly…so, like you said, it is difficult to improve the situation. However, have you any idea of how to make a *skew* data/executor distribution? Best, Yifan LI > On 06 May 2015, at 15:13, Saisai Shao wr

Re: No space left on device??

2015-05-06 Thread Yifan LI
Thanks, Shao. :-) I am wondering if the spark will rebalance the storage overhead in runtime…since still there is some available space on other nodes. Best, Yifan LI > On 06 May 2015, at 14:57, Saisai Shao wrote: > > I think you could configure multiple disks through spark.

No space left on device??

2015-05-06 Thread Yifan LI
computation…, so maybe sometime the request on that node was too bigger than available space. But, is there any way to avoid this kind of error? I am sure that the overall disk space of all nodes is enough for my application. Thanks in advance! Best, Yifan LI

Re: to split an RDD to multiple ones?

2015-05-02 Thread Yifan LI
Thanks, Olivier and Franz. :) Best, Yifan LI > On 02 May 2015, at 23:23, Olivier Girardot wrote: > > I guess : > > val srdd_s1 = srdd.filter(_.startsWith("s1_")).sortBy(_) > val srdd_s2 = srdd.filter(_.startsWith("s2_")).sortBy(_) > val srdd_s3

to split an RDD to multiple ones?

2015-05-02 Thread Yifan LI
… … Have any idea? Thanks in advance! :) Best, Yifan LI

How to avoid the repartitioning in graph construction

2015-03-27 Thread Yifan LI
repartitioning will be inevitable?? Thanks in advance! Best, Yifan LI

Re: Processing graphs

2015-02-17 Thread Yifan LI
Hi Kannan, I am not sure I have understood what your question is exactly, but maybe the reduceByKey or reduceByKeyLocally functionality is better to your need. Best, Yifan LI > On 17 Feb 2015, at 17:37, Vijayasarathy Kannan wrote: > > Hi, > > I am working on a Spark a

Re: OutofMemoryError: Java heap space

2015-02-12 Thread Yifan LI
Thanks, Kelvin :) The error seems to disappear after I decreased both spark.storage.memoryFraction and spark.shuffle.memoryFraction to 0.2 And, some increase on driver memory. Best, Yifan LI > On 10 Feb 2015, at 18:58, Kelvin Chu <2dot7kel...@gmail.com> wrote: > > Since

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Yifan LI
Yes, I have read it, and am trying to find some way to do that… Thanks :) Best, Yifan LI > On 10 Feb 2015, at 12:06, Akhil Das wrote: > > Did you have a chance to look at this doc > http://spark.apache.org/docs/1.2.0/tuning.html > <http://spark.apache.org/docs

Re: OutofMemoryError: Java heap space

2015-02-10 Thread Yifan LI
during pregel supersteps. so, it seems to suffer from high GC? Best, Yifan LI > On 10 Feb 2015, at 10:26, Akhil Das wrote: > > You could try increasing the driver memory. Also, can you be more specific > about the data volume? > > Thanks > Best Regards > > On

OutofMemoryError: Java heap space

2015-02-09 Thread Yifan LI
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:128) at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110) Best, Yifan LI

Re: how to debug this kind of error, e.g. "lost executor"?

2015-02-06 Thread Yifan LI
Hi Ankur, Thanks very much for your help, but I am using v1.2, so it is SORT… Let me know if you have any other advice, :) Best, Yifan LI > On 05 Feb 2015, at 17:56, Ankur Srivastava wrote: > > Li, I cannot tell you the reason for this exception but have seen these kind &g

Re: how to debug this kind of error, e.g. "lost executor"?

2015-02-05 Thread Yifan LI
Anyone has idea on where I can find the detailed log of that lost executor(why it was lost)? Thanks in advance! > On 05 Feb 2015, at 16:14, Yifan LI wrote: > > Hi, > > I am running a heavy memory/cpu overhead graphx application, I think the > memory is suffi

how to debug this kind of error, e.g. "lost executor"?

2015-02-05 Thread Yifan LI
? Where I can get more details for this issue? Best, Yifan LI

Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-02-02 Thread Yifan LI
executor 13 15/02/02 11:48:49 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 13 Anyone has points on this? Best, Yifan LI > On 02 Feb 2015, at 11:47, Yifan LI wrote: > > Thanks, Sonal. > > But it seems to be an error happened when “cleaning broadc

Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-02-02 Thread Yifan LI
Thanks, Sonal. But it seems to be an error happened when “cleaning broadcast”? BTW, what is the timeout of “[30 seconds]”? can I increase it? Best, Yifan LI > On 02 Feb 2015, at 11:12, Sonal Goyal wrote: > > That may be the cause of your issue. Take a look at the tuning gui

Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Yifan LI
Yes, I think so, esp. for a pregel application… have any suggestion? Best, Yifan LI > On 30 Jan 2015, at 22:25, Sonal Goyal wrote: > > Is your code hitting frequent garbage collection? > > Best Regards, > Sonal > Founder, Nube Technologies <http://www.nu

[Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Yifan LI
=> (9 + 2) / > 11]15/01/29 23:57:30 ERROR TaskSchedulerImpl: Lost executor 8 on > small10-tap1.common.lip6.fr <http://small10-tap1.common.lip6.fr/>: remote > Akka client disassociated > 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove > non-existent executor 8 > 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove > non-existent executor 8 > > Best, > Yifan LI > > > > >

[Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Yifan LI
ll10-tap1.common.lip6.fr: remote Akka client disassociated 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 8 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 8 Best, Yifan LI

300% "Fraction Cached"?

2014-12-19 Thread Yifan LI
Hi, I just saw an Edge RDD is "300% Fraction Cached” in Storage WebUI, what does that mean? I can understand if the value was under 100%… Thanks. Best, Yifan LI

[Graphx] the communication cost of leftJoin

2014-12-12 Thread Yifan LI
they communicate between partitions(and machines)? Anyone has some points on this, or communication between RDDs? Thanks, :) Best, Yifan LI

[Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Yifan LI
regel api, one superset is enough - by using spark basic operations(groupByKey, leftJoin, etc) on vertices RDD and its intermediate results. w.r.t the communication among machines, and the high cost of groupByKey/leftJoin, I guess that 1st option is better? what’s your idea? Best, Yifan LI

Re: map function

2014-12-04 Thread Yifan LI
Thanks, Paolo and Mark. :) > On 04 Dec 2014, at 11:58, Paolo Platter wrote: > > Hi, > > rdd.flatMap( e => e._2.map( i => ( i, e._1))) > > Should work, but I didn't test it so maybe I'm missing something. > > Paolo > > Inviata dal

map function

2014-12-04 Thread Yifan LI
Hi, I have a RDD like below: (1, (10, 20)) (2, (30, 40, 10)) (3, (30)) … Is there any way to map it to this: (10,1) (20,1) (30,2) (40,2) (10,2) (30,3) … generally, for each element, it might be mapped to multiple. Thanks in advance! Best, Yifan LI

[graphx] failed to submit an application with java.lang.ClassNotFoundException

2014-11-27 Thread Yifan LI
(SparkSubmit.scala) anyone has some points on this? Best, Yifan LI

Re: How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Yifan LI
you chose, wrt the vertices replication factor - the distribution of partitions on cluster ... Best, Yifan LI LIP6, UPMC, Paris > On 17 Nov 2014, at 11:59, Hlib Mykhailenko wrote: > > Hello, > > I use Spark Standalone Cluster and I want to measure somehow internode > comm

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Yifan LI
I am not sure if it can work on Spark 1.0, but give it a try. or, Maybe you can try: 1) to construct the edges and vertices RDDs respectively with desired storage level. 2) then, to obtain a graph by using Graph(verticesRDD, edgesRDD). Best, Yifan LI On 28 Oct 2014, at 12:10, Arpit Kumar

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Yifan LI
Hi Arpit, To try this: val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK) Best, Yifan LI On 28 Oct 2014, at 11:17, Arpit Kumar wrote: > Any h

how to send message to specific vertex by Pregel api

2014-10-02 Thread Yifan LI
using basic spark table operations(join, etc), for instance, in [1]) [1] http://event.cwi.nl/grades2014/03-salihoglu.pdf Best, Yifan LI

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI
tor((edge.dstId, edge.srcAttr._2 * edge.attr)) } else { Iterator.empty } } so, in this case, there is a message, even is none, is still sent? or not? Best, Yifan On 16 Sep 2014, at 11:48, Ankur Dave wrote: > At 2014-09-16 10:55:37 +0200, Yifan LI wrote: >> - from

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI
already been introduced in graphx pregel api? ) Best, Yifan LI On 15 Sep 2014, at 23:07, Ankur Dave wrote: > At 2014-09-15 16:25:04 +0200, Yifan LI wrote: >> I am wondering if the vertex active/inactive(corresponding the change of its >> value between two supersteps) feature i

vertex active/inactive feature in Pregel API ?

2014-09-15 Thread Yifan LI
]) = Iterator((edge.dstId, hmCal(edge.srcAttr))) or, I should do that by a customised measure function, e.g. by keeping its change in vertex attribute after each iteration. I noticed that there is an optional parameter “skipStale" in mrTriplets operator. Best, Yifan LI

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-05 Thread Yifan LI
:00 Ankur Dave : > At 2014-09-03 17:58:09 +0200, Yifan LI wrote: > > val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = > numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK) > > > > Error: java.lang.Unsupporte

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-03 Thread Yifan LI
) Error: java.lang.UnsupportedOperationException: Cannot change storage level of an RDD after it was already assigned a level Is there anyone could give me help? Best, Yifan 2014-08-18 23:52 GMT+02:00 Ankur Dave : > On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI wrote: > > I am te

[GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-08-18 Thread Yifan LI
Hi, I am testing our application(similar to "personalised page rank" using Pregel, and note that each vertex property will need pretty much more space to store after new iteration), it works correctly on small graph.(we have one single machine, 8 cores, 16G memory) But when we ran it on larger

[GraphX] how to compute only a subset of vertices in the whole graph?

2014-08-02 Thread Yifan LI
ally, only hubs can receive messages and compute results in the LAST iteration? since we don't need the final result of non-hub vertices. Best, Yifan LI - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For addi

Re: GraphX

2014-08-02 Thread Yifan LI
Try this: ./bin/run-example graphx.LiveJournalPageRank <…> On Aug 2, 2014, at 5:55 PM, Deep Pradhan wrote: > Hi, > I am running Spark in a single node cluster. I am able to run the codes in > Spark like SparkPageRank.scala, SparkKMeans.scala by the following command, > bin/run-examples org.ap

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Yifan LI
Hi Bin, Maybe you could get the vertex, for instance, which id is 80, by using: graph.vertices.filter{case(id, _) => id==80}.collect but I am not sure this is the exactly efficient way.(it will scan the whole table? if it can not get benefit from index of VertexRDD table) @Ankur, is there any

the implications of some items in webUI

2014-07-22 Thread Yifan LI
t; "Task Time" is approximated to "Duration" multiplied with "Total Tasks"? 2) what are the exact meanings of "Shuffle Read/Shuffle Write"? Best, Yifan LI

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Yifan LI
Thanks, Abel. Best, Yifan LI On Jul 21, 2014, at 4:16 PM, Abel Coronado Iruegas wrote: > Hi Yifan > > This works for me: > > export SPARK_JAVA_OPTS="-Xms10g -Xmx40g -XX:MaxPermSize=10g" > export ADD_JARS=/home/abel/spark/MLI/target/MLI-assembly-1.0.jar > expor

java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-07-21 Thread Yifan LI
omewhere(spark-1.0.1)? Best, Yifan LI

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-21 Thread Yifan LI
communication cost of my application upon the chosen partition strategy? On Jul 18, 2014, at 9:18 PM, Ankur Dave wrote: > Sorry, I didn't read your vertex replication example carefully, so my > previous answer is wrong. Here's the correct one: > > On Fri, Jul 18, 2014 at 9:

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-18 Thread Yifan LI
core is stricter to access only its own partition in memory? how do they communicate when the required data(edges?) in another partition? On Jul 15, 2014, at 9:30 PM, Ankur Dave wrote: > On Jul 15, 2014, at 12:06 PM, Yifan LI wrote: > Btw, is there any possibility to customise the partition

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-15 Thread Yifan LI
partitions being scheduled? Best, Yifan On Jul 15, 2014, at 12:06 PM, Yifan LI wrote: > Dear Ankur, > > Thanks so much! > > Btw, is there any possibility to customise the partition strategy as we > expect? > > > Best, > Yifan > On Jul 11, 2014, at 10:20 PM,

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-15 Thread Yifan LI
aphX questions to the > Spark user list if possible. I'll probably still be the one answering them, > but that way others can benefit as well. > > Ankur > > > On Fri, Jul 11, 2014 at 3:05 AM, Yifan LI wrote: > Hi Ankur, > > I am doing graph computation using Gr

GraphX: how to specify partition strategy?

2014-07-10 Thread Yifan LI
Hi, I am doing graph computation using GraphX, but it seems to be an error on graph partition strategy specification. as in "GraphX programming guide": The Graph.partitionBy operator allows users to choose the graph partitioning strategy, but due to SPARK-1931, this method is broken in Spark 1.0

which Spark package(wrt. graphX) I should install to do graph computation on cluster?

2014-07-07 Thread Yifan LI
Hi, I am planning to do graph(social network) computation on a cluster(hadoop has been installed), but it seems there are a "Pre-built" package for hadoop which I am NOT sure if the graphX has been included in. or, should I install other released version(obviously the graphX has been included)