Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)

2014-08-22 Thread Jeffrey Picard
I’m seeing this issue also. I have graph with with 5828339535 vertices and 7398447992 edges, graph.numVertices returns 1533266498 and graph.numEdges is correct and returns 7398447992. I also am having an issue that I’m beginning to suspect is caused by the same underlying problem where connected

Re: Spark Contribution

2014-08-22 Thread Maisnam Ns
Thanks all, for adding this link . On Sat, Aug 23, 2014 at 5:38 AM, Reynold Xin wrote: > Great idea. Added the link > https://github.com/apache/spark/blob/master/README.md > > > > On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> We should add this li

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Tathagata Das
The real fix is that the spark sink suite does not really need to use to the spark-streaming test jars. Removing that dependency altogether, and submitting a PR. TD On Fri, Aug 22, 2014 at 6:34 PM, Tathagata Das wrote: > Figured it out. Fixing this ASAP. > > TD > > > On Fri, Aug 22, 2014 at 5:

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Tathagata Das
Figured it out. Fixing this ASAP. TD On Fri, Aug 22, 2014 at 5:51 PM, Patrick Wendell wrote: > Hey All, > > We can sort this out ASAP. Many of the Spark committers were at a company > offsite for the last 72 hours, so sorry that it is broken. > > - Patrick > > > On Fri, Aug 22, 2014 at 4:07 PM

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Patrick Wendell
Hey All, We can sort this out ASAP. Many of the Spark committers were at a company offsite for the last 72 hours, so sorry that it is broken. - Patrick On Fri, Aug 22, 2014 at 4:07 PM, Hari Shreedharan wrote: > Sean - I think only the ones in 1726 are enough. It is weird that any > class that

Graphx seems to be broken while Creating a large graph(6B nodes in my case)

2014-08-22 Thread npanj
While creating a graph with 6B nodes and 12B edges, I noticed that *'numVertices' api returns incorrect result*; 'numEdges' reports correct number. For few times(with different dataset > 2.5B nodes) I have also notices that numVertices is returned as -ive number; so I suspect that there is some ove

Re: Spark Contribution

2014-08-22 Thread Reynold Xin
Great idea. Added the link https://github.com/apache/spark/blob/master/README.md On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > We should add this link to the readme on GitHub btw. > > 2014년 8월 21일 목요일, Henry Saputra님이 작성한 메시지: > > > The Apache Spark wi

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Hari Shreedharan
Sean - I think only the ones in 1726 are enough. It is weird that any class that uses the test-jar actually requires the streaming jar to be added explicitly. Shouldn't maven take care of this? I posted some comments on the PR. -- Thanks, Hari Sean Owen August 2

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Sean Owen
Yes, master hasn't compiled for me for a few days. It's fixed in: https://github.com/apache/spark/pull/1726 https://github.com/apache/spark/pull/2075 Could a committer sort this out? Sean On Fri, Aug 22, 2014 at 9:55 PM, Ted Yu wrote: > Hi, > Using the following command on (refreshed) master

[Spark SQL] off-heap columnar store

2014-08-22 Thread Evan Chan
Hey guys, What is the plan for getting Tachyon/off-heap support for the columnar compressed store? It's not in 1.1 is it? In particular: - being able to set TACHYON as the caching mode - loading of hot columns or all columns - write-through of columnar store data to HDFS or backing store - b

reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Ted Yu
Hi, Using the following command on (refreshed) master branch: mvn clean package -DskipTests I got: constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/ --- java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAcce

Graphx GraphLoader Coalesce Shuffle

2014-08-22 Thread Jeffrey Picard
Hey all, I’ve often found that my spark programs run much more stable with a higher number of partitions, and a lot of the graphs I deal with will have a few hundred large part files. I was wondering if having a parameter in GraphLoader, defaulting to false, to set the shuffle parameter in coal

Re: take() reads every partition if the first one is empty

2014-08-22 Thread Andrew Ash
Yep, anyone can create a bug at https://issues.apache.org/jira/browse/SPARK Then if you make a pull request on GitHub and have the bug number in the header like "[SPARK-1234] Make take() less OOM-prone", then the PR gets linked to the Jira ticket. I think that's the best way to get feedback on a

Re: take() reads every partition if the first one is empty

2014-08-22 Thread pnepywoda
What's the process at this point? Does someone make a bug? Should I make a bug? (do I even have permission to?) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/take-reads-every-partition-if-the-first-one-is-empty-tp7956p7958.html Sent from the Apache S

Re: take() reads every partition if the first one is empty

2014-08-22 Thread Andrew Ash
Hi Paul, I agree that jumping straight from reading N rows from 1 partition to N rows from ALL partitions is pretty aggressive. The exponential growth strategy of doubling the partition count every time seems better -- 1, 2, 4, 8, 16, ... will be much more likely to prevent OOMs than the 1 -> ALL

take() reads every partition if the first one is empty

2014-08-22 Thread pnepywoda
On line 777 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771 the logic for take() reads ALL partitions if the first one (or first k) are empty. This has actually lead to OOMs when we had many partitions (thousands) and unfortu

RE: Spark SQL Query and join different data sources.

2014-08-22 Thread chutium
oops, thanks Yan, you are right, i got scala> sqlContext.sql("select * from a join b").take(10) java.lang.RuntimeException: Table Not Found: b at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)

Adding support for a new object store

2014-08-22 Thread Rajendran Appavu
I am new to Spark source code and looking to see if i can add push-down support of spark filters to the storage (in my case an object store). I am willing to consider