VertexId type in GraphX

2015-01-13 Thread Madhu
Are there any plans to generalize the type of VertexId in GraphX? Our keys are particularly long. We could use the hashCode() trick, but the chance of collisions is not acceptable. Given our data volume, we have encountered hashCode() collisions more than once. I see this Jira, but it is specific

Re: DBSCAN for MLlib

2015-01-13 Thread Muhammad Ali A'råby
I have to say, I have created a Jira task for it: [SPARK-5226] Add DBSCAN Clustering Algorithm to MLlib - ASF JIRA |   | |   |   |   |   |   | | [SPARK-5226] Add DBSCAN Clustering Algorithm to MLlib - ASF JIRAMLlib is all k-means now, and I think we should add some new clustering algorithms to it

DBSCAN for MLlib

2015-01-13 Thread Muhammad Ali A'råby
Dear all, I think MLlib needs more clustering algorithms and DBSCAN is my first candidate. I am starting to implement it. Any advice? Muhammad-Ali

Re: create a SchemaRDD from a custom datasource

2015-01-13 Thread Reynold Xin
If it is a small collection of them on the driver, you can just use sc.parallelize to create an RDD. On Tue, Jan 13, 2015 at 7:56 AM, Malith Dhanushka wrote: > Hi Reynold, > > Thanks for the response. I am just wondering, lets say we have set of Row > objects. Isn't there a straightforward way

Re: Use of MapConverter, ListConverter in python to java object conversion

2015-01-13 Thread Davies Liu
It's not necessary, I will create a PR to remove them. For larger dict/list/tuple, the pickle approach may have less RPC calls, better performance. Davies On Tue, Jan 13, 2015 at 4:53 AM, Meethu Mathew wrote: > Hi all, > > In the python object to java conversion done in the method _py2java in >

Re: Python to Java object conversion of numpy array

2015-01-13 Thread Davies Liu
On Mon, Jan 12, 2015 at 8:14 PM, Meethu Mathew wrote: > Hi, > > This is the function defined in PythonMLLibAPI.scala > def findPredict( > data: JavaRDD[Vector], > wt: Object, > mu: Array[Object], > si: Array[Object]): RDD[Array[Double]] = { > } > > So the parameter mu sho

Fwd: [ NOTICE ] Service Downtime Notification - R/W git repos

2015-01-13 Thread Patrick Wendell
FYI our git repo may be down for a few hours today. -- Forwarded message -- From: "Tony Stevenson" Date: Jan 13, 2015 6:49 AM Subject: [ NOTICE ] Service Downtime Notification - R/W git repos To: Cc: Folks, Please note than on Thursday 15th at 20:00 UTC the Infrastructure team wi

Unable to find configuration file at location scalastyle-config.xml

2015-01-13 Thread Zhiwei Chan
Hi everyone, I am newly to spark, and try to package the spark-core for some modification. I use IDEA to package the spark-core_2.10 of spark 1.1.1. When encounter the following error, I check the website http://www.scalastyle.org/maven.html, and its suggest configuration is to modify the spark

Use of MapConverter, ListConverter in python to java object conversion

2015-01-13 Thread Meethu Mathew
Hi all, In the python object to java conversion done in the method _py2java in spark/python/pyspark/mllib/common.py, why we are doing individual conversion using MpaConverter,ListConverter? The same can be acheived using bytearray(PickleSerializer().dumps(obj)) obj = sc._jvm.SerDe.loads(by

Re: create a SchemaRDD from a custom datasource

2015-01-13 Thread Reynold Xin
Depends on what the other side is doing. You can create your own RDD implementation by subclassing RDD, or it might work if you use sc.parallelize(1 to n, n).mapPartitionsWithIndex( /* code to read the data and return an iterator */ ) where n is the number of partitions. On Tue, Jan 13, 2015 at 12

create a SchemaRDD from a custom datasource

2015-01-13 Thread Niranda Perera
Hi, We have a custom datasources API, which connects to various data sources and exposes them out as a common API. We are now trying to implement the Spark datasources API released in 1.2.0 to connect Spark for analytics. Looking at the sources API, we figured out that we should extend a scan cla