Strongly Connected Components

2016-11-08 Thread Shreya Agarwal
Hi, I am running this on a graph with >5B edges and >3B edges and have 2 questions - 1. What is the optimal number of iterations? 2. I am running it for 1 iteration right now on a beefy 100 node cluster, with 300 executors each having 30GB RAM and 5 cores. I have persisted the graph to M

Splines or Smoothing Kernels for Linear Regression

2016-11-08 Thread Tobi Bosede
Hi fellow users, Has anyone ever used splines or smoothing kernels for linear regression in Spark? If not, does anyone have ideas on how this can be done or what suitable alternatives exist? I am on Spark 1.6.1 with python. Thanks, Tobi

Re: Any Dynamic Compilation of Scala Query

2016-11-08 Thread Mahender Sarangam
Hi Kiran, Thanks for responding. We would like to know how industry is dealing scenario like Update in SPARK. Here is our scenario Manjunath, We are in process of migrating our SQL server data to Spark. We have our logic in stored procedure, where we dynamically create SQL String and execute t

spark ml - ngram - how to preserve single word (1-gram)

2016-11-08 Thread Nirav Patel
Is it possible to preserve single token while using n-gram feature transformer? e.g. Array("Hi", "I", "heard", "about", "Spark") Becomes Array("Hi", "i", "heard", "about", "Spark", "Hi i", "I heard", "heard about", "about Spark") Currently if I want to do it I will have to manually transform c

Re: Save a spark RDD to disk

2016-11-08 Thread Andrew Holway
Thats around 750MB/s which seems quite respectable even in this day and age! How many and what kind of disks to you have attached to your nodes? What are you expecting? On Tue, Nov 8, 2016 at 11:08 PM, Elf Of Lothlorein wrote: > Hi > I am trying to save a RDD to disk and I am using the > saveAs

Save a spark RDD to disk

2016-11-08 Thread Elf Of Lothlorein
Hi I am trying to save a RDD to disk and I am using the saveAsNewAPIHadoopFile for that. I am seeing that it takes almost 20 mins for about 900 GB of data. Is there any parameter that I can tune to make this saving faster. I am running about 45 executors with 5 cores each on 5 Spark worker nodes an

How Spark determines Parquet partition size

2016-11-08 Thread Selvam Raman
Hi, Can you please tell me how parquet partitions the data while saving the dataframe. I have a dataframe which contains 10 values like below ++ |field_num| ++ | 139| | 140| | 40| | 41| | 148| | 149| | 151

Convert RDD of numpy matrices to Dataframes

2016-11-08 Thread aditya1702
Hello, I am trying out the MultilayerPerceptronClassifier and it takes only a dataframe in its train method. Now the problem is that I have a training RDD of labels (x,y) with x and y being matrices. X has dimensions (1,401) while y has dimensions (1,10). I need to convert the train RDD to datafram

GraphX and Public Transport Shortest Paths

2016-11-08 Thread Gerard Casey
Hi all, I’m doing a quick lit review. Consider I have a graph that has link weights dependent on time. I.e., a bus on this road gives a journey time (link weight) of x at time y. This is a classic public transport shortest path problem. This is a weighted directed graph that is time dependent

Re: GraphX Connected Components

2016-11-08 Thread Robineast
Have you tried this? https://spark.apache.org/docs/2.0.1/api/scala/index.html#org.apache.spark.graphx.GraphLoader$ - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context

Re: Anomalous Spark RDD persistence behavior

2016-11-08 Thread Dave Jaffe
No, I am not using serializing either with memory or disk. Dave Jaffe VMware dja...@vmware.com From: Shreya Agarwal Date: Monday, November 7, 2016 at 3:29 PM To: Dave Jaffe , "user@spark.apache.org" Subject: RE: Anomalous Spark RDD persistence behavior I don’t think this is correct. Unless yo

Running

2016-11-08 Thread rurbanow
Hi, I have a little haddop, hive, spark, hue setup. I am using hue to try to run the sample spark notebook. I get the following error messages: The Spark session could not be created in the cluster: timeout or "Session '-1' not found." (error 404) Spark and livy are up and running. Searching t

Re: mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
Scratch that, it's working fine. Thank you. On Tue, Nov 8, 2016 at 8:19 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I should have used transform instead of map > > val x: DStream[(String, Record)] = > kafkaStream.transform(x=>{x.map(_._2)}).transform(x=>{sqlContext.read.j

Re: mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
Hi, I should have used transform instead of map val x: DStream[(String, Record)] = kafkaStream.transform(x=>{x.map(_._2)}).transform(x=>{sqlContext.read.json(x).as[Record].map(r=>(r.iid,r))}.rdd) but I'm still unable to call mapWithState on x. any idea why ? Thank you, Daniel On Tue, Nov 8,

mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
Hi, I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] : val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet) val stateStream: DStream[RDD[(String, Record)]] = kafkaStream.map(x=> { sqlContext.read.json(x._2

read large number of files on s3

2016-11-08 Thread Xiaomeng Wan
Hi, We have 30 million small (100k each) files on s3 to process. I am thinking about something like below to load them in parallel val data = sc.union(sc.wholeTextFiles("s3a://.../*.json").map(...) .toDF().createOrReplaceTempView("data") How to estimate the driver memory it should be given? is th

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-08 Thread Masood Krohy
No, you do not scale back the predicted value. The output values (labels) were never scaled; only input features were scaled. For prediction on new samples, you scale the new sample first using the avg/std that you calculated for each feature when you trained your model, then feed it to the tra

use case reading files split per id

2016-11-08 Thread ruben
Hey, We have files organized on hdfs in this manner: base_folder |- |- file1 |- file2 |- ... |- |- file1 |- file2 |- ... | - ... We want to be able to do the following operation on our data: - for each ID we want to parse

Re: Live data visualisations with Spark

2016-11-08 Thread Masood Krohy
+1 for Zeppelin. See https://community.hortonworks.com/articles/10365/apache-zeppelin-and-sparkr.html -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R&D Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De :Vadim Semenov A : Andrew Ho

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-08 Thread Arijit
Thanks TD. Is "hdfs.append.support" a standard configuration? I see a seemingly equivalent configuration "dfs.support.append" that is used in our version of HDFS. In case we want to use a pseudo file-system (like S3) which does not support append what are our options? I am not familiar with

Re: sanboxing spark executors

2016-11-08 Thread Michael Segel
Not that easy of a problem to solve… Can you impersonate the user who provided the code? I mean if Joe provides the lambda function, then it runs as Joe so it has joe’s permissions. Steve is right, you’d have to get down to your cluster’s security and authenticate the user before accepting

Re: Live data visualisations with Spark

2016-11-08 Thread Vadim Semenov
Take a look at https://zeppelin.apache.org On Tue, Nov 8, 2016 at 11:13 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: > Hello, > > A colleague and I are trying to work out the best way to provide live data > visualisations based on Spark. Is it possible to explore a dataset in spark

Live data visualisations with Spark

2016-11-08 Thread Andrew Holway
Hello, A colleague and I are trying to work out the best way to provide live data visualisations based on Spark. Is it possible to explore a dataset in spark from a web browser? Set up pre defined functions that the user can click on which return datsets. We are using a lot of R here. Is this som

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-08 Thread Carlo . Allocca
Hi Masood, Thank you again for your suggestion. I have got a question about the following: For prediction on new samples, you need to scale each sample first before making predictions using your trained model. When applying the ML linear model as suggested above, it means that the predicted v

Re: TallSkinnyQR

2016-11-08 Thread Iman Mohtashemi
Thanks Sean! Let me take a look! Iman On Nov 8, 2016 7:29 AM, "Sean Owen" wrote: > I think the problem here is that IndexedRowMatrix.toRowMatrix does *not* > result in a RowMatrix with rows in order of their indices, necessarily: > > // Drop its row indices. > RowMatrix rowMat = indexedRowMatrix

Re: TallSkinnyQR

2016-11-08 Thread Sean Owen
I think the problem here is that IndexedRowMatrix.toRowMatrix does *not* result in a RowMatrix with rows in order of their indices, necessarily: // Drop its row indices. RowMatrix rowMat = indexedRowMatrix.toRowMatrix(); What you get is a matrix where the rows are arranged in whatever order they

Re: TallSkinnyQR

2016-11-08 Thread Iman Mohtashemi
So b = 0.89 0.42 0.0 0.88 0.97 The solution at the bottom is the solution to Ax = b solved using Gaussian elimination. I guess another question is, is there another way to solve this problem? I'm trying to solve the least squares fit with a huge A (5MM x 1MM) x = inverse(A-transpose*A)*A-transose

Re: Kafka stream offset management question

2016-11-08 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html specifically http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets Have you set enable.auto.commit to false? The new consumer stores offsets in kafka, so the idea of specifically deleti

Re: TallSkinnyQR

2016-11-08 Thread Iman Mohtashemi
Hi Sean, Here you go: sparsematrix.txt = row, col ,val 0,0,.42 0,1,.28 0,2,.89 1,0,.83 1,1,.34 1,2,.42 2,0,.23 3,0,.42 3,1,.98 3,2,.88 4,0,.23 4,1,.36 4,2,.97 The vector is just the third column of the matrix which should give the trivial solution of [0,0,1] This translates to this which is cor

Spark streaming uses lesser number of executors

2016-11-08 Thread Aravindh
Hi, I am using spark streaming process some events. It is deployed in standalone mode with 1 master and 3 workers. I have set number of cores per executor to 4 and total num of executors to 24. This means totally 6 executors will be spawned. I have set spread-out to true. So each worker machine get

Re: DataSet toJson

2016-11-08 Thread Andrés Ivaldi
Ok, digging the code, I find out in the class JacksonGenerator the next method private def writeFields( row: InternalRow, schema: StructType, fieldWriters: Seq[ValueWriter]): Unit = { var i = 0 while (i < row.numFields) { val field = schema(i) if (!row.isNullAt(i)) { gen.writeF

how to write a substring search efficiently?

2016-11-08 Thread Haig Didizian
Hello! I have two datasets -- one of short strings, one of longer strings. Some of the longer strings contain the short strings, and I need to identify which. What I've written is taking forever to run (pushing 11 hours on my quad core i5 with 12 GB RAM), appearing to be CPU bound. The way I've w

DataSet toJson

2016-11-08 Thread Andrés Ivaldi
Hello, I'm using spark 2.0 and I'm using toJson method. I've seen that Null values are omitted in the Json Record, witch is valid, but I need the field with null as value, it's possible to configure that? thanks.

Kafka stream offset management question

2016-11-08 Thread Haopu Wang
I'm using Kafka direct stream (auto.offset.reset = earliest) and enable Spark streaming's checkpoint. The application starts and consumes messages correctly. Then I stop the application and clean the checkpoint folder. I restart the application and expect it to consumes old messages. But it

RE: InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0)

2016-11-08 Thread Haopu Wang
It turns out to be a bug in application code. Thank you! From: Haopu Wang Sent: 2016年11月4日 17:23 To: user@spark.apache.org; Cody Koeninger Subject: InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0) When I load spark checkpoin