Hi,
I am running this on a graph with >5B edges and >3B edges and have 2 questions -
1. What is the optimal number of iterations?
2. I am running it for 1 iteration right now on a beefy 100 node cluster,
with 300 executors each having 30GB RAM and 5 cores. I have persisted the graph
to M
Hi fellow users,
Has anyone ever used splines or smoothing kernels for linear regression in
Spark? If not, does anyone have ideas on how this can be done or what
suitable alternatives exist? I am on Spark 1.6.1 with python.
Thanks,
Tobi
Hi Kiran,
Thanks for responding. We would like to know how industry is dealing scenario
like Update in SPARK. Here is our scenario Manjunath, We are in process of
migrating our SQL server data to Spark. We have our logic in stored procedure,
where we dynamically create SQL String and execute t
Is it possible to preserve single token while using n-gram feature
transformer?
e.g.
Array("Hi", "I", "heard", "about", "Spark")
Becomes
Array("Hi", "i", "heard", "about", "Spark", "Hi i", "I heard", "heard
about", "about Spark")
Currently if I want to do it I will have to manually transform c
Thats around 750MB/s which seems quite respectable even in this day and age!
How many and what kind of disks to you have attached to your nodes? What
are you expecting?
On Tue, Nov 8, 2016 at 11:08 PM, Elf Of Lothlorein
wrote:
> Hi
> I am trying to save a RDD to disk and I am using the
> saveAs
Hi
I am trying to save a RDD to disk and I am using the saveAsNewAPIHadoopFile
for that. I am seeing that it takes almost 20 mins for about 900 GB of
data. Is there any parameter that I can tune to make this saving faster.
I am running about 45 executors with 5 cores each on 5 Spark worker nodes
an
Hi,
Can you please tell me how parquet partitions the data while saving the
dataframe.
I have a dataframe which contains 10 values like below
++
|field_num|
++
| 139|
| 140|
| 40|
| 41|
| 148|
| 149|
| 151
Hello,
I am trying out the MultilayerPerceptronClassifier and it takes only a
dataframe in its train method. Now the problem is that I have a training RDD
of labels (x,y) with x and y being matrices. X has dimensions (1,401) while
y has dimensions (1,10). I need to convert the train RDD to datafram
Hi all,
I’m doing a quick lit review.
Consider I have a graph that has link weights dependent on time. I.e., a bus on
this road gives a journey time (link weight) of x at time y. This is a classic
public transport shortest path problem.
This is a weighted directed graph that is time dependent
Have you tried this?
https://spark.apache.org/docs/2.0.1/api/scala/index.html#org.apache.spark.graphx.GraphLoader$
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action
--
View this message in context
No, I am not using serializing either with memory or disk.
Dave Jaffe
VMware
dja...@vmware.com
From: Shreya Agarwal
Date: Monday, November 7, 2016 at 3:29 PM
To: Dave Jaffe , "user@spark.apache.org"
Subject: RE: Anomalous Spark RDD persistence behavior
I don’t think this is correct. Unless yo
Hi,
I have a little haddop, hive, spark, hue setup. I am using hue to try to run
the sample spark notebook. I get the following error messages:
The Spark session could not be created in the cluster: timeout
or
"Session '-1' not found." (error 404)
Spark and livy are up and running. Searching t
Scratch that, it's working fine.
Thank you.
On Tue, Nov 8, 2016 at 8:19 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
> I should have used transform instead of map
>
> val x: DStream[(String, Record)] =
> kafkaStream.transform(x=>{x.map(_._2)}).transform(x=>{sqlContext.read.j
Hi,
I should have used transform instead of map
val x: DStream[(String, Record)] =
kafkaStream.transform(x=>{x.map(_._2)}).transform(x=>{sqlContext.read.json(x).as[Record].map(r=>(r.iid,r))}.rdd)
but I'm still unable to call mapWithState on x.
any idea why ?
Thank you,
Daniel
On Tue, Nov 8,
Hi,
I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] :
val kafkaStream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val stateStream: DStream[RDD[(String, Record)]] = kafkaStream.map(x=>
{ sqlContext.read.json(x._2
Hi,
We have 30 million small (100k each) files on s3 to process. I am thinking
about something like below to load them in parallel
val data = sc.union(sc.wholeTextFiles("s3a://.../*.json").map(...)
.toDF().createOrReplaceTempView("data")
How to estimate the driver memory it should be given? is th
No, you do not scale back the predicted value. The output values (labels)
were never scaled; only input features were scaled.
For prediction on new samples, you scale the new sample first using the
avg/std that you calculated for each feature when you trained your model,
then feed it to the tra
Hey,
We have files organized on hdfs in this manner:
base_folder
|-
|- file1
|- file2
|- ...
|-
|- file1
|- file2
|- ...
| - ...
We want to be able to do the following operation on our data:
- for each ID we want to parse
+1 for Zeppelin.
See
https://community.hortonworks.com/articles/10365/apache-zeppelin-and-sparkr.html
--
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh
De :Vadim Semenov
A : Andrew Ho
Thanks TD.
Is "hdfs.append.support" a standard configuration? I see a seemingly equivalent
configuration "dfs.support.append" that is used in our version of HDFS.
In case we want to use a pseudo file-system (like S3) which does not support
append what are our options? I am not familiar with
Not that easy of a problem to solve…
Can you impersonate the user who provided the code?
I mean if Joe provides the lambda function, then it runs as Joe so it has joe’s
permissions.
Steve is right, you’d have to get down to your cluster’s security and
authenticate the user before accepting
Take a look at https://zeppelin.apache.org
On Tue, Nov 8, 2016 at 11:13 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:
> Hello,
>
> A colleague and I are trying to work out the best way to provide live data
> visualisations based on Spark. Is it possible to explore a dataset in spark
Hello,
A colleague and I are trying to work out the best way to provide live data
visualisations based on Spark. Is it possible to explore a dataset in spark
from a web browser? Set up pre defined functions that the user can click on
which return datsets.
We are using a lot of R here. Is this som
Hi Masood,
Thank you again for your suggestion.
I have got a question about the following:
For prediction on new samples, you need to scale each sample first before
making predictions using your trained model.
When applying the ML linear model as suggested above, it means that the
predicted v
Thanks Sean! Let me take a look!
Iman
On Nov 8, 2016 7:29 AM, "Sean Owen" wrote:
> I think the problem here is that IndexedRowMatrix.toRowMatrix does *not*
> result in a RowMatrix with rows in order of their indices, necessarily:
>
> // Drop its row indices.
> RowMatrix rowMat = indexedRowMatrix
I think the problem here is that IndexedRowMatrix.toRowMatrix does *not*
result in a RowMatrix with rows in order of their indices, necessarily:
// Drop its row indices.
RowMatrix rowMat = indexedRowMatrix.toRowMatrix();
What you get is a matrix where the rows are arranged in whatever order they
So
b =
0.89
0.42
0.0
0.88
0.97
The solution at the bottom is the solution to Ax = b solved using Gaussian
elimination. I guess another question is, is there another way to solve
this problem? I'm trying to solve the least squares fit with a huge A (5MM
x 1MM)
x = inverse(A-transpose*A)*A-transose
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
specifically
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets
Have you set enable.auto.commit to false?
The new consumer stores offsets in kafka, so the idea of specifically
deleti
Hi Sean,
Here you go:
sparsematrix.txt =
row, col ,val
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97
The vector is just the third column of the matrix which should give the
trivial solution of [0,0,1]
This translates to this which is cor
Hi, I am using spark streaming process some events. It is deployed in
standalone mode with 1 master and 3 workers. I have set number of cores per
executor to 4 and total num of executors to 24. This means totally 6
executors will be spawned. I have set spread-out to true. So each worker
machine get
Ok, digging the code, I find out in the class JacksonGenerator the next
method
private def writeFields(
row: InternalRow, schema: StructType, fieldWriters: Seq[ValueWriter]):
Unit = {
var i = 0
while (i < row.numFields) {
val field = schema(i)
if (!row.isNullAt(i)) {
gen.writeF
Hello!
I have two datasets -- one of short strings, one of longer strings. Some of
the longer strings contain the short strings, and I need to identify which.
What I've written is taking forever to run (pushing 11 hours on my quad
core i5 with 12 GB RAM), appearing to be CPU bound. The way I've w
Hello, I'm using spark 2.0 and I'm using toJson method. I've seen that Null
values are omitted in the Json Record, witch is valid, but I need the field
with null as value, it's possible to configure that?
thanks.
I'm using Kafka direct stream (auto.offset.reset = earliest) and enable
Spark streaming's checkpoint.
The application starts and consumes messages correctly. Then I stop the
application and clean the checkpoint folder.
I restart the application and expect it to consumes old messages. But it
It turns out to be a bug in application code. Thank you!
From: Haopu Wang
Sent: 2016年11月4日 17:23
To: user@spark.apache.org; Cody Koeninger
Subject: InvalidClassException when load KafkaDirectStream from checkpoint
(Spark 2.0.0)
When I load spark checkpoin
35 matches
Mail list logo