Hi,
I intend to use spark thrift-server as a service to support concurrent sql
queries. But in our situation we need a way to kill arbitrary query job, is
there an api to use here?
It would be great if you can elaborate on the bulk provisioning use case.
Regards,
Sourav
On Sun, Nov 26, 2017 at 11:53 PM, shankar.roy wrote:
> This would be a useful feature.
> We can leverage it while doing bulk provisioning.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3
@sathich
Here are my thoughts on your points -
1. Yes this should be able to handle any complex json structure returned by
the target rest API. Essentially what it would be returning is Rows of that
complex structure. Then one can use Spark SQL to further flatten it using
the functions like inlin
Hi,
10 TB in Athena would cost $50. If your data is in Parquet, then it will
cost even less because of columnar striping. So I am genuinely not quite
sure what you are speaking about? Also what do you mean by "I currently
need"? Are you already processing the data?
Since you mentioned that you ar
I know that the algorithm itself is not able to extract features for a user
that it was not trained on, however, I'm trying to find a way to compare
users for similarity so that when I find a user that's really similar to
another user, I can just use the similar user's recommendations until the
oth
I'm trying to use the MatrixFactorizationModel to, for instance, determine
the latent factors of a user or item that were not used in the training
data of the model. I'm not as concerned about the rating as I am with the
latent factors for the user/item.
Thanks!
Hi
Simply because you have to pay on top of every instance hour. I currently
need about 4800h of r3.2xlarge EMR takes 0.18$ instance hour so it would be
864$ just in EMR costs (spot prices are around 0.12$/h).
Just to stay on topic I thought about getting 40 i2.xlarge instances which
have about 1T
What's the number of executor and/or number of partitions you are working with?
I'm afraid most of the problem is with the serialization deserialization
overhead between JVM and R...
From: Kunft, Andreas
Sent: Monday, November 27, 2017 10:27:33 AM
To: user@spark
Hello,
I tried to execute some user defined functions with R using the airline arrival
performance dataset.
While the examples from the documentation for the `<-` apply operator work
perfectly fine on a size ~9GB,
the `dapply` operator fails to finish even after ~4 hours.
I'm using a functi
Hi,
I think that I have mentioned all the required alternatives. However I am
quite curious as to how did you conclude that processing using EMR is going
to be more expensive than using any other stack. I have been using EMR
since last 6 years (almost about the time it came out), and have always
f
I don't use EMR I spin my clusters up using flintrock (beeing a student my
budget is slim), my code is writen in pyspark and my data is in the
us-east-1 region (N. Virginia). I will do my best explaining it with tables:
My input with a size of (10TB) sits in multiple (~150) parquets on S3
+--
You are essential doing document clustering. K-means will do it. You do have to
specify the number of clusters up front.
Sent from Email+ secured by MobileIron
From: "Donni Khan"
mailto:prince.don...@googlemail.com>>
Date: Monday, November 27, 2017 at 7:27:33
I have spark job to compute the similarity between text documents:
RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd());
CoordinateMatrix
rowsimilarity=rowMatrix.columnSimilarities(0.5);JavaRDD
entries = rowsimilarity.entries().toJavaRDD();
List list = entries.collect();
for(MatrixEntry s : list)
Hi,
it would be much simpler in case you just provide two tables with the
samples of input and output. Going through the verbose text and trying to
read and figure out what is happening is a bit daunting.
Personally, given that you have your entire data in Parquet, I do not think
that you will ne
I have a temporary result file ( the 10TB one) that looks like this
I have around 3 billion rows of (url,url_list,language,vector,text). The
bulk of data is in url_list and at the moment I can only guess how large
url_list is. I want to give an ID to every url and then this ID to every
url in url_l
How many columns do you need from the big file?
Also how CPU / memory intensive are the computations you want to perform?
Alexander Czech schrieb am Mo. 27. Nov.
2017 um 10:57:
> I want to load a 10TB parquet File from S3 and I'm trying to decide what
> EC2 instances to use.
>
> Should I go for
I want to load a 10TB parquet File from S3 and I'm trying to decide what
EC2 instances to use.
Should I go for instances that in total have a larger memory size than
10TB? Or is it enough that they have in total enough SSD storage so that
everything can be spilled to disk?
thanks
17 matches
Mail list logo