Hi,
If you are using pipeline api, you do not need to map features back to
documents.
Your input (which is the document text) won't change after you used
HashingTF.
If you want to do Information Retrieval with spark, I suggest you to use
not the pipeline but RDDs...
On Fri, Jan 1, 2016 at 2:20 AM
Hi
I am working on proof of concept. I am trying to use spark to classify some
documents. I am using tokenizer and hashingTF to convert the documents into
vectors. Is there any easy way to map feature back to words or do I need to
maintain the reverse index my self? I realize there is a chance so
Dear All,
I'm trying to implement a procedure that iteratively updates a rdd
using results from GaussianMixtureModel.predictSoft. In order to avoid
problems with local variable (the obtained GMM) beeing overwritten in
each pass of the loop I'm doing the following:
##
I want to add AWS credentials into hdfs-site.xml and pass different xml for
different users
Thank you,
Konstantin Kudryavtsev
On Thu, Dec 31, 2015 at 2:19 PM, Ted Yu wrote:
> Check out --conf option for spark-submit
>
> bq. to configure different hdfs-site.xml
>
> What config parameters do you
Check out --conf option for spark-submit
bq. to configure different hdfs-site.xml
What config parameters do you plan to change in hdfs-site.xml ?
If the parameter only affects hdfs NN / DN, passing hdfs-site.xml wouldn't
take effect, right ?
Cheers
On Thu, Dec 31, 2015 at 10:48 AM, KOSTIANTYN K
Hi Jerry,
what you suggested looks to be working (I put hdfs-site.xml into
$SPARK_HOME/conf folder), but could you shed some light on how it can be
federated per user?
Thanks in advance!
Thank you,
Konstantin Kudryavtsev
On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam wrote:
> Hi Kostiantyn,
>
> I
Hi all,
I'm trying to use different spark-default.conf per user, i.e. I want to
have spark-user1.conf and etc. Is it a way to pass a path to appropriate
conf file when I'm using standalone spark installation?
Also, is it possible to configure different hdfs-site.xml and pass it as
well with spark-
For your second question,
bq. Class is not registered: scala.Tuple3[]
The above IllegalArgumentException has stated the class Scala was expecting
registration.
Meaning the type of components in the tuple is insignificant.
BTW what Spark release are you using ?
Cheers
On Thu, Dec 31, 2015 at 9:
The ScalaTest code that is enclosed at the end of this email message
demonstrates what appears to be a bug in the KryoSerializer. This code was
executed from IntelliJ IDEA (community edition) under Mac OS X 10.11.2
The KryoSerializer is enabled by updating the original SparkContext (that is
su
The key to efficient lookups is having a partitioner in place.
If you don't have a partitioner in place, essentially the best you can do
is:
def contains[T](rdd: RDD[T], value: T): Boolean = ! (rdd.filter(x => x ==
value).isEmpty)
If you are going to do this sort of operation frequently, it might
Hello,
Spark collect HDFS read/write metrics per application/job see details
http://spark.apache.org/docs/latest/monitoring.html.
I have connected spark metrics to Graphite and then doing nice graphs
display on Graphana.
BR,
Arek
On Thu, Dec 31, 2015 at 2:00 PM, Steve Loughran wrote:
>
>> On
thanks a lot.
It is very interesting.
Unfortunatly it does not solve my very simple problem :
efficiently find whether a value is in a huge rdd.
thanks again
Dominique
Le 31/12/2015 01:26, madaan.amanmadaan [via Apache Spark User List] a
écrit :
> Hi,
>
> Check out https://github.com/amplab/s
Hi Jerry,
thanks for the hint, could you please more specific how can I pass
different spark-{usr}.conf per user during job submit and which propery I
can use to specify custom hdfs-site.xml? I tried to google, but didn't find
nothing
Thank you,
Konstantin Kudryavtsev
On Wed, Dec 30, 2015 at 2:3
Thanks, Yanbo.
The results become much more reasonable, after I set driver memory to 5GB
and increase worker memory to 25GB.
So, my question is for following code snippet extracted from main method in
JavaKMeans.java in examples, what will the driver do? and what will the
worker do?
I didn't unde
Since you're running in standalone mode, can you try it using Spark 1.5.1
please?
On Thu, Dec 31, 2015 at 9:09 AM Steve Loughran
wrote:
>
> > On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
> >
> > Hi Jerry,
> >
> > I want to run different jobs on dif
> On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev
> wrote:
>
> Hi Jerry,
>
> I want to run different jobs on different S3 buckets - different AWS creds -
> on the same instances. Could you shed some light if it's possible to achieve
> with hdfs-site?
>
> Thank you,
> Konstantin Kudryavtsev
> On 30 Dec 2015, at 13:19, alvarobrandon wrote:
>
> Hello:
>
> Is there anyway of monitoring the number of Bytes or blocks read and written
> by an Spark application?. I'm running Spark with YARN and I want to measure
> how I/O intensive a set of applications are. Closest thing I have seen is
Yeah it's awkward, the transforms being done are fairly time sensitive, so I
don't want them to wait 60 seconds or more.
I might have to move the code from a transform into a custom receiver instead,
so they'll be processed outside the window length. A buffered writer is a good
idea too, thanks
Hi Ewan,Transforms are definitions of what needs to be done - they don't
execute until and action is triggered. For what you want, I think you might
need to have an action that writes out rdds to some sort of buffered writer.
-Ashic.
From: ewan.le...@realitymine.com
To: user@spark.apache.org
Su
Hi all,
I'm sure this must have been solved already, but I can't see anything obvious.
Using Spark Streaming, I'm trying to execute a transform function on a DStream
at short batch intervals (e.g. 1 second), but only write the resulting data to
disk using saveAsTextFiles in a larger batch after
In order to make job run faster, some parameters would be specified in the
command lines, such as --executor-cores , --executor-memory and --num-executors
...
However, as tested, it seemed that those numbers would not be reset randomly,
or some trouble would be caused for the cluster.What is mor
Are you running on YARN or standalone?
On Thu, Dec 31, 2015, 3:35 PM LinChen wrote:
> *Screenshot1(Normal WebUI)*
>
>
>
> *Screenshot2(Corrupted WebUI)*
>
>
>
> As screenshot2 shows, the format of my Spark WebUI looks strange and I
> cannot click the description of active jobs. It seems there is
Hi Anjali,
The main output of KMeansModel is clusterCenters which is Array[Vector]. It
has k elements where k is the number of clusters and each elements is the
center of the specified cluster.
Yanbo
2015-12-31 12:52 GMT+08:00 :
> Hi,
>
> I am trying to use kmeans for clustering in spark using
Screenshot1(Normal WebUI)
Screenshot2(Corrupted WebUI)
As screenshot2 shows, the format of my Spark WebUI looks strange and I cannot
click the description of active jobs. It seems there is something missing in my
opearing system. I googled it but find nothing. Could anybody help me?
---
Hey Lin,
This is a good question. The root cause of this issue lies in the
analyzer. Currently, Spark SQL can only resolve a name to a top level
column. (Hive suffers the same issue.) Take the SQL query and struct you
provided as an example, col_b.col_d.col_g is resolved as two nested
GetStru
25 matches
Mail list logo