date:20151231

Re: does HashingTF maintain a inverse index?

2015-12-31 Thread Hayri Volkan Agun

Hi, If you are using pipeline api, you do not need to map features back to documents. Your input (which is the document text) won't change after you used HashingTF. If you want to do Information Retrieval with spark, I suggest you to use not the pipeline but RDDs... On Fri, Jan 1, 2016 at 2:20 AM

does HashingTF maintain a inverse index?

2015-12-31 Thread Andy Davidson

Hi I am working on proof of concept. I am trying to use spark to classify some documents. I am using tokenizer and hashingTF to convert the documents into vectors. Is there any easy way to map feature back to words or do I need to maintain the reverse index my self? I realize there is a chance so

Problem embedding GaussianMixtureModel in a closure

2015-12-31 Thread Tomasz Fruboes

Dear All, I'm trying to implement a procedure that iteratively updates a rdd using results from GaussianMixtureModel.predictSoft. In order to avoid problems with local variable (the obtained GMM) beeing overwritten in each pass of the loop I'm doing the following: ##

Re: pass custom spark-conf

2015-12-31 Thread KOSTIANTYN Kudriavtsev

I want to add AWS credentials into hdfs-site.xml and pass different xml for different users Thank you, Konstantin Kudryavtsev On Thu, Dec 31, 2015 at 2:19 PM, Ted Yu wrote: > Check out --conf option for spark-submit > > bq. to configure different hdfs-site.xml > > What config parameters do you

Re: pass custom spark-conf

2015-12-31 Thread Ted Yu

Check out --conf option for spark-submit bq. to configure different hdfs-site.xml What config parameters do you plan to change in hdfs-site.xml ? If the parameter only affects hdfs NN / DN, passing hdfs-site.xml wouldn't take effect, right ? Cheers On Thu, Dec 31, 2015 at 10:48 AM, KOSTIANTYN K

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread KOSTIANTYN Kudriavtsev

Hi Jerry, what you suggested looks to be working (I put hdfs-site.xml into $SPARK_HOME/conf folder), but could you shed some light on how it can be federated per user? Thanks in advance! Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam wrote: > Hi Kostiantyn, > > I

pass custom spark-conf

2015-12-31 Thread KOSTIANTYN Kudriavtsev

Hi all, I'm trying to use different spark-default.conf per user, i.e. I want to have spark-user1.conf and etc. Is it a way to pass a path to appropriate conf file when I'm using standalone spark installation? Also, is it possible to configure different hdfs-site.xml and pass it as well with spark-

Re: Apparent bug in KryoSerializer

2015-12-31 Thread Ted Yu

For your second question, bq. Class is not registered: scala.Tuple3[] The above IllegalArgumentException has stated the class Scala was expecting registration. Meaning the type of components in the tuple is insignificant. BTW what Spark release are you using ? Cheers On Thu, Dec 31, 2015 at 9:

Apparent bug in KryoSerializer

2015-12-31 Thread Russ

The ScalaTest code that is enclosed at the end of this email message demonstrates what appears to be a bug in the KryoSerializer. This code was executed from IntelliJ IDEA (community edition) under Mac OS X 10.11.2 The KryoSerializer is enabled by updating the original SparkContext (that is su

Re: efficient checking the existence of an item in a rdd

2015-12-31 Thread Nick Peterson

The key to efficient lookups is having a partitioner in place. If you don't have a partitioner in place, essentially the best you can do is: def contains[T](rdd: RDD[T], value: T): Boolean = ! (rdd.filter(x => x == value).isEmpty) If you are going to do this sort of operation frequently, it might

Re: Monitoring Spark HDFS Reads and Writes

2015-12-31 Thread Arkadiusz Bicz

Hello, Spark collect HDFS read/write metrics per application/job see details http://spark.apache.org/docs/latest/monitoring.html. I have connected spark metrics to Graphite and then doing nice graphs display on Graphana. BR, Arek On Thu, Dec 31, 2015 at 2:00 PM, Steve Loughran wrote: > >> On

Re: efficient checking the existence of an item in a rdd

2015-12-31 Thread domibd

thanks a lot. It is very interesting. Unfortunatly it does not solve my very simple problem : efficiently find whether a value is in a huge rdd. thanks again Dominique Le 31/12/2015 01:26, madaan.amanmadaan [via Apache Spark User List] a écrit : > Hi, > > Check out https://github.com/amplab/s

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread KOSTIANTYN Kudriavtsev

Hi Jerry, thanks for the hint, could you please more specific how can I pass different spark-{usr}.conf per user during job submit and which propery I can use to specify custom hdfs-site.xml? I tried to google, but didn't find nothing Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 2:3

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-31 Thread Jia Zou

Thanks, Yanbo. The results become much more reasonable, after I set driver memory to 5GB and increase worker memory to 25GB. So, my question is for following code snippet extracted from main method in JavaKMeans.java in examples, what will the driver do? and what will the worker do? I didn't unde

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread Brian London

Since you're running in standalone mode, can you try it using Spark 1.5.1 please? On Thu, Dec 31, 2015 at 9:09 AM Steve Loughran wrote: > > > On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev < > kudryavtsev.konstan...@gmail.com> wrote: > > > > Hi Jerry, > > > > I want to run different jobs on dif

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread Steve Loughran

> On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev > wrote: > > Hi Jerry, > > I want to run different jobs on different S3 buckets - different AWS creds - > on the same instances. Could you shed some light if it's possible to achieve > with hdfs-site? > > Thank you, > Konstantin Kudryavtsev

Re: Monitoring Spark HDFS Reads and Writes

2015-12-31 Thread Steve Loughran

> On 30 Dec 2015, at 13:19, alvarobrandon wrote: > > Hello: > > Is there anyway of monitoring the number of Bytes or blocks read and written > by an Spark application?. I'm running Spark with YARN and I want to measure > how I/O intensive a set of applications are. Closest thing I have seen is

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith

Yeah it's awkward, the transforms being done are fairly time sensitive, so I don't want them to wait 60 seconds or more. I might have to move the code from a transform into a custom receiver instead, so they'll be processed outside the window length. A buffered writer is a good idea too, thanks

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ashic Mahtab

Hi Ewan,Transforms are definitions of what needs to be done - they don't execute until and action is triggered. For what you want, I think you might need to have an action that writes out rdds to some sort of buffered writer. -Ashic. From: ewan.le...@realitymine.com To: user@spark.apache.org Su

Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith

Hi all, I'm sure this must have been solved already, but I can't see anything obvious. Using Spark Streaming, I'm trying to execute a transform function on a DStream at short batch intervals (e.g. 1 second), but only write the resulting data to disk using saveAsTextFiles in a larger batch after

what is the proper number set about --num-executors etc

2015-12-31 Thread Zhiliang Zhu

In order to make job run faster, some parameters would be specified in the command lines, such as --executor-cores , --executor-memory and --num-executors ... However, as tested, it seemed that those numbers would not be reset randomly, or some trouble would be caused for the cluster.What is mor

Re: Help me! Spark WebUI is corrupted!

2015-12-31 Thread Aniket Bhatnagar

Are you running on YARN or standalone? On Thu, Dec 31, 2015, 3:35 PM LinChen wrote: > *Screenshot1(Normal WebUI)* > > > > *Screenshot2(Corrupted WebUI)* > > > > As screenshot2 shows, the format of my Spark WebUI looks strange and I > cannot click the description of active jobs. It seems there is

Re: K means clustering in spark

2015-12-31 Thread Yanbo Liang

Hi Anjali, The main output of KMeansModel is clusterCenters which is Array[Vector]. It has k elements where k is the number of clusters and each elements is the center of the specified cluster. Yanbo 2015-12-31 12:52 GMT+08:00 : > Hi, > > I am trying to use kmeans for clustering in spark using

Help me! Spark WebUI is corrupted!

2015-12-31 Thread LinChen

Screenshot1(Normal WebUI) Screenshot2(Corrupted WebUI) As screenshot2 shows, the format of my Spark WebUI looks strange and I cannot click the description of active jobs. It seems there is something missing in my opearing system. I googled it but find nothing. Could anybody help me? ---

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-31 Thread Cheng Lian

Hey Lin, This is a good question. The root cause of this issue lies in the analyzer. Currently, Spark SQL can only resolve a name to a top level column. (Hive suffers the same issue.) Take the SQL query and struct you provided as an example, col_b.col_d.col_g is resolved as two nested GetStru

Re: does HashingTF maintain a inverse index?

does HashingTF maintain a inverse index?

Problem embedding GaussianMixtureModel in a closure

Re: pass custom spark-conf

Re: pass custom spark-conf

Re: SparkSQL integration issue with AWS S3a

pass custom spark-conf

Re: Apparent bug in KryoSerializer

Apparent bug in KryoSerializer

Re: efficient checking the existence of an item in a rdd

Re: Monitoring Spark HDFS Reads and Writes

Re: efficient checking the existence of an item in a rdd

Re: SparkSQL integration issue with AWS S3a

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

Re: SparkSQL integration issue with AWS S3a

Re: SparkSQL integration issue with AWS S3a

Re: Monitoring Spark HDFS Reads and Writes

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

Batch together RDDs for Streaming output, without delaying execution of map or transform functions

what is the proper number set about --num-executors etc

Re: Help me! Spark WebUI is corrupted!

Re: K means clustering in spark

Help me! Spark WebUI is corrupted!

Re: [SparkSQL][Parquet] Read from nested parquet data

25 matches

Site Navigation

Mail list logo

Footer information