date:20160801

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang

Hi Hao, HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced memory

Re: tpcds for spark2.0

2016-08-01 Thread kevin

finally I use spark-sql-perf-0.4.3 : ./bin/spark-shell --jars /home/dcos/spark-sql-perf-0.4.3/target/scala-2.11/spark-sql-perf_2.11-0.4.3.jar --executor-cores 4 --executor-memory 10G --master spark://master1:7077 If I don't use "--jars" I will get error what I mentioned. 2016-07-29 21:17 GMT+08:0

What happens in Dataset limit followed by rdd

2016-08-01 Thread Maciej Szymkiewicz

Hi everyone, This doesn't look like something expected, does it? http://stackoverflow.com/q/38710018/1560062 Quick glance at the UI suggest that there is a shuffle involved and input for first is ShuffledRowRDD. -- Best regards, Maciej Szymkiewicz

[MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Hao Ren

When computing term frequency, we can use either HashTF or CountVectorizer feature extractors. However, both of them just use the number of times that a term appears in a document. It is not a true frequency. Acutally, it should be divided by the length of the document. Is this a wanted feature ?

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger

Put the queue in a static variable that is first referenced on the workers (inside an rdd closure). That way it will be created on each of the workers, not the driver. Easiest way to do that is with a lazy val in a companion object. On Mon, Aug 1, 2016 at 3:22 PM, Martin Le wrote: > How to do t

Re: sampling operation for DStream

2016-08-01 Thread Martin Le

How to do that? if I put the queue inside .transform operation, it doesn't work. On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger wrote: > Can you keep a queue per executor in memory? > > On Mon, Aug 1, 2016 at 11:24 AM, Martin Le > wrote: > > Hi Cody and all, > > > > Thank you for your answer. I

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger

Can you keep a queue per executor in memory? On Mon, Aug 1, 2016 at 11:24 AM, Martin Le wrote: > Hi Cody and all, > > Thank you for your answer. I implement simple random sampling (SRS) for > DStream using transform method, and it works fine. > However, I have a problem when I implement reservoir

Re: sampling operation for DStream

2016-08-01 Thread Martin Le

Hi Cody and all, Thank you for your answer. I implement simple random sampling (SRS) for DStream using transform method, and it works fine. However, I have a problem when I implement reservoir sampling (RS). In RS, I need to maintain a reservoir (a queue) to store selected data items (RDDs). If I

Default date formats in CSV and JSON: see SPARK-16216

2016-08-01 Thread Sean Owen

https://issues.apache.org/jira/browse/SPARK-16216 https://github.com/apache/spark/pull/14279 This concerns default representation of times and dates in CSV and JSON. CSV has UNIX timestamps; JSON has formatted strings but unfortunately they lack timezones. The question here is which to change to

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

Re: tpcds for spark2.0

What happens in Dataset limit followed by rdd

[MLlib] Term Frequency in TF-IDF seems incorrect

Re: sampling operation for DStream

Re: sampling operation for DStream

Re: sampling operation for DStream

Re: sampling operation for DStream

Default date formats in CSV and JSON: see SPARK-16216

9 matches

Site Navigation

Mail list logo

Footer information