Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
Hi Hao, HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced memory

Re: tpcds for spark2.0

2016-08-01 Thread kevin
finally I use spark-sql-perf-0.4.3 : ./bin/spark-shell --jars /home/dcos/spark-sql-perf-0.4.3/target/scala-2.11/spark-sql-perf_2.11-0.4.3.jar --executor-cores 4 --executor-memory 10G --master spark://master1:7077 If I don't use "--jars" I will get error what I mentioned. 2016-07-29 21:17 GMT+08:0

What happens in Dataset limit followed by rdd

2016-08-01 Thread Maciej Szymkiewicz
Hi everyone, This doesn't look like something expected, does it? http://stackoverflow.com/q/38710018/1560062 Quick glance at the UI suggest that there is a shuffle involved and input for first is ShuffledRowRDD. -- Best regards, Maciej Szymkiewicz

[MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Hao Ren
When computing term frequency, we can use either HashTF or CountVectorizer feature extractors. However, both of them just use the number of times that a term appears in a document. It is not a true frequency. Acutally, it should be divided by the length of the document. Is this a wanted feature ?

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
Put the queue in a static variable that is first referenced on the workers (inside an rdd closure). That way it will be created on each of the workers, not the driver. Easiest way to do that is with a lazy val in a companion object. On Mon, Aug 1, 2016 at 3:22 PM, Martin Le wrote: > How to do t

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
How to do that? if I put the queue inside .transform operation, it doesn't work. On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger wrote: > Can you keep a queue per executor in memory? > > On Mon, Aug 1, 2016 at 11:24 AM, Martin Le > wrote: > > Hi Cody and all, > > > > Thank you for your answer. I

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
Can you keep a queue per executor in memory? On Mon, Aug 1, 2016 at 11:24 AM, Martin Le wrote: > Hi Cody and all, > > Thank you for your answer. I implement simple random sampling (SRS) for > DStream using transform method, and it works fine. > However, I have a problem when I implement reservoir

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
Hi Cody and all, Thank you for your answer. I implement simple random sampling (SRS) for DStream using transform method, and it works fine. However, I have a problem when I implement reservoir sampling (RS). In RS, I need to maintain a reservoir (a queue) to store selected data items (RDDs). If I

Default date formats in CSV and JSON: see SPARK-16216

2016-08-01 Thread Sean Owen
https://issues.apache.org/jira/browse/SPARK-16216 https://github.com/apache/spark/pull/14279 This concerns default representation of times and dates in CSV and JSON. CSV has UNIX timestamps; JSON has formatted strings but unfortunately they lack timezones. The question here is which to change to