Hi Hao,
HashingTF directly apply a hash function (Murmurhash3) to the features to
determine their column index. It excluded any thought about the term
frequency or the length of the document. It does similar work compared with
sklearn FeatureHasher. The result is increased speed and reduced memory
finally I use spark-sql-perf-0.4.3 :
./bin/spark-shell --jars
/home/dcos/spark-sql-perf-0.4.3/target/scala-2.11/spark-sql-perf_2.11-0.4.3.jar
--executor-cores 4 --executor-memory 10G --master spark://master1:7077
If I don't use "--jars" I will get error what I mentioned.
2016-07-29 21:17 GMT+08:0
Hi everyone,
This doesn't look like something expected, does it?
http://stackoverflow.com/q/38710018/1560062
Quick glance at the UI suggest that there is a shuffle involved and
input for first is ShuffledRowRDD.
--
Best regards,
Maciej Szymkiewicz
When computing term frequency, we can use either HashTF or CountVectorizer
feature extractors.
However, both of them just use the number of times that a term appears in a
document.
It is not a true frequency. Acutally, it should be divided by the length of
the document.
Is this a wanted feature ?
Put the queue in a static variable that is first referenced on the
workers (inside an rdd closure). That way it will be created on each
of the workers, not the driver.
Easiest way to do that is with a lazy val in a companion object.
On Mon, Aug 1, 2016 at 3:22 PM, Martin Le wrote:
> How to do t
How to do that? if I put the queue inside .transform operation, it
doesn't work.
On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger wrote:
> Can you keep a queue per executor in memory?
>
> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le
> wrote:
> > Hi Cody and all,
> >
> > Thank you for your answer. I
Can you keep a queue per executor in memory?
On Mon, Aug 1, 2016 at 11:24 AM, Martin Le wrote:
> Hi Cody and all,
>
> Thank you for your answer. I implement simple random sampling (SRS) for
> DStream using transform method, and it works fine.
> However, I have a problem when I implement reservoir
Hi Cody and all,
Thank you for your answer. I implement simple random sampling (SRS) for
DStream using transform method, and it works fine.
However, I have a problem when I implement reservoir sampling (RS). In RS,
I need to maintain a reservoir (a queue) to store selected data items
(RDDs). If I
https://issues.apache.org/jira/browse/SPARK-16216
https://github.com/apache/spark/pull/14279
This concerns default representation of times and dates in CSV and
JSON. CSV has UNIX timestamps; JSON has formatted strings but
unfortunately they lack timezones.
The question here is which to change to