Can someone also provide input on why my code may not be working? Below, I
have pasted part of my previous reply which describes the issue I am having
here. I am really more perplexed about the first set of code (in bold). I
know why the second set of code doesn't work, it is just something I
initi
I should be more clear, since the outcome of the discussion above was
not that obvious actually.
- I agree a change should be made to StandardScaler, and not VectorAssembler
- However I do think withMean should still be false by default and be
explicitly enabled
- The 'offset' idea is orthogonal,
Opening this follow-up question to the entire mailing list. Anyone
have thoughts
on how I can add a column of dense vectors (created by converting a column
of sparse features) to a data frame? My efforts are below.
Although I know this is not the best approach for something I plan to put
in produc
No, that doesn't describe the change being discussed, since you've
copied the discussion about adding an 'offset'. That's orthogonal.
You're also suggesting making withMean=True the default, which we
don't want. The point is that if this is *explicitly* requested, the
scaler shouldn't refuse to sub
Sean,
I have created a jira; I hope you don't mind that I borrowed your
explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001
So what did you do to standardize your data, if you didn't use
standardScaler? Did you write a udf to subtract mean and divide by standard
deviation?
Ah right, got it. As you say for storage it helps significantly, but for
operations I suspect it puts one back in a "dense-like" position. Still,
for online / mini-batch algorithms it may still be feasible I guess.
On Wed, 10 Aug 2016 at 19:50, Sean Owen wrote:
> All elements, I think. Imagine a
All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
represents 0 3 0 7. Imagine it also has an offset stored which applies to
all elements. If it is -2 then it now represents -2 1 -2 5, but this
requires just one extra value to store. It only helps with storage of a
shifted sp
Sean by 'offset' do you mean basically subtracting the mean but only from
the non-zero elements in each row?
On Wed, 10 Aug 2016 at 19:02, Sean Owen wrote:
> Yeah I had thought the same, that perhaps it's fine to let the
> StandardScaler proceed, if it's explicitly asked to center, rather
> than
Yeah I had thought the same, that perhaps it's fine to let the
StandardScaler proceed, if it's explicitly asked to center, rather
than refuse to. It's not really much more rope to let a user hang
herself with, and, blocks legitimate usages (we ran into this last
week and couldn't use StandardScaler
Thanks Sean, I agree with 100% that the math is math and dense vs sparse is
just a matter of representation. I was trying to convince a co-worker of
this to no avail. Sending this email was mainly a sanity check.
I think having an offset would be a great idea, although I am not sure how
to impleme
Dense vs sparse is just a question of representation, so doesn't make
an operation on a vector more or less important as a result. You've
identified the reason that subtracting the mean can be undesirable: a
notionally billion-element sparse vector becomes too big to fit in
memory at once.
I know
Hi everyone,
I am doing some standardization using standardScaler on data from
VectorAssembler which is represented as sparse vectors. I plan to fit a
regularized model. However, standardScaler does not allow the mean to be
subtracted from sparse vectors. It will only divide by the standard
devia
12 matches
Mail list logo