Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

Sean Owen Wed, 10 Aug 2016 05:32:13 -0700

Scaling can mean scaling factors up or down so that they're all on a
comparable scale. It certainly changes the sum of squared errors, but,
you can't compare this metric across scaled and unscaled data, exactly
because one is on a totally different scale and will have quite
different absolute values. If that's the motivation here, then, no
it's misleading.


You probably do want to scale factors because the underlying distance
metric (Euclidean) will treat all dimensions equally. If they're on
very different scales, dimensions that happen to have larger units
will dominate.

On Wed, Aug 10, 2016 at 12:46 PM, Rohit Chaddha
<rohitchaddha1...@gmail.com> wrote:
> Hi Sean,
>
> So basically I am trying to cluster a number of elements (its a domain
> object called PItem) based on a the quality factors of these items.
> These elements have 112 quality factors each.
>
> Now the issue is that when I am scaling the factors using StandardScaler I
> get a Sum of Squared Errors = 13300
> When I don't use scaling the Sum of Squared Errors = 5
>
> I was always of the opinion that different factors being on different scale
> should always be normalized, but I am confused based on the results above
> and I am wondering what factors should be removed to get a meaningful result
> (may be with 5% less accuracy)
>
> Will appreciate any help here.
>
> -Rohit
>
> On Tue, Aug 9, 2016 at 12:55 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Fewer features doesn't necessarily mean better predictions, because indeed
>> you are subtracting data. It might, because when done well you subtract more
>> noise than signal. It is usually done to make data sets smaller or more
>> tractable or to improve explainability.
>>
>> But you have an unsupervised clustering problem where talking about
>> feature importance doesnt make as much sense. Important to what? There is no
>> target variable.
>>
>> PCA will not 'improve' clustering per se but can make it faster.
>> You may want to specify what you are actually trying to optimize.
>>
>>
>> On Tue, Aug 9, 2016, 03:23 Rohit Chaddha <rohitchaddha1...@gmail.com>
>> wrote:
>>>
>>> I would rather have less features to make better inferences on the data
>>> based on the smaller number of factors,
>>> Any suggestions Sean ?
>>>
>>> On Mon, Aug 8, 2016 at 11:37 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>> Yes, that's exactly what PCA is for as Sivakumaran noted. Do you
>>>> really want to select features or just obtain a lower-dimensional
>>>> representation of them, with less redundancy?
>>>>
>>>> On Mon, Aug 8, 2016 at 4:10 PM, Tony Lane <tonylane....@gmail.com>
>>>> wrote:
>>>> > There must be an algorithmic way to figure out which of these factors
>>>> > contribute the least and remove them in the analysis.
>>>> > I am hoping same one can throw some insight on this.
>>>> >
>>>> > On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kuma...@me.com>
>>>> > wrote:
>>>> >>
>>>> >> Not an expert here, but the first step would be devote some time and
>>>> >> identify which of these 112 factors are actually causative. Some
>>>> >> domain
>>>> >> knowledge of the data may be required. Then, you can start of with
>>>> >> PCA.
>>>> >>
>>>> >> HTH,
>>>> >>
>>>> >> Regards,
>>>> >>
>>>> >> Sivakumaran S
>>>> >>
>>>> >> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane....@gmail.com> wrote:
>>>> >>
>>>> >> Great question Rohit.  I am in my early days of ML as well and it
>>>> >> would be
>>>> >> great if we get some idea on this from other experts on this group.
>>>> >>
>>>> >> I know we can reduce dimensions by using PCA, but i think that does
>>>> >> not
>>>> >> allow us to understand which factors from the original are we using
>>>> >> in the
>>>> >> end.
>>>> >>
>>>> >> - Tony L.
>>>> >>
>>>> >> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha
>>>> >> <rohitchaddha1...@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>>
>>>> >>> I have a data-set where each data-point has 112 factors.
>>>> >>>
>>>> >>> I want to remove the factors which are not relevant, and say reduce
>>>> >>> to 20
>>>> >>> factors out of these 112 and then do clustering of data-points using
>>>> >>> these
>>>> >>> 20 factors.
>>>> >>>
>>>> >>> How do I do these and how do I figure out which of the 20 factors
>>>> >>> are
>>>> >>> useful for analysis.
>>>> >>>
>>>> >>> I see SVD and PCA implementations, but I am not sure if these give
>>>> >>> which
>>>> >>> elements are removed and which are remaining.
>>>> >>>
>>>> >>> Can someone please help me understand what to do here
>>>> >>>
>>>> >>> thanks,
>>>> >>> -Rohit
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

Reply via email to