Does it deserve to be a JIRA in Spark / Spark MLLib?
How do you guys normally determine data types?

Frameworks like h2o automatically determine data type scanning a sample of
data, or whole dataset.
So then one can decide e.g. if a variable should be a categorical variable
or numerical.

Another use case is if you get an arbitrary data set (we get them quite
often), and want to save as a Parquet table.
Providing correct data types make parquet more space effiecient (and
probably more query-time performant, e.g.
better parquet bloom filters than just storing everything as
string/varchar).



-- 
Ruslan Dautkhanov

On Thu, Sep 17, 2015 at 12:32 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> Wanted to take something like this
>
> https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java
> and create a Hive UDAF to create an aggregate function that returns a data
> type guess.
> Am I inventing a wheel?
> Does Spark have something like this already built-in?
> Would be very useful for new wide datasets to explore data. Would be
> helpful for ML too,
> e.g. to decide categorical vs numerical variables.
>
>
> Ruslan
>
>

Reply via email to