Does it deserve to be a JIRA in Spark / Spark MLLib? How do you guys normally determine data types?
Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical. Another use case is if you get an arbitrary data set (we get them quite often), and want to save as a Parquet table. Providing correct data types make parquet more space effiecient (and probably more query-time performant, e.g. better parquet bloom filters than just storing everything as string/varchar). -- Ruslan Dautkhanov On Thu, Sep 17, 2015 at 12:32 PM, Ruslan Dautkhanov <dautkha...@gmail.com> wrote: > Wanted to take something like this > > https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java > and create a Hive UDAF to create an aggregate function that returns a data > type guess. > Am I inventing a wheel? > Does Spark have something like this already built-in? > Would be very useful for new wide datasets to explore data. Would be > helpful for ML too, > e.g. to decide categorical vs numerical variables. > > > Ruslan > >