Another +1. I already experienced this case several times.
On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > +1 for this idea since text parsing in CSV/JSON is quite common. > > One thing is about schema inference likewise with JSON functionality. In > case of JSON, we added schema_of_json for it and same thing should be able > to apply to CSV too. > If we see some more needs for it, we can consider a function like > schema_of_csv as well. > > > 2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <maxim.g...@databricks.com>님이 작성: > >> Hi Reynold, >> >> > i'd make this as consistent as to_json / from_json as possible >> >> Sure, new function from_csv() has the same signature as from_json(). >> >> > how would this work in sql? i.e. how would passing options in work? >> >> The options are passed to the function via map, for example: >> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat', >> 'dd/MM/yyyy')) >> >> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <r...@databricks.com> wrote: >> >>> makes sense - i'd make this as consistent as to_json / from_json as >>> possible. >>> >>> how would this work in sql? i.e. how would passing options in work? >>> >>> -- >>> excuse the brevity and lower case due to wrist injury >>> >>> >>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <maxim.g...@databricks.com> >>> wrote: >>> >>>> Hi All, >>>> >>>> I would like to propose new function from_csv() for parsing columns >>>> containing strings in CSV format. Here is my PR: >>>> https://github.com/apache/spark/pull/22379 >>>> >>>> An use case is loading a dataset from an external storage, dbms or >>>> systems like Kafka to where CSV content was dumped as one of >>>> columns/fields. Other columns could contain related information like >>>> timestamps, ids, sources of data and etc. The column with CSV strings can >>>> be parsed by existing method csv() of DataFrameReader but in that case >>>> we have to "clean up" dataset and remove other columns since the csv() >>>> method requires Dataset[String]. Joining back result of parsing and >>>> original dataset by positions is expensive and not convenient. Instead >>>> users parse CSV columns by string functions. The approach is usually error >>>> prone especially for quoted values and other special cases. >>>> >>>> The proposed in the PR methods should make a better user experience in >>>> parsing CSV-like columns. Please, share your thoughts. >>>> >>>> -- >>>> >>>> Maxim Gekk >>>> >>>> Technical Solutions Lead >>>> >>>> Databricks Inc. >>>> >>>> maxim.g...@databricks.com >>>> >>>> databricks.com >>>> >>>> <http://databricks.com/> >>>> >>> >> -- *Dongjin Lee* *A hitchhiker in the mathematical world.* *github: <http://goog_969573159/>github.com/dongjinleekr <http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr <http://kr.linkedin.com/in/dongjinleekr>slideshare: www.slideshare.net/dongjinleekr <http://www.slideshare.net/dongjinleekr>*