Thanks for the flags Reynold. 1. For the 4+ languages, these are just on the consumption side (i.e. you can't write a data source in Python or SQL, correct), right? ? If this is correct and you can only write data sources in the JVM languages than that makes this story a lot easier. On the DataSource side we just require that the configuration object is JSON deserializable.
Then on the consumption side (ie. from sqlContext.read): - From Java/Scala these objects can be passed through to the DataSource natively since it's in the same JVM and people have access to the concrete parameter classes. - On the Python side this object can be passed over via JSON which is deserialized and could be forced to generate explicit serialization failures when insufficient options are provided. The datasource provide could even (optionally) provide a python object which performs validation on the python side to make this easier for consumers. - In the SQL instance, since these objects are JSON serializable, we can alter the OPTIONS keyword to allow nested maps to create the JSON object. In all of these cases the solution proposed still worst case degrades to something equivalent to the Map[String, String] (except that it has nesting support), but in the best cases we have POJOs and optionally provided python objects which help facilitate this in a first class fashion. 2. Yeah agree this is a big problem, which is why I flagged it in the initial email. I'll put some more thought into how this can be done in a reasonable fashion (although any sugguestions wouild be greatly appreciated). With the above answer to #1 and contingent on finding a solution to the API stability part of it, would you be supportive of a change to do this? If so, I'll submit a JIRA first and solicit/brainstorm some ideas on how to do #2 in a more sane way. On Fri, Feb 26, 2016 at 5:02 PM Reynold Xin <r...@databricks.com> wrote: > Thanks for the email. This sounds great in theory, but might run into two > major problems: > > 1. Need to support 4+ programming languages (SQL, Python, Java, Scala) > > 2. API stability (both backward and forward) > > > > On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari <hamelkoth...@gmail.com> > wrote: > >> Hi devs, >> >> Has there been any discussion around changing the DataSource parameters >> arguments be something more sophisticated than Map[String, String]? As you >> write more complex DataSources there are likely to be a variety of >> parameters of varying formats which are needed and having to coerce them to >> be strings becomes suboptimal pretty fast. >> >> Quite often I see this combated by people specifying parameters which >> take in Json strings and then parse them into the parameter objects that >> they actually need. Unfortunately having people write Json strings can be a >> really error prone process so to ensure compile time safety people write >> convenience functions written which take in actual POJOs as parameters, >> serialize them to json so they can be passed into the data source API and >> then deserialize them in the constructors of their data sources. There's >> also no real story around discoverability of options with the current >> Map[String, String] setup other than looking at the source code of the >> datasource and hoping that they specified constants somewhere. >> >> Rather than doing all of the above, we could adapt the DataSource API to >> have RelationProviders be templated on a parameter class which could be >> provided to the createRelation call. On the user's side, they could just >> create the appropriate configuration object and provide that object to the >> DataFrameReader.parameters call and it would be possible to guarantee that >> enough parameters were provided to construct a DataFrame in that case. >> >> The key challenge I see with this approach is that I'm not sure how to >> make the above changes in a backwards compatible way that doesn't involve >> duplicating a bunch of methods. >> >> Do people have thoughts regarding this approach? I'm happy to file a JIRA >> and have the discussion there if it makes sense. >> >> Best, >> Hamel >> > >