Eugene, the JdbcIO output should be updated to support Beam's schema format which would allow for "rows" to cross the language boundaries.
If the connector is easy to write and maintain then it makes sense for native. Maybe the Python version will have an easier time to support splitting and hence could overtake the Java implementation in useful features. On Mon, Jan 6, 2020 at 3:55 PM <[email protected]> wrote: > Apache Airflow went for the DB API approach as well and it seems like to > have worked well for them. We will likely need to add extra_requires for > each database engine Python package though, which adds some complexity but > not a lot > > On Jan 6, 2020, at 6:12 PM, Eugene Kirpichov <[email protected]> wrote: > > Agreed with above, it seems prudent to develop a pure-Python connector for > something as common as interacting with a database. It's likely easier to > achieve an idiomatic API, familiar to non-Beam Python SQL users, within > pure Python. > > Developing a cross-language connector here might be plain impossible, > because rows read from a database are (at least in JDBC) not encodable - > they require a user's callback to translate to an encodable user type, and > the callback can't be in Python because then you have to encode its input > before giving it to Python. Same holds for the write transform. > > Not sure about sqlalchemy though, maybe use plain DB-API > https://www.python.org/dev/peps/pep-0249/ instead? Seems like the Python > one is more friendly than JDBC in the sense that it actually returns rows > as tuples of simple data types. > > On Mon, Jan 6, 2020 at 1:42 PM Robert Bradshaw <[email protected]> > wrote: > >> On Mon, Jan 6, 2020 at 1:39 PM Chamikara Jayalath <[email protected]> >> wrote: >> >>> Regarding cross-language transforms, we need to add better >>> documentation, but for now you'll have to go with existing examples and >>> tests. For example, >>> >>> >>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/gcp/pubsub.py >>> >>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py >>> >>> Note that cross-language transforms feature is currently only available >>> for Flink Runner. Dataflow support is in development. >>> >> >> I think it works with all non-Dataflow runners, with the exception of the >> Java and Go Direct runners. (It does work with the Python direct runner.) >> >> >>> I'm fine with developing this natively for Python as well. AFAIK Java >>> JDBC IO connector is not a super-complicated connector and it should be >>> fine to make relatively easy to maintain and widely usable connectors >>> available in multiple SDKs. >>> >> >> Yes, a case can certainly be made for having native connectors for >> particular common/simple sources. (We certainly don't call cross-language >> to read text files for example.) >> >> >>> >>> Thanks, >>> Cham >>> >>> >>> On Mon, Jan 6, 2020 at 10:56 AM Luke Cwik <[email protected]> wrote: >>> >>>> +Chamikara Jayalath <[email protected]> +Heejong Lee >>>> <[email protected]> >>>> >>>> On Mon, Jan 6, 2020 at 10:20 AM <[email protected]> wrote: >>>> >>>>> How do I go about doing that? From the docs, it appears cross language >>>>> transforms are >>>>> currently undocumented. >>>>> https://beam.apache.org/roadmap/connectors-multi-sdk/ >>>>> On Jan 6, 2020, at 12:55 PM, Luke Cwik <[email protected]> wrote: >>>>> >>>>> What about using a cross language transform between Python and the >>>>> already existing Java JdbcIO transform? >>>>> >>>>> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann <[email protected]> >>>>> wrote: >>>>> >>>>>> I’d like to develop the Python SDK’s SQL IO connector. I was thinking >>>>>> it would be easiest to use sqlalchemy to achieve maximum database engine >>>>>> support, but I suppose I could also create an ABC for databases that >>>>>> follow >>>>>> the DB API and create subclasses for each database engine that override a >>>>>> connect method. What are your thoughts on the best way to do this? >>>>>> >>>>>
