Regarding cross-language and Beam rows (and SQL!) - I have a PR up [1] that adds an example script for using Beam's SqlTransform in Python by leveraging the portable row coder. Unfortunately I got stalled figuring out how to build/stage the Java artifacts for the SQL extensions so it hasn't been merged yet.
I think a cross-language JdbcIO would be quite similar, except it's in core so there's no issue with additional jars. JdbcIO already has a ReadRows transform that can produce a PCollection<Row>, we would just need to add an ExternalTransformBuilder and ExternalTransformRegistrar implementation for that transform. PubsubIO [2] has a good example of this. [1] https://github.com/apache/beam/pull/10055 [2] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L720 On Tue, Jan 7, 2020 at 4:49 AM Lucas Magalhães < [email protected]> wrote: > Hi Peter. > > Why don't you use this external library? > https://pypi.org/project/beam-nuggets/ They already use SQLAlchemy and > is pretty easy to use. > > > On Mon, Jan 6, 2020 at 10:17 PM Luke Cwik <[email protected]> wrote: > >> Eugene, the JdbcIO output should be updated to support Beam's schema >> format which would allow for "rows" to cross the language boundaries. >> >> If the connector is easy to write and maintain then it makes sense for >> native. Maybe the Python version will have an easier time to support >> splitting and hence could overtake the Java implementation in useful >> features. >> >> On Mon, Jan 6, 2020 at 3:55 PM <[email protected]> wrote: >> >>> Apache Airflow went for the DB API approach as well and it seems like to >>> have worked well for them. We will likely need to add extra_requires for >>> each database engine Python package though, which adds some complexity but >>> not a lot >>> >>> On Jan 6, 2020, at 6:12 PM, Eugene Kirpichov <[email protected]> wrote: >>> >>> Agreed with above, it seems prudent to develop a pure-Python connector >>> for something as common as interacting with a database. It's likely easier >>> to achieve an idiomatic API, familiar to non-Beam Python SQL users, within >>> pure Python. >>> >>> Developing a cross-language connector here might be plain impossible, >>> because rows read from a database are (at least in JDBC) not encodable - >>> they require a user's callback to translate to an encodable user type, and >>> the callback can't be in Python because then you have to encode its input >>> before giving it to Python. Same holds for the write transform. >>> >>> Not sure about sqlalchemy though, maybe use plain DB-API >>> https://www.python.org/dev/peps/pep-0249/ instead? Seems like the >>> Python one is more friendly than JDBC in the sense that it actually returns >>> rows as tuples of simple data types. >>> >>> On Mon, Jan 6, 2020 at 1:42 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> On Mon, Jan 6, 2020 at 1:39 PM Chamikara Jayalath <[email protected]> >>>> wrote: >>>> >>>>> Regarding cross-language transforms, we need to add better >>>>> documentation, but for now you'll have to go with existing examples and >>>>> tests. For example, >>>>> >>>>> >>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/gcp/pubsub.py >>>>> >>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py >>>>> >>>>> Note that cross-language transforms feature is currently only >>>>> available for Flink Runner. Dataflow support is in development. >>>>> >>>> >>>> I think it works with all non-Dataflow runners, with the exception of >>>> the Java and Go Direct runners. (It does work with the Python direct >>>> runner.) >>>> >>>> >>>>> I'm fine with developing this natively for Python as well. AFAIK Java >>>>> JDBC IO connector is not a super-complicated connector and it should be >>>>> fine to make relatively easy to maintain and widely usable connectors >>>>> available in multiple SDKs. >>>>> >>>> >>>> Yes, a case can certainly be made for having native connectors for >>>> particular common/simple sources. (We certainly don't call cross-language >>>> to read text files for example.) >>>> >>>> >>>>> >>>>> Thanks, >>>>> Cham >>>>> >>>>> >>>>> On Mon, Jan 6, 2020 at 10:56 AM Luke Cwik <[email protected]> wrote: >>>>> >>>>>> +Chamikara Jayalath <[email protected]> +Heejong Lee >>>>>> <[email protected]> >>>>>> >>>>>> On Mon, Jan 6, 2020 at 10:20 AM <[email protected]> wrote: >>>>>> >>>>>>> How do I go about doing that? From the docs, it appears cross >>>>>>> language transforms are >>>>>>> currently undocumented. >>>>>>> https://beam.apache.org/roadmap/connectors-multi-sdk/ >>>>>>> On Jan 6, 2020, at 12:55 PM, Luke Cwik <[email protected]> wrote: >>>>>>> >>>>>>> What about using a cross language transform between Python and the >>>>>>> already existing Java JdbcIO transform? >>>>>>> >>>>>>> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I’d like to develop the Python SDK’s SQL IO connector. I was >>>>>>>> thinking it would be easiest to use sqlalchemy to achieve maximum >>>>>>>> database >>>>>>>> engine support, but I suppose I could also create an ABC for databases >>>>>>>> that >>>>>>>> follow the DB API and create subclasses for each database engine that >>>>>>>> override a connect method. What are your thoughts on the best way to do >>>>>>>> this? >>>>>>>> >>>>>>> > > -- > Lucas Magalhães, > CTO > > Paralelo CS - Consultoria e Serviços > Tel: +55 (11) 3090-5557 <+55%2011%203090-5557> > Cel: +55 (11) 99420-4667 <+55%2011%2099420-4667> > [email protected] > > <http://www.inteligenciaemnegocios.com.br>www.paralelocs.com.br >
