Re: Python IO Connector

Brian Hulette Mon, 13 Jan 2020 14:08:23 -0800

Regarding cross-language and Beam rows (and SQL!) - I have a PR up [1] that
adds an example script for using Beam's SqlTransform in Python by
leveraging the portable row coder. Unfortunately I got stalled figuring out
how to build/stage the Java artifacts for the SQL extensions so it hasn't
been merged yet.


I think a cross-language JdbcIO would be quite similar, except it's in core
so there's no issue with additional jars. JdbcIO already has a ReadRows
transform that can produce a PCollection<Row>, we would just need to add an
ExternalTransformBuilder and ExternalTransformRegistrar implementation for
that transform. PubsubIO [2] has a good example of this.

[1] https://github.com/apache/beam/pull/10055
[2]
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L720

On Tue, Jan 7, 2020 at 4:49 AM Lucas Magalhães <
[email protected]> wrote:

> Hi Peter.
>
> Why don't you use this external library?
> https://pypi.org/project/beam-nuggets/   They already use SQLAlchemy and
> is pretty easy to use.
>
>
> On Mon, Jan 6, 2020 at 10:17 PM Luke Cwik <[email protected]> wrote:
>
>> Eugene, the JdbcIO output should be updated to support Beam's schema
>> format which would allow for "rows" to cross the language boundaries.
>>
>> If the connector is easy to write and maintain then it makes sense for
>> native. Maybe the Python version will have an easier time to support
>> splitting and hence could overtake the Java implementation in useful
>> features.
>>
>> On Mon, Jan 6, 2020 at 3:55 PM <[email protected]> wrote:
>>
>>> Apache Airflow went for the DB API approach as well and it seems like to
>>> have worked well for them. We will likely need to add extra_requires for
>>> each database engine Python package though, which adds some complexity but
>>> not a lot
>>>
>>> On Jan 6, 2020, at 6:12 PM, Eugene Kirpichov <[email protected]> wrote:
>>>
>>> Agreed with above, it seems prudent to develop a pure-Python connector
>>> for something as common as interacting with a database. It's likely easier
>>> to achieve an idiomatic API, familiar to non-Beam Python SQL users, within
>>> pure Python.
>>>
>>> Developing a cross-language connector here might be plain impossible,
>>> because rows read from a database are (at least in JDBC) not encodable -
>>> they require a user's callback to translate to an encodable user type, and
>>> the callback can't be in Python because then you have to encode its input
>>> before giving it to Python. Same holds for the write transform.
>>>
>>> Not sure about sqlalchemy though, maybe use plain DB-API
>>> https://www.python.org/dev/peps/pep-0249/ instead? Seems like the
>>> Python one is more friendly than JDBC in the sense that it actually returns
>>> rows as tuples of simple data types.
>>>
>>> On Mon, Jan 6, 2020 at 1:42 PM Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> On Mon, Jan 6, 2020 at 1:39 PM Chamikara Jayalath <[email protected]>
>>>> wrote:
>>>>
>>>>> Regarding cross-language transforms, we need to add better
>>>>> documentation, but for now you'll have to go with existing examples and
>>>>> tests. For example,
>>>>>
>>>>>
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/gcp/pubsub.py
>>>>>
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py
>>>>>
>>>>> Note that cross-language transforms feature is currently only
>>>>> available for Flink Runner. Dataflow support is in development.
>>>>>
>>>>
>>>> I think it works with all non-Dataflow runners, with the exception of
>>>> the Java and Go Direct runners. (It does work with the Python direct
>>>> runner.)
>>>>
>>>>
>>>>> I'm fine with developing this natively for Python as well. AFAIK Java
>>>>> JDBC IO connector is not a super-complicated connector and it should be
>>>>> fine to make relatively easy to maintain and widely usable connectors
>>>>> available in multiple SDKs.
>>>>>
>>>>
>>>> Yes, a case can certainly be made for having native connectors for
>>>> particular common/simple sources. (We certainly don't call cross-language
>>>> to read text files for example.)
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>>
>>>>> On Mon, Jan 6, 2020 at 10:56 AM Luke Cwik <[email protected]> wrote:
>>>>>
>>>>>> +Chamikara Jayalath <[email protected]> +Heejong Lee
>>>>>> <[email protected]>
>>>>>>
>>>>>> On Mon, Jan 6, 2020 at 10:20 AM <[email protected]> wrote:
>>>>>>
>>>>>>> How do I go about doing that? From the docs, it appears cross
>>>>>>> language transforms are
>>>>>>> currently undocumented.
>>>>>>> https://beam.apache.org/roadmap/connectors-multi-sdk/
>>>>>>> On Jan 6, 2020, at 12:55 PM, Luke Cwik <[email protected]> wrote:
>>>>>>>
>>>>>>> What about using a cross language transform between Python and the
>>>>>>> already existing Java JdbcIO transform?
>>>>>>>
>>>>>>> On Sun, Jan 5, 2020 at 5:18 AM Peter Dannemann <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I’d like to develop the Python SDK’s SQL IO connector. I was
>>>>>>>> thinking it would be easiest to use sqlalchemy to achieve maximum 
>>>>>>>> database
>>>>>>>> engine support, but I suppose I could also create an ABC for databases 
>>>>>>>> that
>>>>>>>> follow the DB API and create subclasses for each database engine that
>>>>>>>> override a connect method. What are your thoughts on the best way to do
>>>>>>>> this?
>>>>>>>>
>>>>>>>
>
> --
> Lucas Magalhães,
> CTO
>
> Paralelo CS - Consultoria e Serviços
> Tel: +55 (11) 3090-5557 <+55%2011%203090-5557>
> Cel: +55 (11) 99420-4667 <+55%2011%2099420-4667>
> [email protected]
>
> <http://www.inteligenciaemnegocios.com.br>www.paralelocs.com.br
>

Re: Python IO Connector

Reply via email to