Re: BigTable reader for Python?

Luke Cwik via dev Thu, 05 Jan 2023 12:52:47 -0800

By default Beam Java only uploads artifacts that have changed but it looks
like this is not the case for Beam Python and you need to explicitly opt in
with the --enable_artifact_caching flag[1].


It looks like this feature was added 1 year ago[2], should we make this on
by default?

1:
https://github.com/apache/beam/blob/3070160203c6734da0eb04b440e08b43f9fd33f3/sdks/python/apache_beam/options/pipeline_options.py#L794
2: https://github.com/apache/beam/pull/16229



On Thu, Jan 5, 2023 at 11:43 AM Lina Mårtensson <lina@camus.energy> wrote:

> Thanks! I have now successfully written a beautiful string of protobuf
> bytes into a file via Python. 🎉
>
> Two issues though:
> 1. Robert said the Python direct runner would just work with this - but
> it's not working. After about half an hour of these messages repeated over
> and over again I interrupted the job:
>
> E0105 07:25:48.170601677   58210 fork_posix.cc:76]           Other
> threads are currently calling into gRPC, skipping fork() handlers
>
> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'2023/01/05
> 06:57:10 Failed to obtain provisioning information: failed to dial server
> at localhost:41087\n\tcaused by:\ncontext deadline exceeded\n'
> 2. I (unsurprisingly) get back to the issue I had when I tested out the
> Spanner x-lang transform on Dataflow - the overhead for starting a job is
> unbearably slow, the time mainly spent in transferring the expansion
> service jar (115 MB) + my jar (105 MB) with my new code and its
> dependencies:
>
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload
> to
> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/beam-sdks-java-io-google-cloud-platform-expansion-service-2.39.0-uBMB6BRMpxmYFg1PPu1yUxeoyeyX_lYX1NX0LVL7ZcM.jar...
>
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload
> to
> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/beam-sdks-java-io-google-cloud-platform-expansion-service-2.39.0-uBMB6BRMpxmYFg1PPu1yUxeoyeyX_lYX1NX0LVL7ZcM.jar
> in 321 seconds.
>
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload
> to
> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/java_bigtable_deploy-Ed1r7YOeLKLTmg2RGNktkym9sVYciCiielpk61r6CJ4.jar...
>
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload
> to
> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/java_bigtable_deploy-Ed1r7YOeLKLTmg2RGNktkym9sVYciCiielpk61r6CJ4.jar
> in 295 seconds.
> I have a total of 13 minutes until any workers have started on Dataflow,
> then another 4.5 minutes once the job actually does anything (which
> eventually is to read a whopping 3 cells from Bigtable ;).
>
> How could this be improved?
> For one, it seems to me like the upload of
> sdks:java:io:google-cloud-platform:expansion-service:shadowJar from my
> computer shouldn't be necessary - shouldn't Dataflow have that
> already/could it be fetched by Dataflow rather than having to upload it
> over slow internet?
> And what about my own jar - it's not bound to change very often, so would
> it be possible to upload somewhere and then fetch it from there?
>
> Thanks!
> -Lina
>
> On Tue, Jan 3, 2023 at 1:23 PM Luke Cwik <lc...@google.com> wrote:
>
>> I would suggest using BigtableIO which also returns a
>> protobuf com.google.bigtable.v2.Row. This should allow you to replicate
>> what SpannerIO is doing.
>>
>> Alternatively you could provide a way to convert the HBase result into a
>> Beam row by specifying a converter and a schema for it and then you could
>> use the already well known Beam Schema type:
>>
>> https://github.com/apache/beam/blob/0b8f0b4db7a0de4977e30bcfeb50b5c14c7c1572/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L1068
>>
>> Otherwise you'll have to register the HBase result coder with a well
>> known name so that the runner API coder URN is something that you know and
>> then on the Python side you would need a coder for that URN as well allow
>> you to understand the bytes being sent across from the Java portion of the
>> pipeline.
>>
>> On Fri, Dec 30, 2022 at 12:59 AM Lina Mårtensson <lina@camus.energy>
>> wrote:
>>
>>> And next issue... I'm getting KeyError: 'beam:coders:javasdk:0.1' which
>>> I learned
>>> <https://cwiki.apache.org/confluence/display/BEAM/Multi-language+Pipelines+Tips>
>>> is because the transform is trying to return something that there isn't a 
>>> standard
>>> Beam coder for
>>> <https://github.com/apache/beam/blob/05428866cdbf1ea8e4c1789dd40327673fd39451/model/pipeline/src/main/proto/beam_runner_api.proto#L784>
>>> .
>>> Makes sense, but... how do I fix this? The documentation talks about how
>>> to do this for the input, but not for the output.
>>>
>>> Comparing to Spanner, it looks like Spanner returns a protobuf, which
>>> I'm guessing somehow gets converted to bytes... But CloudBigtableIO
>>> <https://github.com/googleapis/java-bigtable-hbase/blob/main/bigtable-dataflow-parent/bigtable-hbase-beam/src/main/java/com/google/cloud/bigtable/beam/CloudBigtableIO.java>
>>> returns org.apache.hadoop.hbase.client.Result.
>>>
>>> My buildExternal method looks like follows:
>>>
>>>         @Override
>>>
>>>         public PTransform<PBegin, PCollection<Result>> buildExternal(
>>>
>>>                 BigtableReadBuilder.Configuration configuration) {
>>>
>>>
>>>             return Read.from(CloudBigtableIO.read(
>>>
>>>                 new CloudBigtableScanConfiguration.Builder()
>>>
>>>
>>>                     .withProjectId(configuration.projectId)
>>>
>>>
>>>                     .withInstanceId(configuration.instanceId)
>>>
>>>
>>>                     .withTableId(configuration.tableId)
>>>
>>>                     .build()
>>>
>>>             ));
>>>
>>>
>>> I also got a warning, which I *believe* is unrelated (but also an issue):
>>>
>>> INFO:apache_beam.utils.subprocess_server:b"WARNING: Configuration class
>>> 'energy.camus.beam.BigtableRegistrar$BigtableReadBuilder$Configuration' has
>>> no schema registered. Attempting to construct with setter approach."
>>>
>>> INFO:apache_beam.utils.subprocess_server:b'Dec 30, 2022 7:46:14 AM
>>> org.apache.beam.sdk.expansion.service.ExpansionService$ExternalTransformRegistrarLoader
>>> payloadToConfig'
>>> What is this schema and what should it look like?
>>>
>>> Thanks!
>>> -Lina
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Dec 30, 2022 at 12:28 AM Lina Mårtensson <lina@camus.energy>
>>> wrote:
>>>
>>>> Thanks! This was really helpful. It took a while to figure out the
>>>> details - a section in the docs on what's required of these jars for
>>>> non-Java users would be a great addition.
>>>>
>>>> But once I did, the Bazel config was actually quite straightforward and
>>>> makes sense.
>>>> I pasted the first section from here
>>>> <https://github.com/bazelbuild/rules_jvm_external/blob/master/README.md#usage>
>>>>  into
>>>> my WORKSPACE file and changed the artifacts to the ones I needed. (How to
>>>> find the right ones remains confusing.)
>>>>
>>>> After that I updated my BUILD rules and Blaze had easy and
>>>> straightforward configs for it, all I needed was this:
>>>>
>>>> # From
>>>> https://github.com/google/bazel-common/blob/master/third_party/java/auto/BUILD
>>>> .
>>>>
>>>> # The auto service is what registers our Registrar class, and it needs
>>>> to be a plugin which
>>>>
>>>> # makes it run at compile-time.
>>>>
>>>> java_plugin(
>>>>
>>>>     name = "auto_service_processor",
>>>>
>>>>     processor_class =
>>>> "com.google.auto.service.processor.AutoServiceProcessor",
>>>>
>>>>     deps = [
>>>>
>>>>         "@maven//:com_google_auto_service_auto_service",
>>>>
>>>>         "@maven//:com_google_auto_service_auto_service_annotations",
>>>>
>>>>         "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>>>
>>>>     ],
>>>>
>>>> )
>>>>
>>>>
>>>> java_binary(
>>>>
>>>>     name = "java_hbase",
>>>>
>>>>     main_class = "energy.camus.beam.BigtableRegistrar",
>>>>
>>>>     plugins = [":auto_service_processor"],
>>>>
>>>>     srcs = ["src/main/java/energy/camus/beam/BigtableRegistrar.java"],
>>>>
>>>>     deps = [
>>>>
>>>>         "@maven//:com_google_auto_service_auto_service",
>>>>
>>>>         "@maven//:com_google_auto_service_auto_service_annotations",
>>>>
>>>>
>>>>         "@maven//:com_google_cloud_bigtable_bigtable_hbase_beam",
>>>>
>>>>
>>>>         "@maven//:org_apache_beam_beam_sdks_java_core",
>>>>
>>>>         "@maven//:org_apache_beam_beam_vendor_guava_26_0_jre",
>>>>
>>>>         "@maven//:org_apache_hbase_hbase_shaded_client",
>>>>
>>>>     ],
>>>>
>>>> )
>>>>
>>>>
>>>> On Thu, Dec 29, 2022 at 2:43 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> AutoService relies on Java's compiler annotation processor.
>>>>> https://github.com/google/auto/tree/main/service#getting-started
>>>>> shows that you need to configure Java's compiler to use the annotation
>>>>> processors within AutoService.
>>>>>
>>>>> I saw this public gist that seemed to enable using the AutoService
>>>>> annotation processor with Bazel
>>>>> https://gist.github.com/jart/5333824b94cd706499a7bfa1e086ee00
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 29, 2022 at 2:27 PM Lina Mårtensson via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> That's good news about the direct runner, thanks!
>>>>>>
>>>>>> On Thu, Dec 29, 2022 at 2:02 PM Robert Bradshaw <rober...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Thu, Jul 28, 2022 at 5:37 PM Chamikara Jayalath via dev
>>>>>>> <dev@beam.apache.org> wrote:
>>>>>>> >
>>>>>>> > On Thu, Jul 28, 2022 at 4:51 PM Lina Mårtensson <lina@camus.energy>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> Thanks for the detailed answers!
>>>>>>> >>
>>>>>>> >> I totally get the points about development & maintenance cost,
>>>>>>> and,
>>>>>>> >> from a user perspective, about getting the performance right.
>>>>>>> >>
>>>>>>> >> I decided to try out the Spanner connector to get a sense of how
>>>>>>> well
>>>>>>> >> the x-language approach works in our world, since that's an
>>>>>>> existing
>>>>>>> >> x-language connector.
>>>>>>> >> Overall, it works and with minimal intervention as you say - it is
>>>>>>> >> very slow, though.
>>>>>>> >> I'm a little confused about "portable runners" - if I understand
>>>>>>> this
>>>>>>> >> correctly, this means we couldn't run with the DirectRunner
>>>>>>> anymore if
>>>>>>> >> using an x-language connector? (At least it didn't work when I
>>>>>>> tried
>>>>>>> >> it.)
>>>>>>> >
>>>>>>> >
>>>>>>> > You'll have to use the portable DirectRunner -
>>>>>>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/portability
>>>>>>> >
>>>>>>> > Job service for this can be started using following command:
>>>>>>> > python apache_beam/runners/portability/local_job_service_main.py
>>>>>>> -p <port>
>>>>>>>
>>>>>>> Note that the Python direct runner is already a portable runner, so
>>>>>>> you shouldn't have to do anything special (like start up a separate
>>>>>>> job service and pass extra options) to run locally. Just use the
>>>>>>> cross-language transforms as you would any normal Python transform.
>>>>>>>
>>>>>>> The goal is to make this as smooth and transparent as possible;
>>>>>>> please
>>>>>>> keep coming back to us if you find rough edges.
>>>>>>>
>>>>>>

Re: BigTable reader for Python?

Reply via email to