Hi,

I have not tried this ( and don't have a chance to test it at the moment)
so apologies if its incorrect, but could you use something like
the DataFileReader within a DoFn to get access to your key? It looks like
it has seek / sync methods that might work for this. Assuming of course
that the data for the key is small enough to not need to be parallelized on
the read.

Cheers
Reza



On Tue, 9 Jul 2019 at 23:52, Lukasz Cwik <[email protected]> wrote:

> Typically this would be done by reading in the contents of the entire file
> into a map side input and then consuming that side input within a DoFn.
>
> Unfortunately, only Dataflow supports really large side inputs with an
> efficient access pattern and only when using Beam Java for bounded
> pipelines. Support for really large side inputs for Beam Python bounded
> pipelines on Dataflow is coming but not yet available.
>
> Otherwise, you could still read the Avro files and still create a map and
> store the index as a side input and as long as the index fits in memory,
> this would work well across all runners.
>
> The programming guide[1] has a basic example on how to get started using
> side inputs.
>
> 1: https://beam.apache.org/documentation/programming-guide/#side-inputs
>
>
> On Tue, Jul 9, 2019 at 2:21 PM Shannon Duncan <[email protected]>
> wrote:
>
>> So being pretty new to beam and big data I have been working on
>> standardizing some input output items for different
>> hadoop/beam/spark/bigquery jobs and processes.
>>
>> So what I'm working on is having them all read/write Avro files which is
>> actually pretty straight forward. So basic read/write I have down.
>>
>> What I'm looking for and hoping someone on this list knows, is how to
>> index an Avro file and be able to search quickly through that index to only
>> open a partial part of an Avro file in beam.
>>
>> For example currently our pipeline is able to do this with Hadoop and
>> Sequence Files since they store <K,V> with bytesoffest.
>>
>> So given a key I'd like to only pull that key from the Avro file reducing
>> IO / Network costs.
>>
>> Any ideas, thoughts, suggestions?
>>
>> Thanks!
>> Shannon
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Reply via email to