Hello! To clarify, you want to do something like this?
PubSubIO.read() -> extract mongodb collection and range ->
MongoDbIO.read(collection, range) -> ...
If I'm not mistaken, it isn't possible with the implementation of MongoDbIO
(based on BoundedSource interface, requiring the collection to be
Hello! These are the "fun" problems to track down.
I believe the GoogleCredentials class (0.12.0 in Beam, if that's where
it's coming from) brings in an unvendored/unshaded dependency on
guava-20.x. BaseEncoding was introduced in guava-14.x
Someplace in your job, there's probably an older vers
Hello Pablo! Just to clarify -- the Row schemas aren't known at
pipeline construction time, but can be discovered from the instance of
MyData?
Once discovered, is the schema "homogeneous" for all instance of
MyData? (i.e. someRow will always have the same schema for all
instances afterwards, and
guring this
> out. But I reckon I'll have to come back to this...
>
> Best
> -P.
>
> On Tue, Jul 23, 2019 at 1:07 AM Ryan Skraba wrote:
>>
>> Hello Pablo! Just to clarify -- the Row schemas aren't known at
>> pipeline construction time, but can be d
Wow -- super neat for multi-HDFS configs! This has always been a weak
spot for the hadoop filesystem (including Hadoop S3).
Any idea if this can support reading with one kerberos keytab and
writing with another? I'll give it a try! I suspect the answer is
"maybe".
Ryan
On Wed, Aug 21, 2019 at
Hello! Talend has a big data ETL product in the cloud called Pipeline
Designer, entirely powered by Beam. There was a talk at Beam Summit
2018 (https://www.youtube.com/watch?v=1AlEGUtiQek), but unfortunately
the live demo wasn't captured in the video. You can find other videos
of Pipeline Design
Hello!
Generally, for big data, line number (or similar "sequence id") is an
antipattern.
Knowing any line number implies knowing the previous line... and even this
simple relationship breaks the ability to process elements in parallel.
It *should* be possible to efficiently know the starting by
Hello!
If I understand correctly in Spark, the common pattern for multiple
outputs is to collect them all into a single _persisted_ RDD
internally, then filter into the separate RDDs (one per output) on
demand.
Persisting is usually the right thing to do. Otherwise, spark could
risk fusioning th
Hello Augusto!
I just took a look. The behaviour that you're seeing looks like it's set
in Avro ReflectData -- to avoid doing expensive reflection calls for each
serialization/deserialization, it uses a cache per-class AND access is
synchronized [1]. Only one thread in your executor JVM is acces