Re: pubsub -> IO

2019-07-17 Thread Ryan Skraba
Hello! To clarify, you want to do something like this? PubSubIO.read() -> extract mongodb collection and range -> MongoDbIO.read(collection, range) -> ... If I'm not mistaken, it isn't possible with the implementation of MongoDbIO (based on BoundedSource interface, requiring the collection to be

Re: Caused by: java.lang.Exception: The user defined 'open()' method caused an exception: java.lang.NoClassDefFoundError: Could not initialize class com.google.common.io.BaseEncoding

2019-07-19 Thread Ryan Skraba
Hello! These are the "fun" problems to track down. I believe the GoogleCredentials class (0.12.0 in Beam, if that's where it's coming from) brings in an unvendored/unshaded dependency on guava-20.x. BaseEncoding was introduced in guava-14.x Someplace in your job, there's probably an older vers

Re: Choosing a coder for a class that contains a Row?

2019-07-23 Thread Ryan Skraba
Hello Pablo! Just to clarify -- the Row schemas aren't known at pipeline construction time, but can be discovered from the instance of MyData? Once discovered, is the schema "homogeneous" for all instance of MyData? (i.e. someRow will always have the same schema for all instances afterwards, and

Re: Choosing a coder for a class that contains a Row?

2019-07-24 Thread Ryan Skraba
guring this > out. But I reckon I'll have to come back to this... > > Best > -P. > > On Tue, Jul 23, 2019 at 1:07 AM Ryan Skraba wrote: >> >> Hello Pablo! Just to clarify -- the Row schemas aren't known at >> pipeline construction time, but can be d

Re: Multiple file systems configuration

2019-08-23 Thread Ryan Skraba
Wow -- super neat for multi-HDFS configs! This has always been a weak spot for the hadoop filesystem (including Hadoop S3). Any idea if this can support reading with one kerberos keytab and writing with another? I'll give it a try! I suspect the answer is "maybe". Ryan On Wed, Aug 21, 2019 at

Re: ETL with Beam?

2019-10-11 Thread Ryan Skraba
Hello! Talend has a big data ETL product in the cloud called Pipeline Designer, entirely powered by Beam. There was a talk at Beam Summit 2018 (https://www.youtube.com/watch?v=1AlEGUtiQek), but unfortunately the live demo wasn't captured in the video. You can find other videos of Pipeline Design

Re: Reading file with line number

2019-11-07 Thread Ryan Skraba
Hello! Generally, for big data, line number (or similar "sequence id") is an antipattern. Knowing any line number implies knowing the previous line... and even this simple relationship breaks the ability to process elements in parallel. It *should* be possible to efficiently know the starting by

Re: RDD Caching in SparkRunner

2020-02-26 Thread Ryan Skraba
Hello! If I understand correctly in Spark, the common pattern for multiple outputs is to collect them all into a single _persisted_ RDD internally, then filter into the separate RDDs (one per output) on demand. Persisting is usually the right thing to do. Otherwise, spark could risk fusioning th

Re: Is AvroCoder the right coder for me?

2019-04-04 Thread Ryan Skraba
Hello Augusto! I just took a look. The behaviour that you're seeing looks like it's set in Avro ReflectData -- to avoid doing expensive reflection calls for each serialization/deserialization, it uses a cache per-class AND access is synchronized [1]. Only one thread in your executor JVM is acces