LGTM

> On Jul 18, 2019, at 8:24 AM, Driesprong, Fokko <fo...@driesprong.frl> wrote:
> 
> Thank you Ryan, I have a few comments on Github. Looks good to me.
> 
> Cheers, Fokko
> 
> Op do 18 jul. 2019 om 11:58 schreef Ryan Skraba <r...@skraba.com 
> <mailto:r...@skraba.com>>:
> Hello!  I'm motivated to see this happen :D
> 
> +Zoltan, the original author.  I created a PR against apache/avro master 
> here: https://github.com/apache/avro/pull/589 
> <https://github.com/apache/avro/pull/589>
> 
> I cherry-picked the commit from your fork, and reapplied spotless/checkstyle. 
>  I hope this is the correct way to preserve authorship and that I'm not 
> stepping on any toes!
> 
> Can someone take a look at the above PR?  
> 
> Best regards, 
> 
> Ryan
> 
> On Tue, Jul 16, 2019 at 11:58 AM Ismaël Mejía <ieme...@gmail.com 
> <mailto:ieme...@gmail.com>> wrote:
> Yes probably it is overkill to warn given the examples you mention.
> Also your argument towards reusing the mature (and battle tested)
> combination of Schema.Parser + String serialization makes sense.
> 
> Adding this to 1.9.1 will be an extra selling point for projects
> wanting to migrate to the latest version of Avro so it sounds good to
> me but you should add it to master and then we can cherry pick it from
> there.
> 
> 
> On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba <r...@skraba.com 
> <mailto:r...@skraba.com>> wrote:
> >
> > Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was 
> > looking for.
> >
> > I agree that Java serialization shouldn't be used for anything 
> > cross-platform, or (in my opinion) used for any data persistence at all.  
> > Especially not for an Avro container file or sending binary data through a 
> > messaging system...
> >
> > But Java serialization is definitely useful and used for sending instances 
> > of "distributed work" implemented in Java from node to node in a cluster.  
> > I'm not too worried about existing connectors -- we can see that each 
> > framework has "solved" the problem one at a time.  In addition to Flink, 
> > there's 
> > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
> >  
> > <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29>
> >  and 
> > https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
> >  
> > <https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35>.
> >
> > Specifically, I see the advantage for user-defined distributed functions 
> > that happen to carry along an Avro Schema -- and I can personally say that 
> > I've encountered this a lot in our code!
> >
> > That being said, I think it's probably overkill to warn the user about the 
> > perils of Java serialization (not being cross-language and requiring 
> > consistent JDKs and libraries across JVMs).  If an error occurs for one of 
> > those reasons, there's a larger problem for the dev to address, and it's 
> > just as likely to occur for any Java library in the job if the environment 
> > is bad.  Related, we've encountered similar issues with logical types 
> > existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster... 
> > the solution is "make sure you don't do that".  (Looking at you, guava and 
> > jackson!)
> >
> > The patch in question delegates serialization to the string form of the 
> > schema, so it's basically doing what all of the above Avro "holders" are 
> > doing -- I wouldn't object to having a sample schema available that fully 
> > exercises what a schema can hold, but I also think that Schema.Parser (used 
> > underneath) is currently pretty well tested and mature!
> >
> > Do you think this could be a candidate for 1.9.1 as a minor improvement?  I 
> > can't think of any reason that this wouldn't be backwards compatible.
> >
> > Ryan
> >
> > side note: I wrote java.lang.Serializable earlier, which probably didn't 
> > help my search for prior discussion... :/
> >
> > On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ieme...@gmail.com 
> > <mailto:ieme...@gmail.com>> wrote:
> >>
> >> This is a good idea even if it may have some issues that we should
> >> probably document and warn users about:
> >>
> >> 1. Java based serialization is really practical for JVM based systems,
> >> but we should probably add a warning or documentation because Java
> >> serialization is not deterministic between JVMs so this could be a
> >> source for issues (usually companies use the same version of the JVM
> >> so this is less critical, but this still can happen specially now with
> >> all the different versions of Java and OpenJDK based flavors).
> >>
> >> 2. This is not cross language compatible, the String based
> >> representation (or even an Avro based representation of Schema) can be
> >> used in every language.
> >>
> >> Even with these I think just for ease of use it is worth to make
> >> Schema Serializable. Is the plan to fully serialize it, or just to
> >> make it a String and serialize the String as done in the issue Doug
> >> mentioned?
> >> If we take the first approach we need to properly test with a Schema
> >> that has elements of the full specification that (de)-serialization
> >> works correctly. Does anyone know if we have already a test schema
> >> that covers the full ‘schema’ specification to reuse it if so?
> >>
> >> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl> 
> >> wrote:
> >> >
> >> > Correct me if I'm wrong here. But as far as I understood the way of
> >> > serializing the schema is using Avro, as it is part of the file. To avoid
> >> > confusion there should be one way of serializing.
> >> >
> >> > However, I'm not sure if this is worth the hassle of not simply
> >> > implementing serializable. Also Flink there is a rather far from optimal
> >> > implementation:
> >> > https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> >> >  
> >> > <https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72>
> >> > This converts it to JSON and back while distributing the schema to the
> >> > executors.
> >> >
> >> > Cheers, Fokko
> >> >
> >> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cutt...@gmail.com 
> >> > <mailto:cutt...@gmail.com>>:
> >> >
> >> > > I can't think of a reason Schema should not implement Serializable.
> >> > >
> >> > > There's actually already an issue & patch for this:
> >> > >
> >> > > https://issues.apache.org/jira/browse/AVRO-1852 
> >> > > <https://issues.apache.org/jira/browse/AVRO-1852>
> >> > >
> >> > > Doug
> >> > >
> >> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ieme...@gmail.com 
> >> > > <mailto:ieme...@gmail.com>> wrote:
> >> > >
> >> > > > +d...@avro.apache.org <mailto:d...@avro.apache.org>
> >> > > >
> >> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <r...@skraba.com 
> >> > > > <mailto:r...@skraba.com>> wrote:
> >> > > > >
> >> > > > > Hello!
> >> > > > >
> >> > > > > I'm looking for any discussion or reference why the Schema object 
> >> > > > > isn't
> >> > > > serializable -- I'm pretty sure this must have already been discussed
> >> > > (but
> >> > > > the keywords +avro +serializable +schema have MANY results in all the
> >> > > > searches I did: JIRA, stack overflow, mailing list, web)
> >> > > > >
> >> > > > > In particular, I was at a demo today where we were asked why 
> >> > > > > Schemas
> >> > > > needed to be passed as strings to run in distributed tasks.  I 
> >> > > > remember
> >> > > > running into this problem years ago with MapReduce, and again in 
> >> > > > Spark,
> >> > > and
> >> > > > again in Beam...
> >> > > > >
> >> > > > > Is there any downside to making a Schema implement
> >> > > > java.lang.Serializable?  The only thing I can think of is that the 
> >> > > > schema
> >> > > > _should not_ be serialized with the data, and making it 
> >> > > > non-serializable
> >> > > > loosely enforces this (at the cost of continually writing different
> >> > > > flavours of "Avro holders" for when you really do want to serialize 
> >> > > > it).
> >> > > > >
> >> > > > > Willing to create a JIRA and work on the implementation, of course!
> >> > > > >
> >> > > > > All my best, Ryan
> >> > > >
> >> > >

Reply via email to