Thanks again Ryan, I've merged to master and the 1.9 branch, so it should be in Avro 1.9.1.
Cheers, Fokko Driesprong Op do 18 jul. 2019 om 15:18 schreef Zoltan Farkas <zolyfar...@yahoo.com>: > LGTM > > On Jul 18, 2019, at 8:24 AM, Driesprong, Fokko <fo...@driesprong.frl> > wrote: > > Thank you Ryan, I have a few comments on Github. Looks good to me. > > Cheers, Fokko > > Op do 18 jul. 2019 om 11:58 schreef Ryan Skraba <r...@skraba.com>: > >> Hello! I'm motivated to see this happen :D >> >> +Zoltan, the original author. I created a PR against apache/avro master >> here: https://github.com/apache/avro/pull/589 >> >> I cherry-picked the commit from your fork, and reapplied >> spotless/checkstyle. I hope this is the correct way to preserve authorship >> and that I'm not stepping on any toes! >> >> Can someone take a look at the above PR? >> >> Best regards, >> >> Ryan >> >> On Tue, Jul 16, 2019 at 11:58 AM Ismaël Mejía <ieme...@gmail.com> wrote: >> >>> Yes probably it is overkill to warn given the examples you mention. >>> Also your argument towards reusing the mature (and battle tested) >>> combination of Schema.Parser + String serialization makes sense. >>> >>> Adding this to 1.9.1 will be an extra selling point for projects >>> wanting to migrate to the latest version of Avro so it sounds good to >>> me but you should add it to master and then we can cherry pick it from >>> there. >>> >>> >>> On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba <r...@skraba.com> wrote: >>> > >>> > Hello! Thanks to the reference to AVRO-1852. It's exactly what I was >>> looking for. >>> > >>> > I agree that Java serialization shouldn't be used for anything >>> cross-platform, or (in my opinion) used for any data persistence at all. >>> Especially not for an Avro container file or sending binary data through a >>> messaging system... >>> > >>> > But Java serialization is definitely useful and used for sending >>> instances of "distributed work" implemented in Java from node to node in a >>> cluster. I'm not too worried about existing connectors -- we can see that >>> each framework has "solved" the problem one at a time. In addition to >>> Flink, there's >>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29 >>> and >>> https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35 >>> . >>> > >>> > Specifically, I see the advantage for user-defined distributed >>> functions that happen to carry along an Avro Schema -- and I can personally >>> say that I've encountered this a lot in our code! >>> > >>> > That being said, I think it's probably overkill to warn the user about >>> the perils of Java serialization (not being cross-language and requiring >>> consistent JDKs and libraries across JVMs). If an error occurs for one of >>> those reasons, there's a larger problem for the dev to address, and it's >>> just as likely to occur for any Java library in the job if the environment >>> is bad. Related, we've encountered similar issues with logical types >>> existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster... >>> the solution is "make sure you don't do that". (Looking at you, guava and >>> jackson!) >>> > >>> > The patch in question delegates serialization to the string form of >>> the schema, so it's basically doing what all of the above Avro "holders" >>> are doing -- I wouldn't object to having a sample schema available that >>> fully exercises what a schema can hold, but I also think that Schema.Parser >>> (used underneath) is currently pretty well tested and mature! >>> > >>> > Do you think this could be a candidate for 1.9.1 as a minor >>> improvement? I can't think of any reason that this wouldn't be backwards >>> compatible. >>> > >>> > Ryan >>> > >>> > side note: I wrote java.lang.Serializable earlier, which probably >>> didn't help my search for prior discussion... :/ >>> > >>> > On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ieme...@gmail.com> >>> wrote: >>> >> >>> >> This is a good idea even if it may have some issues that we should >>> >> probably document and warn users about: >>> >> >>> >> 1. Java based serialization is really practical for JVM based systems, >>> >> but we should probably add a warning or documentation because Java >>> >> serialization is not deterministic between JVMs so this could be a >>> >> source for issues (usually companies use the same version of the JVM >>> >> so this is less critical, but this still can happen specially now with >>> >> all the different versions of Java and OpenJDK based flavors). >>> >> >>> >> 2. This is not cross language compatible, the String based >>> >> representation (or even an Avro based representation of Schema) can be >>> >> used in every language. >>> >> >>> >> Even with these I think just for ease of use it is worth to make >>> >> Schema Serializable. Is the plan to fully serialize it, or just to >>> >> make it a String and serialize the String as done in the issue Doug >>> >> mentioned? >>> >> If we take the first approach we need to properly test with a Schema >>> >> that has elements of the full specification that (de)-serialization >>> >> works correctly. Does anyone know if we have already a test schema >>> >> that covers the full ‘schema’ specification to reuse it if so? >>> >> >>> >> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko < >>> fo...@driesprong.frl> wrote: >>> >> > >>> >> > Correct me if I'm wrong here. But as far as I understood the way of >>> >> > serializing the schema is using Avro, as it is part of the file. To >>> avoid >>> >> > confusion there should be one way of serializing. >>> >> > >>> >> > However, I'm not sure if this is worth the hassle of not simply >>> >> > implementing serializable. Also Flink there is a rather far from >>> optimal >>> >> > implementation: >>> >> > >>> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72 >>> >> > This converts it to JSON and back while distributing the schema to >>> the >>> >> > executors. >>> >> > >>> >> > Cheers, Fokko >>> >> > >>> >> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cutt...@gmail.com >>> >: >>> >> > >>> >> > > I can't think of a reason Schema should not implement >>> Serializable. >>> >> > > >>> >> > > There's actually already an issue & patch for this: >>> >> > > >>> >> > > https://issues.apache.org/jira/browse/AVRO-1852 >>> >> > > >>> >> > > Doug >>> >> > > >>> >> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ieme...@gmail.com> >>> wrote: >>> >> > > >>> >> > > > +d...@avro.apache.org >>> >> > > > >>> >> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <r...@skraba.com> >>> wrote: >>> >> > > > > >>> >> > > > > Hello! >>> >> > > > > >>> >> > > > > I'm looking for any discussion or reference why the Schema >>> object isn't >>> >> > > > serializable -- I'm pretty sure this must have already been >>> discussed >>> >> > > (but >>> >> > > > the keywords +avro +serializable +schema have MANY results in >>> all the >>> >> > > > searches I did: JIRA, stack overflow, mailing list, web) >>> >> > > > > >>> >> > > > > In particular, I was at a demo today where we were asked why >>> Schemas >>> >> > > > needed to be passed as strings to run in distributed tasks. I >>> remember >>> >> > > > running into this problem years ago with MapReduce, and again >>> in Spark, >>> >> > > and >>> >> > > > again in Beam... >>> >> > > > > >>> >> > > > > Is there any downside to making a Schema implement >>> >> > > > java.lang.Serializable? The only thing I can think of is that >>> the schema >>> >> > > > _should not_ be serialized with the data, and making it >>> non-serializable >>> >> > > > loosely enforces this (at the cost of continually writing >>> different >>> >> > > > flavours of "Avro holders" for when you really do want to >>> serialize it). >>> >> > > > > >>> >> > > > > Willing to create a JIRA and work on the implementation, of >>> course! >>> >> > > > > >>> >> > > > > All my best, Ryan >>> >> > > > >>> >> > > >>> >> >