Thanks Fokko. Thanks to some very kind help from Ryan Skraba, I managed to fix the issue. The problem was that the writer needed to be created with the same schema in the code (the actual schemas that I was using were fine). The resulting PR is here: https://github.com/apache/avro/pull/785
cheers, rog. On Tue, 21 Jan 2020 at 07:52, Driesprong, Fokko <fo...@driesprong.frl> wrote: > Sorry for the late reply Rog, been kinda busy lately. > > Please look into the schema evolution of Avro. Confluent has an excellent > article on this: > https://docs.confluent.io/current/schema-registry/avro.html > > Could you try again with optional fields? e.g. "type": ["null", "array"]. > > Since the names are different, I would expect the default value (or even > an exception). If you do a cat on the Avro file, you can see that the > original schema is in the header of the file. The B field is not there the > record, so the reader field is not compatible, so it won't work. I'll check > if we can come up with a more meaningful exception. > > Cheers, Fokko > > > > Op vr 17 jan. 2020 om 17:02 schreef roger peppe <rogpe...@gmail.com>: > >> >> >> On Fri, 17 Jan 2020 at 13:35, Ryan Skraba <r...@skraba.com> wrote: >> >>> Hello! I just created a JIRA for this as an improvement :D >>> https://issues.apache.org/jira/browse/AVRO-2689 >>> >>> To check evolution, we'd probably want to specify the reader schema in >>> the GenericDatumReader created here: >>> >>> https://github.com/apache/avro/blob/master/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileReadTool.java#L75 >>> >>> The writer schema is automatically set when the DataFileStream is >>> created. If we want to set a different reader schema (than the one >>> found in the file), it should be set by calling >>> reader.setExpected(readerSchema) just after the DataFileStream is >>> created. >>> >> >> Ah, that's a good pointer, thanks! I was looking for an appropriate >> constructor, but there didn't seem to be one. >> >> >>> >>> I think it's a pretty good idea -- it feels like we're seeing more >>> questions about schema evolution these days, so that would be a neat >>> way for a user to test (or to create reproducible scenarios for bug >>> reports). If you're interested, feel free to take the JIRA! I'd be >>> happy to help out. >>> >> >> So, I've had a go at it... see >> https://github.com/rogpeppe-contrib/avro/commit/1236e9d33207a11d557c1eb2a171972e085dfcf2 >> >> I did the following to see if it was working ("avro" is my shell script >> wrapper around the avro-tools jar): >> >> % cat schema.avsc >> { >> "name": "R", >> "type": "record", >> "fields": [ >> { >> "name": "A", >> "type": { >> "type": "array", >> "items": "int" >> } >> } >> ] >> } >> % cat schema1.avsc >> { >> "name": "R", >> "type": "record", >> "fields": [ >> { >> "name": "B", >> "type": "string", >> "default": "hello" >> } >> ] >> } >> % >> AVRO_TOOLS_JAR=/home/rog/other/avro/lang/java/tools/target/avro-tools-1.10.0-SNAPSHOT.ja% >> avro random --count 1 --schema-file schema.avsc x.out >> % avro tojson x.out >> {"A":[-890831012,1123049230,302974832]} >> % cp schema.avsc schema1.avsc >> % avro tojson --reader-schema-file schema1.avsc x.out >> Exception in thread "main" java.lang.ClassCastException: class >> org.apache.avro.util.Utf8 cannot be cast to class java.util.Collection >> (org.apache.avro.util.Utf8 is in unnamed module of loader 'app'; >> java.util.Collection is in module java.base of loader 'bootstrap') >> at >> org.apache.avro.generic.GenericDatumWriter.getArraySize(GenericDatumWriter.java:258) >> at >> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:228) >> at >> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:136) >> at >> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82) >> at >> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:206) >> at >> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:195) >> at >> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130) >> at >> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82) >> at >> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) >> at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:99) >> at org.apache.avro.tool.Main.run(Main.java:66) >> at org.apache.avro.tool.Main.main(Main.java:55) >> % >> >> I am a bit clueless when it comes to interpreting that exception... sorry >> for the ignorance - this is the first Java code I've ever written! >> Any idea what's going on? This is maybe getting a bit too noisy for the >> list - feel to reply directly. >> >> cheers, >> rog. >> >> >>> Ryan >>> >>> >>> On Fri, Jan 17, 2020 at 2:22 PM roger peppe <rogpe...@gmail.com> wrote: >>> > >>> > On Thu, 16 Jan 2020 at 17:21, Ryan Skraba <r...@skraba.com> wrote: >>> >> >>> >> didn't find anything currently in the avro-tools that uses both >>> >> reader and writer schemas while deserializing data... It should be a >>> >> pretty easy feature to add as an option to the DataFileReadTool >>> >> (a.k.a. tojson)! >>> > >>> > >>> > Thanks for that suggestion. I've been delving into that code a bit and >>> trying to understand what's going on. >>> > >>> > At the heart of it is this code: >>> > >>> > GenericDatumReader<Object> reader = new GenericDatumReader<>(); >>> > try (DataFileStream<Object> streamReader = new >>> DataFileStream<>(inStream, reader)) { >>> > Schema schema = streamReader.getSchema(); >>> > DatumWriter<Object> writer = new GenericDatumWriter<>(schema); >>> > JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, >>> out, pretty); >>> > >>> > I'm trying to work out where the best place to put the specific reader >>> schema (taken from a command line flag) might be. >>> > >>> > Would it be best to do it when creating the DatumReader (it looks like >>> there might be a way to create that with a generic writer schema and a >>> specific reader schema, although I can't quite see how to do that atm), or >>> when creating the DatumWriter? >>> > Or perhaps there's a better way? >>> > >>> > Thanks for any guidance. >>> > >>> > cheers, >>> > rog. >>> >> >>> >> >>> >> You are correct about running ./build.sh dist in the java directory -- >>> >> it fails with JDK 11 (likely fixable: >>> >> https://issues.apache.org/jira/browse/MJAVADOC-562). >>> >> >>> >> You should probably do a simple mvn clean install instead and find the >>> >> jar in lang/java/tools/target/avro-tools-1.10.0-SNAPSHOT.jar. That >>> >> should work with JDK11 without any problem (well-tested in the build). >>> >> >>> >> Best regards, Ryan >>> >> >>> >> >>> >> >>> >> On Thu, Jan 16, 2020 at 5:49 PM roger peppe <rogpe...@gmail.com> >>> wrote: >>> >> > >>> >> > Update: I tried running `build.sh dist` in `lang/java` and it >>> failed (at least, it looks like a failure message) after downloading a load >>> of Maven deps with the following errors: >>> https://gist.github.com/rogpeppe/df05d993254dc5082253a5ef5027e965 >>> >> > >>> >> > Any hints on what I should do to build the avro-tools jar? >>> >> > >>> >> > cheers, >>> >> > rog. >>> >> > >>> >> > On Thu, 16 Jan 2020 at 16:45, roger peppe <rogpe...@gmail.com> >>> wrote: >>> >> >> >>> >> >> >>> >> >> On Thu, 16 Jan 2020 at 13:57, Ryan Skraba <r...@skraba.com> wrote: >>> >> >>> >>> >> >>> Hello! Is it because you are using brew to install avro-tools? >>> I'm >>> >> >>> not entirely familiar with how it packages the command, but using >>> a >>> >> >>> direct bash-like solution instead might solve this problem of >>> mixing >>> >> >>> stdout and stderr. This could be the simplest (and right) >>> solution >>> >> >>> for piping. >>> >> >> >>> >> >> >>> >> >> No, I downloaded the jar and am directly running it with "java >>> -jar ~/other/avro-tools-1.9.1.jar". >>> >> >> I'm using Ubuntu Linux 18.04 FWIW - the binary comes from Debian >>> package openjdk-11-jre-headless. >>> >> >> >>> >> >> I'm going to try compiling avro-tools myself to investigate but >>> I'm a total Java ignoramus - wish me luck! >>> >> >> >>> >> >>> >>> >> >>> alias avrotoolx='java -jar >>> >> >>> >>> ~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar' >>> >> >>> avrotoolx tojson x.out 2> /dev/null >>> >> >>> >>> >> >>> (As Fokko mentioned, the 2> /dev/null isn't even necessary -- the >>> >> >>> warnings and logs should not be piped along with the normal >>> content.) >>> >> >>> >>> >> >>> Otherwise, IIRC, there is no way to disable the first illegal >>> >> >>> reflective access warning when running in Java 9+, but you can >>> "fix" >>> >> >>> these module errors, and deactivate the NativeCodeLoader logs >>> with an >>> >> >>> explicit log4j.properties: >>> >> >>> >>> >> >>> java -Dlog4j.configuration=file:///tmp/log4j.properties >>> --add-opens >>> >> >>> java.security.jgss/sun.security.krb5=ALL-UNNAMED -jar >>> >> >>> >>> ~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar >>> >> >>> tojson x.out >>> >> >> >>> >> >> >>> >> >> Thanks for that suggestion! I'm afraid I'm not familiar with log4j >>> properties files though. What do I need to put in /tmp/log4j.properties to >>> make this work? >>> >> >> >>> >> >>> None of that is particularly satisfactory, but it could be a >>> >> >>> workaround for your immediate use. >>> >> >> >>> >> >> >>> >> >> Yeah, not ideal, because if something goes wrong, stdout will be >>> corrupted, but at least some noise should go away :) >>> >> >> >>> >> >>> I'd also like to see a more unified experience with the CLI tool >>> for >>> >> >>> documentation and usage. The current state requires a bit of Avro >>> >> >>> expertise to use, but it has some functions that would be pretty >>> >> >>> useful for a user working with Avro data. I raised >>> >> >>> https://issues.apache.org/jira/browse/AVRO-2688 as an >>> improvement. >>> >> >>> >>> >> >>> In my opinion, a schema compatibility tool would be a useful and >>> >> >>> welcome feature! >>> >> >> >>> >> >> >>> >> >> That would indeed be nice, but in the meantime, is there really >>> nothing in the avro-tools commands that uses a chosen schema to read a data >>> file written with some other schema? That would give me what I'm after >>> currently. >>> >> >> >>> >> >> Thanks again for the helpful response. >>> >> >> >>> >> >> cheers, >>> >> >> rog. >>> >> >> >>> >> >>> >>> >> >>> Best regards, Ryan >>> >> >>> >>> >> >>> >>> >> >>> >>> >> >>> On Thu, Jan 16, 2020 at 12:25 PM roger peppe <rogpe...@gmail.com> >>> wrote: >>> >> >>> > >>> >> >>> > Hi Fokko, >>> >> >>> > >>> >> >>> > Thanks for your swift response! >>> >> >>> > >>> >> >>> > Stdout and stderr definitely seem to be merged on this platform >>> at least. Here's a sample: >>> >> >>> > >>> >> >>> > % avrotool random --count 1 --schema '"int"' x.out >>> >> >>> > % avrotool tojson x.out > x.json >>> >> >>> > % cat x.json >>> >> >>> > 125140891 >>> >> >>> > WARNING: An illegal reflective access operation has occurred >>> >> >>> > WARNING: Illegal reflective access by >>> org.apache.hadoop.security.authentication.util.KerberosUtil >>> (file:/home/rog/other/avro-tools-1.9.1.jar) to method >>> sun.security.krb5.Config.getInstance() >>> >> >>> > WARNING: Please consider reporting this to the maintainers of >>> org.apache.hadoop.security.authentication.util.KerberosUtil >>> >> >>> > WARNING: Use --illegal-access=warn to enable warnings of >>> further illegal reflective access operations >>> >> >>> > WARNING: All illegal access operations will be denied in a >>> future release >>> >> >>> > 20/01/16 11:00:37 WARN util.NativeCodeLoader: Unable to load >>> native-hadoop library for your platform... using builtin-java classes where >>> applicable >>> >> >>> > % >>> >> >>> > >>> >> >>> > I've just verified that it's not a problem with the java >>> executable itself (I ran a program that printed to System.err and the text >>> correctly goes to the standard error). >>> >> >>> > >>> >> >>> > > Regarding the documentation, the CLI itself contains info on >>> all the available commands. Also, there are excellent online resources: >>> https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/ >>> Is there anything specific that you're missing? >>> >> >>> > >>> >> >>> > There's the single line summary produced for each command by >>> running "avro-tools" with no arguments, but that's not as much info as I'd >>> ideally like. For example, it often doesn't say what file format is being >>> written or read. For some commands, the purpose is not very clear. >>> >> >>> > >>> >> >>> > For example the description of the recodec command is "Alters >>> the codec of a data file". It doesn't describe how it alters it or how one >>> might configure the alteration parameters. I managed to get some usage help >>> by passing it more than two parameters (specifying "--help" gives an >>> exception), but that doesn't provide much more info: >>> >> >>> > >>> >> >>> > % avro-tools recodec a b c >>> >> >>> > Expected at most an input file and output file. >>> >> >>> > Option Description >>> >> >>> > ------ ----------- >>> >> >>> > --codec <String> Compression codec (default: null) >>> >> >>> > --level <Integer> Compression level (only applies to deflate >>> and xz) (default: >>> >> >>> > -1) >>> >> >>> > >>> >> >>> > For the record, I'm wondering it might be possible to get >>> avrotool to tell me if one schema is compatible with another so that I can >>> check hypotheses about schema-checking in practice without having to write >>> Java code. >>> >> >>> > >>> >> >>> > cheers, >>> >> >>> > rog. >>> >> >>> > >>> >> >>> > >>> >> >>> > On Thu, 16 Jan 2020 at 10:30, Driesprong, Fokko >>> <fo...@driesprong.frl> wrote: >>> >> >>> >> >>> >> >>> >> Hi Rog, >>> >> >>> >> >>> >> >>> >> This is actually a warning produced by the Hadoop library, >>> that we're using. Please note that htis isn't part of the stdout: >>> >> >>> >> >>> >> >>> >> $ find /tmp/tmp >>> >> >>> >> /tmp/tmp >>> >> >>> >> /tmp/tmp/._SUCCESS.crc >>> >> >>> >> >>> /tmp/tmp/part-00000-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro >>> >> >>> >> >>> /tmp/tmp/.part-00000-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro.crc >>> >> >>> >> /tmp/tmp/_SUCCESS >>> >> >>> >> >>> >> >>> >> $ avro-tools tojson >>> /tmp/tmp/part-00000-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro >>> >> >>> >> 20/01/16 11:26:10 WARN util.NativeCodeLoader: Unable to load >>> native-hadoop library for your platform... using builtin-java classes where >>> applicable >>> >> >>> >> {"line_of_text":{"string":"Hello"}} >>> >> >>> >> {"line_of_text":{"string":"World"}} >>> >> >>> >> >>> >> >>> >> $ avro-tools tojson >>> /tmp/tmp/part-00000-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro > >>> /tmp/tmp/data.json >>> >> >>> >> 20/01/16 11:26:20 WARN util.NativeCodeLoader: Unable to load >>> native-hadoop library for your platform... using builtin-java classes where >>> applicable >>> >> >>> >> >>> >> >>> >> $ cat /tmp/tmp/data.json >>> >> >>> >> {"line_of_text":{"string":"Hello"}} >>> >> >>> >> {"line_of_text":{"string":"World"}} >>> >> >>> >> >>> >> >>> >> So when you pipe the data, it doesn't include the warnings. >>> >> >>> >> >>> >> >>> >> Regarding the documentation, the CLI itself contains info on >>> all the available commands. Also, there are excellent online resources: >>> https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/ >>> Is there anything specific that you're missing? >>> >> >>> >> >>> >> >>> >> Hope this helps. >>> >> >>> >> >>> >> >>> >> Cheers, Fokko >>> >> >>> >> >>> >> >>> >> Op do 16 jan. 2020 om 09:30 schreef roger peppe < >>> rogpe...@gmail.com>: >>> >> >>> >>> >>> >> >>> >>> Hi, >>> >> >>> >>> >>> >> >>> >>> I've been trying to use avro-tools to verify Avro >>> implementations, and I've come across an issue. Perhaps someone here might >>> be able to help? >>> >> >>> >>> >>> >> >>> >>> When I run avro-tools with some subcommands, it prints a >>> bunch of warnings (see below) to the standard output. Does anyone know a >>> way to disable this? I'm using openjdk 11.0.5 under Ubuntu 18.04 and >>> avro-tools 1.9.1. >>> >> >>> >>> >>> >> >>> >>> The warnings are somewhat annoying because they can corrupt >>> output of tools that print to the standard output, such as recodec. >>> >> >>> >>> >>> >> >>> >>> Aside: is there any documentation for the commands in >>> avro-tools? Some seem to have some command-line help (though unfortunately >>> there doesn't seem to be a standard way of showing it), but often that help >>> often doesn't describe what the command actually does. >>> >> >>> >>> >>> >> >>> >>> Here's the output that I see: >>> >> >>> >>> >>> >> >>> >>> WARNING: An illegal reflective access operation has occurred >>> >> >>> >>> WARNING: Illegal reflective access by >>> org.apache.hadoop.security.authentication.util.KerberosUtil >>> (file:/home/rog/other/avro-tools-1.9.1.jar) to method >>> sun.security.krb5.Config.getInstance() >>> >> >>> >>> WARNING: Please consider reporting this to the maintainers of >>> org.apache.hadoop.security.authentication.util.KerberosUtil >>> >> >>> >>> WARNING: Use --illegal-access=warn to enable warnings of >>> further illegal reflective access operations >>> >> >>> >>> WARNING: All illegal access operations will be denied in a >>> future release >>> >> >>> >>> 20/01/16 08:12:39 WARN util.NativeCodeLoader: Unable to load >>> native-hadoop library for your platform... using builtin-java classes where >>> applicable >>> >> >>> >>> >>> >> >>> >>> cheers, >>> >> >>> >>> rog. >>> >> >>> >>> >>> >>