Thanks Fokko. Thanks to some very kind help from Ryan Skraba, I managed to
fix the issue. The problem was that the writer needed to be created with
the same schema in the code (the actual schemas that I was using were
fine). The resulting PR is here: https://github.com/apache/avro/pull/785

  cheers,
    rog.


On Tue, 21 Jan 2020 at 07:52, Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Sorry for the late reply Rog, been kinda busy lately.
>
> Please look into the schema evolution of Avro. Confluent has an excellent
> article on this:
> https://docs.confluent.io/current/schema-registry/avro.html
>
> Could you try again with optional fields? e.g. "type": ["null", "array"].
>
> Since the names are different, I would expect the default value (or even
> an exception). If you do a cat on the Avro file, you can see that the
> original schema is in the header of the file. The B field is not there the
> record, so the reader field is not compatible, so it won't work. I'll check
> if we can come up with a more meaningful exception.
>
> Cheers, Fokko
>
>
>
> Op vr 17 jan. 2020 om 17:02 schreef roger peppe <rogpe...@gmail.com>:
>
>>
>>
>> On Fri, 17 Jan 2020 at 13:35, Ryan Skraba <r...@skraba.com> wrote:
>>
>>> Hello!  I just created a JIRA for this as an improvement :D
>>> https://issues.apache.org/jira/browse/AVRO-2689
>>>
>>> To check evolution, we'd probably want to specify the reader schema in
>>> the GenericDatumReader created here:
>>>
>>> https://github.com/apache/avro/blob/master/lang/java/tools/src/main/java/org/apache/avro/tool/DataFileReadTool.java#L75
>>>
>>> The writer schema is automatically set when the DataFileStream is
>>> created.  If we want to set a different reader schema (than the one
>>> found in the file), it should be set by calling
>>> reader.setExpected(readerSchema) just after the DataFileStream is
>>> created.
>>>
>>
>> Ah, that's a good pointer, thanks! I was looking for an appropriate
>> constructor, but there didn't seem to be one.
>>
>>
>>>
>>> I think it's a pretty good idea -- it feels like we're seeing more
>>> questions about schema evolution these days, so that would be a neat
>>> way for a user to test (or to create reproducible scenarios for bug
>>> reports).  If you're interested, feel free to take the JIRA!  I'd be
>>> happy to help out.
>>>
>>
>> So, I've had a go at it... see
>> https://github.com/rogpeppe-contrib/avro/commit/1236e9d33207a11d557c1eb2a171972e085dfcf2
>>
>> I did the following to see if it was working ("avro" is my shell script
>> wrapper around the avro-tools jar):
>>
>> % cat schema.avsc
>> {
>>   "name": "R",
>>   "type": "record",
>>   "fields": [
>>     {
>>       "name": "A",
>>       "type": {
>>         "type": "array",
>>         "items": "int"
>>       }
>>     }
>>   ]
>> }
>> % cat schema1.avsc
>> {
>>   "name": "R",
>>   "type": "record",
>>   "fields": [
>>     {
>>       "name": "B",
>>       "type": "string",
>>       "default": "hello"
>>     }
>>   ]
>> }
>> %
>> AVRO_TOOLS_JAR=/home/rog/other/avro/lang/java/tools/target/avro-tools-1.10.0-SNAPSHOT.ja%
>> avro random --count 1 --schema-file schema.avsc x.out
>> % avro tojson x.out
>> {"A":[-890831012,1123049230,302974832]}
>> % cp schema.avsc schema1.avsc
>> % avro tojson --reader-schema-file schema1.avsc x.out
>> Exception in thread "main" java.lang.ClassCastException: class
>> org.apache.avro.util.Utf8 cannot be cast to class java.util.Collection
>> (org.apache.avro.util.Utf8 is in unnamed module of loader 'app';
>> java.util.Collection is in module java.base of loader 'bootstrap')
>> at
>> org.apache.avro.generic.GenericDatumWriter.getArraySize(GenericDatumWriter.java:258)
>> at
>> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:228)
>> at
>> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:136)
>> at
>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
>> at
>> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:206)
>> at
>> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:195)
>> at
>> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:130)
>> at
>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82)
>> at
>> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>> at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:99)
>> at org.apache.avro.tool.Main.run(Main.java:66)
>> at org.apache.avro.tool.Main.main(Main.java:55)
>> %
>>
>> I am a bit clueless when it comes to interpreting that exception... sorry
>> for the ignorance - this is the first Java code I've ever written!
>> Any idea what's going on? This is maybe getting a bit too noisy for the
>> list - feel to reply directly.
>>
>>   cheers,
>>     rog.
>>
>>
>>> Ryan
>>>
>>>
>>> On Fri, Jan 17, 2020 at 2:22 PM roger peppe <rogpe...@gmail.com> wrote:
>>> >
>>> > On Thu, 16 Jan 2020 at 17:21, Ryan Skraba <r...@skraba.com> wrote:
>>> >>
>>> >> didn't find anything currently in the avro-tools that uses both
>>> >> reader and writer schemas while deserializing data...  It should be a
>>> >> pretty easy feature to add as an option to the DataFileReadTool
>>> >> (a.k.a. tojson)!
>>> >
>>> >
>>> > Thanks for that suggestion. I've been delving into that code a bit and
>>> trying to understand what's going on.
>>> >
>>> > At the heart of it is this code:
>>> >
>>> >     GenericDatumReader<Object> reader = new GenericDatumReader<>();
>>> >     try (DataFileStream<Object> streamReader = new
>>> DataFileStream<>(inStream, reader)) {
>>> >       Schema schema = streamReader.getSchema();
>>> >       DatumWriter<Object> writer = new GenericDatumWriter<>(schema);
>>> >       JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema,
>>> out, pretty);
>>> >
>>> > I'm trying to work out where the best place to put the specific reader
>>> schema (taken from a command line flag) might be.
>>> >
>>> > Would it be best to do it when creating the DatumReader (it looks like
>>> there might be a way to create that with a generic writer schema and a
>>> specific reader schema, although I can't quite see how to do that atm), or
>>> when creating the DatumWriter?
>>> > Or perhaps there's a better way?
>>> >
>>> > Thanks for any guidance.
>>> >
>>> >    cheers,
>>> >     rog.
>>> >>
>>> >>
>>> >> You are correct about running ./build.sh dist in the java directory --
>>> >> it fails with JDK 11 (likely fixable:
>>> >> https://issues.apache.org/jira/browse/MJAVADOC-562).
>>> >>
>>> >> You should probably do a simple mvn clean install instead and find the
>>> >> jar in lang/java/tools/target/avro-tools-1.10.0-SNAPSHOT.jar.  That
>>> >> should work with JDK11 without any problem (well-tested in the build).
>>> >>
>>> >> Best regards, Ryan
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Jan 16, 2020 at 5:49 PM roger peppe <rogpe...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Update: I tried running `build.sh dist` in `lang/java` and it
>>> failed (at least, it looks like a failure message) after downloading a load
>>> of Maven deps with the following errors:
>>> https://gist.github.com/rogpeppe/df05d993254dc5082253a5ef5027e965
>>> >> >
>>> >> > Any hints on what I should do to build the avro-tools jar?
>>> >> >
>>> >> >   cheers,
>>> >> >     rog.
>>> >> >
>>> >> > On Thu, 16 Jan 2020 at 16:45, roger peppe <rogpe...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >>
>>> >> >> On Thu, 16 Jan 2020 at 13:57, Ryan Skraba <r...@skraba.com> wrote:
>>> >> >>>
>>> >> >>> Hello!  Is it because you are using brew to install avro-tools?
>>> I'm
>>> >> >>> not entirely familiar with how it packages the command, but using
>>> a
>>> >> >>> direct bash-like solution instead might solve this problem of
>>> mixing
>>> >> >>> stdout and stderr.  This could be the simplest (and right)
>>> solution
>>> >> >>> for piping.
>>> >> >>
>>> >> >>
>>> >> >> No, I downloaded the jar and am directly running it with "java
>>> -jar ~/other/avro-tools-1.9.1.jar".
>>> >> >> I'm using Ubuntu Linux 18.04 FWIW - the binary comes from Debian
>>> package openjdk-11-jre-headless.
>>> >> >>
>>> >> >> I'm going to try compiling avro-tools myself to investigate but
>>> I'm a total Java ignoramus - wish me luck!
>>> >> >>
>>> >> >>>
>>> >> >>> alias avrotoolx='java -jar
>>> >> >>>
>>> ~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar'
>>> >> >>> avrotoolx tojson x.out 2> /dev/null
>>> >> >>>
>>> >> >>> (As Fokko mentioned, the 2> /dev/null isn't even necessary -- the
>>> >> >>> warnings and logs should not be piped along with the normal
>>> content.)
>>> >> >>>
>>> >> >>> Otherwise, IIRC, there is no way to disable the first illegal
>>> >> >>> reflective access warning when running in Java 9+, but you can
>>> "fix"
>>> >> >>> these module errors, and deactivate the NativeCodeLoader logs
>>> with an
>>> >> >>> explicit log4j.properties:
>>> >> >>>
>>> >> >>> java -Dlog4j.configuration=file:///tmp/log4j.properties
>>> --add-opens
>>> >> >>> java.security.jgss/sun.security.krb5=ALL-UNNAMED -jar
>>> >> >>>
>>> ~/.m2/repository/org/apache/avro/avro-tools/1.9.1/avro-tools-1.9.1.jar
>>> >> >>> tojson x.out
>>> >> >>
>>> >> >>
>>> >> >> Thanks for that suggestion! I'm afraid I'm not familiar with log4j
>>> properties files though. What do I need to put in /tmp/log4j.properties to
>>> make this work?
>>> >> >>
>>> >> >>> None of that is particularly satisfactory, but it could be a
>>> >> >>> workaround for your immediate use.
>>> >> >>
>>> >> >>
>>> >> >> Yeah, not ideal, because if something goes wrong, stdout will be
>>> corrupted, but at least some noise should go away :)
>>> >> >>
>>> >> >>> I'd also like to see a more unified experience with the CLI tool
>>> for
>>> >> >>> documentation and usage.  The current state requires a bit of Avro
>>> >> >>> expertise to use, but it has some functions that would be pretty
>>> >> >>> useful for a user working with Avro data.  I raised
>>> >> >>> https://issues.apache.org/jira/browse/AVRO-2688 as an
>>> improvement.
>>> >> >>>
>>> >> >>> In my opinion, a schema compatibility tool would be a useful and
>>> >> >>> welcome feature!
>>> >> >>
>>> >> >>
>>> >> >> That would indeed be nice, but in the meantime, is there really
>>> nothing in the avro-tools commands that uses a chosen schema to read a data
>>> file written with some other schema? That would give me what I'm after
>>> currently.
>>> >> >>
>>> >> >> Thanks again for the helpful response.
>>> >> >>
>>> >> >>    cheers,
>>> >> >>      rog.
>>> >> >>
>>> >> >>>
>>> >> >>> Best regards, Ryan
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Jan 16, 2020 at 12:25 PM roger peppe <rogpe...@gmail.com>
>>> wrote:
>>> >> >>> >
>>> >> >>> > Hi Fokko,
>>> >> >>> >
>>> >> >>> > Thanks for your swift response!
>>> >> >>> >
>>> >> >>> > Stdout and stderr definitely seem to be merged on this platform
>>> at least. Here's a sample:
>>> >> >>> >
>>> >> >>> > % avrotool random --count 1 --schema '"int"'  x.out
>>> >> >>> > % avrotool tojson x.out > x.json
>>> >> >>> > % cat x.json
>>> >> >>> > 125140891
>>> >> >>> > WARNING: An illegal reflective access operation has occurred
>>> >> >>> > WARNING: Illegal reflective access by
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> (file:/home/rog/other/avro-tools-1.9.1.jar) to method
>>> sun.security.krb5.Config.getInstance()
>>> >> >>> > WARNING: Please consider reporting this to the maintainers of
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> >> >>> > WARNING: Use --illegal-access=warn to enable warnings of
>>> further illegal reflective access operations
>>> >> >>> > WARNING: All illegal access operations will be denied in a
>>> future release
>>> >> >>> > 20/01/16 11:00:37 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> >> >>> > %
>>> >> >>> >
>>> >> >>> > I've just verified that it's not a problem with the java
>>> executable itself (I ran a program that printed to System.err and the text
>>> correctly goes to the standard error).
>>> >> >>> >
>>> >> >>> > > Regarding the documentation, the CLI itself contains info on
>>> all the available commands. Also, there are excellent online resources:
>>> https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
>>> Is there anything specific that you're missing?
>>> >> >>> >
>>> >> >>> > There's the single line summary produced for each command by
>>> running "avro-tools" with no arguments, but that's not as much info as I'd
>>> ideally like. For example, it often doesn't say what file format is being
>>> written or read. For some commands, the purpose is not very clear.
>>> >> >>> >
>>> >> >>> > For example the description of the recodec command is "Alters
>>> the codec of a data file". It doesn't describe how it alters it or how one
>>> might configure the alteration parameters. I managed to get some usage help
>>> by passing it more than two parameters (specifying "--help" gives an
>>> exception), but that doesn't provide much more info:
>>> >> >>> >
>>> >> >>> > % avro-tools recodec a b c
>>> >> >>> > Expected at most an input file and output file.
>>> >> >>> > Option             Description
>>> >> >>> > ------             -----------
>>> >> >>> > --codec <String>   Compression codec (default: null)
>>> >> >>> > --level <Integer>  Compression level (only applies to deflate
>>> and xz) (default:
>>> >> >>> >                      -1)
>>> >> >>> >
>>> >> >>> > For the record, I'm wondering it might be possible to get
>>> avrotool to tell me if one schema is compatible with another so that I can
>>> check hypotheses about schema-checking in practice without having to write
>>> Java code.
>>> >> >>> >
>>> >> >>> >   cheers,
>>> >> >>> >     rog.
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > On Thu, 16 Jan 2020 at 10:30, Driesprong, Fokko
>>> <fo...@driesprong.frl> wrote:
>>> >> >>> >>
>>> >> >>> >> Hi Rog,
>>> >> >>> >>
>>> >> >>> >> This is actually a warning produced by the Hadoop library,
>>> that we're using. Please note that htis isn't part of the stdout:
>>> >> >>> >>
>>> >> >>> >> $ find /tmp/tmp
>>> >> >>> >> /tmp/tmp
>>> >> >>> >> /tmp/tmp/._SUCCESS.crc
>>> >> >>> >>
>>> /tmp/tmp/part-00000-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro
>>> >> >>> >>
>>> /tmp/tmp/.part-00000-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro.crc
>>> >> >>> >> /tmp/tmp/_SUCCESS
>>> >> >>> >>
>>> >> >>> >> $ avro-tools tojson
>>> /tmp/tmp/part-00000-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro
>>> >> >>> >> 20/01/16 11:26:10 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> >> >>> >> {"line_of_text":{"string":"Hello"}}
>>> >> >>> >> {"line_of_text":{"string":"World"}}
>>> >> >>> >>
>>> >> >>> >> $ avro-tools tojson
>>> /tmp/tmp/part-00000-9300fba6-ccdd-4ecc-97cb-0c3ae3631be5-c000.avro >
>>> /tmp/tmp/data.json
>>> >> >>> >> 20/01/16 11:26:20 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> >> >>> >>
>>> >> >>> >> $ cat /tmp/tmp/data.json
>>> >> >>> >> {"line_of_text":{"string":"Hello"}}
>>> >> >>> >> {"line_of_text":{"string":"World"}}
>>> >> >>> >>
>>> >> >>> >> So when you pipe the data, it doesn't include the warnings.
>>> >> >>> >>
>>> >> >>> >> Regarding the documentation, the CLI itself contains info on
>>> all the available commands. Also, there are excellent online resources:
>>> https://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/
>>> Is there anything specific that you're missing?
>>> >> >>> >>
>>> >> >>> >> Hope this helps.
>>> >> >>> >>
>>> >> >>> >> Cheers, Fokko
>>> >> >>> >>
>>> >> >>> >> Op do 16 jan. 2020 om 09:30 schreef roger peppe <
>>> rogpe...@gmail.com>:
>>> >> >>> >>>
>>> >> >>> >>> Hi,
>>> >> >>> >>>
>>> >> >>> >>> I've been trying to use avro-tools to verify Avro
>>> implementations, and I've come across an issue. Perhaps someone here might
>>> be able to help?
>>> >> >>> >>>
>>> >> >>> >>> When I run avro-tools with some subcommands, it prints a
>>> bunch of warnings (see below) to the standard output. Does anyone know a
>>> way to disable this? I'm using openjdk 11.0.5 under Ubuntu 18.04 and
>>> avro-tools 1.9.1.
>>> >> >>> >>>
>>> >> >>> >>> The warnings are somewhat annoying because they can corrupt
>>> output of tools that print to the standard output, such as recodec.
>>> >> >>> >>>
>>> >> >>> >>> Aside: is there any documentation for the commands in
>>> avro-tools? Some seem to have some command-line help (though unfortunately
>>> there doesn't seem to be a standard way of showing it), but often that help
>>> often doesn't describe what the command actually does.
>>> >> >>> >>>
>>> >> >>> >>> Here's the output that I see:
>>> >> >>> >>>
>>> >> >>> >>> WARNING: An illegal reflective access operation has occurred
>>> >> >>> >>> WARNING: Illegal reflective access by
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> (file:/home/rog/other/avro-tools-1.9.1.jar) to method
>>> sun.security.krb5.Config.getInstance()
>>> >> >>> >>> WARNING: Please consider reporting this to the maintainers of
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> >> >>> >>> WARNING: Use --illegal-access=warn to enable warnings of
>>> further illegal reflective access operations
>>> >> >>> >>> WARNING: All illegal access operations will be denied in a
>>> future release
>>> >> >>> >>> 20/01/16 08:12:39 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes where
>>> applicable
>>> >> >>> >>>
>>> >> >>> >>>   cheers,
>>> >> >>> >>>     rog.
>>> >> >>> >>>
>>>
>>

Reply via email to