[jira] [Commented] (AVRO-1720) Add an avro-tool to count records in an avro file

Janosch Woschitz (JIRA) Mon, 24 Aug 2015 13:33:59 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710008#comment-14710008
 ]


Janosch Woschitz commented on AVRO-1720:
----------------------------------------

This exception gets thrown within 
[DataFileStream:nextBlock|https://github.com/apache/avro/blob/eb31746cd5efd5e2c9c57780a651afaccd5cfe06/lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java#L243].
 It is the actual abort condition for this while loop.
The loop could be expressed as well like this:

{noformat}
      try {
        while (true) {
          streamReader.nextBlock()
          count += streamReader.getBlockCount();
        }
      } catch (NoSuchElementException e) {
        // no op
      }
{noformat}

The exception gets always thrown if the filestream reaches the end of a file. 
The DataFileStream class contains also the method "hasNextBlock" which does not 
throw an exception (returns boolean instead) but unfortunately this method is 
only exposed on package level (in org.apache.avro.file).

I did not want to change the visibility of DataFileStream methods for this tool 
therefore I used this workaround.

> Add an avro-tool to count records in an avro file
> -------------------------------------------------
>
>                 Key: AVRO-1720
>                 URL: https://issues.apache.org/jira/browse/AVRO-1720
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Janosch Woschitz
>            Priority: Minor
>         Attachments: AVRO-1720.patch
>
>
> If you're dealing with bigger avro files (>100MB) it would be nice to have a 
> way to quickly count the amount of records contained within that file.
> With the current state of avro-tools the only way to achieve this (to my 
> current knowledge) is to dump the data to json and count the amount of 
> records. For bigger files this might take a while due to the serialization 
> overhead and since every record needs to be looked at.
> I added a new tool which is optimized for counting records, it does not 
> serialize the records and reads only the block count for each block.
> {panel:title=Naive benchmark}
> {noformat}
> # the input file had a size of ~300MB
> $ du -sh sample.avro 
> 323M    sample.avro
> # using the new count tool
> $ time java -jar avro-tools.jar count sample.avro
> 331439
> real    0m4.670s
> user    0m6.167s
> sys 0m0.513s
> # the current way of counting records
> $ time java -jar avro-tools.jar tojson sample.avro | wc
> 331439 54904484 1838231743
> real    0m52.760s
> user    1m42.317s
> sys 0m3.209s
> # the overhead of wc is rather minor
> $ time java -jar avro-tools.jar tojson sample.avro > /dev/null
> real    0m47.834s
> user    0m53.317s
> sys 0m1.194s
> {noformat}
> {panel}
> This tool uses the HDFS API to handle files from any supported filesystem. I 
> added the unit tests to the already existing TestDataFileTools since it 
> provided convenient utility functions which I could reuse for my test 
> scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AVRO-1720) Add an avro-tool to count records in an avro file

Reply via email to