[jira] [Commented] (AVRO-1720) Add an avro-tool to count records in an avro file

Niels Basjes (JIRA) Tue, 25 Aug 2015 02:31:10 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710990#comment-14710990
 ]


Niels Basjes commented on AVRO-1720:
------------------------------------

I had a quick look and the 'nextBlock' starts with 
{code}public ByteBuffer nextBlock() throws IOException {
    if (!hasNext())
      throw new NoSuchElementException();
{code}

So effectively your code now uses {{hasNext}} which is public and you could 
rewrite this as a simple {{while (hasNext())}}.
Yet I also see that the {{hasNext}} calls {{block.decompressUsing(codec);}} 
which (looking at the rest of the surrounding code) seems like a needless 
operation to simply obtain the number of records.
At this point (without trying it out) it seems to me that making the 
hasNextBlock() public would seem logical to me (because the nextBlock() is 
already public) and may possibly lead to some additional performance (no more 
decompressing) when only counting the records.


> Add an avro-tool to count records in an avro file
> -------------------------------------------------
>
>                 Key: AVRO-1720
>                 URL: https://issues.apache.org/jira/browse/AVRO-1720
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Janosch Woschitz
>            Priority: Minor
>         Attachments: AVRO-1720.patch
>
>
> If you're dealing with bigger avro files (>100MB) it would be nice to have a 
> way to quickly count the amount of records contained within that file.
> With the current state of avro-tools the only way to achieve this (to my 
> current knowledge) is to dump the data to json and count the amount of 
> records. For bigger files this might take a while due to the serialization 
> overhead and since every record needs to be looked at.
> I added a new tool which is optimized for counting records, it does not 
> serialize the records and reads only the block count for each block.
> {panel:title=Naive benchmark}
> {noformat}
> # the input file had a size of ~300MB
> $ du -sh sample.avro 
> 323M    sample.avro
> # using the new count tool
> $ time java -jar avro-tools.jar count sample.avro
> 331439
> real    0m4.670s
> user    0m6.167s
> sys 0m0.513s
> # the current way of counting records
> $ time java -jar avro-tools.jar tojson sample.avro | wc
> 331439 54904484 1838231743
> real    0m52.760s
> user    1m42.317s
> sys 0m3.209s
> # the overhead of wc is rather minor
> $ time java -jar avro-tools.jar tojson sample.avro > /dev/null
> real    0m47.834s
> user    0m53.317s
> sys 0m1.194s
> {noformat}
> {panel}
> This tool uses the HDFS API to handle files from any supported filesystem. I 
> added the unit tests to the already existing TestDataFileTools since it 
> provided convenient utility functions which I could reuse for my test 
> scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AVRO-1720) Add an avro-tool to count records in an avro file

Reply via email to