Your code builds a new builder and instance each time through the loop:

  for (int i=0;i<1000000;i++) {
  user = User.newBuilder().build();
  ...

How does it perform if you move that second line outside the loop?

Thanks,

Doug


On Fri, Feb 2, 2018 at 3:50 PM, Nishanth S <[email protected]> wrote:

> Thanks Doug .  Here  is a  comparison .
>
> Load Avro  Record Size : Roughly15 Kb
>
> I have used the same payload  with a schema  that has  around 2k fields
> and  also  with    another schema   that has  5 fileds . I re used the
> avro object in both cases   using a builder once . Test was run for 1 M
> records writing the  same amount of data  (1 Gb ) to  a  local drive . Ran
> this few times as  single threaded . Average TPS in case of smaller schema
> is  40 K where  as with a bigger schema it drops down to 10 K  even though
> both are  writing the same amount of data. Since I am   only creating the
> avro object once  in both  cases   it looks   like  there is an overhead in
> the  datafilewriter too in case of bigger schemas .
>
>
>
> public static void main(String[] args){
> try{
> new LoadGenerator().load();
> }catch(IOException e){
>     e.printStackTrace();
> }
>     }
>
>     DataFileWriter<User> dataFileWriter;
>     DatumWriter<User> datumWriter;
>     FileSystem hdfsFileSystem;
>     Configuration conf;
>     Path path;
>     OutputStream outStream;
>     User user;
>     com.google.common.base.Stopwatch stopwatch = new
> com.google.common.base.Stopwatch().start();
>     public  void load() throws IOException{
> conf = new Configuration();
> hdfsFileSystem = FileSystem.get(conf);
> datumWriter = new SpecificDatumWriter<User>(User.class);
> dataFileWriter = new DataFileWriter<User>(datumWriter);
> dataFileWriter.setCodec(CodecFactory.snappyCodec());
>         path = new Path("/projects/tmp/load.avro");
>         outStream=hdfsFileSystem.create(path, true);
> dataFileWriter.create(User.getClassSchema(), outStream);
>         dataFileWriter.setFlushOnEveryBlock(false);
> // Create and Load User
> int numRecords =1000000;
> for (int i=0;i<1000000;i++){
>     user = User.newBuilder().build();
>     user.setFirstName("testName"+new Random().nextLong());
>     user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
>     user.setFavoriteColor("blue" +new Random().nextFloat());
>     user.setData(ByteBuffer.wrap(new byte[15000]));
>     dataFileWriter.append(user);
> }
> dataFileWriter.close();
> stopwatch.stop();
> long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
> System.out.println("Time elapsed for myCall() is "+ elapsedTime);
>
> On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <[email protected]> wrote:
>
>> Builders have some inherent overheads.  Things could be optimized to
>> better minimize this, but it will likely always be faster to reuse a single
>> instance when writing.
>>
>> The deepCopy's are probably of the default values of each field you're
>> not setting.  If you're only setting a few fields then you might use a
>> builder to create a single instance so its defaults are set, then reuse
>> that instance as you write, setting only those few fields you need to
>> differ from the default.  (This only works if you're setting the same
>> fields every time.  Otherwise you'd need to restore the default value.)
>>
>> An optimization for Avro here might be to inline default values for
>> immutable types when generating the build() method.
>>
>> Doug
>>
>> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <[email protected]>
>> wrote:
>>
>>> Hello Every One,
>>>
>>> We have a process that reads data from a  local file share  ,serailizes
>>> and writes to HDFS in avro format. .I am just wondering if I am building
>>> the avro objects correctly. For every record that  is read from the binary
>>> file we create an equivalent avro object in the below format.
>>>
>>> Parent p = new Parent();
>>> LOGHDR hdr = LOGHDR.newBuilder().build()
>>> MSGHDR msg = MSGHDR.newBuilder().build()
>>> p.setHdr(hdr);
>>> p.setMsg(msg);
>>> p..
>>> p..set
>>> datumFileWriter.write(p);
>>>
>>> This avro schema has  around 1800 fileds including 26 nested types
>>> within it .I did some load testing and figured that if I serialize the same
>>> object to disk  performance is  6 x times faster  than a constructing a new
>>> object (p.build). When a new  avro object is constructed everytime using
>>> RecordBuilder.build()  much of the time is spend in
>>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>>> using Avro 1.8.2.
>>>
>>> Thanks,
>>> Nishanth
>>>
>>>
>>>
>>>
>>>
>>
>

Reply via email to