Your code builds a new builder and instance each time through the loop:
for (int i=0;i<1000000;i++) {
user = User.newBuilder().build();
...
How does it perform if you move that second line outside the loop?
Thanks,
Doug
On Fri, Feb 2, 2018 at 3:50 PM, Nishanth S <[email protected]> wrote:
> Thanks Doug . Here is a comparison .
>
> Load Avro Record Size : Roughly15 Kb
>
> I have used the same payload with a schema that has around 2k fields
> and also with another schema that has 5 fileds . I re used the
> avro object in both cases using a builder once . Test was run for 1 M
> records writing the same amount of data (1 Gb ) to a local drive . Ran
> this few times as single threaded . Average TPS in case of smaller schema
> is 40 K where as with a bigger schema it drops down to 10 K even though
> both are writing the same amount of data. Since I am only creating the
> avro object once in both cases it looks like there is an overhead in
> the datafilewriter too in case of bigger schemas .
>
>
>
> public static void main(String[] args){
> try{
> new LoadGenerator().load();
> }catch(IOException e){
> e.printStackTrace();
> }
> }
>
> DataFileWriter<User> dataFileWriter;
> DatumWriter<User> datumWriter;
> FileSystem hdfsFileSystem;
> Configuration conf;
> Path path;
> OutputStream outStream;
> User user;
> com.google.common.base.Stopwatch stopwatch = new
> com.google.common.base.Stopwatch().start();
> public void load() throws IOException{
> conf = new Configuration();
> hdfsFileSystem = FileSystem.get(conf);
> datumWriter = new SpecificDatumWriter<User>(User.class);
> dataFileWriter = new DataFileWriter<User>(datumWriter);
> dataFileWriter.setCodec(CodecFactory.snappyCodec());
> path = new Path("/projects/tmp/load.avro");
> outStream=hdfsFileSystem.create(path, true);
> dataFileWriter.create(User.getClassSchema(), outStream);
> dataFileWriter.setFlushOnEveryBlock(false);
> // Create and Load User
> int numRecords =1000000;
> for (int i=0;i<1000000;i++){
> user = User.newBuilder().build();
> user.setFirstName("testName"+new Random().nextLong());
> user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
> user.setFavoriteColor("blue" +new Random().nextFloat());
> user.setData(ByteBuffer.wrap(new byte[15000]));
> dataFileWriter.append(user);
> }
> dataFileWriter.close();
> stopwatch.stop();
> long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
> System.out.println("Time elapsed for myCall() is "+ elapsedTime);
>
> On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <[email protected]> wrote:
>
>> Builders have some inherent overheads. Things could be optimized to
>> better minimize this, but it will likely always be faster to reuse a single
>> instance when writing.
>>
>> The deepCopy's are probably of the default values of each field you're
>> not setting. If you're only setting a few fields then you might use a
>> builder to create a single instance so its defaults are set, then reuse
>> that instance as you write, setting only those few fields you need to
>> differ from the default. (This only works if you're setting the same
>> fields every time. Otherwise you'd need to restore the default value.)
>>
>> An optimization for Avro here might be to inline default values for
>> immutable types when generating the build() method.
>>
>> Doug
>>
>> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <[email protected]>
>> wrote:
>>
>>> Hello Every One,
>>>
>>> We have a process that reads data from a local file share ,serailizes
>>> and writes to HDFS in avro format. .I am just wondering if I am building
>>> the avro objects correctly. For every record that is read from the binary
>>> file we create an equivalent avro object in the below format.
>>>
>>> Parent p = new Parent();
>>> LOGHDR hdr = LOGHDR.newBuilder().build()
>>> MSGHDR msg = MSGHDR.newBuilder().build()
>>> p.setHdr(hdr);
>>> p.setMsg(msg);
>>> p..
>>> p..set
>>> datumFileWriter.write(p);
>>>
>>> This avro schema has around 1800 fileds including 26 nested types
>>> within it .I did some load testing and figured that if I serialize the same
>>> object to disk performance is 6 x times faster than a constructing a new
>>> object (p.build). When a new avro object is constructed everytime using
>>> RecordBuilder.build() much of the time is spend in
>>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>>> using Avro 1.8.2.
>>>
>>> Thanks,
>>> Nishanth
>>>
>>>
>>>
>>>
>>>
>>
>