How to perfom a logical diff on two PigStorage files

Clément MATHIEU Fri, 30 Nov 2012 05:49:09 -0800

Hi all,

I'm trying to build a non regression testing tool to verify that thefiles

produced by two Pig scripts are equals.

The files are in PigStorage format. The first field is a key andremaining

fields are opaque data (primitive or complex types).

Example:
        1       43      {(10), (12), (14)}      {(55), (90)}    0       60

I want to check that each key is present in both or neither files, andthatfor each key the lines are equals. By being equals I mean logicalequalitynot string or byte equality. For example, the two following linesshould be

equal:
        1       43      {(10), (12), (14)}      {(55), (90)}    0       60
        1       43      {(12), (10), (14)}      {(90), (55)}    0       60


My issue is that since this tool needs to operate on lot of different
files, it should not rely on a predefined schema. I experimented
the following idea:

------
        f1 = LOAD '$FILE1' USING PigStorage();
        f2 = LOAD '$FILE2' USING PigStorage();

        g_f1 = GROUP f1 BY $0;
        g_f2 = GROUP f2 BY $0;

        joined = JOIN
                g_f1  by group full outer,
                g_f2  by group;

        cmp = FILTER joined by
                g_f1::group is null
                or  g_f2::group is null
                or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;

        dump cmp;
------

Unfortunately, since no schema is specified at load time, g_f1::f1 and

g_f2::f2 are instance of DataByteArray. It means that the DIFF functiondoesnot behave as wanted. A byte-to-byte comparison is performed ratherthan alogical comparison. For example "1 {(2),(1)}" and "1{(1),(2)}"

are different since their byte representations are not the same.

Do you know if a such tool already exist or how to write it ?

I currently foresee three options:

1- Specify the schema. It could be done using scripting and afile-to-schemamapping. The schema would be inserted using a variable. Howeverthe schemaof each file has to be described manually. This is a cumbersomeprocess.2- Use PigStorageSchema instead of PigStorage. I believe this wouldsolvethe issue; but being stuck with 0.8.1 I'm wondering ifPigStorageSchemais reasonably robust and side effect free to be used in productionscripts.3- Write a custom DIFF UDF taking two DataByteArray. This optionallows to notmodify production scripts but I don't know how much effort isrequired

     to write a such UDF. Parsing the DataByteArray to rebuild a

set/list/string structure seems quite easy. Do you think some partofPig code like Utf8StorageConverter can be reused or should Isimply write

     my own parser ?


Thanks !

- Clément

How to perfom a logical diff on two PigStorage files

Reply via email to