Hi all,

I'm trying to build a non regression testing tool to verify that the files
produced by two Pig scripts are equals.

The files are in PigStorage format. The first field is a key and remaining
fields are opaque data (primitive or complex types).

Example:
        1       43      {(10), (12), (14)}      {(55), (90)}    0       60

I want to check that each key is present in both or neither files, and that for each key the lines are equals. By being equals I mean logical equality not string or byte equality. For example, the two following lines should be
equal:
        1       43      {(10), (12), (14)}      {(55), (90)}    0       60
        1       43      {(12), (10), (14)}      {(90), (55)}    0       60


My issue is that since this tool needs to operate on lot of different
files, it should not rely on a predefined schema. I experimented
the following idea:

------
        f1 = LOAD '$FILE1' USING PigStorage();
        f2 = LOAD '$FILE2' USING PigStorage();

        g_f1 = GROUP f1 BY $0;
        g_f2 = GROUP f2 BY $0;

        joined = JOIN
                g_f1  by group full outer,
                g_f2  by group;

        cmp = FILTER joined by
                g_f1::group is null
                or  g_f2::group is null
                or  SIZE(DIFF(g_f1::f1, g_f2::f2)) != 0;

        dump cmp;
------

Unfortunately, since no schema is specified at load time, g_f1::f1 and
g_f2::f2 are instance of DataByteArray. It means that the DIFF function does not behave as wanted. A byte-to-byte comparison is performed rather than a logical comparison. For example "1 {(2),(1)}" and "1 {(1),(2)}"
are different since their byte representations are not the same.

Do you know if a such tool already exist or how to write it ?

I currently foresee three options:

1- Specify the schema. It could be done using scripting and a file-to-schema mapping. The schema would be inserted using a variable. However the schema of each file has to be described manually. This is a cumbersome process. 2- Use PigStorageSchema instead of PigStorage. I believe this would solve the issue; but being stuck with 0.8.1 I'm wondering if PigStorageSchema is reasonably robust and side effect free to be used in production scripts. 3- Write a custom DIFF UDF taking two DataByteArray. This option allows to not modify production scripts but I don't know how much effort is required
     to write a such UDF. Parsing the DataByteArray to rebuild a
set/list/string structure seems quite easy. Do you think some part of Pig code like Utf8StorageConverter can be reused or should I simply write
     my own parser ?


Thanks !

- Clément


Reply via email to