Hi,

I have around 4 million time series. ~1000 of them had a special
occurrence at some point. Now, I want to draw 10 samples for each
special time-series based on a similarity comparison.

What I have currently implemented is a script in Python which consumes
time-series one-by-one and does a comparison with all 1000 special
time-series. If the similarity is sufficient with one of them I pass
it back to Pig and strike out the according special time-series,
subsequent time-series will not be compared against this one.

This routine runs, but it lasts around 6 hours.

One of the problems I'm facing is that Pig starts >160 scripts
although 10 would be sufficient. Is there some way to define the
number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
set default_parallel to 10, but it doesn't seem to have any effect.

I'm also open to any other ideas on how to accomplish the task.

Regards,
        Thomas Bach.

Reply via email to