IEEE International Conference on Data Mining ICDM 2008 Data Mining Contest: Radioxenon monitoring for verification of the Comprehensive nuclear-Test-Ban Treaty
Organisers ---------- Kurt Ungar, Trevor Stocki and Jing Yi Health Canada Nathalie Japkowicz University of Ottawa Arno Siebes Universiteit Utrecht Website: http://www.cs.uu.nl/groups/ADA/icdm08cup/ The IEEE ICDM 2008 Data Mining Contest is, simply put, about keeping the world safe using data mining. This contest is about developing and testing data mining techniques to verify worldwide compliance of the global ban on nuclear tests. Such tests can be detected by measuring the amount of special xenon isotopes. Obviously, it's not just that simple; these isotopes are also emitted during various legal activities. Timeline for the Contest ------------------------ 1) September 12, 2008: Release of the Training Data and Test Data sets and the Software tools that will be used to evaluate the results. 2) October 22, 2008: Results from the labelled set are due. 3) November 8, 2008: Results obtained on the unlabeled data set are due 4) December 15-19: Results of the competition are announced at the conference. General Description of the Problem ----------------------------------- Compliance verification of the Comprehensive Nuclear-Test-Ban Treaty (CTBT), when the treaty enters into force, will employ four remote sensing technologies to detect nuclear explosions. Only radionuclide detection can unequivocally establish that an explosion was due to a nuclear detonation. Radioactive noble gas (the following isotopes: Xe-131m, Xe-133m, Xe-133, and Xe-135) are sampled and measured in a procedure called radionuclide monitoring. Different relative combinations of these isotopes correspond to different signatures that can be mapped to distinct sources (such as nuclear power plants, medical isotope production facilities, or various types of weapons). The problem of attributing a specific observation of airborne concentrations of radioxenon to an explosion is twofold. Firstly, in the first few weeks after an explosion the relative concentrations of the four isotopes are expected to be released in “fingerprint” relative concentrations quite distinct from other background sources. Since the CTBT stations are not located at the source of the explosion, the radioxenon is detected at a location which can be well over a thousand kilometres away. This atmospheric transport process can take weeks, which can increase the complexity of this signature. Secondly, one can never observe radioxenons emitted purely from an explosion source but admixtures of this gas with the radioxenons released from all background sources. These 2 points above constitute an interesting data mining problem for the Preparatory Commission for the Comprehensive Nuclear-Test-Ban Treaty Organization (CTBTO). Description of the dataset to be used ------------------------------------- Radioxenon measurements from four to five CTBTO monitoring sites will be provided. These will be comprised of a few hundred to a few thousand sets of observations of the four species for each site. A synthesized a set of explosion observations at these same sites will be added to actual radioxenon concentrations caused by background sources. The data sets are composed of two classes, Background (B), and Background plus Explosion (B+E). Each type has a set of quadruplets representing the four activity concentrations of Xe-131m, Xe-133m, Xe-133, and Xe-135 for a given air sample. We will be issuing labelled data sets containing both classes during the first phase of the competition, while teams develop a classification method appropriate for this task. In a second phase, we will issue a new data set also containing data from both classes, but we will withhold the label. This testing data set will be used for our final evaluation. Description of the computational tasks -------------------------------------- Two versions of data sets will be provided. The first will have each datum described according to station of origin, a unique randomly assigned tracking number allowing the contest evaluators trace the datum back to the original scenario of explosion release, whether it is Background or whether it is Background plus Explosion. The second version will have each datum described by station of origin using the same stations as the first data set and a unique randomly assigned tracking number allowing the contest evaluators trace the datum back to the original scenario of explosion release. The second set of data will contain cases of B or B+E but this will be unknown to the contestants. The first version of the data will be employed in Tasks 1 and 2. The final version of the data will be employed in Task 3. Task 1: The first task is to classify as accurately as possible the results as Background or Explosion over the entire set of stations provide with one classifier. Contestants may combine data as they see fit. They may separately tune classifier parameters for each station but they may not have separate classifier parameter types for each station nor separate classifiers. Contestants can to report on more than one classifier for this task. Task 2: In the second task, conversely, the contestant is requested to identify an optimal algorithm for each station given. Task 3 In the third task, the contestants will apply the classifiers developed in Tasks 1 and 2 using the second data set and report their results for evaluation. The primary goal of this contest is to produce methods that are broadly applicable over different station background measurement distributions and explosion source hypotheses. The best methods will also have a very efficient learning curve. Recognition will also be given to methods more proficient in properly categorizing data arising from specific classes of explosion release hypotheses or station background types, because these methods add a forensic or diagnostic dimension. _______________________________________________ uai mailing list uai@ENGR.ORST.EDU https://secure.engr.oregonstate.edu/mailman/listinfo/uai