IIRC, enabling symlink creation for your files should solve the problem. Call DistributedCache.createSymLink(); before submitting your job.
On 12/25/08 10:40 AM, "Sean Shanny" <[email protected]> wrote: > To all, > > Version: hadoop-0.17.2.1-core.jar > > I created a MapFile on a local node. > > I put the files into the HDFS using the following commands: > > $ bin/hadoop fs -copyFromLocal /tmp/ur/data /2008-12-19/url/data > $ bin/hadoop fs -copyFromLocal /tmp/ur/index /2008-12-19/url/index > > and placed them in the DistributedCache using the following calls in > the JobConf class: > > DistributedCache.addCacheFile(new URI("/2008-12-19/url/data"), conf); > DistributedCache.addCacheFile(new URI("/2008-12-19/url/index"), conf); > > What I cannot figure out how to do is actually access the MapFile now > within my Map code. I tried the following but I am getting file not > found errors when I try to run the job. > > private FileSystem fs; > private MapFile.Reader myReader; > private Path[] localFiles; > > .... > > public void configure(JobConf conf) > { > String[] s = conf.getStrings("map.input.file"); > m_sFileName = s[0]; > > try > { > localFiles = DistributedCache.getLocalCacheFiles(conf); > > for (Path localFile : localFiles) > { > String sFileName = localFile.getName(); > > if (sFileName.equalsIgnoreCase("data")) > { > System.out.println("Full Path: " + > localFile.toString()); > System.out.println("Parent: " + > localFile.getParent().toString()); > > fs = FileSystem.get(localFile.toUri(), conf); > myReader = new MapFile.Reader(fs, > localFile.getParent().toString(), conf); > } > } > } > catch (IOException e) > { > // TODO Auto-generated catch block > e.printStackTrace(); > } > > The following exception is thrown and I cannot figure out why it is > adding the extra data element at the end of the path. The data is > actually at > > Task Logs: 'task_200812250002_0001_m_000000_0' > > stdout logs > Full Path: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/ > 2008-12-19/url/data/data > Parent: /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/ > 2008-12-19/url/data > stderr logs > java.io.FileNotFoundException: File does not exist: /tmp/hadoop-root/ > mapred/local/taskTracker/archive/hdp01n/2008-12-19/url/data/data at > org > .apache > .hadoop > .dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: > 369) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:628) > at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java: > 1431) at org.apache.hadoop.io.SequenceFile > $Reader.<init>(SequenceFile.java:1426) at org.apache.hadoop.io.MapFile > $Reader.createDataFileReader(MapFile.java:301) at > org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:283) at > org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:272) at > org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:259) at > org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:252) at > com > .TripResearch > .warehouse.etl.EtlTestUrlMapLookup.configure(EtlTestUrlMapLookup.java: > 84) at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java: > 58) at > org > .apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java: > 82) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33) > at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java: > 58) at > org > .apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java: > 82) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215) at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122) > > The files do exist but I don't understand why they were placed in > their own directories. I would have expected both files to exist at / > 2008-12-19/url/ not /2008-12-19/url/data/ and /2008-12-19/url/index/ > > ls -la /tmp/hadoop-root/mapred/local/taskTracker/archive/hdp01n/ > 2008-12-19/url/data > total 740640 > drwxr-xr-x 2 root root 4096 Dec 24 23:49 . > drwxr-xr-x 4 root root 4096 Dec 24 23:49 .. > -rwxr-xr-x 1 root root 751776245 Dec 24 23:49 data > -rw-r--r-- 1 root root 5873260 Dec 24 23:49 .data.crc > > [r...@hdp01n warehouse]# ls -la /tmp/hadoop-root/mapred/local/ > taskTracker/archive/hdp01n/2008-12-19/url/index > total 2148 > drwxr-xr-x 2 root root 4096 Dec 25 00:04 . > drwxr-xr-x 4 root root 4096 Dec 25 00:04 .. > -rwxr-xr-x 1 root root 2165220 Dec 25 00:04 index > -rw-r--r-- 1 root root 16924 Dec 25 00:04 .index.crc > > .... > > I know I must be doing something really stupid here as I am sure this > has been done by lots of folks prior to my feeble attempt. I did a > google search but really could not come up with any examples of using > a MapFile on the DistributedCache. > > Thanks. > > --sean > > > >
