Re: Hadoop Compatibility and EMR

2010-03-24 Thread Vibhooti Verma
Up-gradation information is given in http://wiki.apache.org/hadoop/Hadoop_Upgrade Our team had done the upgrade from 15.5 to 18 with keeping the data intact but it did require lot of testing. I suggest to first do the upgrade on test data and then to production data. On Thu, Mar 25, 2010 at 9:58 A

Re: posted again: how are the splits for map tasks computed?

2010-03-24 Thread Ravi Phulari
Hello Abhishek , Unless you have modified conf/mapred-site.xml file MapReduce will use configuration values specified in $HADOOP_HOME/src/mapred/mapred-default.xml In this configuration file mapred.map.tasks is configured as 2. And due to this your job is running 2 map tasks. mapred.map.tas

posted again: how are the splits for map tasks computed?

2010-03-24 Thread abhishek sharma
I realized that I made a mistake in my earlier post. So here is the correct one. I have a job ("loadgen") with only 1 input (say) part-0 of size 1368654 bytes. So when I submit this job, I get the following output: INFO mapred.FileInputFormat: Total input paths to process : 1 However, in th

how are the splits for map tasks computed?

2010-03-24 Thread abhishek sharma
Hi all, I have a job ("loadgen") with only 1 input (say) part-0 of size 1368654 bytes. So when I submit this job, I get the following output: INFO mapred.FileInputFormat: Total input paths to process : 1 However, in the JobTracker log, I see the following entry: Split info for job:job_201

Re: [DISCUSSION] Release process

2010-03-24 Thread Jeff Hammerbacher
Hey Tom, That sounds like a great idea. +1. Thanks, Jeff On Wed, Mar 24, 2010 at 4:25 PM, Tom White wrote: > I agree that getting the release process restarted is of utmost > importance to the project. To help make that happen I'm happy to > volunteer to be a release manager for the next relea

Re: [DISCUSSION] Release process

2010-03-24 Thread Tom White
I agree that getting the release process restarted is of utmost importance to the project. To help make that happen I'm happy to volunteer to be a release manager for the next release. This will be the first release post-split, so there will undoubtedly be some issues to work out. I think the focus

[jira] Created: (HADOOP-6659) Switch RPC to use Avro

2010-03-24 Thread Doug Cutting (JIRA)
Switch RPC to use Avro -- Key: HADOOP-6659 URL: https://issues.apache.org/jira/browse/HADOOP-6659 Project: Hadoop Common Issue Type: Improvement Components: ipc Reporter: Doug Cutting This is an um

[jira] Resolved: (HADOOP-6646) Move HarfileSystem out of Hadoop Common.

2010-03-24 Thread Mahadev konar (JIRA)
[ https://issues.apache.org/jira/browse/HADOOP-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar resolved HADOOP-6646. --- Resolution: Fixed Hadoop Flags: [Reviewed] I just committed this. I only moved the t

De-Duplication Technique

2010-03-24 Thread Joseph Stein
I have been researching ways to handle de-dupping data while running a map/reduce program (so as to not re-calculate/re-aggregate data that we have seen before[possibly months before]). The data sets we have are littered with repeats of data from mobile devices which continue to come in over time

Re: [DISCUSSION] Release process

2010-03-24 Thread Brian Bockelman
Hey Allen, Your post provoked a few thoughts: 1) Hadoop is a large, but relatively immature project (as in, there's still a lot of major features coming down the pipe). If we wait to release on features, especially when there are critical bugs, we end up with a large number of patches between

Re: [DISCUSSION] Release process

2010-03-24 Thread Allen Wittenauer
On 3/15/10 9:06 AM, "Owen O'Malley" wrote: > From our 21 experience, it looks like our old release strategy is > failing. Maybe this is a dumb question but... Are we sure it isn't the community failing? From where I stand, the major committers (PMC?) have essentially forked Hadoop into

Re: adding new filesystems

2010-03-24 Thread Tom White
Steve, For testing, have a look at FileSystemContractBaseTest and the FileContext*BaseTest classes. You can subclass these to get a lot of basic tests for free. (These don't cover the kind of stress tests you mentioned though.) Cheers Tom On Wed, Mar 24, 2010 at 3:28 AM, Steve Loughran wrote: >

Re: adding new filesystems

2010-03-24 Thread Eli Collins
On Wed, Mar 24, 2010 at 3:28 AM, Steve Loughran wrote: > > I'm looking at what it currently takes to implement new back end > filestores, getting lost in the details. > > This is my current understanding -am I wrong? > > 1. There is an AbstractFileSystem, came in with HADOOP-6223, and is in > SVN

adding new filesystems

2010-03-24 Thread Steve Loughran
I'm looking at what it currently takes to implement new back end filestores, getting lost in the details. This is my current understanding -am I wrong? 1. There is an AbstractFileSystem, came in with HADOOP-6223, and is in SVN_HEAD only https://issues.apache.org/jira/browse/HADOOP-6223 2.