That makes sense. The problem is I jumped directly to using pig, which is abstracting some of the data flow from me. I guess I'll have to figure out what it's doing under the covers, to know how to optimize/fix bottlenecks.
But for now, I'm taking this information to mean "I should run datanodes with HDFS on larger non-root disks on all tasktracker nodes to ensure my pig scripts work, until I'm willing to either write the M/R code myself, or figure out how to optimize pig and/or the pig script". will On Wed, Jul 6, 2011 at 3:29 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > > > On Wed, Jul 6, 2011 at 2:48 PM, William Oberman > <ober...@civicscience.com>wrote: > >> I have a few cassandra/hadoop/pig questions. I currently have things set >> up in a test environment, and for the most part everything works. But, >> before I start to roll things out to production, I wanted to check >> on/confirm some things. >> >> When I originally set things up, I used: >> http://wiki.apache.org/cassandra/HadoopSupport >> http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html >> >> One difference I noticed between the two guides, which I ignored at the >> time, was how "datanodes" are treated. The wiki said "At least one node in >> your cluster will also need to be a datanode. That's because Hadoop uses >> HDFS to store information like jar dependencies for your job, static data >> (like stop words for a word count), and things like that - it's the >> distributed cache. It's a very small amount of data but the Hadoop cluster >> needs it to run properly". But, the hadoop guide (if you follow it blindly >> like I did), creates a datanode on all TaskTracker nodes. I _think_ that is >> controlled by the conf/slaves file, but I haven't proved that yet. Is there >> any good reason to run datanodes on only the JobTracker vs. on all nodes? >> If I should only run it on the JobTracker, how do I properly stop the >> datanodes from starting automatically (when both start-dfs and start-mapred >> seem to draw from the same slaves file)? >> >> I noticed a second issue/oddness with datanodes, in that the HDFS data >> isn't always small. The other day I ran out of disk running my pig script. >> I checked, and by default, hadoop creates HDFS in /tmp, and I'm using EC2 >> (and /tmp is on the boot device) which is only 10G by default. Do other >> people put HDFS on a different disk? If yes, I'll really want to only run >> one datanode, as I don't want to re-template all of my cassandra nodes to >> have HDFS disks vs. one new JobTracker node. >> >> In terms of hardware, I am running small instances (32bit, 2GB) in the >> test cluster, while my production cluster is larges (64bit, 7 or 8GB). I >> was going to check the performance impact there, but even on smalls in test >> I was able to run hadoop jobs while serving web requests. I am wondering if >> smalls are causing the high HDFS usage though (I think data might "spill" >> more, if I'm understanding things correctly). >> >> If these are more hadoop then cassandra questions, let me know and I'll >> move my questions around. >> >> I did want to mention that these are small details compared to the amount >> of complicated things that worked like a charm during my configuration and >> testing of the combination of cassandra/hadoop/pig. It was impressive :-) >> >> Thanks! >> >> will >> >> > The logic that "only one datanode is needed" is not an absolute truth. If > your jobs use ColumnFamilyInputFormat to read and write to > ColumnFamilyOutputFormat then technically you only need one DataNode to hold > the distributed cache. However, if you have a large amount of intermediate > results or even a multiphase job that has to persist data between phases > (this is very very common) then that single DataNode is a bottleneck. Most > hadoop clusters run a DataNode and TaskTracker on each slave. > > Most situations would use datanodes very heavily, for example suppose you > have 4 map/reduce jobs to run on the same Cassandra data. Ingesting the data > from Cassandra at the beginning of each job might would be wasteful. It > might be better to take the data into HDFS during the first job and then > save it. Your subsequent jobs could use that instead of re-acquiring it from > Cassandra. >