Re: Announcing CloudBase-1.3.1 release

2009-06-22 Thread Edward Capriolo
It would be interesting to see a cloudbase vs hive benchmark/comparison. Has anyone ever ran the two side by side? 2009/6/21 imcaptor : > When you use cloudbase, you can create different table for different > daily files. > > For example, your directory will like this. > > logs > /200905 >

Re: Multiple NIC Cards

2009-06-09 Thread Edward Capriolo
Also if you are using a topology rack map, make sure you scripts responds correctly to every possible hostname or IP address as well. On Tue, Jun 9, 2009 at 1:19 PM, John Martyniak wrote: > It seems that this is the issue, as there several posts related to same > topic but with no resolution. > >

Re: Multiple NIC Cards

2009-06-09 Thread Edward Capriolo
On Tue, Jun 9, 2009 at 11:59 AM, Steve Loughran wrote: > John Martyniak wrote: >> >> When I run either of those on either of the two machines, it is trying to >> resolve against the DNS servers configured for the external addresses for >> the box. >> >> Here is the result >> Server:        xxx.xxx.

Re: Monitoring hadoop?

2009-06-05 Thread Edward Capriolo
On Fri, Jun 5, 2009 at 10:10 AM, Brian Bockelman wrote: > Hey Anthony, > > Look into hooking your Hadoop system into Ganglia; this produces about 20 > real-time statistics per node. > > Hadoop also does JMX, which hooks into more "enterprise"-y monitoring > systems. > > Brian > > On Jun 5, 2009, at

Re: Blocks amount is "stuck" in statistics

2009-05-25 Thread Edward Capriolo
On Mon, May 25, 2009 at 6:34 AM, Stas Oskin wrote: > Hi. > > Ok, was too eager to report :). > > It got sorted out after some time. > > Regards. > > 2009/5/25 Stas Oskin > >> Hi. >> >> I just did an erase of large test folder with about 20,000 blocks, and >> created a new one. I copied about 128

Re: ssh issues

2009-05-22 Thread Edward Capriolo
Pankil, I used to be very confused by hadoop and SSH keys. SSH is NOT required. Each component can be started by hand. This gem of knowledge is hidden away in the hundreds of DIGG style articles entitled 'HOW TO RUN A HADOOP MULTI-MASTER CLUSTER!' The SSH keys are only required by the shell s

Re: Optimal Filesystem (and Settings) for HDFS

2009-05-18 Thread Edward Capriolo
Do not forget 'tune2fs -m 2'. By default this value gets set at 5%. With 1 TB disks we got 33 GB more usable space. Talk about instant savings! On Mon, May 18, 2009 at 1:31 PM, Alex Loddengaard wrote: > I believe Yahoo! uses ext3, though I know other people have said that XFS > has performed bett

Re: Linking against Hive in Hadoop development tree

2009-05-15 Thread Edward Capriolo
On Fri, May 15, 2009 at 5:05 PM, Aaron Kimball wrote: > Hi all, > > For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition > to uploading data into HDFS and using MapReduce to load/transform the data, > I'd like to integrate more closely with Hive. Specifically, to run the > CR

Hadoop JMX and Cacti

2009-05-14 Thread Edward Capriolo
Hey all, I have come pretty far along with using cacti to graph Hadoop JMX variables using caciti. http://www.jointhegrid.com/hadoop/. Currently I have about 8 different hadoop graph types available for the NameNode and the DataNode. The NameNode has many fairly complete and detailed counters. I h

Re: sub 60 second performance

2009-05-11 Thread Edward Capriolo
On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon wrote: > In addition to Jason's suggestion, you could also see about setting some of > Hadoop's directories to subdirs of /dev/shm. If the dataset is really small, > it should be easy to re-load it onto the cluster if it's lost, so even > putting dfs.d

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Edward Capriolo
2009/5/7 Jeff Hammerbacher : > Hey, > > You can read more about why small files are difficult for HDFS at > http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. > > Regards, > Jeff > > 2009/5/7 Piotr Praczyk > >> If You want to use many small files, they are probably having the same >>

Cacti Templates for Hadoop

2009-05-06 Thread Edward Capriolo
For those of you that would like to graph the hadoop JMX variables with cacti I have created cacti templates and data input scripts. Currently the package gathers and graphs the following information from the NameNode: Blocks Total Files Total Capacity Used/Capacity Free Live Data Nodes/Dead Data

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

2009-05-05 Thread Edward Capriolo
'cloud computing' is a hot term. According to the definition provided by wikipedia http://en.wikipedia.org/wiki/Cloud_computing, Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well. Hadoop is scalable, with HOD it is dynamically scalable. I do not think (Hadoop+HBase+Lucene+Zook

Re: What User Accounts Do People Use For Team Dev?

2009-05-05 Thread Edward Capriolo
On Tue, May 5, 2009 at 10:44 AM, Dan Milstein wrote: > Best-practices-type question: when a single cluster is being used by a team > of folks to run jobs, how do people on this list handle user accounts? > > Many of the examples seem to show everything being run as root on the > master, which is h

Re: Getting free and used space

2009-05-02 Thread Edward Capriolo
You can also pull these variables from the name node, datanode with JMX. I am doing this to graph them with cacti. Both the JMX READ/WRITE and READ user can access this variable. On Tue, Apr 28, 2009 at 8:29 AM, Stas Oskin wrote: > Hi. > > Any idea if the getDiskStatus() function requires superus

Re: Hadoop / MySQL

2009-04-29 Thread Edward Capriolo
On Wed, Apr 29, 2009 at 2:48 PM, Todd Lipcon wrote: > On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski wrote: > >> If you have trouble loading your data into mysql using INSERTs or LOAD >> DATA, consider that MySQL supports CSV directly using the CSV storage >> engine. The only thing you have t

Re: Hadoop / MySQL

2009-04-29 Thread Edward Capriolo
On Wed, Apr 29, 2009 at 10:19 AM, Stefan Podkowinski wrote: > If you have trouble loading your data into mysql using INSERTs or LOAD > DATA, consider that MySQL supports CSV directly using the CSV storage > engine. The only thing you have to do is to copy your hadoop produced > csv file into the m

Re: max value for a dataset

2009-04-21 Thread Edward Capriolo
t; >>>> reduce would be the simplest. >>>> >>>> On your question a Mapper and Reducer defines 3 entry points, configure, >>>> called once on on task start, the map/reduce called once for each >>>> record, >>>> and close, cal

Re: max value for a dataset

2009-04-20 Thread Edward Capriolo
once on on task start, the map/reduce called once for each record, >> > and close, called once after the last call to map/reduce. >> > at least through 0.19, the close is not provided with the output >> collector >> > or the reporter, so you need to save them in the

max value for a dataset

2009-04-18 Thread Edward Capriolo
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase support the ability to max(). I am writing my own max() over a simple one column dataset. The best solution I came up with was using MapRunner. With maprunner I can store the highest value in a private member variable. I can read

Re: Using HDFS to serve www requests

2009-03-27 Thread Edward Capriolo
>>but does Sun's Lustre follow in the steps of Gluster then Yes. IMHO GlusterFS advertises benchmarks vs Luster. The main difference is that GlusterFS is a fuse (userspace filesystem) while Luster has to be patched into the kernel, or a module.

Re: Using HDFS to serve www requests

2009-03-26 Thread Edward Capriolo
It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform

Re: virtualization with hadoop

2009-03-26 Thread Edward Capriolo
I use linux-vserver http://linux-vserver.org/ The Linux-VServer technology is a soft partitioning concept based on Security Contexts which permits the creation of many independent Virtual Private Servers (VPS) that run simultaneously on a single physical server at full speed, efficiently sharing h

Re: Using Hadoop for near real-time processing of log data

2009-02-25 Thread Edward Capriolo
>>Yeah, but what's the point of using Hadoop then? i.e. we lost all the >>parallelism? Some jobs do not need it. For example, I am working with the Hive sub project. If I have a table that is less then my block size. Having a large number of mappers or reducers is counter productive. Hadoop will s

Re: Using Hadoop for near real-time processing of log data

2009-02-25 Thread Edward Capriolo
On Wed, Feb 25, 2009 at 1:13 PM, Mikhail Yakshin wrote: > Hi, > >> Is anyone using Hadoop as more of a near/almost real-time processing >> of log data for their systems to aggregate stats, etc? > > We do, although "near realtime" is pretty relative subject and your > mileage may vary. For example,

Re: Batching key/value pairs to map

2009-02-23 Thread Edward Capriolo
We have a MR program that collects once for each token on a line. What types of applications can benefit from batch mapping?

Hadoop JMX

2009-02-20 Thread Edward Capriolo
I am working to graph the hadoop JMX variables. http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/dfs/namenode/metrics/NameNodeStatistics.html I have a two nodes, one running 0.17 and the other running.0.19 The NameNode JMX objects and attributes seem to be working well. I am graphi

Re: Pluggable JDBC schemas [Was: How to use DBInputFormat?]

2009-02-13 Thread Edward Capriolo
One thing to mention is 'limit' is not SQL standard. Microsoft SQL Server uses the SELECT TOP 100 FROM table. Some RDBMS may not support any such syntax. To be more SQL compliant you should use some data like an auto ID or DATE column for an offset. It is tricky to write anything truly database ag

Using HOD to manage a production cluster

2009-01-31 Thread Edward Capriolo
I am looking at using HOD (Hadoop On Demand) to manage a production cluster. After reading the documentation It seems that HOD is missing some things that would need to be carefully set in a production cluster. Rack Locality: HOD uses the -N 5 option and starts a cluster of N nodes. There seems to

Re: Question about HDFS capacity and remaining

2009-01-30 Thread Edward Capriolo
Very interesting note for a new cluster checklist. Good to tune the file system down from 5%. On a related note some operating systems ::cough:: FreeBSD will report negative disk space when you go over the quota. What does that mean? We run nagios with NRPE to run remote disk checks. We configure

Re: Zeroconf for hadoop

2009-01-26 Thread Edward Capriolo
Zeroconf is more focused on simplicity then security. One of the original problems that may have been fixes is that any program can announce any service. IE my laptop can announce that it is the DNS for google.com etc. I want to mention a related topic to the list. People are approaching the auto-

Re: Netbeans/Eclipse plugin

2009-01-25 Thread Edward Capriolo
On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar wrote: > Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I want > to make plugin for netbeans > > http://vinayakkatkar.wordpress.com > -- > Vinayak Katkar > Sun Campus Ambassador > Sun Microsytems,India > COEP > There is an ecp

Re: Why does Hadoop need ssh access to master and slaves?

2009-01-23 Thread Edward Capriolo
I am looking to create some RA scripts and experiment with starting hadoop via linux-ha cluster manager. Linux HA would handle restarting downed nodes and eliminate the ssh key dependency.

Re: When I system.out.println() in a map or reduce, where does it go?

2008-12-10 Thread Edward Capriolo
Also be careful when you do this. If you are running map/reduce on a large file the map and reduce operations will be called many times. You can end up with a lot of output. Use log4j instead.

Re: File loss at Nebraska

2008-12-09 Thread Edward Capriolo
Also it might be useful to strongly word hadoop-default.conf as many people might not know a downside exists for using 2 rather then 3 as the replication factor. Before reading this thread I would have thought 2 to be sufficient.

JDBC input/output format

2008-12-08 Thread Edward Capriolo
Is anyone working on a JDBC RecordReader/InputFormat. I was thinking this would be very useful for sending data into mappers. Writing data to a relational database might be more application dependent but still possible.

Hadoop IP Tables configuration

2008-12-03 Thread Edward Capriolo
All, I always run iptables on my systems. Most of the hadoop setup guides I have found skip iptables/firewall configuration. My namenode and task tracker are the same node. My current configuration is not working as I submit jobs from the namenode jobs are kicked off on the slave nodes but they fa

Re: What do you do with task logs?

2008-11-18 Thread Edward Capriolo
We just setup a log4j server. This takes the logs off the cluster. Plus you get all the benefits of log4j http://timarcher.com/?q=node/10

Re: To Compute or Not to Compute on Prod

2008-10-31 Thread Edward Capriolo
Shahab, This can be done. If you client speaks java you can connect to hadoop and write as a stream. If you client does not have java. The thrift api will generate stubs in a variety of languages Thrift API: http://wiki.apache.org/hadoop/HDFS-APIs Shameless plug -- If you just want to stream da

Re: nagios to monitor hadoop datanodes!

2008-10-29 Thread Edward Capriolo
All I have to say is wow! I never tried jconsole before. I have hadoop_trunk checked out and the JMX has all kinds of great information. I am going to look at how I can get JMX/cacti/and hadoop working together. Just as an FYI there are separate ENV variables for each now. If you override hadoop_o

Re: How does an offline Datanode come back up ?

2008-10-29 Thread Edward Capriolo
Someone on the list is looking at monitoring hadoop features with nagios. Nagios can be configured with an event_handler. In the past I have written event handlers to do operations like this. If down --- use SSH key and restart. HoweverSince you have an SSH key on your master node, you should

Re: LHadoop Server simple Hadoop input and output

2008-10-24 Thread Edward Capriolo
I came up with my line of thinking after reading this article: http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data As a guy that was intrigued by the java coffee cup in 95, that now lives as a data center/noc jock/unix guy. Lets say I look at a log manageme

Re: LHadoop Server simple Hadoop input and output

2008-10-23 Thread Edward Capriolo
I had downloaded thrift and ran the example applications after the Hive meet up. It is very cool stuff. The thriftfs interface is more elegant than what I was trying to do, and that implementation is more complete. Still, someone might be interested in what I did if they want a super-light API :)

LHadoop Server simple Hadoop input and output

2008-10-23 Thread Edward Capriolo
One of my first questions about hadoop was, "How do systems outside the cluster interact with the file system?" I read several documents that described streaming data into hadoop for processing, but I had trouble finding examples. The goal of LHadoop Server (L stands for Lightweight) is to produce

Hive Web-UI

2008-10-10 Thread Edward Capriolo
I was checking out this slide show. http://www.slideshare.net/jhammerb/2008-ur-tech-talk-zshao-presentation/ in the diagram a Web-UI exists. This was the first I have heard of this. Is this part of or planned to be a part of contrib/hive? I think a web interface for showing table schema and executi

Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Edward Capriolo
That all sounds good. By 'quick hack' I meant 'check_tcp' was not good enough because an open TCP socket does not prove much. However, if the page returns useful attributes that show cluster is alive that is great and easy. Come to think of it you can navigate the dfshealth page and get useful in

Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Edward Capriolo
The simple way would be use use nrpe and check_proc. I have never tested, but a command like 'ps -ef | grep java | grep NameNode' would be a fairly decent check. That is not very robust but it should let you know if the process is alive. You could also monitor the web interfaces associated with t

Re: Hadoop and security.

2008-10-06 Thread Edward Capriolo
You bring up some valid points. This would be a great topic for a white paper. The first line of defense should be to apply inbound and outbound iptables rules. Only source IPs that have a direct need to interact with the cluster should be allowed to. The same is true with the web access. Only a

Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
I determined the problem once I set the log4j properties to debug. derbyclient.jar derbytools.jar does not ship with hive. As a result when you try to org.apache.derby.jdbc.ClientDriver you get an invocation target exception. The solution for this was to download the derby, and place those files in

Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
>> hive.metastore.local >> true Why would I set this property to true? My goal is to store the meta data in an external database. It i set this to true the metabase is created in the working directory.

Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
I am doing a lot of testing with Hive, I will be sure to add this information to the wiki once I get it going. Thus far I downloaded the same version of derby that hive uses. I have verified that the connections is up and running. ij version 10.4 ij> connect 'jdbc:derby://nyhadoop1:1527/metastore

Hive questions about the meta db

2008-10-01 Thread Edward Capriolo
I have been working with Hive for the past week. The ability to wrap an SQL like tool over HDFS as very powerful. Now that I am comfortable with the concept, I am looking at an implementation of it. Currently I have a three node cluster for testing hadoop1, hadoop2, and hadoop3. I have hive instal

Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Edward Capriolo
wait and sleep are not what you are looking for. you can use 'nohup' to run a job in the background and have its output piped to a file. On Tue, Jun 10, 2008 at 5:48 PM, Meng Mao <[EMAIL PROTECTED]> wrote: > I'm interested in the same thing -- is there a recommended way to batch > Hadoop jobs toge

Re: text extraction from html based on uniqueness metric

2008-06-10 Thread Edward Capriolo
I have never tried this method. The concept came from a research paper I ran into. The goal was to detect the language of piece of text by looking at several factors. Average length of word, average length of sentence, average number of vowels in a word, etc. He used these to score and article, and

Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Edward Capriolo
There is a place for virtualization. If you can justify the overhead for hands free network management. If your systems really have enough power they can run hadoop and something else. If you need to run multiple truly isolated versions of hadoop (selling some type of hadoop grid services?) If you

Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Edward Capriolo
I once asked a wise man in change of a rather large multi-datacenter service, "Have you every considered virtualization?" He replied, "All the CPU's here are pegged at 100%" They may be applications for this type of processing. I have thought about systems like this from time to time. This thinkin

Re: Realtime Map Reduce = Supercomputing for the Masses?

2008-06-01 Thread Edward Capriolo
I think that feature makes sense because starting JVM has overhead. On Sun, Jun 1, 2008 at 4:26 AM, Christophe Taton <[EMAIL PROTECTED]> wrote: > Actually Hadoop could be made more friendly to such realtime Map/Reduce > jobs. > For instance, we could consider running all tasks inside the task trac

Re: Making the case for Hadoop

2008-05-16 Thread Edward Capriolo
Conservative IT executiveSounds like your working at my last job. :) Yahoo uses hadoop. For a very large cluster. http://developer.yahoo.com/blogs/hadoop/ And afterall hadoop is a work alike of the Google File System, google uses that for all types of satelite data, The new york times is usi