It would be interesting to see a cloudbase vs hive
benchmark/comparison. Has anyone ever ran the two side by side?
2009/6/21 imcaptor :
> When you use cloudbase, you can create different table for different
> daily files.
>
> For example, your directory will like this.
>
> logs
> /200905
>
Also if you are using a topology rack map, make sure you scripts
responds correctly to every possible hostname or IP address as well.
On Tue, Jun 9, 2009 at 1:19 PM, John Martyniak wrote:
> It seems that this is the issue, as there several posts related to same
> topic but with no resolution.
>
>
On Tue, Jun 9, 2009 at 11:59 AM, Steve Loughran wrote:
> John Martyniak wrote:
>>
>> When I run either of those on either of the two machines, it is trying to
>> resolve against the DNS servers configured for the external addresses for
>> the box.
>>
>> Here is the result
>> Server: xxx.xxx.
On Fri, Jun 5, 2009 at 10:10 AM, Brian Bockelman wrote:
> Hey Anthony,
>
> Look into hooking your Hadoop system into Ganglia; this produces about 20
> real-time statistics per node.
>
> Hadoop also does JMX, which hooks into more "enterprise"-y monitoring
> systems.
>
> Brian
>
> On Jun 5, 2009, at
On Mon, May 25, 2009 at 6:34 AM, Stas Oskin wrote:
> Hi.
>
> Ok, was too eager to report :).
>
> It got sorted out after some time.
>
> Regards.
>
> 2009/5/25 Stas Oskin
>
>> Hi.
>>
>> I just did an erase of large test folder with about 20,000 blocks, and
>> created a new one. I copied about 128
Pankil,
I used to be very confused by hadoop and SSH keys. SSH is NOT
required. Each component can be started by hand. This gem of knowledge
is hidden away in the hundreds of DIGG style articles entitled 'HOW TO
RUN A HADOOP MULTI-MASTER CLUSTER!'
The SSH keys are only required by the shell s
Do not forget 'tune2fs -m 2'. By default this value gets set at 5%.
With 1 TB disks we got 33 GB more usable space. Talk about instant
savings!
On Mon, May 18, 2009 at 1:31 PM, Alex Loddengaard wrote:
> I believe Yahoo! uses ext3, though I know other people have said that XFS
> has performed bett
On Fri, May 15, 2009 at 5:05 PM, Aaron Kimball wrote:
> Hi all,
>
> For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition
> to uploading data into HDFS and using MapReduce to load/transform the data,
> I'd like to integrate more closely with Hive. Specifically, to run the
> CR
Hey all,
I have come pretty far along with using cacti to graph Hadoop JMX
variables using caciti. http://www.jointhegrid.com/hadoop/. Currently
I have about 8 different hadoop graph types available for the NameNode
and the DataNode.
The NameNode has many fairly complete and detailed counters. I h
On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon wrote:
> In addition to Jason's suggestion, you could also see about setting some of
> Hadoop's directories to subdirs of /dev/shm. If the dataset is really small,
> it should be easy to re-load it onto the cluster if it's lost, so even
> putting dfs.d
2009/5/7 Jeff Hammerbacher :
> Hey,
>
> You can read more about why small files are difficult for HDFS at
> http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
>
> Regards,
> Jeff
>
> 2009/5/7 Piotr Praczyk
>
>> If You want to use many small files, they are probably having the same
>>
For those of you that would like to graph the hadoop JMX variables
with cacti I have created cacti templates and data input scripts.
Currently the package gathers and graphs the following information
from the NameNode:
Blocks Total
Files Total
Capacity Used/Capacity Free
Live Data Nodes/Dead Data
'cloud computing' is a hot term. According to the definition provided
by wikipedia http://en.wikipedia.org/wiki/Cloud_computing,
Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well.
Hadoop is scalable, with HOD it is dynamically scalable.
I do not think (Hadoop+HBase+Lucene+Zook
On Tue, May 5, 2009 at 10:44 AM, Dan Milstein wrote:
> Best-practices-type question: when a single cluster is being used by a team
> of folks to run jobs, how do people on this list handle user accounts?
>
> Many of the examples seem to show everything being run as root on the
> master, which is h
You can also pull these variables from the name node, datanode with
JMX. I am doing this to graph them with cacti. Both the JMX READ/WRITE
and READ user can access this variable.
On Tue, Apr 28, 2009 at 8:29 AM, Stas Oskin wrote:
> Hi.
>
> Any idea if the getDiskStatus() function requires superus
On Wed, Apr 29, 2009 at 2:48 PM, Todd Lipcon wrote:
> On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski wrote:
>
>> If you have trouble loading your data into mysql using INSERTs or LOAD
>> DATA, consider that MySQL supports CSV directly using the CSV storage
>> engine. The only thing you have t
On Wed, Apr 29, 2009 at 10:19 AM, Stefan Podkowinski wrote:
> If you have trouble loading your data into mysql using INSERTs or LOAD
> DATA, consider that MySQL supports CSV directly using the CSV storage
> engine. The only thing you have to do is to copy your hadoop produced
> csv file into the m
t;
>>>> reduce would be the simplest.
>>>>
>>>> On your question a Mapper and Reducer defines 3 entry points, configure,
>>>> called once on on task start, the map/reduce called once for each
>>>> record,
>>>> and close, cal
once on on task start, the map/reduce called once for each record,
>> > and close, called once after the last call to map/reduce.
>> > at least through 0.19, the close is not provided with the output
>> collector
>> > or the reporter, so you need to save them in the
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
support the ability to max(). I am writing my own max() over a simple
one column dataset.
The best solution I came up with was using MapRunner. With maprunner I
can store the highest value in a private member variable. I can read
>>but does Sun's Lustre follow in the steps of Gluster then
Yes. IMHO GlusterFS advertises benchmarks vs Luster.
The main difference is that GlusterFS is a fuse (userspace filesystem)
while Luster has to be patched into the kernel, or a module.
It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.
I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
I will not testify to how well this setup will perform
I use linux-vserver http://linux-vserver.org/
The Linux-VServer technology is a soft partitioning concept based on
Security Contexts which permits the creation of many independent
Virtual Private Servers (VPS) that run simultaneously on a single
physical server at full speed, efficiently sharing h
>>Yeah, but what's the point of using Hadoop then? i.e. we lost all the
>>parallelism?
Some jobs do not need it. For example, I am working with the Hive sub
project. If I have a table that is less then my block size. Having a
large number of mappers or reducers is counter productive. Hadoop will
s
On Wed, Feb 25, 2009 at 1:13 PM, Mikhail Yakshin
wrote:
> Hi,
>
>> Is anyone using Hadoop as more of a near/almost real-time processing
>> of log data for their systems to aggregate stats, etc?
>
> We do, although "near realtime" is pretty relative subject and your
> mileage may vary. For example,
We have a MR program that collects once for each token on a line. What
types of applications can benefit from batch mapping?
I am working to graph the hadoop JMX variables.
http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/dfs/namenode/metrics/NameNodeStatistics.html
I have a two nodes, one running 0.17 and the other running.0.19
The NameNode JMX objects and attributes seem to be working well. I am
graphi
One thing to mention is 'limit' is not SQL standard. Microsoft SQL
Server uses the SELECT TOP 100 FROM table. Some RDBMS may not support
any such syntax. To be more SQL compliant you should use some data
like an auto ID or DATE column for an offset. It is tricky to write
anything truly database ag
I am looking at using HOD (Hadoop On Demand) to manage a production
cluster. After reading the documentation It seems that HOD is missing
some things that would need to be carefully set in a production
cluster.
Rack Locality:
HOD uses the -N 5 option and starts a cluster of N nodes. There seems
to
Very interesting note for a new cluster checklist. Good to tune the
file system down from 5%.
On a related note some operating systems ::cough:: FreeBSD will report
negative disk space when you go over the quota. What does that mean?
We run nagios with NRPE to run remote disk checks. We configure
Zeroconf is more focused on simplicity then security. One of the
original problems that may have been fixes is that any program can
announce any service. IE my laptop can announce that it is the DNS for
google.com etc.
I want to mention a related topic to the list. People are approaching
the auto-
On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar wrote:
> Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I want
> to make plugin for netbeans
>
> http://vinayakkatkar.wordpress.com
> --
> Vinayak Katkar
> Sun Campus Ambassador
> Sun Microsytems,India
> COEP
>
There is an ecp
I am looking to create some RA scripts and experiment with starting
hadoop via linux-ha cluster manager. Linux HA would handle restarting
downed nodes and eliminate the ssh key dependency.
Also be careful when you do this. If you are running map/reduce on a
large file the map and reduce operations will be called many times.
You can end up with a lot of output. Use log4j instead.
Also it might be useful to strongly word hadoop-default.conf as many
people might not know a downside exists for using 2 rather then 3 as
the replication factor. Before reading this thread I would have
thought 2 to be sufficient.
Is anyone working on a JDBC RecordReader/InputFormat. I was thinking
this would be very useful for sending data into mappers. Writing data
to a relational database might be more application dependent but still
possible.
All,
I always run iptables on my systems. Most of the hadoop setup guides I
have found skip iptables/firewall configuration. My namenode and task
tracker are the same node. My current configuration is not working as
I submit jobs from the namenode jobs are kicked off on the slave nodes
but they fa
We just setup a log4j server. This takes the logs off the cluster.
Plus you get all the benefits of log4j
http://timarcher.com/?q=node/10
Shahab,
This can be done.
If you client speaks java you can connect to hadoop and write as a stream.
If you client does not have java. The thrift api will generate stubs
in a variety of languages
Thrift API: http://wiki.apache.org/hadoop/HDFS-APIs
Shameless plug -- If you just want to stream da
All I have to say is wow! I never tried jconsole before. I have
hadoop_trunk checked out and the JMX has all kinds of great
information. I am going to look at how I can get JMX/cacti/and hadoop
working together.
Just as an FYI there are separate ENV variables for each now. If you
override hadoop_o
Someone on the list is looking at monitoring hadoop features with
nagios. Nagios can be configured with an event_handler. In the past I
have written event handlers to do operations like this. If down ---
use SSH key and restart.
HoweverSince you have an SSH key on your master node, you should
I came up with my line of thinking after reading this article:
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
As a guy that was intrigued by the java coffee cup in 95, that now
lives as a data center/noc jock/unix guy. Lets say I look at a log
manageme
I had downloaded thrift and ran the example applications after the
Hive meet up. It is very cool stuff. The thriftfs interface is more
elegant than what I was trying to do, and that implementation is more
complete.
Still, someone might be interested in what I did if they want a
super-light API :)
One of my first questions about hadoop was, "How do systems outside
the cluster interact with the file system?" I read several documents
that described streaming data into hadoop for processing, but I had
trouble finding examples.
The goal of LHadoop Server (L stands for Lightweight) is to produce
I was checking out this slide show.
http://www.slideshare.net/jhammerb/2008-ur-tech-talk-zshao-presentation/
in the diagram a Web-UI exists. This was the first I have heard of
this. Is this part of or planned to be a part of contrib/hive? I think
a web interface for showing table schema and executi
That all sounds good. By 'quick hack' I meant 'check_tcp' was not
good enough because an open TCP socket does not prove much. However,
if the page returns useful attributes that show cluster is alive that
is great and easy.
Come to think of it you can navigate the dfshealth page and get useful
in
The simple way would be use use nrpe and check_proc. I have never
tested, but a command like 'ps -ef | grep java | grep NameNode' would
be a fairly decent check. That is not very robust but it should let
you know if the process is alive.
You could also monitor the web interfaces associated with t
You bring up some valid points. This would be a great topic for a
white paper. The first line of defense should be to apply inbound and
outbound iptables rules. Only source IPs that have a direct need to
interact with the cluster should be allowed to. The same is true with
the web access. Only a
I determined the problem once I set the log4j properties to debug.
derbyclient.jar derbytools.jar does not ship with hive. As a result
when you try to org.apache.derby.jdbc.ClientDriver you get an
invocation target exception.
The solution for this was to download the derby, and place those files
in
>> hive.metastore.local
>> true
Why would I set this property to true? My goal is to store the meta
data in an external database. It i set this to true the metabase is
created in the working directory.
I am doing a lot of testing with Hive, I will be sure to add this
information to the wiki once I get it going.
Thus far I downloaded the same version of derby that hive uses. I have
verified that the connections is up and running.
ij version 10.4
ij> connect 'jdbc:derby://nyhadoop1:1527/metastore
I have been working with Hive for the past week. The ability to wrap
an SQL like tool over HDFS as very powerful. Now that I am comfortable
with the
concept, I am looking at an implementation of it.
Currently I have a three node cluster for testing hadoop1, hadoop2,
and hadoop3. I have hive instal
wait and sleep are not what you are looking for. you can use 'nohup'
to run a job in the background and have its output piped to a file.
On Tue, Jun 10, 2008 at 5:48 PM, Meng Mao <[EMAIL PROTECTED]> wrote:
> I'm interested in the same thing -- is there a recommended way to batch
> Hadoop jobs toge
I have never tried this method. The concept came from a research paper
I ran into. The goal was to detect the language of piece of text by
looking at several factors. Average length of word, average length of
sentence, average number of vowels in a word, etc. He used these to
score and article, and
There is a place for virtualization. If you can justify the overhead
for hands free network management. If your systems really have enough
power they can run hadoop and something else. If you need to run
multiple truly isolated versions of hadoop (selling some type of
hadoop grid services?)
If you
I once asked a wise man in change of a rather large multi-datacenter
service, "Have you every considered virtualization?" He replied, "All
the CPU's here are pegged at 100%"
They may be applications for this type of processing. I have thought
about systems like this from time to time. This thinkin
I think that feature makes sense because starting JVM has overhead.
On Sun, Jun 1, 2008 at 4:26 AM, Christophe Taton <[EMAIL PROTECTED]> wrote:
> Actually Hadoop could be made more friendly to such realtime Map/Reduce
> jobs.
> For instance, we could consider running all tasks inside the task trac
Conservative IT executiveSounds like your working at my last job. :)
Yahoo uses hadoop. For a very large cluster.
http://developer.yahoo.com/blogs/hadoop/
And afterall hadoop is a work alike of the Google File System, google
uses that for all types of satelite data,
The new york times is usi
58 matches
Mail list logo