Hello,
you can use the Pydoop HDFS API to work with HDFS files:
>>> import pydoop.hdfs as hdfs
>>> with hdfs.open('hdfs://localhost:8020/user/myuser/filename') as f:
... for line in f:
... do_something(line)
As you can see, the API is very similar to that of ordinary Python file
objects. Check out the following tutorial for more details:
http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html
Note that Pydoop also has a MapReduce API, so you can use it to rewrite
the whole program:
http://pydoop.sourceforge.net/docs/tutorial/mapred_api.html
It also has a more compact and easy-to-use scripting engine for simple
applications:
http://pydoop.sourceforge.net/docs/tutorial/pydoop_script.html
If you think Pydoop is right for you, read the installation guide:
http://pydoop.sourceforge.net/docs/installation.html
Simone
On 01/14/2013 11:24 PM, Andy Isaacson wrote:
Oh, another link I should have included!
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
-andy
On Mon, Jan 14, 2013 at 2:19 PM, Andy Isaacson <a...@cloudera.com> wrote:
Hadoop Streaming does not magically teach Python open() how to read
from "hdfs://" URLs. You'll need to use a library or fork a "hdfs dfs
-cat" to read the file for you.
A few links that may help:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs
https://bitbucket.org/turnaev/cyhdfs
-andy
On Sat, Jan 12, 2013 at 12:30 AM, springring <springr...@126.com> wrote:
Hi,
When I run code below as a streaming, the job error N/A and killed. I
run step by step, find it error when
" file_obj = open(file) " . When I run same code outside of hadoop, everything
is ok.
1 #!/bin/env python
2
3 import sys
4
5 for line in sys.stdin:
6 offset,filename = line.split("\t")
7 file = "hdfs://user/hdfs/catalog3/" + filename
8 print line
9 print filename
10 print file
11 file_obj = open(file)
..................................
--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone....@crs4.it
http://www.crs4.it