Hello,

you can use the Pydoop HDFS API to work with HDFS files:

>>> import pydoop.hdfs as hdfs
>>> with hdfs.open('hdfs://localhost:8020/user/myuser/filename') as f:
...     for line in f:
...             do_something(line)

As you can see, the API is very similar to that of ordinary Python file objects. Check out the following tutorial for more details:

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

Note that Pydoop also has a MapReduce API, so you can use it to rewrite the whole program:

http://pydoop.sourceforge.net/docs/tutorial/mapred_api.html

It also has a more compact and easy-to-use scripting engine for simple applications:

http://pydoop.sourceforge.net/docs/tutorial/pydoop_script.html

If you think Pydoop is right for you, read the installation guide:

http://pydoop.sourceforge.net/docs/installation.html

Simone

On 01/14/2013 11:24 PM, Andy Isaacson wrote:
Oh, another link I should have included!
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

-andy

On Mon, Jan 14, 2013 at 2:19 PM, Andy Isaacson <a...@cloudera.com> wrote:
Hadoop Streaming does not magically teach Python open() how to read
from "hdfs://" URLs. You'll need to use a library or fork a "hdfs dfs
-cat" to read the file for you.

A few links that may help:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs
https://bitbucket.org/turnaev/cyhdfs

-andy

On Sat, Jan 12, 2013 at 12:30 AM, springring <springr...@126.com> wrote:
Hi,

      When I run code below as a streaming, the job error N/A and killed.  I 
run step by step, find it error when
" file_obj = open(file) " .  When I run same code outside of hadoop, everything 
is ok.

   1 #!/bin/env python
   2
   3 import sys
   4
   5 for line in sys.stdin:
   6     offset,filename = line.split("\t")
   7     file = "hdfs://user/hdfs/catalog3/" + filename
   8     print line
   9     print filename
  10     print file
  11     file_obj = open(file)
..................................


--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone....@crs4.it
http://www.crs4.it

Reply via email to