thanks for your inputs Ashish and Hari. Ashish, I'm attempting something similar (using webHDFS) to what you mentioned inline of 3rd point (whether to consider flume for a daily batch job). Let me know if you have any idea about the error. I'll update the group if setting the flag
dfs.webhdfs.enabled = true helps. regards Sunita ---------- Forwarded message ---------- From: Sunita Arvind <[email protected]> Date: Fri, Jul 19, 2013 at 7:30 PM Subject: Re: Seeking advice over choice of language and implementation To: [email protected] Thankyou Israel, I will attempt option 1 and share my experiences. I tried a workaround in the meanwhile- Using webhdfs to directly write the files to hdfs from a python daemon (using this library - https://github.com/carlosmarin/webhdfs-py/blob/master/webhdfs/webhdfs.py) However, with this, I am getting an exception 07/19/2013 06:05:59 PM - webhdfs - DEBUG - HTTP Response: 404, Not Found If I copy paste the resultant URL in the browser address bar, I get something like this: " {"RemoteException":{"exception":"IllegalArgumentException","javaClassName":"java.lang.IllegalArgumentException","message":"Invalid value for webhdfs parameter \"op\": No enum const class org.apache.hadoop.hdfs.web.resources.GetOpParam$Op.CREATE"}} " No idea what this means. Wondering if it means the hdfs is not configured as dfs.webhdfs.enabled = true. (I do not have permission to check/change this. Requesting for access from the admin). Let me know your thoughts. regards Sunita On Fri, Jul 19, 2013 at 12:51 AM, Israel Ekpo <[email protected]> wrote: > Sunita, > > Depending on your level of comfort, you can do one of the following: > > 1. Use Python to fetch your data and then send the events via HTTP to the > Flume HTTP Source [1] > 2. Use Java to create a custom source [6] in Flume that handles the data > fetching and then puts it in a channel [3] so that it can be funneled into > the sinks [4] and [5] > > Option 1 would be easier for you since you can get the data in Python and > just stream it down via HTTP to Flume. > > Option 2 will be more involved since you need to write code that > communicates with external endpoints. > > References > [1] http://goo.gl/5lHlg > [2] http://goo.gl/GnVbE > [3] http://goo.gl/t31Xh > [4] http://goo.gl/G9xS8 > [5] http://goo.gl/Wn4W5 > [6] http://goo.gl/Q0yyn > > > *Author and Instructor for the Upcoming Book and Lecture Series* > *Massive Log Data Aggregation, Processing, Searching and Visualization > with Open Source Software* > *http://massivelogdata.com* > > > On 18 July 2013 13:38, Sunita Arvind <[email protected]> wrote: > >> Hello friends, >> >> I am new to flume and have written a python script to fetch some data >> from social media. My response is JSON. I am seeking help on following >> issues: >> 1. I am finding it hard to make python and flume talk. Is it just my >> ignorance or it is indeed a long route? AFAIK, I need to understand thrift >> API and Avro etc to achieve this. I also read about pipes. Would this be a >> simple implementation >> >> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if >> its better to re-write my application in Java so that I can easily >> integrate it with flume. Are there any advantages of having a java >> application, as all of hadoop is java? >> >> 3. I need to schedule the agent to run on a daily basis. Which of the >> above approaches would help me achieve this easily? >> >> 4. Going by this - >> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%[email protected]%3Elooks >> like we need to manually clean up disk space even with flume. I am >> not clear on the advantages I would have with flume over using a simple >> cron job to do the task. I can manually write statements like "hadoop fs >> -put <location of output file on local> <location on hdfs>" in the cron job >> instead. >> >> Appreciate your help and guidance >> >> regards, >> Sunita >> > >
