Re: beginner's question -- file source configuration

Lin Ma Sun, 08 Mar 2015 23:14:43 -0700

Thanks Gwen,

For your comments " if one collector is down, the client can connect to
another" in #3, how it related to the two-tier architecture? And client and
collector in this case means?


regards,
Lin

On Sun, Mar 8, 2015 at 10:42 PM, Gwen Shapira <gshap...@cloudera.com> wrote:

> There are several benefits to the two tier architecture:
>
> 1. Limit number of processes writing to HDFS. As you correctly
> mentioned, there are some limitations there.
> 2. Enable us to create larger files faster. (We want to switch files
> on HDFS fast to allow querying new data faster, but we also don't want
> gazillion small files)
> 3. Two tier architecture can support high availability and load
> balancing - if one collector is down, the client can connect to
> another.
>
> Gwen
>
> On Sun, Mar 8, 2015 at 10:30 PM, Lin Ma <lin...@gmail.com> wrote:
> > Thanks Gwen,
> >
> > Using two-tier architecture of Flume is for the purpose of reduce the
> number
> > of processes written to HDFS? Remember if too many processes written to
> > HDFS, name node will have issues.
> >
> > regards,
> > Lin
> >
> > On Sun, Mar 8, 2015 at 8:26 PM, Gwen Shapira <gshap...@cloudera.com>
> wrote:
> >>
> >> As stated in the docs, you'll need to have the timestamp in the event
> >> header for HDFS to automatically place the events in the correct
> >> directory.
> >> This can be done using the timestamp interceptor.
> >>
> >> You can see an example here:
> >>
> >>
> https://github.com/hadooparchitecturebook/hadoop-arch-book/tree/master/ch09-clickstream/Flume
> >>
> >> This example uses 2-tier architecture (i.e. one flume agent collecting
> >> logs from web servers and the other writing to HDFS).
> >> However, you can see how in client.conf the spooling-directory source
> >> is configured with timestamp interceptor and in collector.conf the
> >> HDFS sink has a parameterized target directory with the timestamp in
> >> it.
> >>
> >> Gwen
> >>
> >>
> >> Gwen
> >>
> >> On Sun, Mar 8, 2015 at 7:56 PM, Lin Ma <lin...@gmail.com> wrote:
> >> > Thanks Ashish,
> >> >
> >> > One further question on HDFS sink. If I configure the destination
> >> > directory
> >> > on HDFS to be Year Month Day Hour, etc. pattern, Flume will put the
> data
> >> > event it received automatically to the related directory and created
> new
> >> > directory with time elapsed further? Or I have to setup some Key/Value
> >> > headers event in order for HDFS sink to recognize event time and put
> >> > into
> >> > appropriate time based folder?
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> > On Sun, Mar 8, 2015 at 6:32 PM, Ashish <paliwalash...@gmail.com>
> wrote:
> >> >>
> >> >> Your understanding is correct :)
> >> >>
> >> >> On Mon, Mar 9, 2015 at 6:54 AM, Lin Ma <lin...@gmail.com> wrote:
> >> >> > Thanks Ashish,
> >> >> >
> >> >> > Followed your guidance, and found below instructions of which have
> >> >> > further
> >> >> > questions to confirm with you, it seems we need to close the files
> >> >> > and
> >> >> > never
> >> >> > touch it for Flume to process correctly, so not sure if it is good
> >> >> > practice
> >> >> > that -- (1) let the application write log file in existing way,
> like
> >> >> > hourly
> >> >> > or 5 mins pattern, (2) close and move the files to another
> directory
> >> >> > as
> >> >> > input Source for Flume Agent which Flume could process as Spooling
> >> >> > Directory?
> >> >> >
> >> >> > “This source will watch the specified directory for new files, and
> >> >> > will
> >> >> > parse events out of new files as they appear. ”
> >> >> >
> >> >> > "
> >> >> >
> >> >> > If a file is written to after being placed into the spooling
> >> >> > directory,
> >> >> > Flume will print an error to its log file and stop processing.
> >> >> > If a file name is reused at a later time, Flume will print an error
> >> >> > to
> >> >> > its
> >> >> > log file and stop processing.
> >> >> >
> >> >> > "
> >> >> >
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> > On Sun, Mar 8, 2015 at 12:23 AM, Ashish <paliwalash...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Please look at following
> >> >> >> Spooling Directory Source
> >> >> >>
> >> >> >> [
> http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source]
> >> >> >> and
> >> >> >> HDFS Sink (http://flume.apache.org/FlumeUserGuide.html#hdfs-sink)
> >> >> >>
> >> >> >> Spooling Directory Source need immutable files, means files should
> >> >> >> not
> >> >> >> be written to once they are being consumed. In short your
> >> >> >> application
> >> >> >> cannot write to the file being read by Flume.
> >> >> >>
> >> >> >> Log format is not an issue, as long as you don't want it to be
> >> >> >> interpreted by Flume components. Since it's log assuming single
> log
> >> >> >> per line with line separator at the end of line.
> >> >> >>
> >> >> >> You can also look at Exec source
> >> >> >> (http://flume.apache.org/FlumeUserGuide.html#exec-source) for
> >> >> >> tailing
> >> >> >> to a file being written by application. Documentation covers
> details
> >> >> >> on all the links.
> >> >> >>
> >> >> >> HTH !
> >> >> >>
> >> >> >>
> >> >> >> On Sun, Mar 8, 2015 at 12:32 PM, Lin Ma <lin...@gmail.com> wrote:
> >> >> >> > Hi Flume masters,
> >> >> >> >
> >> >> >> > I want to install Flume on a box, and consume local log file as
> >> >> >> > source
> >> >> >> > and
> >> >> >> > send to remote HDFS sink. The log format is private and text
> (not
> >> >> >> > Avro
> >> >> >> > or
> >> >> >> > JSON format).
> >> >> >> >
> >> >> >> > I am reading the guide on Flume and many advanced Source
> >> >> >> > configuration,
> >> >> >> > wondering for the plain local log file source, any reference
> >> >> >> > samples?
> >> >> >> > And
> >> >> >> > not sure if Flume could consume the local file while the
> >> >> >> > application
> >> >> >> > is
> >> >> >> > still writing the log file? Thanks.
> >> >> >> >
> >> >> >> > regards,
> >> >> >> > Lin
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> thanks
> >> >> >> ashish
> >> >> >>
> >> >> >> Blog: http://www.ashishpaliwal.com/blog
> >> >> >> My Photo Galleries: http://www.pbase.com/ashishpaliwal
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> thanks
> >> >> ashish
> >> >>
> >> >> Blog: http://www.ashishpaliwal.com/blog
> >> >> My Photo Galleries: http://www.pbase.com/ashishpaliwal
> >> >
> >> >
> >
> >
>

Re: beginner's question -- file source configuration

Reply via email to