Can you not use sc.wholeTextFile() and use a custom parser or a regex to
extract out the TransactionIDs?

Thanks
Best Regards

On Sat, Jul 11, 2015 at 8:18 AM, ssbiox <sergey.korytni...@gmail.com> wrote:

> Hello,
>
> I have a very specific question on how to do a search between particular
> lines of log file. I did some research to find the answer and what I
> learned
> is that if one of the shuffle operation applied to RDD, there is no a way
> to
> "reconstruct" the sequence of line (except zipping with id). I'm looking
> for
> any useful approaches/workarounds how other developers solve that problem.
>
> Here is a sample:
> I have log4j log files where for each request/transaction a specific BEGIN
> and END transaction marker is printed. Somewhere in between other classes
> may report useful statistics, which is needed to parse, and unfortunately
> there is now a way to keep transaction id with that record. What is the
> best
> approach to link transaction with particular line between BEGIN and END
> markers?
>
> Assume, only timestamp and thread name are available:
> 2015-01-01 20:00:00 DEBUG className [Thread-0] - BEGIN
> TransactionID=AA000000001
> 2015-01-01 20:00:00 DEBUG className [Thread-0] - ... {some other logs}
> 2015-01-01 20:00:01 DEBUG className [Thread-0] - SQL execution time: 500ms
> 2015-01-01 20:00:02 DEBUG className [Thread-0] - ... {some other logs}
> 2015-01-01 20:00:05 DEBUG className [Thread-0] - END
>
> Finally, I want to get the result with transaction ID AA000000001 and SQL
> execution time 500ms.
>
> Probably, another good example would be - extracting java stacktrace from
> logs, when stacktrace lines wouldn't have any key strings (timestamp,
> thread
> id) at all to parse by.
>
> So far I've come up with one "idea" and one approach:
> 1) Find out the file and position of BEGIN line and run separate non-Spark
> process to parse it line-by-line. In this case the question is what is the
> best approach to know to which file this line belongs to, and what is the
> position? Is zipWithUniqueId helpful for that? Not sure if it's really
> effective and can help to find the file name (or may be hadoop partition).
>
> 2) I use thread id as a key and map that key with BEGIN / END lines. Then I
> create another RDD with the same key, but for SQL execution time line. Then
> I do left join of RDDs by thread id and filter by timestamps, coming from
> both RDDs: leaving only this SQL line which is prior to END line (SQL's
> timestamp is before END's timestamp).
> Approach like this becomes very confusing in cases when it's required to
> extract more information (lines) between BEGIN/END. Is there any
> recommendations how to handle cases like that?
>
> Thank you,
> Sergey
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Linear-search-between-particular-log4j-log-lines-tp23773.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to