Can you not use sc.wholeTextFile() and use a custom parser or a regex to extract out the TransactionIDs?
Thanks Best Regards On Sat, Jul 11, 2015 at 8:18 AM, ssbiox <sergey.korytni...@gmail.com> wrote: > Hello, > > I have a very specific question on how to do a search between particular > lines of log file. I did some research to find the answer and what I > learned > is that if one of the shuffle operation applied to RDD, there is no a way > to > "reconstruct" the sequence of line (except zipping with id). I'm looking > for > any useful approaches/workarounds how other developers solve that problem. > > Here is a sample: > I have log4j log files where for each request/transaction a specific BEGIN > and END transaction marker is printed. Somewhere in between other classes > may report useful statistics, which is needed to parse, and unfortunately > there is now a way to keep transaction id with that record. What is the > best > approach to link transaction with particular line between BEGIN and END > markers? > > Assume, only timestamp and thread name are available: > 2015-01-01 20:00:00 DEBUG className [Thread-0] - BEGIN > TransactionID=AA000000001 > 2015-01-01 20:00:00 DEBUG className [Thread-0] - ... {some other logs} > 2015-01-01 20:00:01 DEBUG className [Thread-0] - SQL execution time: 500ms > 2015-01-01 20:00:02 DEBUG className [Thread-0] - ... {some other logs} > 2015-01-01 20:00:05 DEBUG className [Thread-0] - END > > Finally, I want to get the result with transaction ID AA000000001 and SQL > execution time 500ms. > > Probably, another good example would be - extracting java stacktrace from > logs, when stacktrace lines wouldn't have any key strings (timestamp, > thread > id) at all to parse by. > > So far I've come up with one "idea" and one approach: > 1) Find out the file and position of BEGIN line and run separate non-Spark > process to parse it line-by-line. In this case the question is what is the > best approach to know to which file this line belongs to, and what is the > position? Is zipWithUniqueId helpful for that? Not sure if it's really > effective and can help to find the file name (or may be hadoop partition). > > 2) I use thread id as a key and map that key with BEGIN / END lines. Then I > create another RDD with the same key, but for SQL execution time line. Then > I do left join of RDDs by thread id and filter by timestamps, coming from > both RDDs: leaving only this SQL line which is prior to END line (SQL's > timestamp is before END's timestamp). > Approach like this becomes very confusing in cases when it's required to > extract more information (lines) between BEGIN/END. Is there any > recommendations how to handle cases like that? > > Thank you, > Sergey > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Linear-search-between-particular-log4j-log-lines-tp23773.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >