Hi, I have a question about sampling Spark Streaming data, or getting part of
the data. For every minute, I only want the data read in during the first 10
seconds, and discard all data in the next 50 seconds. Is there any way to
pause reading and discard data in that period? I'm doing this to sampl
Hi I'm working with Spark Streaming using scala, and trying to figure out the
following problem. In my DStream[(int, int)], each record is an int pair
tuple. For each batch, I would like to filter out all records with first
integer below average of first integer in this batch, and for all records
w
What's the best practice of creating RDD from some external unix command
output? I assume if the output size is large (say millions of lines),
creating RDD from an array of all lines is not a good idea? Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.co
Hi, I'm new to Spark Streaming, and I want to create a application where
Spark Streaming could create DStream from stdin. Basically I have a command
line utility that generates stream data, and I'd like to pipe data into
DStream. What's the best way to do that? I thought rdd.pipe() could help,
but