Re: Plan on Structured Streaming in next major/minor release?

2018-11-04 Thread JackyLee
Can these things be added into this list?
1. [SPARK-24630] Support SQLStreaming in Spark
  This patch defines the Table API for StructStreaming
2. [SPARK-25937] Support user-defined schema in Kafka Source & Sink
  This patch make user easier to work with StructStreaming
3. SS supports dynamic partition scheduling 
   SS uses the serial execution engine, which means, SS can not catch up
with data output effectively when back pressure or computing speed is
reduced. If the dynamic partition scheduling for SS can be realized, the
partition number will be automatically increased when needed, then SS can
effectively catch up with the calculation speed.The main idea is to replace
time with computing resources.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Continuous task retry support

2018-11-04 Thread Yuanjian Li
>
> *I found that task retries are currently not supported
> 
>  in
> continuous processing mode. Is there another way to recover from continuous
> task failures currently?*

Yes, currently task level retry is not supported in CP mode and the recover
strategy instead by stage restart.

 *If not, are there plans to support this in a future release?*

 Actually task level retry in CP mode is easy to implement in map-only
operators but need more discussion when we plan to support more shuffled
stateful operators in CP. More discussion in
https://github.com/apache/spark/pull/20675.

Basil Hariri  于2018年11月3日周六 上午3:09写道:

> *Hi all,*
>
>
>
> *I found that task retries are currently not supported
> 
> in continuous processing mode. Is there another way to recover from
> continuous task failures currently? If not, are there plans to support this
> in a future release?*
>
> Thanks,
>
> Basil
>


Equivalent of emptyDataFrame in StructuredStreaming

2018-11-04 Thread Arun Manivannan
Hi,

This is going to come off as a silly question with a stream being unbounded
but this is problem that I have (created for myself).

I am trying to build an ETL pipeline and I have a bunch of stages.

val pipelineStages = List(
  new AddRowKeyStage(EvergreenSchema),
  new WriteToHBaseStage(hBaseCatalog),
  new ReplaceCharDataStage(DoubleColsReplaceMap, EvergreenSchema, DoubleCols),
  new ReplaceCharDataStage(SpecialCharMap, EvergreenSchema, StringCols),
  new DataTypeValidatorStage(EvergreenSchema),
  new DataTypeCastStage(EvergreenSchema)
)

*I would like to collect the errors at each of these stages into a
different stream. *I am using a WriterMonad for this.  I have made
provisions that the "collection" part of the Monad is also a DataFrame.
Now, I would like to do a :

val validRecords = pipelineStages.foldLeft(initDf) { case
(dfWithErrors, stage) =>
  for {
df <- dfWithErrors
applied <- stage.apply(df)
  } yield applied
}

Now, the tricky bit is this :

val initDf = Writer(*DataFrameOps.emptyErrorStream(spark)*, sourceRawDf)


The "empty" of the fold must be an empty stream. With Spark batch, I can
always use an "emptyDataFrame" but I have no clue on how to achieve this in
Spark streaming.  Unfortunately, "emptyDataFrame"  is not "isStreaming" and
therefore I won't be able to union the errors together.

Appreciate if you could give me some pointers.

Cheers,
Arun