Re: Asking for reviewing PRs regarding structured streaming

2018-07-12 Thread Jungtaek Lim
I recently added more test results to SPARK-24763 [1] which shows that the
proposal reduces state size according to the ratio of key-value size,
whereas there's no performance hit and sometimes even slight boost.

Please refer the latest comment in JIRA issue [2] to see the numbers from
perf. tests.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24763
2.
https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367


2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim 님이 작성:

> Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
> option to enable optimization of state size in streaming aggregation
> without hurting performance.
>
> The idea is to remove data for key fields from value which is duplicated
> between key and value in state row. This requires additional operations
> like projection and join, but smaller state row would also give performance
> benefit, which can offset each other.
>
> Please refer the comment in JIRA issue [2] to see the numbers from simple
> perf. test.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24763
>
>
> 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim 님이 작성:
>
>> Ted Yu suggested posting the improved numbers to this thread and I think
>> it's good idea, so also posting here, but I also think explaining
>> rationalization of my issues would help understanding why I'm submitting
>> couple of patches, so I'll explain it first. (Sorry to post a wall of text).
>>
>> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
>> store provider from 10x~80x of size of state for a batch according to
>> various stateful workloads to less than or around 2x. The new option is
>> flexible so it can be even around 1x or even effectively disable cache.
>> Please refer the comment in the PR [2] to get more details. (hard to post
>> detailed numbers in mail format so link a Github comment instead)
>>
>> I have interest on stateful streaming processing on Structured Streaming,
>> and have been learning from codebase as well as analyzing memory usage as
>> well as latency (while I admit it is hard to measure latency correctly...).
>>
>>
>> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>>
>> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
>> excessive caching. As I described in section "The impact of storing
>> multiple versions from HDFSBackedStateStoreProvider" in above link, while
>> multiple versions share the same UnsafeRow unless there's a change on the
>> value which lessen the impact of caching multiple versions (credit to Jose
>> Torres since I realized it from his comment). But in some workloads which
>> lots of writes to state incurs in a batch, the overall memory usage of
>> state is going to be out of expectation.
>>
>> Related patch [3] is also submitted from other contributor (so I'm not
>> the one to notice this behavior), whereas the patch might not look enough
>> generalized to apply.
>>
>> First I decided to track the overall memory size of state provider cache
>> and expose to UI as well as query status (SPARK-24441 [4]). The metric
>> looked like critical and worth to monitor, so I thought it is better to
>> expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>>
>> Based on adoption of SPARK-24441, I could find more flexible way to
>> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>>
>> So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
>> that it helps explaining worth of reviews for these patches.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24717
>> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
>> 3. https://github.com/apache/spark/pull/21500
>> 4. https://issues.apache.org/jira/browse/SPARK-24441
>> 5. https://issues.apache.org/jira/browse/SPARK-24637
>>
>> ps. Before all mentioned issues I also submitted some other issues
>> regarding feature addition/refactor (2 of 5 issues).
>>
>>
>> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim 님이 작성:
>>
>>> Bump. I have been having hard time working on making additional PRs
>>> since some of these rely on non-merged PRs, so spending additional time to
>>> decouple these things if possible.
>>>
>>> https://github.com/apache/spark/pulls/HeartSaVioR
>>>
>>> Pending 5 PRs so far, and may add more sooner or later.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim 님이 작성:
>>>
 Kindly reminder since around 2 weeks passed. I've added more PR during
 2 weeks and even planning to do more.

 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim 님이 작성:

> Hi Spark devs,
>
> I have couple of pull requests for structured streaming which are
> getting older and fading out from earlier pages in PR

Re: Creating JDBC source table schema(DDL) dynamically

2018-07-12 Thread Thakrar, Jayesh
One option is to use plain JDBC to interrogate Postgresql catalog for the 
source table and generate the DDL to create the destination table.
Then using plain JDBC again, create the table at the destination.

See the link below for some pointers…..

https://stackoverflow.com/questions/2593803/how-to-generate-the-create-table-sql-statement-for-an-existing-table-in-postgr


On 7/11/18, 9:55 PM, "Kadam, Gangadhar (GE Aviation, Non-GE)" 
 wrote:

Hi All,

I am trying to build a spark application which will  read the data from 
Postgresql (source)  one environment  and write it to  postgreSQL, Aurora 
(target)  on a dfiffernt environment  (like to PROD to QA or QA to PROD etc) 
using spark JDBC.

When I am loading the dataframe back to target DB, I would like to ensure 
the same schema as the source table schema using

val targetTableSchema: String =
  """
|  operating_unit_nm character varying(20),
|  organization_id integer,
|  organization_cd character varying(30),
|  requesting_organization_id integer,
|  requesting_organization_cd character varying(50),
|  owning_organization_id integer,
|  owning_organization_cd character varying(50)
""".stripMargin


.option("createTableColumnTypes", targetTableSchema )

I would like to know if there is way I can create this targetTableSchema 
(source table DDL) variable directly from source table or from a csv file. I 
don’t want spark to enforce its default schema.  Based on the table name, How 
do I  get the DDL created dynamically to pass it to targetTableSchema variable 
as a string.

Currently I am updating targetTableSchema manually  and looking for some 
pointer to automate it.


Below is my code

// Define the parameter
val sourceDb: String = args(0)
val targetDb: String = args(1)
val sourceTable: String = args(2)
val targetTable: String = args(3)
val sourceEnv: String = args(4)
val targetEnv: String = args(5)

println("Arguments Provided: " + sourceDb, targetDb,sourceTable, 
targetTable, sourceEnv, targetEnv)

// Define the spark session
val spark: SparkSession = SparkSession
  .builder()
  .appName("Ca-Data-Transporter")
  .master("local")
  .config("driver", "org.postgresql.Driver")
  .getOrCreate()

// define the input directory
val inputDir: String = 
"/Users/gangadharkadam/projects/ca-spark-apps/src/main/resources/"

// Define the source DB properties
val sourceParmFile: String = if (sourceDb == "RDS") {
"rds-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "AURORA") {
"aws-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "GP") {
"gp-db-parms-" + sourceEnv + ".txt"
  }
  else "NA"

println(sourceParmFile)

val sourceDbParms: Properties = new Properties()
sourceDbParms.load(new FileInputStream(new File(inputDir + sourceParmFile)))
val sourceDbJdbcUrl: String = sourceDbParms.getProperty("jdbcUrl")

println(s"$sourceDb")
println(s"$sourceDbJdbcUrl")

// Define the target DB properties
val targetParmFile: String = if (targetDb == "RDS") {
s"rds-db-parms-" + targetEnv + ".txt"
  }
  else if (targetDb == "AURORA") {
s"aws-db-parms-" + targetEnv + ".txt"
  }
  else if (targetDb == "GP") {
s"gp-db-parms-" + targetEnv + ".txt"
  } else "aws-db-parms-$targetEnv.txt"

println(targetParmFile)

val targetDbParms: Properties = new Properties()
targetDbParms.load(new FileInputStream(new File(inputDir + targetParmFile)))
val targetDbJdbcUrl: String = targetDbParms.getProperty("jdbcUrl")

println(s"$targetDb")
println(s"$targetDbJdbcUrl")

// Read the source table as dataFrame
val sourceDF: DataFrame = spark
  .read
  .jdbc(url = sourceDbJdbcUrl,
table = sourceTable,
sourceDbParms
  )
  //.filter("site_code is not null")

sourceDF.printSchema()
sourceDF.show()

val sourceDF1 = sourceDF.repartition(
  sourceDF("organization_id")
  //sourceDF("plan_id")
)


val targetTableSchema: String =
  """
|  operating_unit_nm character varying(20),
|  organization_id integer,
|  organization_cd character varying(30),
|  requesting_organization_id integer,
|  requesting_organization_cd character varying(50),
|  owning_organization_id integer,
|  owning_organization_cd character varying(50)
  """.stripMargin


// write the dataFrame
sourceDF1
  .write
  .option("createTableColumnTypes", targetTableSchema )
  .mode(saveMode = "Overwrite")
  .option("truncate", "true")
  .jdbc(targetDbJdbcUrl, targetTable, targetDbParms)


Thanks!
Gangadhar Kadam
Sr. Data Engineer
M

Re: Creating JDBC source table schema(DDL) dynamically

2018-07-12 Thread Kadam, Gangadhar (GE Aviation, Non-GE)
Thanks Jayesh.

I was aware of the catalog table approach but I was avoiding that  because I 
will hit the database twice for one table, one to create DDL and other to read 
the data. I have lots of table to transport from one environment to other and I 
don’t want to create unnecessary load on the DB.


On 7/12/18, 10:09 AM, "Thakrar, Jayesh"  wrote:

One option is to use plain JDBC to interrogate Postgresql catalog for the 
source table and generate the DDL to create the destination table.
Then using plain JDBC again, create the table at the destination.

See the link below for some pointers…..


https://stackoverflow.com/questions/2593803/how-to-generate-the-create-table-sql-statement-for-an-existing-table-in-postgr


On 7/11/18, 9:55 PM, "Kadam, Gangadhar (GE Aviation, Non-GE)" 
 wrote:

Hi All,

I am trying to build a spark application which will  read the data from 
Postgresql (source)  one environment  and write it to  postgreSQL, Aurora 
(target)  on a dfiffernt environment  (like to PROD to QA or QA to PROD etc) 
using spark JDBC.

When I am loading the dataframe back to target DB, I would like to 
ensure the same schema as the source table schema using

val targetTableSchema: String =
  """
|  operating_unit_nm character varying(20),
|  organization_id integer,
|  organization_cd character varying(30),
|  requesting_organization_id integer,
|  requesting_organization_cd character varying(50),
|  owning_organization_id integer,
|  owning_organization_cd character varying(50)
""".stripMargin


.option("createTableColumnTypes", targetTableSchema )

I would like to know if there is way I can create this 
targetTableSchema (source table DDL) variable directly from source table or 
from a csv file. I don’t want spark to enforce its default schema.  Based on 
the table name, How do I  get the DDL created dynamically to pass it to 
targetTableSchema variable as a string.

Currently I am updating targetTableSchema manually  and looking for 
some pointer to automate it.


Below is my code

// Define the parameter
val sourceDb: String = args(0)
val targetDb: String = args(1)
val sourceTable: String = args(2)
val targetTable: String = args(3)
val sourceEnv: String = args(4)
val targetEnv: String = args(5)

println("Arguments Provided: " + sourceDb, targetDb,sourceTable, 
targetTable, sourceEnv, targetEnv)

// Define the spark session
val spark: SparkSession = SparkSession
  .builder()
  .appName("Ca-Data-Transporter")
  .master("local")
  .config("driver", "org.postgresql.Driver")
  .getOrCreate()

// define the input directory
val inputDir: String = 
"/Users/gangadharkadam/projects/ca-spark-apps/src/main/resources/"

// Define the source DB properties
val sourceParmFile: String = if (sourceDb == "RDS") {
"rds-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "AURORA") {
"aws-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "GP") {
"gp-db-parms-" + sourceEnv + ".txt"
  }
  else "NA"

println(sourceParmFile)

val sourceDbParms: Properties = new Properties()
sourceDbParms.load(new FileInputStream(new File(inputDir + 
sourceParmFile)))
val sourceDbJdbcUrl: String = sourceDbParms.getProperty("jdbcUrl")

println(s"$sourceDb")
println(s"$sourceDbJdbcUrl")

// Define the target DB properties
val targetParmFile: String = if (targetDb == "RDS") {
s"rds-db-parms-" + targetEnv + ".txt"
  }
  else if (targetDb == "AURORA") {
s"aws-db-parms-" + targetEnv + ".txt"
  }
  else if (targetDb == "GP") {
s"gp-db-parms-" + targetEnv + ".txt"
  } else "aws-db-parms-$targetEnv.txt"

println(targetParmFile)

val targetDbParms: Properties = new Properties()
targetDbParms.load(new FileInputStream(new File(inputDir + 
targetParmFile)))
val targetDbJdbcUrl: String = targetDbParms.getProperty("jdbcUrl")

println(s"$targetDb")
println(s"$targetDbJdbcUrl")

// Read the source table as dataFrame
val sourceDF: DataFrame = spark
  .read
  .jdbc(url = sourceDbJdbcUrl,
table = sourceTable,
sourceDbParms
  )
  //.filter("site_code is not null")

sourceDF.printSchema()
sourceDF.show()

val sourceDF1 = sourceDF.repartition(
 

Re: Creating JDBC source table schema(DDL) dynamically

2018-07-12 Thread Thakrar, Jayesh
Unless the tables are very small (< 1000 rows), the impact of hitting the 
catalog tables is negligible.
Furthermore, normally the catalog tables (or views) are usually in memory 
because they are needed for query compilation, query execution (for triggers, 
referential integrity, etc) and even to establish a connection.

On 7/12/18, 9:53 AM, "Kadam, Gangadhar (GE Aviation, Non-GE)" 
 wrote:

Thanks Jayesh.

I was aware of the catalog table approach but I was avoiding that  because 
I will hit the database twice for one table, one to create DDL and other to 
read the data. I have lots of table to transport from one environment to other 
and I don’t want to create unnecessary load on the DB.


On 7/12/18, 10:09 AM, "Thakrar, Jayesh"  
wrote:

One option is to use plain JDBC to interrogate Postgresql catalog for 
the source table and generate the DDL to create the destination table.
Then using plain JDBC again, create the table at the destination.

See the link below for some pointers…..


https://stackoverflow.com/questions/2593803/how-to-generate-the-create-table-sql-statement-for-an-existing-table-in-postgr


On 7/11/18, 9:55 PM, "Kadam, Gangadhar (GE Aviation, Non-GE)" 
 wrote:

Hi All,

I am trying to build a spark application which will  read the data 
from Postgresql (source)  one environment  and write it to  postgreSQL, Aurora 
(target)  on a dfiffernt environment  (like to PROD to QA or QA to PROD etc) 
using spark JDBC.

When I am loading the dataframe back to target DB, I would like to 
ensure the same schema as the source table schema using

val targetTableSchema: String =
  """
|  operating_unit_nm character varying(20),
|  organization_id integer,
|  organization_cd character varying(30),
|  requesting_organization_id integer,
|  requesting_organization_cd character varying(50),
|  owning_organization_id integer,
|  owning_organization_cd character varying(50)
""".stripMargin


.option("createTableColumnTypes", targetTableSchema )

I would like to know if there is way I can create this 
targetTableSchema (source table DDL) variable directly from source table or 
from a csv file. I don’t want spark to enforce its default schema.  Based on 
the table name, How do I  get the DDL created dynamically to pass it to 
targetTableSchema variable as a string.

Currently I am updating targetTableSchema manually  and looking for 
some pointer to automate it.


Below is my code

// Define the parameter
val sourceDb: String = args(0)
val targetDb: String = args(1)
val sourceTable: String = args(2)
val targetTable: String = args(3)
val sourceEnv: String = args(4)
val targetEnv: String = args(5)

println("Arguments Provided: " + sourceDb, targetDb,sourceTable, 
targetTable, sourceEnv, targetEnv)

// Define the spark session
val spark: SparkSession = SparkSession
  .builder()
  .appName("Ca-Data-Transporter")
  .master("local")
  .config("driver", "org.postgresql.Driver")
  .getOrCreate()

// define the input directory
val inputDir: String = 
"/Users/gangadharkadam/projects/ca-spark-apps/src/main/resources/"

// Define the source DB properties
val sourceParmFile: String = if (sourceDb == "RDS") {
"rds-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "AURORA") {
"aws-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "GP") {
"gp-db-parms-" + sourceEnv + ".txt"
  }
  else "NA"

println(sourceParmFile)

val sourceDbParms: Properties = new Properties()
sourceDbParms.load(new FileInputStream(new File(inputDir + 
sourceParmFile)))
val sourceDbJdbcUrl: String = sourceDbParms.getProperty("jdbcUrl")

println(s"$sourceDb")
println(s"$sourceDbJdbcUrl")

// Define the target DB properties
val targetParmFile: String = if (targetDb == "RDS") {
s"rds-db-parms-" + targetEnv + ".txt"
  }
  else if (targetDb == "AURORA") {
s"aws-db-parms-" + targetEnv + ".txt"
  }
  else if (targetDb == "GP") {
s"gp-db-parms-" + targetEnv + ".txt"

Re: Creating JDBC source table schema(DDL) dynamically

2018-07-12 Thread Kadam, Gangadhar (GE Aviation, Non-GE)
Ok. Thanks.

On 7/12/18, 11:12 AM, "Thakrar, Jayesh"  wrote:

Unless the tables are very small (< 1000 rows), the impact of hitting the 
catalog tables is negligible.
Furthermore, normally the catalog tables (or views) are usually in memory 
because they are needed for query compilation, query execution (for triggers, 
referential integrity, etc) and even to establish a connection.

On 7/12/18, 9:53 AM, "Kadam, Gangadhar (GE Aviation, Non-GE)" 
 wrote:

Thanks Jayesh.

I was aware of the catalog table approach but I was avoiding that  
because I will hit the database twice for one table, one to create DDL and 
other to read the data. I have lots of table to transport from one environment 
to other and I don’t want to create unnecessary load on the DB.


On 7/12/18, 10:09 AM, "Thakrar, Jayesh"  
wrote:

One option is to use plain JDBC to interrogate Postgresql catalog 
for the source table and generate the DDL to create the destination table.
Then using plain JDBC again, create the table at the destination.

See the link below for some pointers…..


https://stackoverflow.com/questions/2593803/how-to-generate-the-create-table-sql-statement-for-an-existing-table-in-postgr


On 7/11/18, 9:55 PM, "Kadam, Gangadhar (GE Aviation, Non-GE)" 
 wrote:

Hi All,

I am trying to build a spark application which will  read the 
data from Postgresql (source)  one environment  and write it to  postgreSQL, 
Aurora (target)  on a dfiffernt environment  (like to PROD to QA or QA to PROD 
etc) using spark JDBC.

When I am loading the dataframe back to target DB, I would like 
to ensure the same schema as the source table schema using

val targetTableSchema: String =
  """
|  operating_unit_nm character varying(20),
|  organization_id integer,
|  organization_cd character varying(30),
|  requesting_organization_id integer,
|  requesting_organization_cd character varying(50),
|  owning_organization_id integer,
|  owning_organization_cd character varying(50)
""".stripMargin


.option("createTableColumnTypes", targetTableSchema )

I would like to know if there is way I can create this 
targetTableSchema (source table DDL) variable directly from source table or 
from a csv file. I don’t want spark to enforce its default schema.  Based on 
the table name, How do I  get the DDL created dynamically to pass it to 
targetTableSchema variable as a string.

Currently I am updating targetTableSchema manually  and looking 
for some pointer to automate it.


Below is my code

// Define the parameter
val sourceDb: String = args(0)
val targetDb: String = args(1)
val sourceTable: String = args(2)
val targetTable: String = args(3)
val sourceEnv: String = args(4)
val targetEnv: String = args(5)

println("Arguments Provided: " + sourceDb, 
targetDb,sourceTable, targetTable, sourceEnv, targetEnv)

// Define the spark session
val spark: SparkSession = SparkSession
  .builder()
  .appName("Ca-Data-Transporter")
  .master("local")
  .config("driver", "org.postgresql.Driver")
  .getOrCreate()

// define the input directory
val inputDir: String = 
"/Users/gangadharkadam/projects/ca-spark-apps/src/main/resources/"

// Define the source DB properties
val sourceParmFile: String = if (sourceDb == "RDS") {
"rds-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "AURORA") {
"aws-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "GP") {
"gp-db-parms-" + sourceEnv + ".txt"
  }
  else "NA"

println(sourceParmFile)

val sourceDbParms: Properties = new Properties()
sourceDbParms.load(new FileInputStream(new File(inputDir + 
sourceParmFile)))
val sourceDbJdbcUrl: String = 
sourceDbParms.getProperty("jdbcUrl")

println(s"$sourceDb")
println(s"$sourceDb

Re: Spark model serving

2018-07-12 Thread Maximiliano Felice
Hi,

As I know many of you don't read / are not part of the user list. I'll make
a summary of what happened at the summit:

We discussed some needs we get in order to start serving our predictions
with Spark. We mostly talked about alternatives to this work and what we
could expect in these areas.

I'm going to share mine here, hoping it will trigger further discussion. We
currently:

   - Use Spark as an ETL tool, followed by
   - a Python (numpy/pandas based) pipeline to preprocess information and
   - use Tensorflow for training our Neural Networks


What we'd love to, and why we don't:

   - Start using Spark for our full preprocessing pipeline. Because type
   safety. And distributed computation. And catalyst. Buy mainly because
   *not-python.*
   Our main issue:
  - We want to use the same code for online serving. We're not willing
  to duplicate the preprocessing operations. Spark is not
  *serving-friendly*.
  - If we want it to preprocess online, we need to copy/paste our
  custom transformations to MLeap.
  - It's an issue to communicate with a Tensorflow API to give it the
  preprocessed data to serve.
   - Use Spark to do hyperparameter tunning.
   We'd need:
  - GPU Integration with Spark, letting us achieve finer tuning.
  - Better TensorFlow integration


Now that I'm on the @dev, do you think that any of this issues could be
addressed? We talked at the summit about PFA (Portable Format for
Analytics) and how we would expect it to cover some issues. Another
discussion I remember was about *encoding operations (functions/lambdas) in
PFA itself. *And I don't remember having smoked anything at that point,
although we could as well have.

Oh, and @Holden Karau  insisted that she would be
much happier with us if we started helping with code reviews. I'm willing
to make some time for that.


Sorry again for the delay in replying to this email *(and now sorry for the
length), *looking forward to following up on this topic

El mar., 3 jul. 2018 a las 15:37, Saikat Kanjilal ()
escribió:

> Ping, would love to hear back on this.
>
>
> --
> *From:* Saikat Kanjilal 
> *Sent:* Tuesday, June 26, 2018 7:27 AM
> *To:* dev@spark.apache.org
> *Subject:* Spark model serving
>
> HoldenK and interested folks,
> Am just following up on the spark model serving discussions as this is
> highly relevant to what I’m embarking on at work.  Is there a concrete list
> of next steps or can someone summarize what was discussed at the summit ,
> would love to have a Seattle version of this discussion with some folks.
>
> Look forward to hearing back and driving this.
>
> Regards
>
> Sent from my iPhone
>


Re: Revisiting Online serving of Spark models?

2018-07-12 Thread Maximiliano Felice
Hi!

To keep things ordered, I just sent an update on an older email requesting
an update for this, named: Spark model serving.

I propose to follow the discussion there. Or here, but not to branch.

Bye!


El mar., 3 jul. 2018 a las 22:15, Matei Zaharia ()
escribió:

> Just wondering, is there an update on this? I haven’t seen a summary of
> the offline discussion but maybe I’ve missed it.
>
> Matei
>
> > On Jun 11, 2018, at 8:51 PM, Holden Karau 
> wrote:
> >
> > So I kicked of a thread on user@ to collect people's feedback there but
> I'll summarize the offline results later this week too.
> >
> > On Tue, Jun 12, 2018, 5:03 AM Liang-Chi Hsieh  wrote:
> >
> > Hi,
> >
> > It'd be great if there can be any sharing of the offline discussion.
> Thanks!
> >
> >
> >
> > Holden Karau wrote
> > > We’re by the registration sign going to start walking over at 4:05
> > >
> > > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <
> >
> > > maximilianofelice@
> >
> > >> wrote:
> > >
> > >> Hi!
> > >>
> > >> Do we meet at the entrance?
> > >>
> > >> See you
> > >>
> > >>
> > >> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
> > >>
> >
> > > nick.pentreath@
> >
> > >> escribió:
> > >>
> > >>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
> > >>>
> > >>> On Sun, 3 Jun 2018 at 00:24 Holden Karau <
> >
> > > holden@
> >
> > > > wrote:
> > >>>
> >  On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> > 
> >
> > > maximilianofelice@
> >
> > >> wrote:
> > 
> > > Hi!
> > >
> > > We're already in San Francisco waiting for the summit. We even
> think
> > > that we spotted @holdenk this afternoon.
> > >
> >  Unless you happened to be walking by my garage probably not super
> >  likely, spent the day working on scooters/motorcycles (my style is a
> >  little
> >  less unique in SF :)). Also if you see me feel free to say hi
> unless I
> >  look
> >  like I haven't had my first coffee of the day, love chatting with
> folks
> >  IRL
> >  :)
> > 
> > >
> > > @chris, we're really interested in the Meetup you're hosting. My
> team
> > > will probably join it since the beginning of you have room for us,
> and
> > > I'll
> > > join it later after discussing the topics on this thread. I'll send
> > > you an
> > > email regarding this request.
> > >
> > > Thanks
> > >
> > > El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
> > >
> >
> > > sxk1969@
> >
> > >> escribió:
> > >
> > >> @Chris This sounds fantastic, please send summary notes for
> Seattle
> > >> folks
> > >>
> > >> @Felix I work in downtown Seattle, am wondering if we should a
> tech
> > >> meetup around model serving in spark at my work or elsewhere
> close,
> > >> thoughts?  I’m actually in the midst of building microservices to
> > >> manage
> > >> models and when I say models I mean much more than machine
> learning
> > >> models
> > >> (think OR, process models as well)
> > >>
> > >> Regards
> > >>
> > >> Sent from my iPhone
> > >>
> > >> On May 31, 2018, at 10:32 PM, Chris Fregly <
> >
> > > chris@
> >
> > > > wrote:
> > >>
> > >> Hey everyone!
> > >>
> > >> @Felix:  thanks for putting this together.  i sent some of you a
> > >> quick
> > >> calendar event - mostly for me, so i don’t forget!  :)
> > >>
> > >> Coincidentally, this is the focus of June 6th's *Advanced Spark
> and
> > >> TensorFlow Meetup*
> > >> <
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
> ;
> > >> @5:30pm
> > >> on June 6th (same night) here in SF!
> > >>
> > >> Everybody is welcome to come.  Here’s the link to the meetup that
> > >> includes the signup link:
> > >> *
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
> > >> <
> https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
> ;
> > >>
> > >> We have an awesome lineup of speakers covered a lot of deep,
> > >> technical
> > >> ground.
> > >>
> > >> For those who can’t attend in person, we’ll be broadcasting live -
> > >> and
> > >> posting the recording afterward.
> > >>
> > >> All details are in the meetup link above…
> > >>
> > >> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more
> than
> > >> welcome to give a talk. I can move things around to make room.
> > >>
> > >> @joseph:  I’d personally like an update on the direction of the
> > >> Databricks proprietary ML Serving export format which is similar
> to
> > >> PMML
> > >> but not a standard in any way.
> > >>
> > >> Also, the Databricks ML Serving Runtime is only available to
> > >> Databricks customers.  This seems in conflict with the community
> > >> efforts
> > >> described here.  Can you comment on behalf of Dat

Re: Spark model serving

2018-07-12 Thread Saikat Kanjilal
Thanks  maximiliano so much for responding, I didn't want this discussion to 
disappear in the wilderness of dev emails :), here's what I would like to see 
or contribute to for model serving within spark, first of I want to be clear on 
what we mean by model serving so I'll add my interpretation of the definition 
here:  model serving is the ability to discover what models exist through the 
use of a model repository and serve up the contents of a particular model for 
invocation/consumption, before we dive into the details you specify below is 
this the definition that people have in mind.  Finally as I mentioned earlier 
when I'm thinking about models I'm initially targeting deep/machine learning 
models but eventually models requiring lots of compute or I/O frequently 
present in Operations Research and other worlds.

Given the above I feel like we also need a more robustified (nice word huh)  
version of Livy , something that discovers and serves up any model for 
downstream computation in addition to hooking it up to zeppelin or some other 
downstream viz engine.

Would love to hear thoughts.

From: Maximiliano Felice 
Sent: Thursday, July 12, 2018 11:52 AM
To: Saikat Kanjilal; Holden Karau
Cc: dev
Subject: Re: Spark model serving

Hi,

As I know many of you don't read / are not part of the user list. I'll make a 
summary of what happened at the summit:

We discussed some needs we get in order to start serving our predictions with 
Spark. We mostly talked about alternatives to this work and what we could 
expect in these areas.

I'm going to share mine here, hoping it will trigger further discussion. We 
currently:

  *   Use Spark as an ETL tool, followed by
  *   a Python (numpy/pandas based) pipeline to preprocess information and
  *   use Tensorflow for training our Neural Networks

What we'd love to, and why we don't:

  *   Start using Spark for our full preprocessing pipeline. Because type 
safety. And distributed computation. And catalyst. Buy mainly because 
not-python.
Our main issue:
 *   We want to use the same code for online serving. We're not willing to 
duplicate the preprocessing operations. Spark is not serving-friendly.
 *   If we want it to preprocess online, we need to copy/paste our custom 
transformations to MLeap.
 *   It's an issue to communicate with a Tensorflow API to give it the 
preprocessed data to serve.
  *   Use Spark to do hyperparameter tunning.
We'd need:
 *   GPU Integration with Spark, letting us achieve finer tuning.
 *   Better TensorFlow integration

Now that I'm on the @dev, do you think that any of this issues could be 
addressed? We talked at the summit about PFA (Portable Format for Analytics) 
and how we would expect it to cover some issues. Another discussion I remember 
was about encoding operations (functions/lambdas) in PFA itself. And I don't 
remember having smoked anything at that point, although we could as well have.

Oh, and @Holden Karau insisted that she would be 
much happier with us if we started helping with code reviews. I'm willing to 
make some time for that.


Sorry again for the delay in replying to this email (and now sorry for the 
length), looking forward to following up on this topic

El mar., 3 jul. 2018 a las 15:37, Saikat Kanjilal 
(mailto:sxk1...@hotmail.com>>) escribió:

Ping, would love to hear back on this.



From: Saikat Kanjilal mailto:sxk1...@hotmail.com>>
Sent: Tuesday, June 26, 2018 7:27 AM
To: dev@spark.apache.org
Subject: Spark model serving

HoldenK and interested folks,
Am just following up on the spark model serving discussions as this is highly 
relevant to what I’m embarking on at work.  Is there a concrete list of next 
steps or can someone summarize what was discussed at the summit , would love to 
have a Seattle version of this discussion with some folks.

Look forward to hearing back and driving this.

Regards

Sent from my iPhone


Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-12 Thread Ryan Blue
Thanks! I'm all for calling a vote on the SPIP. If I understand the process
correctly, the intent is for a "shepherd" to do it. I'm happy to call a
vote, or feel free if you'd like to play that role.

Other comments:
* DeleteData API: I completely agree that we need to have a proposal for
it. I think the SQL side is easier because DELETE FROM is already a
statement. We just need to be able to identify v2 tables to use it. I'll
come up with something and send a proposal to the dev list.
* Table create/drop/alter/load API: I think we have agreement around the
proposed DataSourceV2 API, but we need to decide how the public API will
work and how this will fit in with ExternalCatalog (see the other thread
for discussion there). Do you think we need to get that entire SPIP
approved before we can start getting the API in? If so, what do you think
needs to be decided to get it ready?

Thanks!

rb

On Wed, Jul 11, 2018 at 8:24 PM Wenchen Fan  wrote:

> Hi Ryan,
>
> Great job on this! Shall we call a vote for the plan standardization SPIP?
> I think this is a good idea and we should do it.
>
> Notes:
> We definitely need new user-facing APIs to produce these new logical plans
> like DeleteData. But we need a design doc for these new APIs after the SPIP
> passed.
> We definitely need the data source to provide the ability to
> create/drop/alter/lookup tables, but that belongs to the other SPIP and
> should be voted separately.
>
> Thanks,
> Wenchen
>
> On Fri, Apr 20, 2018 at 5:01 AM Ryan Blue 
> wrote:
>
>> Hi everyone,
>>
>> A few weeks ago, I wrote up a proposal to standardize SQL logical plans
>> 
>>  and
>> a supporting design doc for data source catalog APIs
>> .
>> From the comments on those docs, it looks like we mostly have agreement
>> around standardizing plans and around the data source catalog API.
>>
>> We still need to work out details, like the transactional API extension,
>> but I'd like to get started implementing those proposals so we have
>> something working for the 2.4.0 release. I'm starting this thread because I
>> think we're about ready to vote on the proposal
>> 
>> and I'd like to get any remaining discussion going or get anyone that
>> missed this to read through the docs.
>>
>> Thanks!
>>
>> rb
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix