Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Jason
need to look into custom DataSet encoders though I'm not sure what kind of gain (if any) you'd get with that approach. Jason On Fri, Jun 17, 2016, 12:38 PM Everett Anderson wrote: > Hi, > > I have a system with files in a variety of non-standard input formats, > though

Persisting sorted parquet tables for future sort merge joins

2015-08-25 Thread Jason
se, 0] [ TungstenExchange hashpartitioning(CHROM#28399,pos#28332)] [ConvertToUnsafe] [ Scan ParquetRelation[file:exploded_sorted.parquet][pos#2.399]] Thanks, Jason

Persisting sorted parquet tables for future sort merge joins

2015-08-26 Thread Jason
hanks, Jason

Re: Alternative to Large Broadcast Variables

2015-08-28 Thread Jason
You could try using an external key value store (like HBase, Redis) and perform lookups/updates inside of your mappers (you'd need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)? I haven't done this myself though, so I'm just throwing th

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Jason
I had similar problems to this (reduce side failures for large joins (25bn rows with 9bn)), and found the answer was to further up the spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for me, but your tables look a little denser, so you may want to go even higher. On Thu, Aug 2

Re: Getting number of physical machines in Spark

2015-08-28 Thread Jason
I've wanted similar functionality too: when network IO bound (for me I was trying to pull things from s3 to hdfs) I wish there was a `.mapMachines` api where I wouldn't have to try guess at the proper partitioning of a 'driver' RDD for `sc.parallelize(1 to N, N).map( i=> pull the i'th chunk from S3

Re: Dynamic lookup table

2015-08-28 Thread Jason
your updates are atomic (transactions or CAS semantics) or you could run into race condition problems. Jason On Fri, Aug 28, 2015 at 11:39 AM N B wrote: > Hi all, > > I have the following use case that I wanted to get some insight on how to > go about doing in Spark Streaming. > &

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Jason
a which shouldn't in theory require more than 1000 reducers on my 10TB > memory cluster (~7GB of spill per reducer). > I'm now wondering if my shuffle partitions are uneven and I should use a > custom partitioner, is there a way to get stats on the partition sizes from > Spark ?

[Announcement] Analytics Zoo 0.10.0 release

2021-04-28 Thread Jason Dai
Apache Spark/Flink and TF/PyTorch/BigDL/OpenVINO in a secure fashion on cloud Thanks, -Jason

[Announcement] Analytics Zoo 0.11.0 release

2021-07-21 Thread Jason Dai
wnd>, etc.) - Enhancements to Orca (scaling TF/PyTorch models to distributed Big Data) for end-to-end computer vision pipelines (distributed image preprocessing, training and inference); for more information, please see our CPVR 2021 tutorial <https://jason-dai.github.i

Spark 3 migration question

2022-05-18 Thread Jason Xu
Hi Spark user group, Spark 2.4 to 3 migration for existing Spark jobs seems a big challenge given a long list of changes in migration guide , they could introduce failures or output changes related to

[apache-spark] documentation on File Metadata _metadata struct

2024-01-10 Thread Jason Horner
All, the only documentation about the File Metadata ( hidden_metadata struct) I can seem to find is on the databricks website https://docs.databricks.com/en/ingestion/file-metadata-column.html#file-metadata-column for reference here is the struct:_metadata: struct (nullable = false) |-- file_path:

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin
unique id. To get the existing rules to be evaluated when we wanted, we kept a property on each fact that we mutated each time. Hackery, but it worked. I recommend you try hard to use a stateless KB, if it is possible. Thank you. Jason // brevity and poor typing by iPhone > On Apr 18, 2016, at

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin
Could you post some psuedo-code? val data = rdd.whatever(…) val facts: Array[DroolsCompatibleType] = convert(data) facts.map{ f => ksession.insert( f ) } > On Apr 18, 2016, at 9:20 AM, yaoxiaohua wrote: > > Thanks for your reply , Jason, > I can use stateless session in spar

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Jason Nerothin
Hi Erwan, You might consider InsightEdge: http://insightedge.io <http://insightedge.io/> . It has the capability of doing WAN between data grids and would save you the work of having to re-invent the wheel. Additionally, RDDs can be shared between developers in the same DC. Thanks,

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
a new data center > isn't really different from starting up a post-crash job in the original data > center. > > On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <mailto:eallain.po...@gmail.com>> wrote: > Thanks Jason and Cody. I'll try to explain a bit better

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
ed offset > > I understand that starting up a post crash job would work. > > Question is: how can we detect when DC2 crashes to start a new job ? > > dynamic topic partition (at each kafkaRDD creation for instance) + topic > subscription may be the answer ? > >

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
> > I understand that starting up a post crash job would work. > > Question is: how can we detect when DC2 crashes to start a new job ? > > dynamic topic partition (at each kafkaRDD creation for instance) + topic > subscription may be the answer ? > > I appreciate yo

Re: DeepSpark: where to start

2016-05-05 Thread Jason Nerothin
Just so that there is no confusion, there is a Spark user interface project called DeepSense that is actually useful: http://deepsense.io I am not affiliated with them in any way... On Thu, May 5, 2016 at 9:42 AM, Joice Joy wrote: > What the heck, I was already beginning to like it. > > On Thu,

Efficient join multiple times

2016-01-08 Thread Jason White
I'm trying to join a contant large-ish RDD to each RDD in a DStream, and I'm trying to keep the join as efficient as possible so each batch finishes within the batch window. I'm using PySpark on 1.6 I've tried the trick of keying the large RDD into (k, v) pairs and using .partitionBy(100).persist(

local class incompatible: stream classdesc serialVersionUID

2016-01-28 Thread Jason Plurad
I've searched through the mailing list archive. It seems that if you try to run, for example, a Spark 1.5.2 program against a Spark 1.5.1 standalone server, you will run into an exception like this: WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 192.168.14.10

Re: local class incompatible: stream classdesc serialVersionUID

2016-01-29 Thread Jason Plurad
t; Cheers > > On Thu, Jan 28, 2016 at 1:38 PM, Jason Plurad wrote: > >> I've searched through the mailing list archive. It seems that if you try >> to run, for example, a Spark 1.5.2 program against a Spark 1.5.1 standalone >> server, you will run into an exception lik

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a broad

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a broad

PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
I'm trying to create a DStream of DataFrames using PySpark. I receive data from Kafka in the form of a JSON string, and I'm parsing these RDDs of Strings into DataFrames. My code is: I get the following error at pyspark/streaming/util.py, line 64: I've verified that the sqlContext is properly

Re: PySpark + Streaming + DataFrames

2015-10-16 Thread Jason White
Hi Ken, thanks for replying. Unless I'm misunderstanding something, I don't believe that's correct. Dstream.transform() accepts a single argument, func. func should be a function that accepts a single RDD, and returns a single RDD. That's what transform_to_df does, except the RDD it returns is a D

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Jason White
Ah, that makes sense then, thanks TD. The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if you provide the schema, so I was avoiding back-and-forth conversions. I’ll see if I can create a ‘trusted’ conversion that doesn’t involve the `take`. --  Jason On October 19, 2

Re: PySpark + Streaming + DataFrames

2015-11-02 Thread Jason White
d this, could reply back > with the trusted conversion that worked for you (for a clear solution)? > > TD > > > On Mon, Oct 19, 2015 at 3:08 PM, Jason White > wrote: > >> Ah, that makes sense then, thanks TD. >> >> The conversion from RDD -> DF involves a

Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Jason Rubenstein
wrote one so we could upgrade to a specific version of Spark (via commit-hash) and used it to upgrade from 1.4.1. to 1.5.0 best, Jason On Thu, Nov 12, 2015 at 9:49 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > spark-ec2 does not offer a way to upgrade an existing cluster

UDAF collect_list: Hive Query or spark sql expression

2016-09-23 Thread Jason Mop
Hi Spark Team, I see most Hive function have been implemented by Spark SQL expression, but collect_list is still using Hive Query, will it also be implemented by Expression in future? any update? Cheers, Ming

[ML] Converting ml.DenseVector to mllib.Vector

2016-12-30 Thread Jason Wolosonovich
Vector` for use with `Statistics`? Thanks! -Jason - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Case class with POJO - encoder issues

2017-02-11 Thread Jason White
I'd like to create a Dataset using some classes from Geotools to do some geospatial analysis. In particular, I'm trying to use Spark to distribute the work based on ID and label fields that I extract from the polygon data. My simplified case class looks like this: implicit val geometryEncoder: Enc

spark.pyspark.python is ignored?

2017-06-29 Thread Jason White
According to the documentation, `spark.pyspark.python` configures which python executable is run on the workers. It seems to be ignored in my simple test cast. I'm running on a pip-installed Pyspark 2.1.1, completely stock. The only customization at this point is my Hadoop configuration directory.

[Spark SQL] How to run a custom meta query for `ANALYZE TABLE`

2018-01-02 Thread Jason Heo
ts into metastore_db manually. Is there any API exposed to handle metastore_db (e.g. insert/delete meta db)? Regards, Jason

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
Hi Geoff - Appreciate the help here - I do understand what you’re saying below. And I am able to get this working when I submit a job to a local cluster. I think part of the issue here is that there’s ambiguity in the terminology. When I say “LOCAL” spark, I mean an instance of spark that is

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
.config("Option", > "value")` when creating the spark session, and then other runtime options as > you describe above with `spark.conf.set`. At this point though I've just > moved everything out into a `spark-submit` script. > > On Fri, Apr 13, 2018 at 8:1

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
is not. Once again - I’m trying to solve the use case for master=local, NOT for a cluster and NOT with spark-submit. > On Apr 13, 2018, at 12:47 PM, yohann jardin wrote: > > Hey Jason, > Might be related to what is behind your variable ALLUXIO_SPARK_CLIENT and > where is loca

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
Ok thanks - I was basing my design on this: https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html Wherein it says: Once the SparkSession is instantiated, you can configure

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread Jason Boorn
this - I feel like my particular use case is clouding what is already a tricky issue. > On Apr 13, 2018, at 2:26 PM, Gene Pang wrote: > > Hi Jason, > > Alluxio does work with Spark in master=local mode. This is because both > spark-submit and spark-shell have command-line

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-14 Thread Jason Boorn
Ok great I’ll give that a shot - Thanks for all the help > On Apr 14, 2018, at 12:08 PM, Gene Pang wrote: > > Yes, I think that is the case. I haven't tried that before, but it should > work. > > Thanks, > Gene > > On Fri, Apr 13, 2018 at 11:32 AM, Jason

Upcoming talks on BigDL and Analytics Zoo this week

2019-03-24 Thread Jason Dai
rata/strata-ca/public/schedule/detail/72802> at *Strata Data Conference in San Francisco* (March 28, 3:50–4:30pm) If you plan to attend these events, please drop by and talk to the speakers :-) Thanks, -Jason

streaming - absolute maximum

2019-03-25 Thread Jason Nerothin
e during streaming operation. Do I write to flavors of the query - one as a static Dataset for initiation and another for realtime? Is my logic incorrect? Thanks, Jason -- Thanks, Jason

Re: How to extract data in parallel from RDBMS tables

2019-03-28 Thread Jason Nerothin
wrote: > Hi All, > > Is there any way to copy all the tables in parallel from RDBMS using > Spark? We are looking for a functionality similar to Sqoop. > > Thanks, > Surendra > > -- Thanks, Jason

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Jason Nerothin
2 workers with 2 executors. The link for running > application “name” goes back to my server, the machine that launched the > job. > > > > This is spark.submit.deployMode = “client” according to the docs. I set > the Driver to run on the cluster but it runs on the client, ignoring the > spark.submit.deployMode. > > > > Is this as expected? It is documented nowhere I can find. > > > > > -- > Marcelo > > -- Thanks, Jason

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Jason Nerothin
Meant this one: https://docs.databricks.com/api/latest/jobs.html On Thu, Mar 28, 2019 at 5:06 PM Pat Ferrel wrote: > Thanks, are you referring to > https://github.com/spark-jobserver/spark-jobserver or the undocumented > REST job server included in Spark? > > > From: Jason

Re: How to extract data in parallel from RDBMS tables

2019-03-29 Thread Jason Nerothin
How many tables? What DB? On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti < surendra.manchika...@gmail.com> wrote: > Hi Jason, > > Thanks for your reply, But I am looking for a way to parallelly extract > all the tables in a Database. > > > On Thu, Mar 28, 201

Re: Spark SQL API taking longer time than DF API.

2019-03-30 Thread Jason Nerothin
-- > > > As per my understanding, both Spark SQL and DtaaFrame API generate the > same code under the hood and execution time has to be similar. > > > Regards, > > Neeraj > > > -- Thanks, Jason

Re: How to extract data in parallel from RDBMS tables

2019-04-02 Thread Jason Nerothin
er of tables. > > > On Fri, Mar 29, 2019 at 5:04 AM Jason Nerothin > wrote: > >> How many tables? What DB? >> >> On Fri, Mar 29, 2019 at 00:50 Surendra , Manchikanti < >> surendra.manchika...@gmail.com> wrote: >> >>> Hi Jason, >>> >&g

Re: Upcoming talks on BigDL and Analytics Zoo this week

2019-04-03 Thread Jason Dai
The slides of the two technical talks for BigDL and Analytics Zoo ( https://github.com/intel-analytics/analytics-zoo/) at Strata Data Conference have been uploaded to https://analytics-zoo.github.io/master/#presentations/. Thanks, -Jason On Sun, Mar 24, 2019 at 9:03 PM Jason Dai wrote: >

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
'invoice_id'))) >>>>>> out_df = df.filter(df['update_time'] == df['wanted_time']) >>>>>> .drop('wanted_time').dropDuplicates('invoice_id', 'update_time') >>>>>> >>>>>> The min() is faster than doing an orderBy() and a row_number(). >>>>>> And the dropDuplicates at the end ensures records with two values for >>>>>> the same 'update_time' don't cause issues. >>>>>> >>>>>> >>>>>> On Thu, Apr 4, 2019 at 10:22 AM Chetan Khatri < >>>>>> chetan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> Hello Dear Spark Users, >>>>>>> >>>>>>> I am using dropDuplicate on a DataFrame generated from large parquet >>>>>>> file from(HDFS) and doing dropDuplicate based on timestamp based column, >>>>>>> every time I run it drops different - different rows based on same >>>>>>> timestamp. >>>>>>> >>>>>>> What I tried and worked >>>>>>> >>>>>>> val wSpec = Window.partitionBy($"invoice_ >>>>>>> id").orderBy($"update_time".desc) >>>>>>> >>>>>>> val irqDistinctDF = irqFilteredDF.withColumn("rn", >>>>>>> row_number.over(wSpec)).where($"rn" === 1) >>>>>>> .drop("rn").drop("update_time") >>>>>>> >>>>>>> But this is damn slow... >>>>>>> >>>>>>> Can someone please throw a light. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> -- Thanks, Jason

Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-04 Thread Jason Nerothin
nder immediately > by phone or email and permanently delete this Communication from your > computer without making a copy. Thank you. -- Thanks, Jason

Re: dropDuplicate on timestamp based column unexpected output

2019-04-04 Thread Jason Nerothin
My thinking is that if you run everything in one partition - say 12 GB - then you don't experience the partitioning problem - one partition will have all duplicates. If that's not the case, there are other options, but would probably require a design change. On Thu, Apr 4, 2019 at 8:4

Re: reporting use case

2019-04-04 Thread Jason Nerothin
re welcome. > > > Thanks, > Prasad > > > > Thanks, > Prasad > -- Thanks, Jason

Re: combineByKey

2019-04-05 Thread Jason Nerothin
accumulator2: Set[String]) => accumulator1 ++ > accumulator2 > > sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.id, (x.id, > x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner) > > *Compile Error:-* > found : (String, String) => scala.collection.immutable.Set[String] > required: ((String, String)) => ? > sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.id, (x.id, > x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner) > > Regards, > Rajesh > > -- Thanks, Jason

Re: Checking if cascading graph computation is possible in Spark

2019-04-05 Thread Jason Nerothin
--- To > unsubscribe e-mail: user-unsubscr...@spark.apache.org -- Thanks, Jason

Re: combineByKey

2019-04-05 Thread Jason Nerothin
(x.Id, x.value))).aggregateByKey(Set[String]())( > (aggr, value) => aggr ++ Set(value._2), > (aggr1, aggr2) => aggr1 ++ aggr2).collect().toMap > > print(result) > > Map(0-d1 -> Set(t1, t2, t3, t4), 0-d2 -> Set(t1, t5, t6, t2), 0-d3 -> > Set(t1, t2

Re: Checking if cascading graph computation is possible in Spark

2019-04-05 Thread Jason Nerothin
eful > stabbing > > But, I am not sure how that helps here > > On Fri, 5 Apr 2019, 10:13 pm Jason Nerothin, > wrote: > >> Have you looked at Arbitrary Stateful Streaming and Broadcast >> Accumulators? >> >> On Fri, Apr 5, 2019 at 10:55 AM Basavara

Re: Structured streaming flatMapGroupWithState results out of order messages when reading from Kafka

2019-04-09 Thread Jason Nerothin
pWithState with Kafka > source? > > This my pipline; > > Kafka => GroupByKey(key from Kafka schema) => flatMapGroupWithState => > parquet > > When I printed out the Kafka offset for each key inside my state update > function they are not in order. I am using spark 2.3.3. > > Thanks & Regards, > Akila > > > > -- Thanks, Jason

Re: Spark2: Deciphering saving text file name

2019-04-09 Thread Jason Nerothin
each write action completes. HTH, Jason On Mon, Apr 8, 2019 at 19:55 Subash Prabakar wrote: > Hi, > While saving in Spark2 as text file - I see encoded/hash value attached in > the part files along with part number. I am curious to know what is that > value is about ?

BigDL and Analytics Zoo talks at upcoming Spark+AI Summit and Strata London

2019-04-18 Thread Jason Dai
ark and BigDL <https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/74077> at *Strata Data Conference in London* (16:35-17:15pm, Wednesday, 1 May 2019) If you plan to attend these events, please drop by and talk to the speakers :-) Thanks, -Jason

Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Jason Nerothin
to 226 jar files in $SPARK_HOME/jars and usually you're implementing against just a few of the essential ones like spark-sql. --jars is what to use for spark-shell. Final related tidbit: If you're implementing in Scala, make sure your jars are version-compatible with the scala compiler vers

Re: Update / Delete records in Parquet

2019-04-22 Thread Jason Nerothin
Hi Chetan, Do you have to use Parquet? It just feels like it might be the wrong sink for a high-frequency change scenario. What are you trying to accomplish? Thanks, Jason On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri wrote: > Hello All, > > If I am doing incremental load / delta

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-29 Thread Jason Nerothin
t; Is there any way to override this , other than using na.fill functions >> >> Regards, >> Snehasish >> > -- Thanks, Jason

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-29 Thread Jason Nerothin
See also here: https://stackoverflow.com/questions/44671597/how-to-replace-null-values-with-a-specific-value-in-dataframe-using-spark-in-jav On Mon, Apr 29, 2019 at 5:27 PM Jason Nerothin wrote: > Spark SQL has had an na.fill function on it since at least 2.1. Would that > work f

Re: Deep Learning with Spark, what is your experience?

2019-05-05 Thread Jason Dai
tps://software.intel.com/en-us/articles/talroo-uses-analytics-zoo-and-aws-to-leverage-deep-learning-for-job-recommendations> Thanks, -Jason On Sun, May 5, 2019 at 6:29 AM Riccardo Ferrari wrote: > Thank you for your answers! > > While it is clear each DL framework can solve the di

Re: Deep Learning with Spark, what is your experience?

2019-05-05 Thread Jason Dai
on of Keras 1.2.2 on Spark (using BigDL). Thanks, -Jason On Mon, May 6, 2019 at 5:37 AM Riccardo Ferrari wrote: > Thanks everyone, I really appreciate your contributions here. > > @Jason, thanks for the references I'll take a look. Quickly checking > github: > https://github.com/

Re: Streaming job, catch exceptions

2019-05-12 Thread Jason Nerothin
ubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Thanks, Jason

Re: BigDL and Analytics Zoo talks at upcoming Spark+AI Summit and Strata London

2019-05-14 Thread Jason Dai
The slides for the talks have been uploaded to https://analytics-zoo.github.io/master/#presentations/. Thanks, -Jason On Fri, Apr 19, 2019 at 9:55 PM Khare, Ankit wrote: > Thanks for sharing. > > Sent from my iPhone > > On 19. Apr 2019, at 01:35, Jason Dai wrote: > > Hi

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-15 Thread Jason Nerothin
thers) - to provide a single abstraction that works across Scala, Java, Python, and R. But what they came up with required the APIs you list to make it work.) Think carefully about what new things you're trying to provide and what things you're trying to hide beneath your abstraction. HTH

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
all). You might also look at the way the Apache Livy <https://livy.incubator.apache.org/> team is implementing their solution. HTH Jason On Tue, May 21, 2019 at 6:04 AM bsikander wrote: > Ok, I found the reason. > > In my QueueStream example, I have a while(true) which keeps o

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
Correction: The Driver manages the Tasks, the resource manager serves up resources to the Driver or Task. On Tue, May 21, 2019 at 9:11 AM Jason Nerothin wrote: > The behavior is a deliberate design decision by the Spark team. > > If Spark were to "fail fast", it would prev

Re: Streaming job, catch exceptions

2019-05-21 Thread Jason Nerothin
ob failure (current behavior). By analogy to the halting problem <https://en.wikipedia.org/wiki/Chaitin%27s_constant#Relationship_to_the_halting_problem>, I believe that expecting a program to handle all possible exceptional states is unreasonable. Jm2c Jason On Tue, May 21, 2019 at 9

[Announcement] Analytics Zoo 0.5 release

2019-06-17 Thread Jason Dai
releases. For more details, you may refer to the project website at https://github.com/intel-analytics/analytics-zoo/ Thanks, -Jason

alternatives to shading

2019-12-17 Thread Jason Nerothin
. We are considering implementing our own ClassLoaders and/or rebuilding and shading the spark distribution. Are there better alternatives? -- Thanks, Jason

Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Jason Nerothin
-- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Thanks, Jason

[Announcement] Analytics Zoo 0.7.0 release

2020-01-20 Thread Jason Dai
el-analytics/analytics-zoo/ Thanks, -Jason

Re: [Announcement] Analytics Zoo 0.7.0 release

2020-01-20 Thread Jason Dai
Fixed one typo below: should be TensorFlow 1.15 support in tfpark :-) On Tue, Jan 21, 2020 at 7:52 AM Jason Dai wrote: > Hi all, > > > > We are happy to announce the 0.7.0 release of Analytics Zoo > <https://github.com/intel-analytics/analytics-zoo/>, a unified Data >

[Announcement] Analytics Zoo 0.8 release

2020-04-27 Thread Jason Dai
nalytics-zoo.github.io/0.8.1/#gettingstarted/> page. Thanks, -Jason

Fwd: [Announcement] Analytics Zoo 0.8 release

2020-04-27 Thread Jason Dai
FYI :-) -- Forwarded message - From: Jason Dai Date: Tue, Apr 28, 2020 at 10:31 AM Subject: [Announcement] Analytics Zoo 0.8 release To: BigDL User Group Hi all, We are happy to announce the 0.8 release of Analytics Zoo <https://github.com/intel-analytics/analytics-zoo/&

Re: Running Example Spark Program

2015-02-22 Thread Jason Bell
If you would like a morr detailed walkthrough I wrote one recently. https://dataissexy.wordpress.com/2015/02/03/apache-spark-standalone-clusters-bigdata-hadoop-spark/ Regards Jason Bell On 22 Feb 2015 14:16, "VISHNU SUBRAMANIAN" wrote: > Try restarting your Spark cluster .

Re: Speed Benchmark

2015-02-27 Thread Jason Bell
How many machines are on the cluster? And what is the configuration of those machines (Cores/RAM)? "Small cluster" is very subjective statement. Guillaume Guy wrote: Dear Spark users: I want to see if anyone has an idea of the performance for a small cluster.

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
atest DataFrame API, and will provide a stable version for people to try soon. Thabnks, -Jason On Wed, Mar 11, 2015 at 9:12 AM, Tobias Pfeiffer wrote: > Hi, > > On Wed, Mar 11, 2015 at 9:33 AM, Cheng, Hao wrote: > >> Intel has a prototype for doing this, SaiSai and Jason are the a

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
Sorry typo; should be https://github.com/intel-spark/stream-sql Thanks, -Jason On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad wrote: > Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql > > > *Irfan Ahmad* > CTO | Co-Founder | *CloudPhysics* <http://ww

EC2 Cluster script. Shark install fails

2014-07-10 Thread Jason H
Hi Just going though the process of installing Spark 1.0.0 on EC2 and notice that the script throws an error when installing shark. Unpacking Spark ~/spark-ec2 Initializing shark ~ ~/spark-ec2 ERROR: Unknown Shark version The install completes in the end but shark is completely missed. Lo

RE: EC2 Cluster script. Shark install fails

2014-07-11 Thread Jason H
-on-spark.html http://spark.apache.org/docs/latest/sql-programming-guide.html On Thu, Jul 10, 2014 at 5:51 AM, Jason H wrote: Hi Just going though the process of installing Spark 1.0.0 on EC2 and notice that the script throws an error when installing shark. Unpacking Spark ~/spark-

OutOfMemor​yError when spark streaming receive flume events

2014-08-12 Thread jason chen
I checked javacore file, there is: Dump Event "systhrow" (0004) Detail "java/lang/OutOfMemoryError" "Java heap space" received After checking the failure thread, I found it occur in SparkFlumeEvent.readExternal() method: 71 for (i <- 0 until numHeaders) { 72 val keyLength = in.rea

Applications status missing when Spark HA(zookeeper) enabled

2014-09-11 Thread jason chen
Hi guys, I configured Spark with the configuration in spark-env.sh: export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=host1:2181,host2:2181,host3:2181 -Dspark.deploy.zookeeper.dir=/spark" And I started spark-shell on one master host1(active): MASTER

Re: Applications status missing when Spark HA(zookeeper) enabled

2014-09-12 Thread jason chen
anybody met this high availability problem with zookeeper? 2014-09-12 10:34 GMT+08:00 jason chen : > Hi guys, > > I configured Spark with the configuration in spark-env.sh: > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.

Is it possible to store graph directly into HDFS?

2014-12-30 Thread Jason Hong
your suggestion. Best regards. Jason Hong -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-store-graph-directly-into-HDFS-tp20908.html Sent from the Apache Spark User List mailing list archive at Nabbl

Re: Is it possible to store graph directly into HDFS?

2014-12-30 Thread Jason Hong
Thanks for your answer, Xuefeng Wu. But, I don't understand how to save a graph as object. :( Do you have any sample codes? 2014-12-31 13:27 GMT+09:00 Jason Hong : > Thanks for your answer, Xuefeng Wu. > > But, I don't understand how to save a graph as object. :( > &g

Re: Issue with VAE Model - KerasTensor Incompatibility with TensorFlow Functions

2024-10-21 Thread Jason Tan
Jjjb On Mon, 21 Oct 2024, 17:59 Mich Talebzadeh, wrote: > > Spark version 3.4.0 > Python 3.9.16 > tensorflow 2.17.0 > > Hi, > > I encountered an issue while building a VAE (Variational Autoencoder > ) model using the > following configuratio

Create dataset from dataframe with missing columns

2017-06-14 Thread Tokayer, Jason M.
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d lik

Create dataset from dataframe with missing columns

2017-06-14 Thread Tokayer, Jason M.
Is it possible to concisely create a dataset from a dataframe with missing columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d lik