Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-17 Thread chris
Hi, I took your code and ran it on spark 2.4.5 and it works ok for me. My first though, like Sean, is that you have some Spark ML version mismatch somewhere. Chris > On 17 Aug 2020, at 16:18, Sean Owen wrote: > >  > Hm, next guess: you need a no-arg constructor this() on Fo

Re: Time-Series Forecasting

2018-09-19 Thread chris
There’s also flint: https://github.com/twosigma/flint > On 19 Sep 2018, at 17:55, Jörn Franke wrote: > > What functionality do you need ? Ie which methods? > >> On 19. Sep 2018, at 18:01, Mina Aslani wrote: >> >> Hi, >> I have a question for you. Do we have any Time-Series Forecasting library

Re: Caused by: java.io.NotSerializableException: com.softwaremill.sttp.FollowRedirectsBackend

2018-11-30 Thread chris
spark being able to use the serializable versions. That’s very much a last resort though! Chris > On 30 Nov 2018, at 05:08, Koert Kuipers wrote: > > if you only use it in the executors sometimes using lazy works > >> On Thu, Nov 29, 2018 at 9:45 AM James Starks >>

Re: Spark SQL API taking longer time than DF API.

2019-04-08 Thread chris
information should allow us to figure out what. Thanks, Chris > On 8 Apr 2019, at 09:21, neeraj bhadani wrote: > > Hi All, > Can anyone help me here with my query? > > Regards, > Neeraj > >> On Mon, Apr 1, 2019 at 9:44 AM neeraj bhadani >> wrote:

How to speedup your Spark ML training

2019-04-15 Thread chris
achine-learning/ Visit us at Spark Summit in San Francisco next week and we would be happy to provide more info on how we can help speedup your Spark ML applications and at the same time reduce the TCO. Chris Kachris www.inaccel.com <http://www.inaccel.com>

Re: Spark standalone and Pandas UDF from custom archive

2019-05-25 Thread chris
Hi, The solution we have is to make a generic spark submit python file (driver.py). This is just a main method which takes a single parameter- the module containing the app you want to run. The main method itself just dynamically loads the module and executes some well know method on it (we us

Datasource V2- Heavy Metadata Query

2020-04-22 Thread chris
is cached for the lifetime of the dataframe. In the case of parquet files Spark solves this issue via FileScanRDD. For Datasource V2 it’s not obvious how one would solve a similar problem. Does anyone have any ideas or prior art here? Thanks, Chris

Re: Shuffle write increases in spark 1.2

2015-02-10 Thread chris
by providing an answer ;) Any help will be greatly appreciated, because otherwise I'm stuck with Spark 1.1.0, as quadrupling runtime is not an option. Sincerely, Chris 2015-02-09T14:06:06.328+01:00 INFOorg.apache.spark.executor.Executor Running task 9.0 in stage 18.0 (TID 300)E

Re: Shuffle write increases in spark 1.2

2015-02-10 Thread chris
ee https://issues.apache.org/jira/browse/SPARK-5715 Any help will be greatly appreciated, because otherwise I'm stuck with Spark 1.1.0, as quadrupling runtime is not an option. Sincerely, Chris 2015-02-09T14:06:06.328+01:00 INFOorg.apache.spark.executor.Executor Running task 9.0 in st

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Chris Coutinho
I'm also curious if this is possible, so while I can't offer a solution maybe you could try the following. The driver and executor nodes need to have access to the same (distributed) file system, so you could try to mount the file system to your laptop, locally, and then try to submit jobs and/or

Spark with External Shuffle Service - using saved shuffle files in the event of executor failure

2021-05-12 Thread Chris Thomas
  <https://stackoverflow.com/questions/67466878/can-spark-with-external-shuffle-service-use-saved-shuffle-files-in-the-event-of>open on the same if you would rather answer directly there). Kind regards, Chris

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Chris Martin
One thing I would check is this line: val fetchedRdd = rdd.map(r => (r.getGroup, r)) how many distinct groups do you ended up with? If there's just one then I think you might see the behaviour you observe. Chris On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote: > Also just to f

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Chris Martin
oving "spark.task.cpus":"16" or setting spark.executor.cores to 1. 4. print out the group keys and see if there's any weird pattern to them. 5. See if the same thing happens in spark local. If you have a reproducible example you can post publically then I'm happy to take a look. Ch

Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
10MB spark.sql.files.minPartitionNum 1000 Unfortunately we still see a large number of empty partitions and a small number containing the rest of the data (see median vs max number of input records). [image: image.png] Any help would be much appreciated Chris

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
r, I don't quite understand the link between the splitting settings, row group configuration, and resulting number of records when reading from a delta table. For more specifics: we're running Spark 3.1.2 using ADLS as cloud storage. Best, Chris On Fri, Feb 11, 2022 at 1:40 PM Adam Binford w

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho
483 num_row_groups: 1 format_version: 1.0 serialized_size: 6364 Columns ... Chris On Fri, Feb 11, 2022 at 3:37 PM Sean Owen wrote: > It should just be parquet.block.size indeed. > spark.write.option("parquet.block.size", "16m").parquet(...) > This is an issue

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Chris Coutinho
. We set > spark.hadoop.parquet.block.size in our spark config for writing to Delta. > > Adam > > On Fri, Feb 11, 2022, 10:15 AM Chris Coutinho > wrote: > >> I tried re-writing the table with the updated block size but it doesn't >> appear to have an effect

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Chris Coutinho
es in the static table for that particular non-unique key. The key is a single column. Thanks, Chris On Fri, 2022-02-11 at 20:10 +, Gourav Sengupta wrote: > Gourav Sengupta

Re: cannot write spark log to s3a

2022-11-10 Thread Chris Nauroth
blob/v2.2.8/gcs/CONFIGURATION.md#io-configuration [3] https://issues.apache.org/jira/browse/HADOOP-13327 [4] https://issues.apache.org/jira/browse/HADOOP-17597 Chris Nauroth On Wed, Nov 9, 2022 at 11:04 PM second_co...@yahoo.com.INVALID wrote: > when running spark job, i used > > &q

Re: VolcanoFeatureStep( Custom Scheduler ) not found in Spark 3.3.1 archive

2022-11-18 Thread Chris Nauroth
ou'll hit after this. Speaking just in terms of what the build does though, this should be sufficient. I hope this helps. Chris Nauroth On Thu, Nov 17, 2022 at 11:32 PM Gnana Kumar wrote: > I have maven built the spark-kubernetes jar ( > spark-kubernetes_2.12-3.3.2-SNAPSHOT ) but wh

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-02 Thread Chris Nauroth
ficiency. (Of course, we're not always in complete control of the data formats we're given, so the support for bz2 is there.) [1] https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html [2] https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aw

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-05 Thread Chris Nauroth
re also would not be any way to produce bgzip-style output like in the df.write.option code sample. To achieve either of those, it would require writing a custom Hadoop compression codec to integrate more closely with the data format. Chris Nauroth On Mon, Dec 5, 2022 at 2:08 PM Olive

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Chris Gore
d = (binary_test - binary_train.mean()) / binary_train.std() On a data set this small, the difference in models could also be the result of how the training/test sets were split. Have you tried running k-folds cross validation on a larger data set? Chris > On May 20, 2015, at 6:15 PM, DB Tsai wrote: >

Issues with `when` in Column class

2015-06-12 Thread Chris Freeman
I’m trying to iterate through a list of Columns and create new Columns based on a condition. However, the when method keeps giving me errors that don’t quite make sense. If I do `when(col === “abc”, 1).otherwise(0)` I get the following error at compile time: [error] not found: value when Howe

Re: Issues with `when` in Column class

2015-06-12 Thread Chris Freeman
That did it! Thanks! From: Yin Huai Date: Friday, June 12, 2015 at 10:31 AM To: Chris Freeman Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: Issues with `when` in Column class Hi Chris, Have you imported "org.apache.spark.sql.functions._"? Th

How to silence Parquet logging?

2015-06-13 Thread Chris Freeman
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR This does, in fact, silence the logging for everything else, but the Parquet config seems totally unchanged. Does anyone know how to do this? Thanks! -Chris Freeman

RE: How to silence Parquet logging?

2015-06-14 Thread Chris Freeman
and reading a local Parquet file from my hard drive. -Chris From: Cheng Lian [lian.cs@gmail.com] Sent: Saturday, June 13, 2015 6:56 PM To: Chris Freeman; user@spark.apache.org Subject: Re: How to silence Parquet logging? Hi Chris, Which Spark version were you

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”) Chris > On Jun 22, 2015, at 5:28 PM, ravi tella wrote: > > > Hello All, > I am new to Spark. I have a very basic question.How do I write the output of > an action on a RDD to HDFS? > > Thanks

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, For this case, you could simply do sc.parallelize([rdd.first()]).saveAsTextFile(“hdfs:///my_file”) using pyspark or sc.parallelize(Array(rdd.first())).saveAsTextFile(“hdfs:///my_file”) using Scala Chris > On Jun 22, 2015, at 5:53 PM, ddpis...@gmail.com wrote: > > Hi Chris,

Avro SerDe Issue w/ Manual Partitions?

2016-03-02 Thread Chris Miller
add the partitions manually so that I can specify a location. For what it's worth, "ActionEnum" is the first field in my schema. This same table and query structure works fine with Hive. When I try to run this with SparkSQL, however, I get the above error. Anyone have any idea what the problem is here? Thanks! -- Chris Miller

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
.lang.Thread.run(Thread.java:745) **** In addition to the above, I also tried putting the test Avro files on HDFS instead of S3 -- the error is the same. I also tried querying from Scala instead of using Zeppelin, and I get the same error. Where should I begin with troubleshooting th

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
ool.run(DataFileWriteTool.java:99) at org.apache.avro.tool.Main.run(Main.java:84) at org.apache.avro.tool.Main.main(Main.java:73) **** Any other ideas? -- Chris Miller On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman wrote: > your field name is > *enum1_values* > > but

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
oding the same files work fine with Hive, and I imagine the same deserializer code is used there too. Thoughts? -- Chris Miller On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman wrote: > your field name is > *enum1_values* > > but you have data > { "foo1": "test123&q

Re: Using netlib-java in Spark 1.6 on linux

2016-03-04 Thread Chris Fregly
p26386p26392.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-04 Thread Chris Fregly
ome learning overhead if we go the Scala route. What I want to >>>>> know is: is the Scala version of Spark still far enough ahead of pyspark >>>>> to >>>>> be well worth any initial training overhead? >>>>> >>>>> >>>>> If you are a very advanced Python shop and if you’ve in-house >>>>> libraries that you have written in Python that don’t exist in Scala or >>>>> some >>>>> ML libs that don’t exist in the Scala version and will require fair amount >>>>> of porting and gap is too large, then perhaps it makes sense to stay put >>>>> with Python. >>>>> >>>>> However, I believe, investing (or having some members of your group) >>>>> learn and invest in Scala is worthwhile for few reasons. One, you will get >>>>> the performance gain, especially now with Tungsten (not sure how it >>>>> relates >>>>> to Python, but some other knowledgeable people on the list, please chime >>>>> in). Two, since Spark is written in Scala, it gives you an enormous >>>>> advantage to read sources (which are well documented and highly readable) >>>>> should you have to consult or learn nuances of certain API method or >>>>> action >>>>> not covered comprehensively in the docs. And finally, there’s a long term >>>>> benefit in learning Scala for reasons other than Spark. For example, >>>>> writing other scalable and distributed applications. >>>>> >>>>> >>>>> Particularly, we will be using Spark Streaming. I know a couple of >>>>> years ago that practically forced the decision to use Scala. Is this >>>>> still >>>>> the case? >>>>> >>>>> >>>>> You’ll notice that certain APIs call are not available, at least for >>>>> now, in Python. >>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html >>>>> >>>>> >>>>> Cheers >>>>> Jules >>>>> >>>>> -- >>>>> The Best Ideas Are Simple >>>>> Jules S. Damji >>>>> e-mail:dmat...@comcast.net >>>>> e-mail:jules.da...@gmail.com >>>>> >>>>> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread Chris Miller
instead of writing to the file from coalesce, sort that data structure, then write your file. -- Chris Miller On Sat, Mar 5, 2016 at 5:24 AM, jelez wrote: > My streaming job is creating files on S3. > The problem is that those files end up very small if I just write them to > S3 > dir

Re: MLLib + Streaming

2016-03-06 Thread Chris Miller
Guru:This is a really great response. Thanks for taking the time to explain all of this. Helpful for me too. -- Chris Miller On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani wrote: > Hi Lan, > > Streaming Means, Linear Regression and Logistic Regression support online > machine lea

Re: Is Spark right for us?

2016-03-06 Thread Chris Miller
Gut instinct is no, Spark is overkill for your needs... you should be able to accomplish all of that with a relational database or a column oriented database (depending on the types of queries you most frequently run and the performance requirements). -- Chris Miller On Mon, Mar 7, 2016 at 1:17

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-06 Thread Chris Miller
For anyone running into this same issue, it looks like Avro deserialization is just broken when used with SparkSQL and partitioned schemas. I created an bug report with details and a simplified example on how to reproduce: https://issues.apache.org/jira/browse/SPARK-13709 -- Chris Miller On Fri

Job submission failure exception - unserializable TaskEndReason

2016-03-10 Thread Chris Westin
I'm getting an exception when I try to submit a job (through prediction.io, if you know it): [INFO] [Runner$] Submission command: /home/pio/PredictionIO/vendors/spark-1.5.1/bin/spark-submit --class io.prediction.tools.imprt.FileToEvents --files file:/home/pio/PredictionIO/conf/log4j.properties,

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubs

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
the DAGScheduler, which > will be apportioning the Tasks from those concurrent Jobs across the > available Executor cores. > > On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly wrote: > >> Good stuff, Evan. Looks like this is utilizing the in-memory >> capabilities of FiloDB

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Chris Fregly
bra operation > > > > Unfortunately, I'm fairly ignorant as to the internal mechanics of ALS > > itself. Is what I'm asking possible? > > > > Thank you, > > Colin > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
for analytical queries > that the OP wants; and MySQL is great but not scalable. Probably > something like VectorWise, HANA, Vertica would work well, but those > are mostly not free solutions. Druid could work too if the use case > is right. > > Anyways, great discussio

Repeating Records w/ Spark + Avro?

2016-03-11 Thread Chris Miller
one the datum? Seems I'm not the only one who ran into this problem: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I can't figure out how to fix it in my case without hacking away like the person in the linked PR did. Suggestions? -- Chris Miller

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Chris Miller
What exactly are you trying to do? Zeppelin is for interactive analysis of a dataset. What do you mean "realtime analytics" -- do you mean build a report or dashboard that automatically updates as new data comes in? -- Chris Miller On Sat, Mar 12, 2016 at 3:13 PM, trung kien wrote:

Re: Repeating Records w/ Spark + Avro?

2016-03-12 Thread Chris Miller
tln(record.get("myValue")) }) * What am I doing wrong? -- Chris Miller On Sat, Mar 12, 2016 at 1:48 PM, Peyman Mohajerian wrote: > Here is the reason for the behavior: > '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable > objec

Re: Repeating Records w/ Spark + Avro?

2016-03-12 Thread Chris Miller
removed. Finally, if I add rdd.persist(), then it doesn't work. I guess I would need to do .map(_._1.datum) again before the map that does the real work. -- Chris Miller On Sat, Mar 12, 2016 at 4:15 PM, Chris Miller wrote: > Wow! That sure is buried in the documentation! But yeah, that&#x

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Chris Miller
trigger your code to push out an updated value to any clients via the websocket. You could use something like a Redis pub/sub channel to trigger the web app to notify clients of an update. There are about 5 million other ways you could design this, but I would just keep it as simple as possible. I

Re: Correct way to use spark streaming with apache zeppelin

2016-03-13 Thread Chris Miller
Cool! Thanks for sharing. -- Chris Miller On Sun, Mar 13, 2016 at 12:53 AM, Todd Nist wrote: > Below is a link to an example which Silvio Fiorito put together > demonstrating how to link Zeppelin with Spark Stream for real-time charts. > I think the original thread was pack in early

Re: reading file from S3

2016-03-16 Thread Chris Miller
r you described. -- Chris Miller On Tue, Mar 15, 2016 at 11:22 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > There are many solutions to a problem. > > Also understand that sometimes your situation might be such. For ex what > if you are accessing S3 from

Re: Does parallelize and collect preserve the original order of list?

2016-03-16 Thread Chris Miller
Short answer: Nope Less short answer: Spark is not designed to maintain sort order in this case... it *may*, but there's no guarantee... generally, it would not be in the same order unless you implement something to order by and then sort the result based on that. -- Chris Miller On Wed, M

Re: newbie HDFS S3 best practices

2016-03-16 Thread Chris Miller
If you have lots of small files, distcp should handle that well -- it's supposed to distribute the transfer of files across the nodes in your cluster. Conductor looks interesting if you're trying to distribute the transfer of single, large file(s)... right? -- Chris Miller On Wed, Ma

Re: Spark schema evolution

2016-03-22 Thread Chris Miller
With Avro you solve this by using a default value for the new field... maybe Parquet is the same? -- Chris Miller On Tue, Mar 22, 2016 at 9:34 PM, gtinside wrote: > Hi , > > I have a table sourced from* 2 parquet files* with few extra columns in one > of the parquet file. Simp

Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
the processed logs to both elastic search and > kafka. So that Spark Streaming can pick data from Kafka for the complex use > cases, while logstash filters can be used for the simpler use cases. > > I was wondering if someone has already done this evaluation and could > provide me

Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
with production-ready Kafka Streams, so I can try this out myself - and hopefully remove a lot of extra plumbing. On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly wrote: > this is a very common pattern, yes. > > note that in Netflix's case, they're currently pushing all of their logs >

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Chris Fregly
perhaps renaming to Spark ML would actually clear up code and documentation confusion? +1 for rename > On Apr 5, 2016, at 7:00 PM, Reynold Xin wrote: > > +1 > > This is a no brainer IMO. > > >> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley wrote: >> +1 By the way, the JIRA for tracking

Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Chris Miller
flatmap? -- Chris Miller On Thu, Apr 7, 2016 at 10:25 PM, greg huang wrote: > Hi All, > >Can someone give me a example code to get rid of the empty string in > JavaRDD? I kwon there is a filter method in JavaRDD: > https://spark.apache.org/docs/1.6.0/api/java/org/apache/spa

Re: Any NLP lib could be used on spark?

2016-04-20 Thread Chris Fregly
this took me a bit to get working, but I finally got it up and running so with the package that Burak pointed out. here's some relevant links to my project that should give you some clues: https://github.com/fluxcapacitor/pipeline/blob/master/myapps/spark/ml/src/main/scala/com/advancedspark/ml/n

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Chris Fregly
Tue, May 17, 2016 at 1:36 AM, Todd wrote: >> >>> Hi, >>> We have a requirement to do count(distinct) in a processing batch >>> against all the streaming data(eg, last 24 hours' data),that is,when we do >>> count(distinct),we actually want to compute dis

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Chris Fregly
ere the cluster manager is running. >>>>>>>>> >>>>>>>>> The Driver node runs on the same host that the cluster manager is >>>>>>>>> running. The Driver requests the Cluster Manager for resources to run >>>>>>>>> tasks. The worker is tasked to create the executor (in this case >>>>>>>>> there is >>>>>>>>> only one executor) for the Driver. The Executor runs tasks for the >>>>>>>>> Driver. >>>>>>>>> Only one executor can be allocated on each worker per application. In >>>>>>>>> your >>>>>>>>> case you only have >>>>>>>>> >>>>>>>>> >>>>>>>>> The minimum you will need will be 2-4G of RAM and two cores. Well >>>>>>>>> that is my experience. Yes you can submit more than one spark-submit >>>>>>>>> (the >>>>>>>>> driver) but they may queue up behind the running one if there is not >>>>>>>>> enough >>>>>>>>> resources. >>>>>>>>> >>>>>>>>> >>>>>>>>> You pointed out that you will be running few applications in >>>>>>>>> parallel on the same host. The likelihood is that you are using a VM >>>>>>>>> machine for this purpose and the best option is to try running the >>>>>>>>> first >>>>>>>>> one, Check Web GUI on 4040 to see the progress of this Job. If you >>>>>>>>> start >>>>>>>>> the next JVM then assuming it is working, it will be using port 4041 >>>>>>>>> and so >>>>>>>>> forth. >>>>>>>>> >>>>>>>>> >>>>>>>>> In actual fact try the command "free" to see how much free memory >>>>>>>>> you have. >>>>>>>>> >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> LinkedIn * >>>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 28 May 2016 at 16:42, sujeet jog wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I have a question w.r.t production deployment mode of spark, >>>>>>>>>> >>>>>>>>>> I have 3 applications which i would like to run independently on >>>>>>>>>> a single machine, i need to run the drivers in the same machine. >>>>>>>>>> >>>>>>>>>> The amount of resources i have is also limited, like 4- 5GB RAM , >>>>>>>>>> 3 - 4 cores. >>>>>>>>>> >>>>>>>>>> For deployment in standalone mode : i believe i need >>>>>>>>>> >>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor ) >>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor ) >>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor ) >>>>>>>>>> >>>>>>>>>> The issue here is i will require 6 JVM running in parallel, for >>>>>>>>>> which i do not have sufficient CPU/MEM resources, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hence i was looking more towards a local mode deployment mode, >>>>>>>>>> would like to know if anybody is using local mode where Driver + >>>>>>>>>> Executor >>>>>>>>>> run in a single JVM in production mode. >>>>>>>>>> >>>>>>>>>> Are there any inherent issues upfront using local mode for >>>>>>>>>> production base systems.?.. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- *Chris Fregly* Research Scientist @ Flux Capacitor AI "Bringing AI Back to the Future!" San Francisco, CA http://fluxcapacitor.ai

Re: GraphX Java API

2016-05-30 Thread Chris Fregly
> > *Abhishek Kumar* > > > > This message (including any attachments) contains confidential information > intended for a specific individual and purpose, and is protected by law. If > you are not the intended recipient, you should delete this message and any > disclos

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Chris Fregly
0.0 in stage 0.0 >>>>>> (TID 0, localhost): java.lang.Error: Multiple ES-Hadoop versions detected >>>>>> in the classpath; please use only one >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar >>>>>> >>>>>> at org.elasticsearch.hadoop.util.Version.(Version.java:73) >>>>>> at >>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376) >>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89) >>>>>> at >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> >>>>>> .. still tracking this down but was wondering if there is someting >>>>>> obvious I'm dong wrong. I'm going to take out >>>>>> elasticsearch-hadoop-2.3.2.jar and try again. >>>>>> >>>>>> Lots of trial and error here :-/ >>>>>> >>>>>> Kevin >>>>>> >>>>>> -- >>>>>> >>>>>> We’re hiring if you know of any awesome Java Devops or Linux >>>>>> Operations Engineers! >>>>>> >>>>>> Founder/CEO Spinn3r.com >>>>>> Location: *San Francisco, CA* >>>>>> blog: http://burtonator.wordpress.com >>>>>> … or check out my Google+ profile >>>>>> <https://plus.google.com/102718274791889610666/posts> >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> >>>> We’re hiring if you know of any awesome Java Devops or Linux Operations >>>> Engineers! >>>> >>>> Founder/CEO Spinn3r.com >>>> Location: *San Francisco, CA* >>>> blog: http://burtonator.wordpress.com >>>> … or check out my Google+ profile >>>> <https://plus.google.com/102718274791889610666/posts> >>>> >>>> >> >> >> -- >> >> We’re hiring if you know of any awesome Java Devops or Linux Operations >> Engineers! >> >> Founder/CEO Spinn3r.com >> Location: *San Francisco, CA* >> blog: http://burtonator.wordpress.com >> … or check out my Google+ profile >> <https://plus.google.com/102718274791889610666/posts> >> >> -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA http://pipeline.io

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-12 Thread Chris Fregly
=spark+mllib >>>>> >>>>> >>>>> https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8&qid=1465657706&sr=8-3&keywords=spark+mllib >>>>> >>>>> >>>>> https:

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread Chris Fregly
re are some doc regarding using kafka directly >> from the >> > reader.stream? >> > Has it been integrated already (I mean the source)? >> > >> > Sorry if the answer is RTFM (but then I'd appreciate a pointer anyway^^) >> > >> > thanks, >> > cheers >> > andy >> > -- >> > andy >> > -- > andy > -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA http://pipeline.io

Re: ml models distribution

2016-07-22 Thread Chris Fregly
S) or coordinated distribution of the models. But I >> wanted >> > to know if there is any infrastructure in Spark that specifically >> addresses >> > such need. >> > >> > Thanks. >> > >> > Cheers, >> > >> > P.S

UDF returning generic Seq

2016-07-25 Thread Chris Beavers
column of unknown type? Or is my only alternative here to reduce this to BinaryType and use whatever encoding/data structures I want under the covers there and in subsequent UDFs? Thanks, Chris

Re: UDF returning generic Seq

2016-07-26 Thread Chris Beavers
2.345] or {"name":"Chris","value":123}. Given the Spark SQL constraints that ArrayType and MapType need explicit and consistent element types, I don't see any way to support this in the current type system short of falling back to binary data. Open to other sugges

FW: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Chris Mattmann
On 8/8/16, 2:03 AM, "matthias.du...@fiduciagad.de" wrote: >Hello, > >I write to you because I am not really sure whether I did everything right >when registering and subscribing to the spark user list. > >I posted the appended question to Spark User list after subscribing and >receiving t

Re: Have I done everything correctly when subscribing to Spark User List

2016-08-08 Thread Chris Mattmann
>times: > >https://www.mail-archive.com/search?l=user%40spark.apache.org&q=dueckm&submit.x=0&submit.y=0 > >On Mon, Aug 8, 2016 at 3:03 PM, Chris Mattmann wrote: >> >> >> >> >> >> On 8/8/16, 2:03 AM, "matthias.du...@fiduciagad.de&

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
enkins builds to figure this out as i'll likely get it wrong. please provide the relevant snapshot/preview/nightly/whatever repos (or equivalent) that we need to include in our builds to have access to the absolute latest build assets for every major and minor release. thanks! -chris On Tue,

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
nditions placed on >> the package. If you find that the general public are downloading such test >> packages, then remove them. >> > > On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly wrote: > >> this is a valid question. there are many people building products and >&

Re: [ANNOUNCE] Apache Bahir 2.0.0

2016-08-15 Thread Chris Mattmann
Great work Luciano! On 8/15/16, 2:19 PM, "Luciano Resende" wrote: The Apache Bahir PMC is pleased to announce the release of Apache Bahir 2.0.0 which is our first major release and provides the following extensions for Apache Spark 2.0.0 : Akka Streaming MQTT Streamin

Re: Kafka - streaming from multiple topics

2015-12-20 Thread Chris Fregly
separating out your code into separate streaming jobs - especially when there are no dependencies between the jobs - is almost always the best route. it's easier to combine atoms (fusion), then split them (fission). I recommend splitting out jobs along batch window, stream window, and state-tr

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Chris Fregly
hey Eran, I run into this all the time with Json. the problem is likely that your Json is "too pretty" and extending beyond a single line which trips up the Json reader. my solution is usually to de-pretty the Json - either manually or through an ETL step - by stripping all white space before p

Re: Pyspark SQL Join Failure

2015-12-20 Thread Chris Fregly
how does Spark SQL/DataFrame know that train_users_2.csv has a field named, "id" or anything else domain specific? is there a header? if so, does sc.textFile() know about this header? I'd suggest using the Databricks spark-csv package for reading csv data. there is an option in there to spec

Re: Hive error when starting up spark-shell in 1.5.2

2015-12-20 Thread Chris Fregly
hopping on a plane, but check the hive-site.xml that's in your spark/conf directory (or should be, anyway). I believe you can change the root path thru this mechanism. if not, this should give you more info google on. let me know as this comes up a fair amount. > On Dec 19, 2015, at 4:58 PM,

Re: How to do map join in Spark SQL

2015-12-20 Thread Chris Fregly
this type of broadcast should be handled by Spark SQL/DataFrames automatically. this is the primary cost-based, physical-plan query optimization that the Spark SQL Catalyst optimizer supports. in Spark 1.5 and before, you can trigger this optimization by properly setting the spark.sql.autobroad

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-25 Thread Chris Fregly
M, Jakob Odersky >>>> wrote: >>>> >>>>> It might be a good idea to see how many files are open and try >>>>> increasing the open file limit (this is done on an os level). In some >>>>> application use-cases it is actually a legitim

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-25 Thread Chris Fregly
our PostgreSQL server, I get the following >>>> error. >>>> >>>> Error: java.sql.SQLException: No suitable driver found for >>>> jdbc:postgresql:/// >>>> (state=,code=0) >>>> >>>> Can someone help me understand why this is? >>>> >>>&g

Re: Memory allocation for Broadcast values

2015-12-25 Thread Chris Fregly
> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Fat jar can't find jdbc

2015-12-25 Thread Chris Fregly
connect jar file. >>>> > >>>> > I've tried: >>>> > • Using different versions of mysql-connector-java in my >>>> build.sbt file >>>> > • Copying the connector jar to my_project/src/main/lib >>>> > • Copying the connector jar to my_project/lib <-- (this is >>>> where I keep my build.sbt) >>>> > Everything loads fine and works, except my call that does >>>> "sqlContext.load("jdbc", myOptions)". I know this is a total newbie >>>> question but in my defense, I'm fairly new to Scala, and this is my first >>>> go at deploying a fat jar with sbt-assembly. >>>> > >>>> > Thanks for any advice! >>>> > >>>> > -- >>>> > David Yerrington >>>> > yerrington.net >>>> >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >> >> -- >> David Yerrington >> yerrington.net >> > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
apache-spark-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-un

Re: error while defining custom schema in Spark 1.5.0

2015-12-25 Thread Chris Fregly
; > (fields: > Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType > cannot be applied to (org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField) >val customSchema = StructType( StructField("year", IntegerType, > true), StructField("make", StringType, true) ,StructField("model", > StringType, true) , StructField("comment", StringType, true) , > StructField("blank", StringType, true),StructField("blank", StringType, > true)) > ^ >Would really appreciate if somebody share the example which works with > Spark 1.4 or Spark 1.5.0 > > Thanks, > Divya > > ^ > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: fishing for help!

2015-12-25 Thread Chris Fregly
gt;>> >>>> Hi, >>>> I know it is a wide question but can you think of reasons why a pyspark >>>> job which runs on from server 1 using user 1 will run faster then the same >>>> job when running on server 2 with user 1 >>>> Eran >>>> >>> >>> >> -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Getting estimates and standard error using ml.LinearRegression

2015-12-25 Thread Chris Fregly
; PFB the code snippet > > val lr = new LinearRegression() > lr.setMaxIter(10) > .setRegParam(0.01) > .setFitIntercept(true) > val model= lr.fit(test) > val estimates = model.summary > > > -- > Thanks and Regards > Arun > --

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Chris Fregly
>>> "com.epam.parso.spark.ds.DefaultSource"); >>>>>> df.cache(); >>>>>> df.printSchema(); <-- prints the schema perfectly fine! >>>>>> >>>>>> df.show(); <-- Works perfectly fine (shows table >>>>>> with 20 lines)! >>>>>> df.registerTempTable("table"); >>>>>> df.select("select * from table limit 5").show(); <-- gives weird >>>>>> exception >>>>>> >>>>>> Exception is: >>>>>> >>>>>> AnalysisException: cannot resolve 'select * from table limit 5' given >>>>>> input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS >>>>>> >>>>>> I can do a collect on a dataframe, but cannot select any specific >>>>>> columns either "select * from table" or "select VER, CREATED from table". >>>>>> >>>>>> I use spark 1.5.2. >>>>>> The same code perfectly works through Zeppelin 0.5.5. >>>>>> >>>>>> Thanks. >>>>>> -- >>>>>> Be well! >>>>>> Jean Morozov >>>>>> >>>>> >>>>> >>>> >>> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Chris Fregly
97 On Fri, Dec 25, 2015 at 2:17 PM, Chris Fregly wrote: > I assume by "The same code perfectly works through Zeppelin 0.5.5" that > you're using the %sql interpreter with your regular SQL SELECT statement, > correct? > > If so, the Zeppelin interpreter is conver

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
ust say we didn't have this problem in the old mllib API so > it might be something in the new ml that I'm missing. > I will dig deeper into the problem after holidays. > > 2015-12-25 16:26 GMT+01:00 Chris Fregly : > > so it looks like you're increasing num tr

Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Chris Fregly
which version of spark is this? is there any chance that a single key - or set of keys- key has a large number of values relative to the other keys (aka. skew)? if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although I had some issues still with 1.5.1 in a similar situati

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Chris Fregly
; expect it in the next few days. You will probably want to use the new API >> once it's available. >> >> >> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot >> wrote: >> >>> Hi, >>> I am new bee to spark and a bit confused about RDDs and

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Chris Fregly
results.col("labelIndex"), >> >> >> results.col("prediction"), >> >> >> results.col("words")); >> >> exploreDF.show(10); >> >> >> >> Yes I realize its strange to switch styles how ever this should not cause >> memory problems >> >> >> final String exploreTable = "exploreTable"; >> >> exploreDF.registerTempTable(exploreTable); >> >> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'"; >> >> String stmt = String.format(fmt, exploreTable); >> >> >> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100); >> >> >> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144 >> >> >> exploreDF.unpersist(true); does not resolve memory issue >> >> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Problem with WINDOW functions?

2015-12-29 Thread Chris Fregly
on quick glance, it appears that you're calling collect() in there which is bringing down a huge amount of data down to the single Driver. this is why, when you allocated more memory to the Driver, a different error emerges most -definitely related to stop-the-world GC to cause the node to beco

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Chris Fregly
ator.populateAndCacheStripeDetails(OrcInputFormat.java:927) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> I will be glad for any help on that matter. >> >> Regards >> Dawid Wysakowicz >> > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Using Experminal Spark Features

2015-12-30 Thread Chris Fregly
to the documentation. Has anyone used > this approach yet and if so what has you experience been with using it? If > it helps we’d be looking to implement it using Scala. Secondly, in general > what has people experience been with using experimental features in Spark? > > > &

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-30 Thread Chris Fregly
sisted >> to memory-only. I want to be able to run a count (actually >> "countApproxDistinct") after filtering by an, at compile time, unknown >> (specified by query) value. >> > >> > I've experimented with using (abusing) Spark Streaming, by str

Re: 回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-30 Thread Chris Fregly
@Jim- I'm wondering if those docs are outdated as its my understanding (please correct if I'm wrong), that we should never be seeing OOMs as 1.5/Tungsten not only improved (reduced) the memory footprint of our data, but also introduced better task level - and even key level - external spilling

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
are the credentials visible from each Worker node to all the Executor JVMs on each Worker? > On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev > wrote: > > Dear Spark community, > > I faced the following issue with trying accessing data on S3a, my code is the > following: > > val sparkCo

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
- and at some point, you will want to autoscale. On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev < kudryavtsev.konstan...@gmail.com> wrote: > Chris, > > good question, as you can see from the code I set up them on driver, so I > expect they will be propagated t

Re: how to extend java transformer from Scala UnaryTransformer ?

2016-01-02 Thread Chris Fregly
ride > > public Function1, List> createTransformFunc() { > > // > http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters > > Function1, List> f = new > AbstractFunction1, List>() { > > public

  1   2   3   >