Re: Using netlib-java in Spark 1.6 on linux

2016-03-04 Thread Chris Fregly
p26386p26392.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-04 Thread Chris Fregly
ome learning overhead if we go the Scala route. What I want to >>>>> know is: is the Scala version of Spark still far enough ahead of pyspark >>>>> to >>>>> be well worth any initial training overhead? >>>>> >>>>> >>>>> If you are a very advanced Python shop and if you’ve in-house >>>>> libraries that you have written in Python that don’t exist in Scala or >>>>> some >>>>> ML libs that don’t exist in the Scala version and will require fair amount >>>>> of porting and gap is too large, then perhaps it makes sense to stay put >>>>> with Python. >>>>> >>>>> However, I believe, investing (or having some members of your group) >>>>> learn and invest in Scala is worthwhile for few reasons. One, you will get >>>>> the performance gain, especially now with Tungsten (not sure how it >>>>> relates >>>>> to Python, but some other knowledgeable people on the list, please chime >>>>> in). Two, since Spark is written in Scala, it gives you an enormous >>>>> advantage to read sources (which are well documented and highly readable) >>>>> should you have to consult or learn nuances of certain API method or >>>>> action >>>>> not covered comprehensively in the docs. And finally, there’s a long term >>>>> benefit in learning Scala for reasons other than Spark. For example, >>>>> writing other scalable and distributed applications. >>>>> >>>>> >>>>> Particularly, we will be using Spark Streaming. I know a couple of >>>>> years ago that practically forced the decision to use Scala. Is this >>>>> still >>>>> the case? >>>>> >>>>> >>>>> You’ll notice that certain APIs call are not available, at least for >>>>> now, in Python. >>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html >>>>> >>>>> >>>>> Cheers >>>>> Jules >>>>> >>>>> -- >>>>> The Best Ideas Are Simple >>>>> Jules S. Damji >>>>> e-mail:dmat...@comcast.net >>>>> e-mail:jules.da...@gmail.com >>>>> >>>>> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubs

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
the DAGScheduler, which > will be apportioning the Tasks from those concurrent Jobs across the > available Executor cores. > > On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly wrote: > >> Good stuff, Evan. Looks like this is utilizing the in-memory >> capabilities of FiloDB

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Chris Fregly
bra operation > > > > Unfortunately, I'm fairly ignorant as to the internal mechanics of ALS > > itself. Is what I'm asking possible? > > > > Thank you, > > Colin > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
for analytical queries > that the OP wants; and MySQL is great but not scalable. Probably > something like VectorWise, HANA, Vertica would work well, but those > are mostly not free solutions. Druid could work too if the use case > is right. > > Anyways, great discussio

Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
the processed logs to both elastic search and > kafka. So that Spark Streaming can pick data from Kafka for the complex use > cases, while logstash filters can be used for the simpler use cases. > > I was wondering if someone has already done this evaluation and could > provide me

Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
with production-ready Kafka Streams, so I can try this out myself - and hopefully remove a lot of extra plumbing. On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly wrote: > this is a very common pattern, yes. > > note that in Netflix's case, they're currently pushing all of their logs >

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Chris Fregly
perhaps renaming to Spark ML would actually clear up code and documentation confusion? +1 for rename > On Apr 5, 2016, at 7:00 PM, Reynold Xin wrote: > > +1 > > This is a no brainer IMO. > > >> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley wrote: >> +1 By the way, the JIRA for tracking

Re: Any NLP lib could be used on spark?

2016-04-20 Thread Chris Fregly
this took me a bit to get working, but I finally got it up and running so with the package that Burak pointed out. here's some relevant links to my project that should give you some clues: https://github.com/fluxcapacitor/pipeline/blob/master/myapps/spark/ml/src/main/scala/com/advancedspark/ml/n

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Chris Fregly
Tue, May 17, 2016 at 1:36 AM, Todd wrote: >> >>> Hi, >>> We have a requirement to do count(distinct) in a processing batch >>> against all the streaming data(eg, last 24 hours' data),that is,when we do >>> count(distinct),we actually want to compute dis

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Chris Fregly
ere the cluster manager is running. >>>>>>>>> >>>>>>>>> The Driver node runs on the same host that the cluster manager is >>>>>>>>> running. The Driver requests the Cluster Manager for resources to run >>>>>>>>> tasks. The worker is tasked to create the executor (in this case >>>>>>>>> there is >>>>>>>>> only one executor) for the Driver. The Executor runs tasks for the >>>>>>>>> Driver. >>>>>>>>> Only one executor can be allocated on each worker per application. In >>>>>>>>> your >>>>>>>>> case you only have >>>>>>>>> >>>>>>>>> >>>>>>>>> The minimum you will need will be 2-4G of RAM and two cores. Well >>>>>>>>> that is my experience. Yes you can submit more than one spark-submit >>>>>>>>> (the >>>>>>>>> driver) but they may queue up behind the running one if there is not >>>>>>>>> enough >>>>>>>>> resources. >>>>>>>>> >>>>>>>>> >>>>>>>>> You pointed out that you will be running few applications in >>>>>>>>> parallel on the same host. The likelihood is that you are using a VM >>>>>>>>> machine for this purpose and the best option is to try running the >>>>>>>>> first >>>>>>>>> one, Check Web GUI on 4040 to see the progress of this Job. If you >>>>>>>>> start >>>>>>>>> the next JVM then assuming it is working, it will be using port 4041 >>>>>>>>> and so >>>>>>>>> forth. >>>>>>>>> >>>>>>>>> >>>>>>>>> In actual fact try the command "free" to see how much free memory >>>>>>>>> you have. >>>>>>>>> >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> LinkedIn * >>>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 28 May 2016 at 16:42, sujeet jog wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I have a question w.r.t production deployment mode of spark, >>>>>>>>>> >>>>>>>>>> I have 3 applications which i would like to run independently on >>>>>>>>>> a single machine, i need to run the drivers in the same machine. >>>>>>>>>> >>>>>>>>>> The amount of resources i have is also limited, like 4- 5GB RAM , >>>>>>>>>> 3 - 4 cores. >>>>>>>>>> >>>>>>>>>> For deployment in standalone mode : i believe i need >>>>>>>>>> >>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor ) >>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor ) >>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor ) >>>>>>>>>> >>>>>>>>>> The issue here is i will require 6 JVM running in parallel, for >>>>>>>>>> which i do not have sufficient CPU/MEM resources, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hence i was looking more towards a local mode deployment mode, >>>>>>>>>> would like to know if anybody is using local mode where Driver + >>>>>>>>>> Executor >>>>>>>>>> run in a single JVM in production mode. >>>>>>>>>> >>>>>>>>>> Are there any inherent issues upfront using local mode for >>>>>>>>>> production base systems.?.. >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- *Chris Fregly* Research Scientist @ Flux Capacitor AI "Bringing AI Back to the Future!" San Francisco, CA http://fluxcapacitor.ai

Re: GraphX Java API

2016-05-30 Thread Chris Fregly
> > *Abhishek Kumar* > > > > This message (including any attachments) contains confidential information > intended for a specific individual and purpose, and is protected by law. If > you are not the intended recipient, you should delete this message and any > disclos

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Chris Fregly
0.0 in stage 0.0 >>>>>> (TID 0, localhost): java.lang.Error: Multiple ES-Hadoop versions detected >>>>>> in the classpath; please use only one >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar >>>>>> >>>>>> at org.elasticsearch.hadoop.util.Version.(Version.java:73) >>>>>> at >>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376) >>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89) >>>>>> at >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> >>>>>> .. still tracking this down but was wondering if there is someting >>>>>> obvious I'm dong wrong. I'm going to take out >>>>>> elasticsearch-hadoop-2.3.2.jar and try again. >>>>>> >>>>>> Lots of trial and error here :-/ >>>>>> >>>>>> Kevin >>>>>> >>>>>> -- >>>>>> >>>>>> We’re hiring if you know of any awesome Java Devops or Linux >>>>>> Operations Engineers! >>>>>> >>>>>> Founder/CEO Spinn3r.com >>>>>> Location: *San Francisco, CA* >>>>>> blog: http://burtonator.wordpress.com >>>>>> … or check out my Google+ profile >>>>>> <https://plus.google.com/102718274791889610666/posts> >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> >>>> We’re hiring if you know of any awesome Java Devops or Linux Operations >>>> Engineers! >>>> >>>> Founder/CEO Spinn3r.com >>>> Location: *San Francisco, CA* >>>> blog: http://burtonator.wordpress.com >>>> … or check out my Google+ profile >>>> <https://plus.google.com/102718274791889610666/posts> >>>> >>>> >> >> >> -- >> >> We’re hiring if you know of any awesome Java Devops or Linux Operations >> Engineers! >> >> Founder/CEO Spinn3r.com >> Location: *San Francisco, CA* >> blog: http://burtonator.wordpress.com >> … or check out my Google+ profile >> <https://plus.google.com/102718274791889610666/posts> >> >> -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA http://pipeline.io

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-12 Thread Chris Fregly
=spark+mllib >>>>> >>>>> >>>>> https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8&qid=1465657706&sr=8-3&keywords=spark+mllib >>>>> >>>>> >>>>> https:

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread Chris Fregly
re are some doc regarding using kafka directly >> from the >> > reader.stream? >> > Has it been integrated already (I mean the source)? >> > >> > Sorry if the answer is RTFM (but then I'd appreciate a pointer anyway^^) >> > >> > thanks, >> > cheers >> > andy >> > -- >> > andy >> > -- > andy > -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA http://pipeline.io

Re: ml models distribution

2016-07-22 Thread Chris Fregly
S) or coordinated distribution of the models. But I >> wanted >> > to know if there is any infrastructure in Spark that specifically >> addresses >> > such need. >> > >> > Thanks. >> > >> > Cheers, >> > >> > P.S

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
There currently are no releases beyond >> Spark 2.0.0. >> >> On Tue, Aug 9, 2016 at 9:55 AM, Jestin Ma >> wrote: >> >>> If we want to use versions of Spark beyond the official 2.0.0 release, >>> specifically on Maven + Java, what steps should

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
nditions placed on >> the package. If you find that the general public are downloading such test >> packages, then remove them. >> > > On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly wrote: > >> this is a valid question. there are many people building products and >&

Re: Kafka - streaming from multiple topics

2015-12-20 Thread Chris Fregly
separating out your code into separate streaming jobs - especially when there are no dependencies between the jobs - is almost always the best route. it's easier to combine atoms (fusion), then split them (fission). I recommend splitting out jobs along batch window, stream window, and state-tr

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Chris Fregly
hey Eran, I run into this all the time with Json. the problem is likely that your Json is "too pretty" and extending beyond a single line which trips up the Json reader. my solution is usually to de-pretty the Json - either manually or through an ETL step - by stripping all white space before p

Re: Pyspark SQL Join Failure

2015-12-20 Thread Chris Fregly
how does Spark SQL/DataFrame know that train_users_2.csv has a field named, "id" or anything else domain specific? is there a header? if so, does sc.textFile() know about this header? I'd suggest using the Databricks spark-csv package for reading csv data. there is an option in there to spec

Re: Hive error when starting up spark-shell in 1.5.2

2015-12-20 Thread Chris Fregly
hopping on a plane, but check the hive-site.xml that's in your spark/conf directory (or should be, anyway). I believe you can change the root path thru this mechanism. if not, this should give you more info google on. let me know as this comes up a fair amount. > On Dec 19, 2015, at 4:58 PM,

Re: How to do map join in Spark SQL

2015-12-20 Thread Chris Fregly
this type of broadcast should be handled by Spark SQL/DataFrames automatically. this is the primary cost-based, physical-plan query optimization that the Spark SQL Catalyst optimizer supports. in Spark 1.5 and before, you can trigger this optimization by properly setting the spark.sql.autobroad

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-25 Thread Chris Fregly
M, Jakob Odersky >>>> wrote: >>>> >>>>> It might be a good idea to see how many files are open and try >>>>> increasing the open file limit (this is done on an os level). In some >>>>> application use-cases it is actually a legitim

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-25 Thread Chris Fregly
our PostgreSQL server, I get the following >>>> error. >>>> >>>> Error: java.sql.SQLException: No suitable driver found for >>>> jdbc:postgresql:/// >>>> (state=,code=0) >>>> >>>> Can someone help me understand why this is? >>>> >>>&g

Re: Memory allocation for Broadcast values

2015-12-25 Thread Chris Fregly
> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Fat jar can't find jdbc

2015-12-25 Thread Chris Fregly
connect jar file. >>>> > >>>> > I've tried: >>>> > • Using different versions of mysql-connector-java in my >>>> build.sbt file >>>> > • Copying the connector jar to my_project/src/main/lib >>>> > • Copying the connector jar to my_project/lib <-- (this is >>>> where I keep my build.sbt) >>>> > Everything loads fine and works, except my call that does >>>> "sqlContext.load("jdbc", myOptions)". I know this is a total newbie >>>> question but in my defense, I'm fairly new to Scala, and this is my first >>>> go at deploying a fat jar with sbt-assembly. >>>> > >>>> > Thanks for any advice! >>>> > >>>> > -- >>>> > David Yerrington >>>> > yerrington.net >>>> >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >> >> -- >> David Yerrington >> yerrington.net >> > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
apache-spark-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-un

Re: error while defining custom schema in Spark 1.5.0

2015-12-25 Thread Chris Fregly
; > (fields: > Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType > cannot be applied to (org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField) >val customSchema = StructType( StructField("year", IntegerType, > true), StructField("make", StringType, true) ,StructField("model", > StringType, true) , StructField("comment", StringType, true) , > StructField("blank", StringType, true),StructField("blank", StringType, > true)) > ^ >Would really appreciate if somebody share the example which works with > Spark 1.4 or Spark 1.5.0 > > Thanks, > Divya > > ^ > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: fishing for help!

2015-12-25 Thread Chris Fregly
gt;>> >>>> Hi, >>>> I know it is a wide question but can you think of reasons why a pyspark >>>> job which runs on from server 1 using user 1 will run faster then the same >>>> job when running on server 2 with user 1 >>>> Eran >>>> >>> >>> >> -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Getting estimates and standard error using ml.LinearRegression

2015-12-25 Thread Chris Fregly
; PFB the code snippet > > val lr = new LinearRegression() > lr.setMaxIter(10) > .setRegParam(0.01) > .setFitIntercept(true) > val model= lr.fit(test) > val estimates = model.summary > > > -- > Thanks and Regards > Arun > --

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Chris Fregly
>>> "com.epam.parso.spark.ds.DefaultSource"); >>>>>> df.cache(); >>>>>> df.printSchema(); <-- prints the schema perfectly fine! >>>>>> >>>>>> df.show(); <-- Works perfectly fine (shows table >>>>>> with 20 lines)! >>>>>> df.registerTempTable("table"); >>>>>> df.select("select * from table limit 5").show(); <-- gives weird >>>>>> exception >>>>>> >>>>>> Exception is: >>>>>> >>>>>> AnalysisException: cannot resolve 'select * from table limit 5' given >>>>>> input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS >>>>>> >>>>>> I can do a collect on a dataframe, but cannot select any specific >>>>>> columns either "select * from table" or "select VER, CREATED from table". >>>>>> >>>>>> I use spark 1.5.2. >>>>>> The same code perfectly works through Zeppelin 0.5.5. >>>>>> >>>>>> Thanks. >>>>>> -- >>>>>> Be well! >>>>>> Jean Morozov >>>>>> >>>>> >>>>> >>>> >>> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Chris Fregly
97 On Fri, Dec 25, 2015 at 2:17 PM, Chris Fregly wrote: > I assume by "The same code perfectly works through Zeppelin 0.5.5" that > you're using the %sql interpreter with your regular SQL SELECT statement, > correct? > > If so, the Zeppelin interpreter is conver

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
ust say we didn't have this problem in the old mllib API so > it might be something in the new ml that I'm missing. > I will dig deeper into the problem after holidays. > > 2015-12-25 16:26 GMT+01:00 Chris Fregly : > > so it looks like you're increasing num tr

Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Chris Fregly
which version of spark is this? is there any chance that a single key - or set of keys- key has a large number of values relative to the other keys (aka. skew)? if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although I had some issues still with 1.5.1 in a similar situati

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Chris Fregly
; expect it in the next few days. You will probably want to use the new API >> once it's available. >> >> >> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot >> wrote: >> >>> Hi, >>> I am new bee to spark and a bit confused about RDDs and

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Chris Fregly
results.col("labelIndex"), >> >> >> results.col("prediction"), >> >> >> results.col("words")); >> >> exploreDF.show(10); >> >> >> >> Yes I realize its strange to switch styles how ever this should not cause >> memory problems >> >> >> final String exploreTable = "exploreTable"; >> >> exploreDF.registerTempTable(exploreTable); >> >> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'"; >> >> String stmt = String.format(fmt, exploreTable); >> >> >> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100); >> >> >> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144 >> >> >> exploreDF.unpersist(true); does not resolve memory issue >> >> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Problem with WINDOW functions?

2015-12-29 Thread Chris Fregly
on quick glance, it appears that you're calling collect() in there which is bringing down a huge amount of data down to the single Driver. this is why, when you allocated more memory to the Driver, a different error emerges most -definitely related to stop-the-world GC to cause the node to beco

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Chris Fregly
ator.populateAndCacheStripeDetails(OrcInputFormat.java:927) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> I will be glad for any help on that matter. >> >> Regards >> Dawid Wysakowicz >> > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Using Experminal Spark Features

2015-12-30 Thread Chris Fregly
to the documentation. Has anyone used > this approach yet and if so what has you experience been with using it? If > it helps we’d be looking to implement it using Scala. Secondly, in general > what has people experience been with using experimental features in Spark? > > > &

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-30 Thread Chris Fregly
sisted >> to memory-only. I want to be able to run a count (actually >> "countApproxDistinct") after filtering by an, at compile time, unknown >> (specified by query) value. >> > >> > I've experimented with using (abusing) Spark Streaming, by str

Re: 回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-30 Thread Chris Fregly
@Jim- I'm wondering if those docs are outdated as its my understanding (please correct if I'm wrong), that we should never be seeing OOMs as 1.5/Tungsten not only improved (reduced) the memory footprint of our data, but also introduced better task level - and even key level - external spilling

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
are the credentials visible from each Worker node to all the Executor JVMs on each Worker? > On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev > wrote: > > Dear Spark community, > > I faced the following issue with trying accessing data on S3a, my code is the > following: > > val sparkCo

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
o all nodes, won't them? > > Thank you, > Konstantin Kudryavtsev > > On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly wrote: > >> are the credentials visible from each Worker node to all the Executor >> JVMs on each Worker? >> >> On Dec 30, 2015, at 12:45 PM,

Re: how to extend java transformer from Scala UnaryTransformer ?

2016-01-02 Thread Chris Fregly
ride > > public Function1, List> createTransformFunc() { > > // > http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters > > Function1, List> f = new > AbstractFunction1, List>() { > > public

Re: How to specify the numFeatures in HashingTF

2016-01-02 Thread Chris Fregly
mFeatures". I was >>> wondering what is the best way to set the value to this parameter. In the >>> use case of text categorization, do you need to know in advance the number >>> of words in your vocabulary? or do you set it to be a large value, greater >>> than the number of words in your vocabulary? >>> >>> Thanks, >>> >>> Jianguo >>> >> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Chris Fregly
do >>>> we need to enable Tunsten and unsafe options or they are enabled by >>>> default >>>> I see in documentation that default sort manager is sort I though it is >>>> Tungsten no? Please guide. >>>> >>>> >>>&g

Re: Date Time Regression as Feature

2016-01-08 Thread Chris Fregly
features had > dates in it. > > Thanks > > Jorge Machado > > Jorge Machado > jo...@jmachado.me > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands,

Re: Benchmarking with multiple users in Spark

2016-01-08 Thread Chris Fregly
to just launch multiple spark-shells to >simulate multiple users with dynamic resource allocation enabled. I >haven’t tried this yet. > > Are there any standard approaches for benchmarking with multiple users in > Spark? Any pointers on this would be helpful.

Re: Allowing parallelism in spark local mode

2016-02-12 Thread Chris Fregly
processing them one at a time. > When 2 requests arrive simultaneously, the processing time for each of them > is almost doubled. > I tried setting spark.default.parallelism, spark.executor.cores, > spark.driver.cores but that did not change the time in a meaningful way. > > Am I missing

Re: Communication between two spark streaming Job

2016-02-19 Thread Chris Fregly
if you need update notifications, you could introduce ZooKeeper (eek!) or a Kafka queue between the jobs. I've seen internal Kafka queues (relative to external spark streaming queues) used for this type of incremental update use case. think of the updates as transaction logs. > On Feb 19, 20

Re: Evaluating spark streaming use case

2016-02-21 Thread Chris Fregly
good catch on the cleaner.ttl @jatin- when you say "memory-persisted RDD", what do you mean exactly? and how much data are you talking about? remember that spark can evict these memory-persisted RDDs at any time. they can be recovered from Kafka, but this is not a good situation to be in.

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Chris Fregly
; >> On Feb 28, 2016, at 1:48 PM, Ashok Kumar > > wrote: >> >> Hi Gurus, >> >> Appreciate if you recommend me a good book on Spark or documentation for >> beginner to moderate knowledge >> >> I very much like to skill myself on transformation and action methods. >> >> FYI, I have already looked at examples on net. However, some of them not >> clear at least to me. >> >> Warmest regards >> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-18 Thread Chris Fregly
e >>> since Spark 1.6. But the spark of current cluster is version 1.5. Can we >>> install Spark 2.0 on the master node to work around this? >>> >>> Thanks! >>> >> >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter

Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-08 Thread Chris Fregly
i've had the same problem with elasticsearch when auto-create is enabled for indexes. i've spent hours debugging what ended up being a typo combined with auto-create. you definitely want to fail-fast in this scenario. > On Oct 8, 2016, at 6:26 PM, Cody Koeninger wrote: > > So I just now rete

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Chris Fregly
can get you up and running in your own cloud-based or on-premise environment in minutes. we support aws, google cloud, and azure - basically anywhere that runs docker. any time zone works. we're completely global with free 24x7 support for everyone in the community. thanks! hope this i

Re: Connection pool in workers

2015-03-01 Thread Chris Fregly
hey AKM! this is a very common problem. the streaming programming guide addresses this issue here, actually: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#design-patterns-for-using-foreachrdd the tl;dr is this: 1) you want to use foreachPartition() to operate on a whole par

Re: Pushing data from AWS Kinesis -> Spark Streaming -> AWS Redshift

2015-03-01 Thread Chris Fregly
Hey Mike- Great to see you're using the AWS stack to its fullest! I've already created the Kinesis-Spark Streaming connector with examples, documentation, test, and everything. You'll need to build Spark from source with the -Pkinesis-asl profile, otherwise they won't be included in the build.

Re: Spark Streaming S3 Performance Implications

2015-03-21 Thread Chris Fregly
hey mike! you'll definitely want to increase your parallelism by adding more shards to the stream - as well as spinning up 1 receiver per shard and unioning all the shards per the KinesisWordCount example that is included with the kinesis streaming package.  you'll need more cores (cluster) or t

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Chris Fregly
hey mike- as you pointed out here from my docs, changing the stream name is sometimes problematic due to the way the Kinesis Client Library manages leases and checkpoints, etc in DynamoDB. I noticed this directly while developing the Kinesis connector which is why I highlighted the issue here.

Re: Spark + Kinesis

2015-05-09 Thread Chris Fregly
hey vadim- sorry for the delay. if you're interested in trying to get Kinesis working one-on-one, shoot me a direct email and we'll get it going off-list. we can circle back and summarize our findings here. lots of people are using Spark Streaming+Kinesis successfully. would love to help you t

Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-05-09 Thread Chris Fregly
Ankur- can you confirm that you got the stock JavaKinesisWordCountASL example working on EMR per Chris' suggestion? i want to stay ahead of any issues that you may encounter with the Kinesis + Spark Streaming + EMR integration as this is a popular stack. Thanks! -Chris On Fri, Mar 27, 2015 at

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Chris Fregly
have you tried to union the 2 streams per the KinesisWordCountASL example where 2 streams (against the same Kinesis stream in this case) are created

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Chris Fregly
kinesis >> DStreams are using the Kinesis application name (as they are in the same >> StreamingContext / SparkContext / Spark app name), KCL may be doing weird >> overwriting checkpoint information of both Kinesis streams into the same >> DynamoDB table. Either ways,

Re: Selecting first ten values in a RDD/partition

2014-06-29 Thread Chris Fregly
as brian g alluded to earlier, you can use DStream.mapPartitions() to return the partition-local top 10 for each partition. once you collect the results from all the partitions, you can do a global top 10 merge sort across all partitions. this leads to a much much-smaller dataset to be shuffled

Re: Reconnect to an application/RDD

2014-06-29 Thread Chris Fregly
Tachyon is another option - this is the "off heap" StorageLevel specified when persisting RDDs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel or just use HDFS. this requires subsequent Applications/SparkContext's to reload the data from disk, of co

Re: multiple passes in mapPartitions

2014-07-01 Thread Chris Fregly
also, multiple calls to mapPartitions() will be pipelined by the spark execution engine into a single stage, so the overhead is minimal. On Fri, Jun 13, 2014 at 9:28 PM, zhen wrote: > Thank you for your suggestion. We will try it out and see how it performs. > We > think the single call to mapP

Re: Fw: How Spark Choose Worker Nodes for respective HDFS block

2014-07-01 Thread Chris Fregly
yes, spark attempts to achieve data locality (PROCESS_LOCAL or NODE_LOCAL) where possible just like MapReduce. it's a best practice to co-locate your Spark Workers on the same nodes as your HDFS Name Nodes for just this reason. this is achieved through the RDD.preferredLocations() interface metho

Re: Spark Streaming source from Amazon Kinesis

2014-07-22 Thread Chris Fregly
i took this over from parviz. i recently submitted a new PR for Kinesis Spark Streaming support: https://github.com/apache/spark/pull/1434 others have tested it with good success, so give it a whirl! waiting for it to be reviewed/merged. please put any feedback into the PR directly. thanks! -

Re: Task's "Scheduler Delay" in web ui

2014-08-19 Thread Chris Fregly
"Scheduling Delay" is the time required to assign a task to an available resource. if you're seeing large scheduler delays, this likely means that other jobs/tasks are using up all of the resources. here's some more info on how to setup Fair Scheduling versus the default FIFO Scheduler: https://

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Chris Fregly
this would be awesome. did a jira get created for this? I searched, but didn't find one. thanks! -chris On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani wrote: > Thanks a lot Xiangrui. This will help. > > > On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng wrote: > >> Hi Rahul, >> >> We plan to

Re: slower worker node in the cluster

2014-08-19 Thread Chris Fregly
perhaps creating Fair Scheduler Pools might help? there's no way to pin certain nodes to a pool, but you can specify minShares (cpu's). not sure if that would help, but worth looking in to. On Tue, Jul 8, 2014 at 7:37 PM, haopu wrote: > In a standalone cluster, is there way to specify the sta

Re: Spark on Yarn: Connecting to Existing Instance

2014-08-21 Thread Chris Fregly
perhaps the author is referring to Spark Streaming applications? they're examples of long-running applications. the application/domain-level protocol still needs to be implemented yourself, as sandy pointed out. On Wed, Jul 9, 2014 at 11:03 AM, John Omernik wrote: > So how do I do the "long-l

Re: Spark Streaming - What does Spark Streaming checkpoint?

2014-08-21 Thread Chris Fregly
The StreamingContext can be recreated from a checkpoint file, indeed. check out the following Spark Streaming source files for details: StreamingContext, Checkpoint, DStream, DStreamCheckpoint, and DStreamGraph. On Wed, Jul 9, 2014 at 6:11 PM, Yan Fang wrote: > Hi guys, > > I am a little co

Re: Parsing Json object definition spanning multiple lines

2014-08-26 Thread Chris Fregly
i've seen this done using mapPartitions() where each partition represents a single, multi-line json file. you can rip through each partition (json file) and parse the json doc as a whole. this assumes you use sc.textFile("/*.json") or equivalent to load in multiple files at once. each json file

Re: Spark-Streaming collect/take functionality.

2014-08-26 Thread Chris Fregly
good suggestion, td. and i believe the optimization that jon.burns is referring to - from the big data mini course - is a step earlier: the sorting mechanism that produces sortedCounts. you can use mapPartitions() to get a top k locally on each partition, then shuffle only (k * # of partitions)

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Chris Fregly
great work, Dibyendu. looks like this would be a popular contribution. expanding on bharat's question a bit: what happens if you submit multiple receivers to the cluster by creating and unioning multiple DStreams as in the kinesis example here: https://github.com/apache/spark/blob/ae58aea2d1435

Re: Low Level Kafka Consumer for Spark

2014-08-28 Thread Chris Fregly
@bharat- overall, i've noticed a lot of confusion about how Spark Streaming scales - as well as how it handles failover and checkpointing, but we can discuss that separately. there's actually 2 dimensions to scaling here: receiving and processing. *Receiving* receiving can be scaled out by subm

Re: Kinesis receiver & spark streaming partition

2014-08-28 Thread Chris Fregly
great question, wei. this is very important to understand from a performance perspective. and this extends is beyond kinesis - it's for any streaming source that supports shards/partitions. i need to do a little research into the internals to confirm my theory. lemme get back to you! -chris

Re: saveAsSequenceFile for DStream

2014-08-30 Thread Chris Fregly
couple things to add here: 1) you can import the org.apache.spark.streaming.dstream.PairDStreamFunctions implicit which adds a whole ton of functionality to DStream itself. this lets you work at the DStream level versus digging into the underlying RDDs. 2) you can use ssc.fileStream(directory) t

Re: data locality

2014-08-30 Thread Chris Fregly
you can view the Locality Level of each task within a stage by using the Spark Web UI under the Stages tab. levels are as follows (in order of decreasing desirability): 1) PROCESS_LOCAL <- data was found directly in the executor JVM 2) NODE_LOCAL <- data was found on the same node as the executor

Re: Readin from Amazon S3 behaves inconsistently: return different number of lines...

2014-08-30 Thread Chris Fregly
interesting and possibly-related blog post from netflix earlier this year: http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html On Fri, Aug 1, 2014 at 8:09 AM, nit wrote: > @sean - I am using latest code from master branch, up to commit# > a7d145e98c55fa66a541293930f25d9cdc25f3b

Re: spark RDD join Error

2014-09-04 Thread Chris Fregly
specifically, you're picking up the following implicit: import org.apache.spark.SparkContext.rddToPairRDDFunctions (in case you're a wildcard-phobe like me) On Thu, Sep 4, 2014 at 5:15 PM, Veeranagouda Mukkanagoudar < veera...@gmail.com> wrote: > Thanks a lot, that fixed the issue :) > > > On

Re: Shared variable in Spark Streaming

2014-09-05 Thread Chris Fregly
good question, soumitra. it's a bit confusing. to break TD's code down a bit: dstream.count() is a transformation operation (returns a new DStream), executes lazily, runs in the cluster on the underlying RDDs that come through in that batch, and returns a new DStream with a single element repres

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-10-29 Thread Chris Fregly
jira created with comments/references to this discussion: https://issues.apache.org/jira/browse/SPARK-4144 On Tue, Aug 19, 2014 at 4:47 PM, Xiangrui Meng wrote: > No. Please create one but it won't be able to catch the v1.1 train. > -Xiangrui > > On Tue, Aug 19, 2014 at 4:22

Re: Out of memory with Spark Streaming

2014-10-30 Thread Chris Fregly
curious about why you're only seeing 50 records max per batch. how many receivers are you running? what is the rate that you're putting data onto the stream? per the default AWS kinesis configuration, the producer can do 1000 PUTs per second with max 50k bytes per PUT and max 1mb per second per

Re: network wordcount example

2014-03-31 Thread Chris Fregly
@eric- i saw this exact issue recently while working on the KinesisWordCount. are you passing "local[2]" to your example as the MASTER arg versus just "local" or "local[1]"? you need at least 2. it's documented as "n>1" in the scala source docs - which is easy to mistake for n>=1. i just ran t

Re: function state lost when next RDD is processed

2014-04-13 Thread Chris Fregly
or how about the UpdateStateByKey() operation? https://spark.apache.org/docs/0.9.0/streaming-programming-guide.html the StatefulNetworkWordCount example demonstrates how to keep state across RDDs. > On Mar 28, 2014, at 8:44 PM, Mayur Rustagi wrote: > > Are you referring to Spark Streaming? >

Re: Checkpoint Vs Cache

2014-05-02 Thread Chris Fregly
http://docs.sigmoidanalytics.com/index.php/Checkpoint_and_not_running_out_of_disk_space On Mon, Apr 14, 2014 at 2:43 AM, Cheng Lian wrote: > Checkpointed RDDs are materialized on disk, while cached RDDs are > materialized in memory. When memory is insufficient, cached RDD blocks (1 > block per

Re: Multiple Streams with Spark Streaming

2014-05-03 Thread Chris Fregly
if you want to use true Spark Streaming (not the same as Hadoop Streaming/Piping, as Mayur pointed out), you can use the DStream.union() method as described in the following docs: http://spark.apache.org/docs/0.9.1/streaming-custom-receivers.html http://spark.apache.org/docs/0.9.1/streaming-progra

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Chris Fregly
not sure if this directly addresses your issue, peter, but it's worth mentioned a handy AWS EMR utility called s3distcp that can upload a single HDFS file - in parallel - to a single, concatenated S3 file once all the partitions are uploaded. kinda cool. here's some info: http://docs.aws.amazon.

Re: help me

2014-05-03 Thread Chris Fregly
as Mayur indicated, it's odd that you are seeing better performance from a less-local configuration. however, the non-deterministic behavior that you describe is likely caused by GC pauses in your JVM process. take note of the *spark.locality.wait* configuration parameter described here: http://s

Re: Equally weighted partitions in Spark

2014-05-03 Thread Chris Fregly
@deenar- i like the custom partitioner strategy that you mentioned. i think it's very useful. as a thought-exercise, is it possible to re-partition your RDD to more-evenly distribute the long-running tasks among the short-running tasks by ordering the key's differently? this would play nice wit

Re: spark streaming question

2014-05-04 Thread Chris Fregly
great questions, weide. in addition, i'd also like to hear more about how to horizontally scale a spark-streaming cluster. i've gone through the samples (standalone mode) and read the documentation, but it's still not clear to me how to scale this puppy out under high load. i assume i add more r

Re: Is there any problem on the spark mailing list?

2014-05-11 Thread Chris Fregly
btw, you can see all "missing" messages from may 7th (start of outage) here: http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/browser the last message i received in my inbox was this one: Cheney SunRe: master attempted to re-register the worker and then took all workers as unregis