Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Asher Krim
checkpointDirectory); sparkContext.setCheckpointDir(checkpointPath); Asher Krim Senior Software Engineer On Tue, May 30, 2017 at 12:37 PM, Everett Anderson wrote: > Still haven't found a --conf option. > > Regarding a temporary HDFS checkpoint directory, it looks like when using > -

Re: KMean clustering resulting Skewed Issue

2017-03-29 Thread Asher Krim
e, so any bag-of-words approach to clustering will likely fail unless you first convert the features to a smaller and denser space Asher Krim Senior Software Engineer On Wed, Mar 29, 2017 at 5:49 PM, Reth RM wrote: > Hi Krim, > > The dataset that I am experimenting with is gold-trut

Re: KMean clustering resulting Skewed Issue

2017-03-26 Thread Asher Krim
(LSA, LDA, document2vec, etc). Other than that, this isn't a Spark question. Asher Krim Senior Software Engineer On Fri, Mar 24, 2017 at 9:37 PM, Reth RM wrote: > Hi, > > I am using spark k mean for clustering records that consist of news > documents, vectors are created by ap

Re: HBase Spark

2017-02-03 Thread Asher Krim
03 PM, Benjamin Kim wrote: > Asher, > > You’re right. I don’t see anything but 2.11 being pulled in. Do you know > where I can change this? > > Cheers, > Ben > > > On Feb 3, 2017, at 10:50 AM, Asher Krim wrote: > > Sorry for my persistence, but did you actually run &q

Re: HBase Spark

2017-02-03 Thread Asher Krim
gt; Ben > > > On Feb 3, 2017, at 8:16 AM, Asher Krim wrote: > > Did you check the actual maven dep tree? Something might be pulling in a > different version. Also, if you're seeing this locally, you might want to > check which version of the scala sdk your IDE is using >

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Asher Krim
mance differences between MLeap and vanilla Spark? What does Tensorflow support look like? I would love to serve models from a java stack while being agnostic to what framework was used to train them. Thanks, Asher Krim Senior Software Engineer On Fri, Feb 3, 2017 at 11:53 AM, Hollin Wilkins

Re: HBase Spark

2017-02-03 Thread Asher Krim
Did you check the actual maven dep tree? Something might be pulling in a different version. Also, if you're seeing this locally, you might want to check which version of the scala sdk your IDE is using Asher Krim Senior Software Engineer On Thu, Feb 2, 2017 at 5:43 PM, Benjamin Kim wrote:

Re: HBase Spark

2017-02-02 Thread Asher Krim
Ben, That looks like a scala version mismatch. Have you checked your dep tree? Asher Krim Senior Software Engineer On Thu, Feb 2, 2017 at 1:28 PM, Benjamin Kim wrote: > Elek, > > Can you give me some sample code? I can’t get mine to work. > > import org.apache.spark.

Re: mysql and Spark jdbc

2017-01-12 Thread Asher Krim
Have you tried using an alias? You should be able to replace ("dbtable”,"sometable") with ("dbtable”,"SELECT utc_timestamp AS my_timestamp FROM sometable") -- Asher Krim Senior Software Engineer On Thu, Jan 12, 2017 at 10:49 AM, Jorge Machado wrote: > Hi Guy

Re: How to save spark-ML model in Java?

2017-01-12 Thread Asher Krim
gt; exception is thrown. >> >> >> java.lang.UnsupportedOperationException: Pipeline write will fail on >> this Pipeline because it contains a stage which does not implement >> Writable. Non-Writable stage: rfc_98f8c9e0bd04 of type class >> org.apache.spark.ml.classification.Rand >> >> >> Here is my code segment. >> >> >> model.write().overwrite,save >> >> >> model.write().overwrite().save("path >> model.write().overwrite().save("mypath"); >> >> >> How to resolve this? >> >> Thanks and regards! >> >> Minudika >> >> > -- Asher Krim Senior Software Engineer

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-15 Thread Asher Krim
searched, but haven't found anything. > > Thanks! > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io > -- Asher Krim Senior Software Engineer

Re: example LDA code ClassCastException

2016-11-03 Thread Asher Krim
rk.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run( > Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > ... 1 more > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/example-LDA-code-ClassCastException-tp28009.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Asher Krim Senior Software Engineer

Re: LIMIT issue of SparkSQL

2016-10-29 Thread Asher Krim
We have also found LIMIT to take an unacceptable amount of time when reading parquet formatted data from s3. LIMIT was not strictly needed for our usecase, so we worked around it -- Asher Krim Senior Software Engineer On Fri, Oct 28, 2016 at 5:36 AM, Liz Bai wrote: > Sorry for the late re

Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Asher Krim
Yes, absolutely. Take a look at: https://spark.apache.org/docs/1.4.1/mllib-statistics.html#summary-statistics On Fri, Aug 28, 2015 at 8:39 AM, ashensw wrote: > Hi all, > > I have a dataset which consist of large number of features(columns). It is > in csv format. So I loaded it into a spark data

Re: Job hang when running random forest

2015-07-29 Thread Asher Krim
Did you get a thread dump? We have experienced similar problems during shuffle operations due to a deadlock in InetAddress. Specifically, look for a runnable thread at something like "java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)". Our "solution" has been to put a timeout around the c

spark task hangs at BinaryClassificationMetrics (InetAddress related)

2015-07-13 Thread Asher Krim
:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) Thanks, Asher Krim