High Availability for spark streaming application running in kubernetes

2020-06-24 Thread Shenson Joseph
Hello, I have a spark streaming application running in kubernetes and we use spark operator to submit spark jobs. Any suggestion on 1. How to handle high availability for spark streaming applications. 2. What would be the best approach to handle high availability of checkpoint data if we don't us

Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Joseph Pride
Folks: SnappyData. I’m fairly new to working with it myself, but it looks pretty promising. It marries Spark with a co-located in-memory GemFire (or something gem-related) database. So you can access the data with SQL, JDBC, ODBC (if you wanna go Enterprise instead of open-source) or natively

GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley
eases/tag/release-0.5.0 *Docs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Thanks to all contributors and to the community for feedback! Joseph -- Joseph Bradley Software Eng

GraphFrames 0.4.0 release, with Apache Spark 2.1 support

2017-03-28 Thread Joseph Bradley
ocs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: LDA in Spark

2017-03-23 Thread Joseph Bradley
A does not have this yet. Please feel free to make a feature request JIRA for it! Thanks, Joseph On Thu, Mar 23, 2017 at 4:54 PM, Mathieu DESPRIEE wrote: > Hello Joseph, > > I saw your contribution to online LDA in Spark (SPARK-5563). Please allow > me a very quick question : >

SQL warehouse dir

2017-02-10 Thread Joseph Naegele
Hi all, I've read the docs for Spark SQL 2.1.0 but I'm still having issues with the warehouse and related details. I'm not using Hive proper, so my hive-site.xml consists only of: javax.jdo.option.ConnectionURL jdbc:derby:;databaseName=/mnt/data/spark/metastore_db;create=true I've set "sp

Spark SQL 1.6.3 ORDER BY and partitions

2017-01-06 Thread Joseph Naegele
I have two separate but similar issues that I've narrowed down to a pretty good level of detail. I'm using Spark 1.6.3, particularly Spark SQL. I'm concerned with a single dataset for now, although the details apply to other, larger datasets. I'll call it "table". It's around 160 M records, ave

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Joseph Bradley
t(sc.hadoopConfiguration)* >>>>> * .delete(new Path(s"${checkpointDir.get}/${iteration - >>>>> checkpointInterval}"), true)* >>>>> } >>>>> >>>>> System.gc() // hint Spark to clean shuffle directories >>>>> } >>>>> >>>>> >>>>> Thanks >>>>> Ankur >>>>> >>>>> On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung < >>>>> felixcheun...@hotmail.com> wrote: >>>>> >>>>>> Do you have more of the exception stack? >>>>>> >>>>>> >>>>>> -- >>>>>> *From:* Ankur Srivastava >>>>>> *Sent:* Wednesday, January 4, 2017 4:40:02 PM >>>>>> *To:* user@spark.apache.org >>>>>> *Subject:* Spark GraphFrame ConnectedComponents >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am trying to use the ConnectedComponent algorithm of GraphFrames >>>>>> but by default it needs a checkpoint directory. As I am running my spark >>>>>> cluster with S3 as the DFS and do not have access to HDFS file system I >>>>>> tried using a s3 directory as checkpoint directory but I run into below >>>>>> exception: >>>>>> >>>>>> Exception in thread "main"java.lang.IllegalArgumentException: Wrong >>>>>> FS: s3n://, expected: file:/// >>>>>> >>>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) >>>>>> >>>>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF >>>>>> ileSystem.java:69) >>>>>> >>>>>> If I set checkpoint interval to -1 to avoid checkpointing the driver >>>>>> just hangs after 3 or 4 iterations. >>>>>> >>>>>> Is there some way I can set the default FileSystem to S3 for Spark or >>>>>> any other option? >>>>>> >>>>>> Thanks >>>>>> Ankur >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Storage history in web UI

2017-01-03 Thread Joseph Naegele
Hi all, Is there any way to observe Storage history in Spark, i.e. which RDDs were cached and where, etc. after an application completes? It appears the Storage tab in the History Server UI is useless. Thanks --- Joe Naegele Grier Forensics ---

RE: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Joseph Naegele
. Thanks --- Joe Naegele Grier Forensics From: Michael Stratton [mailto:michael.strat...@komodohealth.com] Sent: Monday, December 19, 2016 10:00 AM To: Joseph Naegele Cc: user Subject: Re: [Spark SQL] Task failed while writing rows It seems like an issue w/ Hadoop. What do you get when

[Spark SQL] Task failed while writing rows

2016-12-18 Thread Joseph Naegele
Hi all, I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). It's current compressed size is around 13 GB, but my problem started when it was much smaller, maybe 5 GB. This dataset is generated by

spark nightly builds with Hadoop 2.7

2016-09-09 Thread Joseph Naegele
Hello, I'm using the Spark nightly build "spark-2.1.0-SNAPSHOT-bin-hadoop2.7" from http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/ due to bugs in Spark 2.0.0 (SPARK-16740, SPARK-16802), however I noticed that the recent builds only come in "-hadoop2.4-without-hive" and "-without

Re: GraphFrames 0.2.0 released

2016-08-26 Thread Joseph Bradley
This should do it: https://github.com/graphframes/graphframes/releases/tag/release-0.2.0 Thanks for the reminder! Joseph On Wed, Aug 24, 2016 at 10:11 AM, Maciej Bryński wrote: > Hi, > Do you plan to add tag for this release on github ? > https://github.com/graphframes/graphframes

Re: Spark Thrift Server Concurrency

2016-06-26 Thread Prabhu Joseph
nd not others? > > It sounds like an interesting problem… > > On Jun 23, 2016, at 5:21 AM, Prabhu Joseph > wrote: > > Hi All, > >On submitting 20 parallel same SQL query to Spark Thrift Server, the > query execution time for some queries are less than a second and some a

Spark Thrift Server Concurrency

2016-06-23 Thread Prabhu Joseph
concurrency is affected by Single Driver. How to improve the concurrency and what are the best practices. Thanks, Prabhu Joseph

sparkR.init() can not load sparkPackages.

2016-06-16 Thread Joseph
t, "file:/home/hadoop/spark-1.6.1-bin-hadoop2.6/data/mllib/sample_tree_data.csv", "csv") registerTempTable(people, "people") teenagers <- sql(sqlContext, "SELECT * FROM people") head(teenagers) Joseph

The metastore database gives errors when start spark-sql CLI.

2016-05-13 Thread Joseph
EMENT: SELECT @@version This does not affect normal use, but maybe it is a bug! ( I use spark 1.6.1 and hive 1.2.1) Joseph

When start spark-sql, postgresql gives errors.

2016-05-13 Thread Joseph
EMENT: SELECT @@version This does not affect normal use, but maybe it is a bug! ( I use spark 1.6.1 and hive 1.2.1) Joseph

When start spark-sql, postgresql gives errors.

2016-05-11 Thread Joseph
EMENT: SELECT @@version This does not affect normal use, but maybe it is a bug! ( I use spark 1.6.1 and hive 1.2.1) Joseph

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley
+1 By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591 On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia wrote: > This sounds good to me as well. The one thing we should pay attention to > is how we update the docs so that people know to start w

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
Can you try reducing maxBins? That reduces communication (at the cost of coarser discretization of continuous features). On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley wrote: > In my experience, 20K is a lot but often doable; 2K is easy; 200 is > small. Communication scales linearly

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small. Communication scales linearly in the number of features. On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov wrote: > Joseph, > > Correction, there 20k features. Is it still a lot? > What number of features can b

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Joseph Bradley
First thought: 70K features is *a lot* for the MLlib implementation (and any PLANET-like implementation) Using fewer partitions is a good idea. Which Spark version was this on? On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov wrote: > The questions I have in mind: > > Is it smth that the one mi

Re: Handling Missing Values in MLLIB Decision Tree

2016-03-22 Thread Joseph Bradley
It does not currently handle surrogate splits. You will need to preprocess your data to remove or fill in missing values. I'd recommend using the DataFrame API for that since it comes with a number of na methods. Joseph On Thu, Mar 17, 2016 at 9:51 PM, Abir Chakraborty wrote: &g

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
reful thought, we could probably avoid using indices altogether. I just created https://issues.apache.org/jira/browse/SPARK-14043 to track this. On Mon, Mar 21, 2016 at 11:22 AM, Eugene Morozov wrote: > Hi, Joseph, > > I thought I understood, why it has a limit of 30 levels for decision t

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
a design document (Google doc & PDF). Thanks in advance for feedback! Joseph

Spark: The build-in indexes in ORC file do not work.

2016-03-20 Thread Joseph
4 disks per datanode. data size: Toal 800 ORC files, each file is about 51MB, total 560,000,000 rows,57 colunms, only one table named gprs(ORC format). Thanks! Joseph

Improving Spark Scheduler Delay

2016-03-19 Thread Prabhu Joseph
, Prabhu Joseph

Re: Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Joseph
select count(*) from gprs where terminal_type=0;scan all the data Time taken: 395.968 seconds The following is my environment: 3 nodes,12 cpu cores per node,48G memory free per node, 4 disks per node, 3 replications per block , hadoop 2.7.2,hi

The build-in indexes in ORC file does not work.

2016-03-16 Thread Joseph
efore (especially in spark SQL)? What's my issue? Thanks! Joseph

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
Tasks in the the > 163 Stages that were skipped. > > I think -- but the Spark UI's accounting may not be 100% accurate and bug > free. > > On Tue, Mar 15, 2016 at 6:34 PM, Prabhu Joseph > wrote: > >> Okay, so out of 164 stages, is 163 are skipped. And how 41405 tas

Re: Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
pped -- i.e. no need to recompute that stage. > > On Tue, Mar 15, 2016 at 5:50 PM, Jeff Zhang wrote: > >> If RDD is cached, this RDD is only computed once and the stages for >> computing this RDD in the following jobs are skipped. >> >> >> On Wed, Mar 16, 2016 at

Spark UI Completed Jobs

2016-03-15 Thread Prabhu Joseph
/14 15:35:32 1.4 min 164/164 * (163 skipped) *19841/19788 *(41405 skipped)* Thanks, Prabhu Joseph

Re: Launch Spark shell using differnt python version

2016-03-15 Thread Prabhu Joseph
pyspark script. DEFAULT_PYTHON="/ANACONDA/anaconda2/bin/python2.7" Thanks, Prabhu Joseph On Tue, Mar 15, 2016 at 11:52 AM, Stuti Awasthi wrote: > Hi All, > > > > I have a Centos cluster (without any sudo permissions) which has by > default Python 2.6. Now I hav

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Prabhu Joseph
gt; > > On 14 March 2016 at 08:06, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> Which version of Spark are you using? The configuration varies by version. >> >> Regards >> Sab >> >> On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Jose

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Prabhu Joseph
our case. > > Regards > Sab > > On Mon, Mar 14, 2016 at 2:20 PM, Prabhu Joseph > wrote: > >> It is a Spark-SQL and the version used is Spark-1.2.1. >> >> On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan < >> sabarish.sasidha...@manthan.com>

Re: Hive Query on Spark fails with OOM

2016-03-14 Thread Prabhu Joseph
gt; >> >> >> On 14 March 2016 at 08:06, Sabarish Sasidharan < >> sabarish.sasidha...@manthan.com> wrote: >> >>> Which version of Spark are you using? The configuration varies by >>> version. >>> >>> Regards >>> Sab

Hive Query on Spark fails with OOM

2016-03-13 Thread Prabhu Joseph
of memory for cache. So, when a Spark Executor has lot of memory available for cache and does not use the cache but when there is a need to do lot of shuffle, will executors only use the shuffle fraction which is set for doing shuffle or will it use the free memory available for cache as well. Thanks, Prabhu Joseph

Re: NullPointerException

2016-03-12 Thread Prabhu Joseph
ata through kafka. > > On Sat 12 Mar, 2016 20:28 Ted Yu, > wrote: > >> Interesting. >> If kv._1 was null, shouldn't the NPE have come from getPartition() (line >> 105) ? >> >> Was it possible that records.next() returned null ? >> >&

Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192, i suspect some input record has Null key. 189 while (records.hasNext) { 190addElementsRead() 191kv = records.next() 192map.changeValue((getPartition(kv._1), kv._1), update) On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph wrote: > Look

Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192 189 while (records.hasNext) { addElementsRead() kv = records.next() map.changeValue((getPartition(kv._1), kv._1), update) maybeSpillCollection(usingMap = true) } On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru wrote: > I am seeing the following exception

Re: Spark configuration with 5 nodes

2016-03-10 Thread Prabhu Joseph
hanks, Prabhu Joseph On Fri, Mar 11, 2016 at 3:45 AM, Ashok Kumar wrote: > > Hi, > > We intend to use 5 servers which will be utilized for building Bigdata > Hadoop data warehouse system (not using any propriety distribution like > Hortonworks or Cloudera or others). > All server

Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
cate hot cached blocks right? > > > On Tuesday, March 8, 2016, Prabhu Joseph > wrote: > >> Hi All, >> >> When a Spark Job is running, and one of the Spark Executor on Node A >> has some partitions cached. Later for some other stage, Scheduler tries to

Spark Scheduler creating Straggler Node

2016-03-08 Thread Prabhu Joseph
shuffle files from an external service instead of from each other which will offload the load on Spark Executors. We want to check whether a similar thing of an External Service is implemented for transferring the cached partition to other executors. Thanks, Prabhu Joseph

Spark Partitioner vs Spark Shuffle Manager

2016-03-07 Thread Prabhu Joseph
Hi All, What is the difference between Spark Partitioner and Spark Shuffle Manager. Spark Partitioner is by default Hash partitioner and Spark shuffle manager is sort based, others are Hash, Tunsten Sort. Thanks, Prabhu Joseph

Spark Custom Partitioner not picked

2016-03-06 Thread Prabhu Joseph
= { val pieces = line.split(' ') val level = pieces(2).toString val one = pieces(0).toString val two = pieces(1).toString (level,LogClass(one,two)) } val output = logData.map(x => parse(x)) *val partitioned = output.partitionBy(new ExactPartitioner(5)).persist()val groups = partitioned.groupByKey(new ExactPartitioner(5))* groups.count() output.partitions.size partitioned.partitions.size } } Thanks, Prabhu Joseph

Re: Spark on Yarn with Dynamic Resource Allocation. Container always marked as failed

2016-03-02 Thread Prabhu Joseph
Is all NodeManager services restarted after the change in yarn-site.xml On Thu, Mar 3, 2016 at 6:00 AM, Jeff Zhang wrote: > The executor may fail to start. You need to check the executor logs, if > there's no executor log then you need to check node manager log. > > On Wed, Mar 2, 2016 at 4:26 P

Spark job on YARN ApplicationMaster DEBUG log

2016-03-02 Thread Prabhu Joseph
Hi All, I am trying to add DEBUG for Spark ApplicationMaster for it is not working. On running Spark job, passed -Dlog4j.configuration=file:/opt/mapr/spark/spark-1.4.1/conf/log4j.properties The log4j.properties has log4j.rootCategory=DEBUG, console Spark Executor Containers has DEBUG logs but

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Prabhu Joseph
Matthias, Can you check appending the jars in LAUNCH_CLASSPATH of spark-1.4.1/sbin/spark_class 2016-03-02 21:39 GMT+05:30 Matthias Niehoff : > no, not to driver and executor but to the master and worker instances of > the spark standalone cluster > > Am 2. März 2016 um 17:05 schrieb Igor Be

Re: Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Prabhu Joseph
java old threading is used somewhere. On Friday, February 19, 2016, Jörn Franke wrote: > How did you configure YARN queues? What scheduler? Preemption ? > > > On 19 Feb 2016, at 06:51, Prabhu Joseph > wrote: > > > > Hi All, > > > >When running con

Concurreny does not improve for Spark Jobs with Same Spark Context

2016-02-18 Thread Prabhu Joseph
taking 2-3 times longer than A, which shows concurrency does not improve with shared Spark Context. [Spark Job Server] Thanks, Prabhu Joseph

Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Prabhu Joseph
ext] > > res0: Boolean = true > > > > On Mon, Feb 15, 2016 at 8:51 PM, Prabhu Joseph > wrote: > >> Hi All, >> >> On creating HiveContext in spark-shell, fails with >> >> Caused by: ERROR XSDB6: Another instance of Derby may have already

Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Prabhu Joseph
. But without HiveContext, i am able to query the data using SqlContext . scala> var df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/SPARK/abc") df: org.apache.spar

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Prabhu Joseph
wrong SPARK_MASTER_IP at worker nodes. Check the logs of other workers running to see what SPARK_MASTER_IP it has connected, I don't think it is using a wrong Master IP. Thanks, Prabhu Joseph On Mon, Feb 15, 2016 at 12:34 PM, Kartik Mathur wrote: > Thanks Prabhu , > > I had

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Prabhu Joseph
Worker nodes are exactly the same as what Spark Master GUI shows. Thanks, Prabhu Joseph On Mon, Feb 15, 2016 at 11:51 AM, Kartik Mathur wrote: > on spark 1.5.2 > I have a spark standalone cluster with 6 workers , I left the cluster idle > for 3 days and after 3 days I saw only 4 worke

Re: Spark Job on YARN accessing Hbase Table

2016-02-10 Thread Prabhu Joseph
hadoop-2.5.1 and hence spark.yarn.dist.files does not work with hadoop-2.5.1, spark.yarn.dist.files works fine on hadoop-2.7.0, as CWD/* is included in container classpath through some bug fix. Searching for the JIRA. Thanks, Prabhu Joseph On Wed, Feb 10, 2016 at 4:04 PM, Ted Yu wrote: > H

Re: Spark Job on YARN accessing Hbase Table

2016-02-10 Thread Prabhu Joseph
of hbase client jars, when i checked launch container.sh , Classpath does not have $PWD/* and hence all the hbase client jars are ignored. Is spark.yarn.dist.files not for adding jars into the executor classpath. Thanks, Prabhu Joseph On Tue, Feb 9, 2016 at 1:42 PM, Prabhu Joseph wrote: >

Spark Job on YARN accessing Hbase Table

2016-02-09 Thread Prabhu Joseph
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks, Prabhu Joseph

Re: Long running Spark job on YARN throws "No AMRMToken"

2016-02-08 Thread Prabhu Joseph
+ Spark-Dev On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph wrote: > Hi All, > > A long running Spark job on YARN throws below exception after running > for few days. > > yarn.ApplicationMaster: Reporter thread fails 1 time(s) in a row. > org.apache.hadoop.yarn.exceptio

Long running Spark job on YARN throws "No AMRMToken"

2016-02-08 Thread Prabhu Joseph
://issues.apache.org/jira/browse/SPARK-5342 spark.yarn.credentials.file How to renew the AMRMToken for a long running job on YARN? Thanks, Prabhu Joseph

Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Prabhu Joseph
> must be the process of putting ..." > - Edsger Dijkstra > > "If you pay peanuts you get monkeys" > > > 2016-02-04 11:33 GMT+01:00 Prabhu Joseph : > >> Okay, the reason for the task delay within executor when some RDD in >> memory and some in Hadoop i.

Re: Spark job does not perform well when some RDD in memory and some on Disk

2016-02-04 Thread Prabhu Joseph
up and launching it on a less-local node. So after making it 0, all tasks started parallel. But learned that it is better not to reduce it to 0. On Mon, Feb 1, 2016 at 2:02 PM, Prabhu Joseph wrote: > Hi All, > > > Sample Spark application which reads a logfile from hadoop (1.2GB

Re: About cache table performance in spark sql

2016-02-03 Thread Prabhu Joseph
executor does not have enough heap. Thanks, Prabhu Joseph On Thu, Feb 4, 2016 at 11:25 AM, fightf...@163.com wrote: > Hi, > > I want to make sure that the cache table indeed would accelerate sql > queries. Here is one of my use case : > impala table size : 24.59GB, no pa

Spark saveAsHadoopFile stage fails with ExecutorLostfailure

2016-02-02 Thread Prabhu Joseph
, saveAsHadoopFile runs fine. What could be the reason for ExecutorLostFailure failing when cores per executor is high. Error: ExecutorLostFailure (executor 3 lost) 16/02/02 04:22:40 WARN TaskSetManager: Lost task 1.3 in stage 15.0 (TID 1318, hdnprd-c01-r01-14): Thanks, Prabhu Joseph

Re: Spark Executor retries infinitely

2016-02-01 Thread Prabhu Joseph
Thanks Ted. My concern is how to avoid these kind of user errors on a production cluster, it would be better if Spark handles this instead of creating an Executor for every second and fails and overloading the Spark Master. Shall i report a Spark JIRA to handle this. Thanks, Prabhu Joseph On

Spark Executor retries infinitely

2016-02-01 Thread Prabhu Joseph
ores, 2.0 GB RAM 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now LOADING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now RUNNING .... Thanks, Prabhu Joseph

Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley
e new data. If you're using ml.clustering.LDAModel (DataFrame API), then you can call transform() on new data. Does that work? Joseph On Tue, Jan 19, 2016 at 6:21 AM, doruchiulan wrote: > Hi, > > Just so you know, I am new to Spark, and also very new to ML (this is my > first conta

Spark on YARN job continuously reports "Application does not exist in cache"

2016-01-13 Thread Prabhu Joseph
application attempt, there are many finishApplicationMaster request causing the ERROR. Need your help to understand on what scenario the above happens. JIRA's related are https://issues.apache.org/jira/browse/SPARK-1032 https://issues.apache.org/jira/browse/SPARK-3072 Thanks, Prabhu Joseph

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-11 Thread Prabhu Joseph
machine and jps -l will list all java processes, jstack -l will give the stack trace. Thanks, Prabhu Joseph On Mon, Jan 11, 2016 at 7:56 PM, Umesh Kacha wrote: > Hi Prabhu thanks for the response. How do I find pid of a slow running > task. Task is running in yarn cluster node. When I

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-02 Thread Prabhu Joseph
for every 2 seconds and total 1 minute. This will help to identify the code where threads are spending lot of time and then try to tune. Thanks, Prabhu Joseph On Sat, Jan 2, 2016 at 1:28 PM, Umesh Kacha wrote: > Hi thanks I did that and I have attached thread dump images. That was

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread Prabhu Joseph
Take thread dump of Executor process several times in a short time period and check what each threads are doing at different times which will help to identify the expensive sections in user code. Thanks, Prabhu Joseph On Sat, Jan 2, 2016 at 3:28 AM, unk1102 wrote: > Sorry please see attac

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a problem with the Parquet dependency. What version of Parquet are you building Spark 1.5 off of? (I'm not that familiar with Parquet issues myself, but hopefully a SQL person can chime in.) On Tue, Dec 15, 2015 at 3:23 PM, Rac

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley
s not an analogous limit for the GLMs you listed, but I'm not very familiar with the perceptron implementation. Joseph On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov wrote: > Hello! > > I'm currently working on POC and try to use Random Forest (classification > and regres

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley
You can do grid search if you set the evaluator to a MulticlassClassificationEvaluator, which expects a prediction column, not a rawPrediction column. There's a JIRA for making BinaryClassificationEvaluator accept prediction instead of rawPrediction. Joseph On Tue, Dec 1, 2015 at 5:

Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+. On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar wrote: > > Hi folks, > > Does anyone know whether the Grid Search capability is enabled since the > issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol > column doesn't exist" when trying to perform a g

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi, Could you please submit this via JIRA as a bug report? It will be very helpful if you include the Spark version, system details, and other info too. Thanks! Joseph On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > *Issue:* > >

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based methods.

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
or (3): We use Breeze, but we have to modify it in order to do distributed optimization based on Spark. Joseph On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu wrote: > Hi everyone, > > I'm curious about the difference between > ml.classifica

Re: Serializing MLlib MatrixFactorizationModel

2015-08-17 Thread Joseph Bradley
I'd recommend using the built-in save and load, which will be better for cross-version compatibility. You should be able to call myModel.save(path), and load it back with MatrixFactorizationModel.load(path). On Mon, Aug 17, 2015 at 6:31 AM, Madawa Soysa wrote: > Hi All, > > I have an issue when

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
tion? > > On Sat, Jul 25, 2015 at 8:07 AM, Joseph Bradley > wrote: > >> I'd recommend starting with a few of the code examples to get a sense of >> Spark usage (in the examples/ folder when you check out the code). Then, >> you can work through the Spark met

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
nd an interesting (small) JIRA, examine the piece of code it mentions, and explore out from that initial "entry point." That's how I mostly did it. Good luck! Joseph On Fri, Jul 24, 2015 at 10:48 AM, shashank kapoor < shashank.prof...@gmail.com> wrote: > > > Hi

Error running sbt package on Windows 7 for Spark 1.3.1 and SimpleApp.scala

2015-06-04 Thread Joseph Washington
Hi all, I'm trying to run the standalone application SimpleApp.scala following the instructions on the http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala I was able to create a .jar file by doing sbt package. However when I tried to do $ YOUR_SPARK_HOME/bin/spark-submit

Re: ALS Rating Object

2015-06-03 Thread Joseph Bradley
ain as a DeveloperApi to allow users to use Long instead of Int. We're also thinking about better ways to permit Long IDs. Joseph On Wed, Jun 3, 2015 at 5:04 AM, Yasemin Kaya wrote: > Hi, > > I want to use Spark's ALS in my project. I have the userid > like 3001139722322712

Re: Restricting the number of iterations in Mllib Kmeans

2015-06-01 Thread Joseph Bradley
Hi Suman & Meethu, Apologies---I was wrong about KMeans supporting an initial set of centroids! JIRA created: https://issues.apache.org/jira/browse/SPARK-8018 If you're interested in submitting a PR, please do! Thanks, Joseph On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW wrote: &g

Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-30 Thread Joseph Bradley
This is really getting into an understanding of how optimization and GLMs work. I'd recommend reading some intro ML or stats literature on how Generalized Linear Models are estimated, as well as how convex optimization is used in ML. There are some free online texts as well as MOOCs which have go

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-30 Thread Joseph Bradley
g a high lambda as a parameter of the logistic regression keeps only > a few significant variables and "deletes" the others with a zero in the > coefficients? What is a high lambda for you? > Is the lambda a parameter available in Spark 1.4 only or can I see it in > Spark 1.

Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread Joseph Bradley
ready calculated. If you have access to a build of the current Spark master (or can wait for 1.4), then the org.apache.spark.ml.classification.LogisticRegression implementation has been compared with R and should get very similar results. Good luck! Joseph On Wed, May 27, 2015 at 8:22 AM, SparknewUs

Re: Multilabel classification using logistic regression

2015-05-27 Thread Joseph Bradley
It looks like you are training each model i (for label i) by only using data with label i. You need to use all of your data to train each model so the models can compare each label i with the other labels (roughly speaking). However, what you're doing is multiclass (not multilabel) classification

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread Joseph Bradley
rithms, by using convex optimization to minimize a regularized log loss. Good luck! Joseph On Fri, May 22, 2015 at 1:07 PM, DB Tsai wrote: > In Spark 1.4, Logistic Regression with elasticNet is implemented in ML > pipeline framework. Model selection can be achieved through high > lamb

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Joseph Bradley
e spark.ml Pipelines API will be a good option. It now has LogisticRegression which does not do feature scaling, and it uses LBFGS or OWLQN (depending on the regularization type) for optimization. It's also been compared with R in unit tests. Good luck! Joseph On Wed, May 20, 2015 at 3:42 PM,

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Joseph Bradley
One more comment: That's a lot of categories for a feature. If it makes sense for your data, it will run faster if you can group the categories or split the 1895 categories into a few features which have fewer categories. On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz wrote: > Could you please op

Re: Getting the best parameter set back from CrossValidatorModel

2015-05-19 Thread Joseph Bradley
of improvements to Params and Pipelines, so this should become easier to use! Joseph On Sun, May 17, 2015 at 10:17 PM, Justin Yip wrote: > > Thanks Ram. > > Your sample look is very helpful. (there is a minor bug that > PipelineModel.stages is hidden under private[ml

Re: Implementing custom metrics under MLPipeline's BinaryClassificationEvaluator

2015-05-18 Thread Joseph Bradley
Hi Justin, It sound like you're on the right track. The best way to write a custom Evaluator will probably be to modify an existing Evaluator as you described. It's best if you don't remove the other code, which handles parameter set/get and schema validation. Joseph On Sun,

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread Joseph Bradley
ions" in the current master instead of "maxIterations," which is sort of a bug in the example). If that does not cap the max iterations, then please report it as a bug. To specify the initial centroids, you will need to modify the DenseKMeans example code. Please see the KMeans API d

Re: Predict.scala using model for clustering In reference

2015-05-07 Thread Joseph Bradley
A KMeansModel was trained in the previous step, and it was saved to "modelFile" as a Java object file. This step is loading the model back and reconstructing the KMeansModel, which can then be used to classify new tweets into different clusters. Joseph On Thu, May 7, 2015 at 12:40

Re: Multilabel Classification in spark

2015-05-05 Thread Joseph Bradley
as well as DecisionTree and RandomForest. Joseph On Tue, May 5, 2015 at 1:27 PM, DB Tsai wrote: > LogisticRegression in MLlib package supports multilable classification. > > Sincerely, > > DB Tsai > --- > Blog: https://www.

Re: MLLib SVM probability

2015-05-04 Thread Joseph Bradley
/jira/browse/SPARK-7015 I agree that naive one-vs-all reductions might not work that well, but that the raw scores could be calibrated using the scaling you mentioned, or other methods. Joseph On Mon, May 4, 2015 at 6:29 AM, Driesprong, Fokko wrote: > Hi Robert, > > I would say, taking t

Re: Multiclass classification using Ml logisticRegression

2015-04-26 Thread Joseph Bradley
ssion API to do multiclass for now, until that JIRA gets completed. (If you're interested in doing it, let me know via the JIRA!) Joseph On Fri, Apr 24, 2015 at 3:26 AM, Selim Namsi wrote: > Hi, > > I just started using spark ML pipeline to implement a multiclass classifier > using Lo

Re: [Ml][Dataframe] Ml pipeline & dataframe repartitioning

2015-04-26 Thread Joseph Bradley
there's a good way to set the buffer size automatically, though. Joseph On Fri, Apr 24, 2015 at 8:20 AM, Peter Rudenko wrote: > Hi i have a next problem. I have a dataset with 30 columns (15 numeric, > 15 categorical) and using ml transformers/estimators to transform each > c

Re: How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-25 Thread Joseph Bradley
It looks like your code is making 1 Row per item, which means that columnSimilarities will compute similarities between users. If you transpose the matrix (or construct it as the transpose), then columnSimilarities should do what you want, and it will return meaningful indices. Joseph On Fri

  1   2   >