from:"Asim Jalis"

How To Save TF-IDF Model In PySpark

2016-01-15 Thread Asim Jalis

Hi, I am trying to save a TF-IDF model in PySpark. Looks like this is not supported. Using `model.save()` causes: AttributeError: 'IDFModel' object has no attribute 'save' Using `pickle` causes: TypeError: can't pickle lock objects Does anyone have suggestions Thanks! Asim Here is the full

Re: MLlib: Feature Importances API

2015-12-17 Thread Asim Jalis

examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala> > . > > Yanbo > > 2015-12-17 13:41 GMT+08:00 Asim Jalis : > >> I wanted to use get feature importances related to a Random Forest as >> described in this JIRA: https://issues.apache.or

MLlib: Feature Importances API

2015-12-16 Thread Asim Jalis

I wanted to use get feature importances related to a Random Forest as described in this JIRA: https://issues.apache.org/jira/browse/SPARK-5133 However, I don’t see how to call this. I don't see any methods exposed on org.apache.spark.mllib.tree.RandomForest How can I get featureImportances when

Python's ReduceByKeyAndWindow DStream Keeps Growing

2015-08-17 Thread Asim Jalis

When I use reduceByKeyAndWindow with func and invFunc (in PySpark) the size of the window keeps growing. I am appending the code that reproduces this issue. This prints out the count() of the dstream which goes up every batch by 10 elements. Is this a bug in the Python version of Scala or is this

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis

Another fix might be to remove the exception that is thrown when windowing and other stateful operations are used without checkpointing. On Fri, Aug 14, 2015 at 5:43 PM, Asim Jalis wrote: > I feel the real fix here is to remove the exception from QueueInputDStream > class by reverting t

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis

e.scala#L61> >> does something similar for Spark Streaming unit tests. >> >> TD >> >> On Fri, Aug 14, 2015 at 1:04 PM, Asim Jalis wrote: >> >>> I want to test some Spark Streaming code that is using >>> reduceByKeyAndWindow. If I do not enable

QueueStream Does Not Support Checkpointing

2015-08-14 Thread Asim Jalis

I want to test some Spark Streaming code that is using reduceByKeyAndWindow. If I do not enable checkpointing, I get the error: java.lang.IllegalArgumentException: requirement failed: The checkpoint > directory has not been set. Please set it by StreamingContext.checkpoint(). But if I enable che

Streaming Checkpointing

2015-01-09 Thread Asim Jalis

In Spark Streaming apps if I enable ssc.checkpoint(dir) does this checkpoint all RDDs? Or is it just checkpointing windowing and state RDDs? For example, if in a DStream I am using an iterative algorithm on a non-state non-window RDD, do I have to checkpoint it explicitly myself, or can I assume t

Initial State of updateStateByKey

2015-01-08 Thread Asim Jalis

In Spark Streaming, is there a way to initialize the state of updateStateByKey before it starts processing RDDs? I noticed that there is an overload of updateStateByKey that takes an initialRDD in the latest sources (although not in the 1.2.0 release). Is there another way to do this until this fea

Spark Streaming Checkpointing

2015-01-08 Thread Asim Jalis

Since checkpointing in streaming apps happens every checkpoint duration, in the event of failure, how is the system able to recover the state changes that happened after the last checkpoint?

Join RDDs with DStreams

2015-01-08 Thread Asim Jalis

Is there a way to join non-DStream RDDs with DStream RDDs? Here is the use case. I have a lookup table stored in HDFS that I want to read as an RDD. Then I want to join it with the RDDs that are coming in through the DStream. How can I do this? Thanks. Asim

Re: disable log4j for spark-shell

2015-01-07 Thread Asim Jalis

Another option is to make a copy of log4j.properties in the current directory where you start spark-shell from, and modify "log4j.rootCategory=INFO, console" to "log4j.rootCategory=ERROR, console". Then start the shell. On Wed, Jan 7, 2015 at 3:39 AM, Akhil wrote: > Edit your conf/log4j.properti

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis

n RDD, not one value. > > On Wed, Jan 7, 2015 at 12:10 AM, Paolo Platter > wrote: > >> In my opinion you should use fold pattern. Obviously after an sort by >> trasformation. >> >> Paolo >> >> Inviata dal mio Windows Phone >> --

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis

rs, > then group. I think this would implementing 1 minute buckets, sliding by 10 > seconds: > > tickerRDD.flatMap(ticker => > (ticker.timestamp - 6 to ticker.timestamp by 15000).map(ts => (ts, > ticker)) > ).map { case(ts, ticker) => > ((ts / 6) * 60000,

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis

I guess I can use a similar groupBy approach. Map each event to all the windows that it can belong to. Then do a groupBy, etc. I was wondering if there was a more elegant approach. On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis wrote: > Except I want it to be a sliding window. So the same rec

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis

some kind > there are more efficient methods than this most generic method, like the > aggregate methods. > > On Tue, Jan 6, 2015 at 8:34 PM, Asim Jalis wrote: > >> Thanks. Another question. I have event data with timestamps. I want to >> create a sliding window using t

Re: RDD Moving Average

2015-01-06 Thread Asim Jalis

rk/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L43 > > On Tue, Jan 6, 2015 at 5:25 PM, Asim Jalis wrote: > >> Is there an easy way to do a moving average across a single RDD (in a >> non-streaming app). Here is the use case. I have an RDD made u

RDD Moving Average

2015-01-06 Thread Asim Jalis

Is there an easy way to do a moving average across a single RDD (in a non-streaming app). Here is the use case. I have an RDD made up of stock prices. I want to calculate a moving average using a window size of N. Thanks. Asim

Spark Streaming Threading Model

2014-12-19 Thread Asim Jalis

Q: In Spark Streaming if your DStream transformation and output action take longer than the batch duration will the system process the next batch in another thread? Or will it just wait until the first batch’s RDD is processed? In other words does it build up a queue of buffered RDDs awaiting proce

Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Asim Jalis

Is there a way I can have a JDBC connection open through a streaming job. I have a foreach which is running once per batch. However, I don’t want to open the connection for each batch but would rather have a persistent connection that I can reuse. How can I do this? Thanks. Asim

How To Save TF-IDF Model In PySpark

Re: MLlib: Feature Importances API

MLlib: Feature Importances API

Python's ReduceByKeyAndWindow DStream Keeps Growing

Re: QueueStream Does Not Support Checkpointing

Re: QueueStream Does Not Support Checkpointing

QueueStream Does Not Support Checkpointing

Streaming Checkpointing

Initial State of updateStateByKey

Spark Streaming Checkpointing

Join RDDs with DStreams

Re: disable log4j for spark-shell

Re: RDD Moving Average

Re: RDD Moving Average

Re: RDD Moving Average

Re: RDD Moving Average

Re: RDD Moving Average

RDD Moving Average

Spark Streaming Threading Model

Spark Streaming Reusing JDBC Connections

20 matches

Site Navigation

Mail list logo

Footer information