date:20170314

Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-14 Thread Yuhao Yang

Hi Raju, Have you tried setNumPartitions with a larger number? 2017-03-07 0:30 GMT-08:00 Eli Super : > Hi > > It's area of knowledge , you will need to read online several hours about > it > > What is your programming language ? > > Try search online : "machine learning binning %my_programing_la

Re: Structured Streaming - Can I start using it?

2017-03-14 Thread Ofir Manor

To add to what Michael said, my experience was that Structured Streaming in 2.0 was half-baked / alpha, but in 2.1 it is significantly more robust. Also a lot of its "missing functionality" were not available in Spark Streaming either way. HOWEVER, you mentioned that you think about rewriting your

Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-14 Thread Raju Bairishetti

Hi Yuhao, I have tried numPartitions from (numExecutors * numExecutorCores), 1000, 2000 and 1. I did not see much improvement. Having more partitions solved some perf issues but did not see any improvement when I give less minsupport. It is generating 260 million frequent item sets with 6

Re: Structured Streaming - Can I start using it?

2017-03-14 Thread Gaurav Pandya

Thanks a lot Michal & Ofir for your insights. To Ofir - I have not yet finalized my spark streaming code. it is still work in progress. Now we have Structured streaming available, so thought to re write it to gain maximum benefit in future. As of now, there are no specific functional or performanc

Re: Structured Streaming - Can I start using it?

2017-03-14 Thread Adline Dsilva

On 14 Mar 2017 4:19 p.m., Gaurav Pandya wrote: Thanks a lot Michal & Ofir for your insights. To Ofir - I have not yet finalized my spark streaming code. it is still work in progress. Now we have Structured streaming available, so thought to re write it to gain maximum benefit in future. As of

Re: Spark and continuous integration

2017-03-14 Thread Steve Loughran

On 13 Mar 2017, at 13:24, Sam Elamin mailto:hussam.ela...@gmail.com>> wrote: Hi Jorn Thanks for the prompt reply, really we have 2 main concerns with CD, ensuring tests pasts and linting on the code. I'd add "providing diagnostics when tests fail", which is a combination of: tests providing

Re: Spark and continuous integration

2017-03-14 Thread Jörn Franke

I agree the reporting is an important aspect. Sonarqube (or similar tool) can report over time, but does not support Scala (well indirectly via JaCoCo). In the end, you will need to think about a dashboard that displays results over time. > On 14 Mar 2017, at 12:44, Steve Loughran wrote: > >

Re: Spark and continuous integration

2017-03-14 Thread Sam Elamin

Thank you both Steve that's a very interesting point. I have to admit I have never thought of doing analysis over time on the tests but it makes sense as the failures over time tell you quite a bit about your data platform Thanks for highlighting! We are using Pyspark for now so I hope some frame

[MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel

Hi everybody, I make some experiments with the Spark kmeans implementation of the new DataFrame-API. I compare clustering results of different runs with different parameters. I recognized that for random initialization mode, the seed value is the same every time. How is it calculated? In my unders

Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel

I'm sorry, I missed some important informations. I use Spark version 2.0.2 in Scala 2.11.8. 2017-03-14 13:44 GMT+01:00 Julian Keppel : > Hi everybody, > > I make some experiments with the Spark kmeans implementation of the new > DataFrame-API. I compare clustering results of different runs with >

Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-14 Thread Nirav Patel

Thanks Kwon. Goal is to preserve whitespace. Not to alter data in general or do it with user provided options. It's causing our downstream jobs to fail. On Mon, Mar 13, 2017 at 7:23 PM, Hyukjin Kwon wrote: > Hi, all the options are documented in https://spark.apache.org/ > docs/latest/api/scala

[MLlib] Multiple estimators for cross validation

2017-03-14 Thread David Leifker

I am hoping to open a discussion around the cross validation in mllib. I found that I often wanted to evaluate multiple estimators/pipelines (with different algorithms) or the same estimator with different parameter grids. The CrossValidator and TrainValidationSplit only allow a single estimator an

OffsetOutOfRangeException

2017-03-14 Thread Mohammad Kargar

To work around an out of space issue in a Direct Kafka Streaming application we create topics with a low retention policy (retention.ms=30) which works fine from Kafka perspective. However this results into OffsetOutOfRangeException in Spark job (red line below). Is there any configuration in S

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread shyla deshpande

Thanks TD for the response. Can you please provide more explanation. I am having multiple streams in the spark streaming application (Spark 2.0.2 using DStreams). I know many people using this setting. So your explanation will help a lot of people. Thanks On Fri, Mar 10, 2017 at 6:24 PM, Tathag

Setting Optimal Number of Spark Executor Instances

2017-03-14 Thread kpeng1

Hi All, I am currently on Spark 1.6 and I was doing a sql join on two tables that are over 100 million rows each and I noticed that it was spawn 3+ tasks (this is the progress meter that we are seeing show up). We tried to coalesece, repartition and shuffle partitions to drop the number of ta

Re: Setting Optimal Number of Spark Executor Instances

2017-03-14 Thread mohini kalamkar

Hi, try using this parameter --conf spark.sql.shuffle.partitions=1000 Thanks, Mohini On Tue, Mar 14, 2017 at 3:30 PM, kpeng1 wrote: > Hi All, > > I am currently on Spark 1.6 and I was doing a sql join on two tables that > are over 100 million rows each and I noticed that it was spawn 3+ ta

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread Tathagata Das

This setting allows multiple spark jobs generated through multiple foreachRDD to run concurrently, even if they are across batches. So output op2 from batch X, can run concurrently with op1 of batch X+1 This is not safe because it breaks the checkpointing logic in subtle ways. Note that this was ne

Scaling Kafka Direct Streming application

2017-03-14 Thread Pranav Shukla

How to scale or possibly auto-scale a spark streaming application consuming from kafka and using kafka direct streams. We are using spark 1.6.3, cannot move to 2.x unless there is a strong reason. Scenario: Kafka topic with 10 partitions Standalone cluster running on kubernetes with 1 master and 2

Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Yuhao Yang

Hi Jinhong, Based on the error message, your second collection of vectors has a dimension of 804202, while the dimension of your training vectors was 144109. So please make sure your test dataset are of the same dimension as the training data. >From the test dataset you posted, the vector dimens

Fast write datastore...

2017-03-14 Thread muthu

Hello there, I have one or more parquet files to read and perform some aggregate queries using Spark Dataframe. I would like to find a reasonable fast datastore that allows me to write the results for subsequent (simpler queries). I did attempt to use ElasticSearch to write the query results using

apply UDFs to N columns dynamically in dataframe

2017-03-14 Thread anup ahire

Hello, I have a schema and name of columns to apply UDF to. Name of columns are user input and they can differ in numbers for each input. Is there a way to apply UDFs to N columns in dataframe ? Thanks !

Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Yuhao Yang

Hi Julian, Thanks for reporting this. This is a valid issue and I created https://issues.apache.org/jira/browse/SPARK-19957 to track it. Right now the seed is set to this.getClass.getName.hashCode.toLong by default, which indeed keeps the same among multiple fits. Feel free to leave your comments

Re: FPGrowth Model is taking too long to generate frequent item sets

Re: Structured Streaming - Can I start using it?

Re: FPGrowth Model is taking too long to generate frequent item sets

Re: Structured Streaming - Can I start using it?

Re: Structured Streaming - Can I start using it?

Re: Spark and continuous integration

Re: Spark and continuous integration

Re: Spark and continuous integration

[MLlib] kmeans random initialization, same seed every time

Re: [MLlib] kmeans random initialization, same seed every time

Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

[MLlib] Multiple estimators for cross validation

OffsetOutOfRangeException

Re: spark streaming with kafka source, how many concurrent jobs?

Setting Optimal Number of Spark Executor Instances

Re: Setting Optimal Number of Spark Executor Instances

Re: spark streaming with kafka source, how many concurrent jobs?

Scaling Kafka Direct Streming application

Re: how to construct parameter for model.transform() from datafile

Fast write datastore...

apply UDFs to N columns dynamically in dataframe

Re: [MLlib] kmeans random initialization, same seed every time

22 matches

Site Navigation

Mail list logo

Footer information