date:20161127

Re: how to see Pipeline model information

2016-11-27 Thread Zhiliang Zhu

I have worked it out, just let java call scala class function .Thank Xiaomeng a lot~~ On Friday, November 25, 2016 1:50 AM, Xiaomeng Wan wrote: here is the scala code I use to get the best model, I never used java val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(ne

how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

2016-11-27 Thread Zhiliang Zhu

Hi All, I need to print auc and prc for GBTClassifier model, it seems okay for RandomForestClassifier but not GBTClassifier, though rawPrediction column is neither in original data. the codes are : .. // Set up Pipeline val stages = new mutable.Arra

createDataFrame causing a strange error.

2016-11-27 Thread Andrew Holway

Hi, Can anyone tell me what is causing this error Spark 2.0.0 Python 2.7.5 df = sqlContext.createDataFrame(foo, schema) https://gist.github.com/mooperd/368e3453c29694c8b2c038d6b7b4413a Traceback (most recent call last): File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py", li

Re: createDataFrame causing a strange error.

2016-11-27 Thread Andrew Holway

I get a slight different error when not specifying a schema: Traceback (most recent call last): File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py", line 61, in df = sqlContext.createDataFrame(foo) File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/co

Re: createDataFrame causing a strange error.

2016-11-27 Thread Marco Mistroni

Hi pickle erros normally point to serialisation issue. i am suspecting something wrong with ur S3 data , but is just a wild guess... Is your s3 object publicly available? few suggestions to nail down the problem 1 - try to see if you can read your object from s3 using boto3 library 'offline',

Re: Third party library

2016-11-27 Thread Steve Loughran

On 27 Nov 2016, at 02:55, kant kodali mailto:kanth...@gmail.com>> wrote: I would say instead of LD_LIBRARY_PATH you might want to use java.library.path in the following way java -Djava.library.path=/path/to/my/library or pass java.library.path along with spark-submit This is only going to s

Re: Why is shuffle write size so large when joining Dataset with nested structure?

2016-11-27 Thread Zhuo Tao

Hi Takeshi, Thank you for your comment. I changed it to RDD and it's a lot better. Zhuo On Fri, Nov 25, 2016 at 7:04 PM, Takeshi Yamamuro wrote: > Hi, > > I think this is just the overhead to represent nested elements as internal > rows on-runtime > (e.g., it consumes null bits for each nested

Spark ignoring partition names without equals (=) separator

2016-11-27 Thread Prasanna Santhanam

I've been toying around with Spark SQL lately and trying to move some workloads from Hive. In the hive world the partitions below are recovered on an ALTER TABLE RECOVER PARTITIONS *Path:* s3://bucket-company/path/2016/03/11 s3://bucket-company/path/2016/03/12 s3://bucket-company/path/2016/03/13

if conditions

2016-11-27 Thread Hitesh Goyal

Hi team, I am using Apache spark 1.6.1 version. In this I am writing Spark SQL queries. I found 2 ways of writing SQL queries. One is by simple SQL syntax and other is by using spark Dataframe functions. I need to execute if conditions by using dataframe functions. Please specify how can I do th

Re: if conditions

2016-11-27 Thread Stuart White

Use the when() and otherwise() functions. For example: import org.apache.spark.sql.functions._ val rows = Seq(("bob", 1), ("lucy", 2), ("pat", 3)).toDF("name", "genderCode") rows.show ++--+ |name|genderCode| ++--+ | bob| 1| |lucy| 2| | pat| 3| +--

Spark app write too many small parquet files

2016-11-27 Thread Kevin Tran

Hi Everyone, Does anyone know what is the best practise of writing parquet file from Spark ? As Spark app write data to parquet and it shows that under that directory there are heaps of very small parquet file (such as e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only 15KB

Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee

Generally, yes - you should try to have larger data sizes due to the overhead of opening up files. Typical guidance is between 64MB-1GB; personally I usually stick with 128MB-512MB with the default of snappy codec compression with parquet. A good reference is Vida Ha's presentation Data Storage T

RE: if conditions

2016-11-27 Thread Hitesh Goyal

I tried this, but it is throwing an error that the method "when" is not applicable. I am doing this in Java instead of scala. Note:- I am using spark 1.6.1 version. -Original Message- From: Stuart White [mailto:stuart.whi...@gmail.com] Sent: Monday, November 28, 2016 10:26 AM To: Hitesh

Re: Spark ignoring partition names without equals (=) separator

2016-11-27 Thread Bharath Bhushan

Prasanna, AFAIK spark does not handle folders without partition column names in them and there is no way to get spark to do it. I think the reason for this is that parquet file hierarchies had this info and historically spark deals more with those. On Mon, Nov 28, 2016 at 9:48 AM, Prasanna Santhan

[Spark R]: Does Spark R supports nonlinear optimization with nonlinear constraints.

2016-11-27 Thread himanshu.gpt

Hi, Component: Spark R Level: Beginner Scenario: Does Spark R supports nonlinear optimization with nonlinear constraints? Our business application supports two types of function convex and S-shaped curves and linear & non-linear constraints. These constraints can be combined with any one type

Re: how to see Pipeline model information

how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

createDataFrame causing a strange error.

Re: createDataFrame causing a strange error.

Re: createDataFrame causing a strange error.

Re: Third party library

Re: Why is shuffle write size so large when joining Dataset with nested structure?

Spark ignoring partition names without equals (=) separator

if conditions

Re: if conditions

Spark app write too many small parquet files

Re: Spark app write too many small parquet files

RE: if conditions

Re: Spark ignoring partition names without equals (=) separator

[Spark R]: Does Spark R supports nonlinear optimization with nonlinear constraints.

15 matches

Site Navigation

Mail list logo

Footer information