from:"pseudo oduesp"

unit testing in spark

2016-12-08 Thread pseudo oduesp

somone can tell me how i can make unit test on pyspark ? (book, tutorial ...)

create new spark context from ipython or jupyter

2016-12-07 Thread pseudo oduesp

Hi, how we can create new sparkcontext from Ipython or jupyter session i mean if i use current sparkcontext and i run sc.stop() how i can launch new one from ipython without restart newsession of ipython by refreshing browser ?? why i code some functions and i figreout i forgot something insde f

add jars like spark-csv to ipython notebook with pyspakr

2016-09-09 Thread pseudo oduesp

Hi , how i can add jar to Ipython notebooke i tied Pyspark_submit_args without succes ? thanks

pyspakr 1.5.0 boradcast join

2016-09-08 Thread pseudo oduesp

hi , some one can show me an example for broadcast join in this version 1.5.0 with data frame in pyspark thanks

long lineage

2016-08-16 Thread pseudo oduesp

Hi , how we can deal after raise stackoverflow trigger by long lineage ? i mean i have this error and how resolve it wiyhout creating new session thanks

java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread pseudo oduesp

hi, i cretae new columns with udf after i try to filter this columns : i get this error why ? : java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true]) at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:221) at

zip for pyspark

2016-08-08 Thread pseudo oduesp

hi, how i can export all project on pyspark like zip from local session to cluster and deploy with spark submit i mean i have a large project with all dependances and i want create zip containing all of dependecs and deploy it on cluster

Re: pyspark on pycharm on WINDOWS

2016-08-05 Thread pseudo oduesp

8) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Process finished with exit code 1 2016-08-05 15:35 GMT+02:00 pseudo oduesp : > HI, > > i configured th pycharm like describ

pyspark on pycharm on WINDOWS

2016-08-05 Thread pseudo oduesp

HI, i configured th pycharm like describe on stack overflow with spark_home and hadoop_conf_dir and donwload winutils to use it with prebuild version of spark 2.0 (pyspark 2.0) and i get this error i f you can help me to find solution thanks C:\Users\AppData\Local\Continuum\Anaconda2\python.ex

WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp

hi , with pyspark 2.0 i get this errors WindowsError: [Error 2] The system cannot find the file specified someone can help me to find solution thanks

Re: WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp

da2\lib\subprocess.py", line 711, in __init__ errread, errwrite) File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\subprocess.py", line 959, in _execute_child startupinfo) WindowsError: [Error 2] Le fichier sp�cifi� est introuvable Process finished with exit code 1 2016-

pycharm and pyspark on windows

2016-08-04 Thread pseudo oduesp

Hi , what is good conf for pyspark and pycharm on windwos ? tahnks

decribe function limit of columns

2016-08-02 Thread pseudo oduesp

Hi in spark 1.5.0 i used descibe function with more than 100 columns . someone can tell me if any limit exsiste now ? thanks

Re: java.net.UnknownHostException

2016-08-02 Thread pseudo oduesp

someone can help me please 2016-08-01 11:51 GMT+02:00 pseudo oduesp : > hi > i get the following erreors when i try using pyspark 2.0 with ipython on > yarn > somone can help me please . > java.lang.IllegalArgumentException: java.net.UnknownHostException: > s001.big

java.net.UnknownHostException

2016-08-01 Thread pseudo oduesp

hi i get the following erreors when i try using pyspark 2.0 with ipython on yarn somone can help me please . java.lang.IllegalArgumentException: java.net.UnknownHostException: s001.bigdata.;s003.bigdata;s008bigdata. at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil

estimation of necessary time of execution

2016-07-29 Thread pseudo oduesp

Hi, on hive we have a awosome function for estimation of time of execution before launch ? in spark can find any function to estimate the time of lineage of spark dag execution ? Thanks

sparse vector to dense vecotor in pyspark

2016-07-26 Thread pseudo oduesp

Hi , with standerscaler we get a sparse vector how i can transform it to list or dense vector without missing the sparse values thanks

Re: PCA machine learning

2016-07-26 Thread pseudo oduesp

e for each value in feature the name of variable . how i can identify names of principal component in second vector ? 2016-07-26 10:39 GMT+02:00 pseudo oduesp : > Hi, > when i perform PCA reduction dimension i get dense vector with length of > number of principla componen

PCA machine learning

2016-07-26 Thread pseudo oduesp

Hi, when i perform PCA reduction dimension i get dense vector with length of number of principla component my question : -How i get the name of features giving this vectors ? -the values inside vectors result its value of projection of all features on this componenets ? - how to use it ? th

Re: add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread pseudo oduesp

PYSPARK_SUBMIT_ARGS = --jars spark-csv_2.10-1.4.0.jar,commons-csv-1.1.jar without succecs thanks 2016-07-25 13:27 GMT+02:00 pseudo oduesp : > Hi , > someone can telle me how i can add jars to ipython i try spark > > >

add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread pseudo oduesp

Hi , someone can telle me how i can add jars to ipython i try spark

spark and plot data

2016-07-21 Thread pseudo oduesp

Hi , i know spark it s engine to compute large data set but for me i work with pyspark and it s very wonderful machine my question we don't have tools for ploting data each time we have to switch and go back to python for using plot. but when you have large result scatter plot or roc curve yo

RandomForestClassifier

2016-07-20 Thread pseudo oduesp

hi , we have parmaters named labelCol="labe" ,featuresCol="features", when i precise the value here (label and features) if train my model on data frame with other columns tha algorithme choos only label columns and features columns ? thanks

lift coefficien

2016-07-20 Thread pseudo oduesp

Hi , how we can claculate lift coeff from pyspark result of prediction ? thanks ?

which one spark ml or spark mllib

2016-07-19 Thread pseudo oduesp

HI, i don't have any idea why we have to library ML and mlib ml you can use it with data frame and mllib with rdd but ml have some lakes like: save model most important if you want create web api with score my question why we don't have all features in MLlib on ML ? ( i use pyspark 1.5.0 bec

pyspark 1.5 0 save model ?

2016-07-18 Thread pseudo oduesp

Hi, how i can save model under pyspakr 1.5.0 ? i use RandomForestClassifier() thanks in advance.

Feature importance IN random forest

2016-07-12 Thread pseudo oduesp

Hi, i use pyspark 1.5.0 can i ask you how i can get feature imprtance for a randomforest algorithme in pyspark and please give me example thanks for advance.

categoricalFeaturesInfo

2016-07-07 Thread pseudo oduesp

Hi, how i can use this option in Random Forest . when i transform my vector (100 features ) i have 20 categoriel feature include. if i understand categorielFeatureinfo , i should past the position of my 20 categoriels feature inside of the vector containing 100 with map{ positionof feature insid

remove row from data frame

2016-07-05 Thread pseudo oduesp

Hi , how i can remove row from data frame verifying some condition on some columns ? thanks

alter table with hive context

2016-06-26 Thread pseudo oduesp

Hi, how i can alter table by adiing new columns to table in hivecontext ?

add multiple columns

2016-06-26 Thread pseudo oduesp

Hi who i can add multiple columns to data frame withcolumns allow to add one columns but when you have multiple i have to loop on eache columns ? thanks

Re: categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp

,f_index)) like that i keep order of variable in this order i have all f_index from 517: to 824 but when i create lable point i lose this order and i lose type int . 2016-06-24 9:40 GMT+02:00 pseudo oduesp : > Hi, > how i can keep type of my variable like int > because i get this err

categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp

Hi, how i can keep type of my variable like int because i get this error when i call random forest algorithm with model = RandomForest.trainClassifier(rdf, numClasses=2, categoricalFeaturesInfo=d,

categoricalFeaturesInfo

2016-06-23 Thread pseudo oduesp

Hi, i am pyspark user and i want test the Randoforest algrithmes. i found this parmeters categoricalFeaturesInfo how i can use it from list of categoriels variables . thanks.

feture importance or variable importance

2016-06-21 Thread pseudo oduesp

hi , i am pyspark user and i want to extract var imprtance in randomforest model for plot how i can deal with that ? thanks

Labeledpoint

2016-06-21 Thread pseudo oduesp

Hi, i am pyspark user and i want test Randomforest. i have dataframe with 100 columns i should give Rdd or data frame to algorithme i transformed my dataframe to only tow columns label ands features columns df.label df.features 0(517,(0,1,2,333,56 ... 1 (517,(0,11,0,3

)

2016-06-21 Thread pseudo oduesp

hi, help me please to resolve this issues ) failed: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.>

cast only some columns

2016-06-21 Thread pseudo oduesp

Hi , with fillna we can select some columns to perform replace some values with chosing columns with dict {columns :values } but how i can do same with cast i have data frame with 300 columns and i want just cats 4 from list columns but with select query like that : df.select(columns1.cast(i

read.parquet or read.load

2016-06-21 Thread pseudo oduesp

hi , realy i m angry about parquet file each time i get error like Could not read footer: java.lang.RuntimeException: or error occuring when o127.load why we have à lot of issuse with this format ? thanks

Unable to acquire bytes of memory

2016-06-20 Thread pseudo oduesp

Hi , i don t have no idea why i get this error Py4JJavaError: An error occurred while calling o69143.parquet. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)

plot importante variable in pyspark

2016-06-19 Thread pseudo oduesp

hi, who can get score for each row of classification algortithmes , and how i can plot features importance of variables like sickit learn ? thanks.

binding two data frame

2016-06-17 Thread pseudo oduesp

Hi, in R we have function named Cbind and rbind for data frame how i can repduce this functions on pyspark df1.col1 df1.col2 df1.col3 df2.col1 df2.col2 df2.col3 fincal result : new data frame df1.col1 df1.col2 df1.col3 df2.col1 df2.col2 df2.col3 thanks

update data frame inside function

2016-06-17 Thread pseudo oduesp

Hi, how i can update data frame inside function ? why ? i have to apply Stingindexer multiple time because i tried Pipeline but it still extremly slow for 84 columns to Stringindexed eache one have 10 modalities and data frame with 21Milion row i need 15 hours of processing . now i want try o

Stringindexers on multiple columns >1000

2016-06-17 Thread pseudo oduesp

Hi, i want aplly string indexers on multiple coluns but when use Stringindexer and pipline that take lang time . Indexer = StringIndexer(inputCol="Feature1", outputCol="indexed1") this it practice for one or two or teen lines but when you have more the 1000 lines how you can do ? thanks

difference between dataframe and dataframwrite

2016-06-16 Thread pseudo oduesp

hi, what is difference between dataframe and dataframwrite ?

Re: advise please

2016-06-16 Thread pseudo oduesp

hi , i use pyspark 1.5.0 on yarn cluster with 19 nodes and 200 GO and 4 cores eache (include driver) 2016-06-16 15:42 GMT+02:00 pseudo oduesp : > Hi , > who i can dummies large set of columns with STRINGindexer fast ? > becasue i tested with 89 values and eache one had 10 max distinc

advise please

2016-06-16 Thread pseudo oduesp

Hi , who i can dummies large set of columns with STRINGindexer fast ? becasue i tested with 89 values and eache one had 10 max distinct values and that take lot of time thanks

cache datframe

2016-06-16 Thread pseudo oduesp

hi, if i cache same data frame and transforme and add collumns i should cache second times df.cache() transforamtion add new columns df.cache() ?

String indexer

2016-06-16 Thread pseudo oduesp

hi , what is limite of modalties in Stingindexer : if i have columns with 1000 modalities it good to use STRINGindexers ? or i should try other function and which one please ? thanks

STringindexer

2016-06-16 Thread pseudo oduesp

Hi , i have dataframe with 1000 columns to dummies with stingIndexer when i apply pipliene take long times whene i want merge result with other data frame i mean : originnal data frame + columns indexed by STringindexers PB save stage it s long why ? code indexers = [StringIndexer(inp

vecotors inside columns

2016-06-15 Thread pseudo oduesp

hi , i want ask question about vector.dense or spars : imagine i have dataframe with columns and one of them contain vectors . my question can i give this columns to machine learning algorithmes like one value ? df.col1 | df.col2 | 1 | (1,[2],[3] ,[] ...[6]) 2 | (1,[5],[3] ,[]

MAtcheERROR : STRINGTYPE

2016-06-14 Thread pseudo oduesp

hello why i get this error when using assembleur = VectorAssembler( inputCols=l_CDMVT, outputCol="aev"+"CODEM") output = assembler.transform(df_aev) L_CDMTV list of columns thanks ?

data frame or RDD for machine learning

2016-06-09 Thread pseudo oduesp

Hi, after spark 1.3 we have dataframe ( thanks good ) , instead rdd : in machine learning algorithmes we should give him an RDD or dataframe? i mean when i build modele : Model = algoritme(rdd) or Model = algorithme(df) if you have an exemple with data frame i prefer work with

comparaing row in pyspark data frame

2016-06-08 Thread pseudo oduesp

Hi, how we can compare multiples columns in datframe i mean if df it s dataframe like that : df.col1 | df.col2 | df.col3 0.2 0.3 0.4 how we can compare columns to get max of row not columns and get name of columns where max it

oozie and spark on yarn

2016-06-08 Thread pseudo oduesp

hi , i want ask if somone used oozie with spark ? if you can give me example: how ? we can configure on yarn thanks

np.unique and collect

2016-06-03 Thread pseudo oduesp

Hi , why np.unique return list instead of list in this function ? def unique_item_df(df,list_var): l = df.select(list_var).distinct().collect() return np.unique(l) df it s data frmae and list it lits of variables . (pyspark) code thanks .

hivecontext and date format

2016-06-01 Thread pseudo oduesp

Hi , can i ask you how we can convert string like dd/mm/ to date type in hivecontext? i try with unix_timestemp and with format date but i get null . thank you.

equvalent beewn join sql and data frame

2016-05-30 Thread pseudo oduesp

hi guys , it s similare thing to do : sqlcontext.join("select * from t1 join t2 on condition) and df1.join(df2,condition,'inner")?? ps:df1.registertable('t1') ps:df2.registertable('t2') thanks

never understand

2016-05-25 Thread pseudo oduesp

hi guys , -i get this errors with pyspark 1.5.0 under cloudera CDH 5.5 (yarn) -i use yarn to deploy job on cluster. -i use hive context and parquet file to save my data. limit container 16 GB number of executor i tested befor it s 12 GB (executor memory) -i tested to increase number of partition

orgin of error

2016-05-15 Thread pseudo oduesp

someone can help me about this issues py4j.protocol.Py4JJavaError: An error occurred while calling o126.parquet. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation

Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread pseudo oduesp

hi guys , i have this error after 5 hours of processing i make lot of joins 14 left joins with small table : i saw in the spark ui and console log evrithing ok but when he save last join i get this error Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet f

multiple tables for join

2016-03-24 Thread pseudo oduesp

hi , i spent two months of my times to make 10 joins whith folowin tables : 1go tbal1 3go table 2 500mo table 3 400 mo table 4 20 mo table 5 100 mo table 6 30 mo table 7 40 mo table 8 700 mo table 9 800 mo table 10 i use hivecontext.sql("select * from table1 lef

62 matches

Mail list logo