somone can tell me how i can make unit test on pyspark ?
(book, tutorial ...)
Hi,
how we can create new sparkcontext from Ipython or jupyter session
i mean if i use current sparkcontext and i run sc.stop()
how i can launch new one from ipython without restart newsession of ipython
by refreshing browser ??
why i code some functions and i figreout i forgot something insde f
Hi ,
how i can add jar to Ipython notebooke
i tied Pyspark_submit_args without succes ?
thanks
hi ,
some one can show me an example for broadcast join in this version
1.5.0 with data frame in pyspark
thanks
Hi ,
how we can deal after raise stackoverflow trigger by long lineage ?
i mean i have this error and how resolve it wiyhout creating new session
thanks
hi,
i cretae new columns with udf after i try to filter this columns :
i get this error why ?
: java.lang.UnsupportedOperationException: Cannot evaluate expression:
fun_nm(input[0, string, true])
at
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:221)
at
hi,
how i can export all project on pyspark like zip from local session to
cluster and deploy with spark submit i mean i have a large project with
all dependances and i want create zip containing all of dependecs and
deploy it on cluster
8)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
at
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Process finished with exit code 1
2016-08-05 15:35 GMT+02:00 pseudo oduesp :
> HI,
>
> i configured th pycharm like describ
HI,
i configured th pycharm like describe on stack overflow with spark_home and
hadoop_conf_dir and donwload winutils to use it with prebuild version of
spark 2.0 (pyspark 2.0)
and i get this error i f you can help me to find solution thanks
C:\Users\AppData\Local\Continuum\Anaconda2\python.ex
hi ,
with pyspark 2.0 i get this errors
WindowsError: [Error 2] The system cannot find the file specified
someone can help me to find solution
thanks
da2\lib\subprocess.py", line
711, in __init__
errread, errwrite)
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\subprocess.py", line
959, in _execute_child
startupinfo)
WindowsError: [Error 2] Le fichier sp�cifi� est introuvable
Process finished with exit code 1
2016-
Hi ,
what is good conf for pyspark and pycharm on windwos ?
tahnks
Hi
in spark 1.5.0 i used descibe function with more than 100 columns .
someone can tell me if any limit exsiste now ?
thanks
someone can help me please
2016-08-01 11:51 GMT+02:00 pseudo oduesp :
> hi
> i get the following erreors when i try using pyspark 2.0 with ipython on
> yarn
> somone can help me please .
> java.lang.IllegalArgumentException: java.net.UnknownHostException:
> s001.big
hi
i get the following erreors when i try using pyspark 2.0 with ipython on
yarn
somone can help me please .
java.lang.IllegalArgumentException: java.net.UnknownHostException:
s001.bigdata.;s003.bigdata;s008bigdata.
at
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil
Hi,
on hive we have a awosome function for estimation of time of execution
before launch ?
in spark can find any function to estimate the time of lineage of spark dag
execution ?
Thanks
Hi ,
with standerscaler we get a sparse vector how i can transform it to list or
dense vector without missing the sparse values
thanks
e for each value in feature the name of variable .
how i can identify names of principal component in second vector ?
2016-07-26 10:39 GMT+02:00 pseudo oduesp :
> Hi,
> when i perform PCA reduction dimension i get dense vector with length of
> number of principla componen
Hi,
when i perform PCA reduction dimension i get dense vector with length of
number of principla component my question :
-How i get the name of features giving this vectors ?
-the values inside vectors result its value of projection of all
features on this componenets ?
- how to use it ?
th
PYSPARK_SUBMIT_ARGS = --jars spark-csv_2.10-1.4.0.jar,commons-csv-1.1.jar
without succecs
thanks
2016-07-25 13:27 GMT+02:00 pseudo oduesp :
> Hi ,
> someone can telle me how i can add jars to ipython i try spark
>
>
>
Hi ,
someone can telle me how i can add jars to ipython i try spark
Hi ,
i know spark it s engine to compute large data set but for me i work with
pyspark and it s very wonderful machine
my question we don't have tools for ploting data each time we have to
switch and go back to python for using plot.
but when you have large result scatter plot or roc curve yo
hi ,
we have parmaters named
labelCol="labe"
,featuresCol="features",
when i precise the value here (label and features) if train my model on
data frame with other columns tha algorithme choos only label columns and
features columns ?
thanks
Hi ,
how we can claculate lift coeff from pyspark result of prediction ?
thanks ?
HI,
i don't have any idea why we have to library ML and mlib
ml you can use it with data frame and mllib with rdd but ml have some lakes
like:
save model most important if you want create web api with score
my question why we don't have all features in MLlib on ML ?
( i use pyspark 1.5.0 bec
Hi,
how i can save model under pyspakr 1.5.0 ?
i use RandomForestClassifier()
thanks in advance.
Hi,
i use pyspark 1.5.0
can i ask you how i can get feature imprtance for a randomforest
algorithme in pyspark and please give me example
thanks for advance.
Hi,
how i can use this option in Random Forest .
when i transform my vector (100 features ) i have 20 categoriel feature
include.
if i understand categorielFeatureinfo , i should past the position of my 20
categoriels feature inside of the vector containing 100 with map{
positionof feature insid
Hi ,
how i can remove row from data frame verifying some condition on some
columns ?
thanks
Hi,
how i can alter table by adiing new columns to table in hivecontext ?
Hi who i can add multiple columns to data frame
withcolumns allow to add one columns but when you have multiple i have to
loop on eache columns ?
thanks
,f_index))
like that i keep order of variable in this order i have all f_index from
517: to 824
but when i create lable point i lose this order and i lose type int .
2016-06-24 9:40 GMT+02:00 pseudo oduesp :
> Hi,
> how i can keep type of my variable like int
> because i get this err
Hi,
how i can keep type of my variable like int
because i get this error when i call random forest algorithm with
model = RandomForest.trainClassifier(rdf,
numClasses=2,
categoricalFeaturesInfo=d,
Hi,
i am pyspark user and i want test the Randoforest algrithmes.
i found this parmeters categoricalFeaturesInfo how i can use it from list
of categoriels variables .
thanks.
hi ,
i am pyspark user and i want to extract var imprtance in randomforest
model for plot
how i can deal with that ?
thanks
Hi,
i am pyspark user and i want test Randomforest.
i have dataframe with 100 columns
i should give Rdd or data frame to algorithme i transformed my dataframe to
only tow columns
label ands features columns
df.label df.features
0(517,(0,1,2,333,56 ...
1 (517,(0,11,0,3
hi,
help me please to resolve this issues
) failed: Exception: It
appears that you are attempting to reference SparkContext from a broadcast
variable, action, or transforamtion. SparkContext can only be used on the
driver, not in code that it run on workers. For more information, see
SPARK-5063.>
Hi ,
with fillna we can select some columns to perform replace some values
with chosing columns with dict
{columns :values }
but how i can do same with cast i have data frame with 300 columns and i
want just cats 4 from list columns but with select query like that :
df.select(columns1.cast(i
hi ,
realy i m angry about parquet file each time i get error like
Could not read footer: java.lang.RuntimeException:
or error occuring when o127.load
why we have à lot of issuse with this format ?
thanks
Hi ,
i don t have no idea why i get this error
Py4JJavaError: An error occurred while calling o69143.parquet.
: org.apache.spark.SparkException: Job aborted.
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
hi,
who can get score for each row of classification algortithmes , and how i
can plot features importance of variables like sickit learn ?
thanks.
Hi,
in R we have function named Cbind and rbind for data frame
how i can repduce this functions on pyspark
df1.col1 df1.col2 df1.col3
df2.col1 df2.col2 df2.col3
fincal result :
new data frame
df1.col1 df1.col2 df1.col3 df2.col1 df2.col2 df2.col3
thanks
Hi,
how i can update data frame inside function ?
why ?
i have to apply Stingindexer multiple time because i tried Pipeline but
it still extremly slow
for 84 columns to Stringindexed eache one have 10 modalities and data frame
with 21Milion row
i need 15 hours of processing .
now i want try o
Hi,
i want aplly string indexers on multiple coluns but when use
Stringindexer and pipline that take lang time .
Indexer = StringIndexer(inputCol="Feature1", outputCol="indexed1")
this it practice for one or two or teen lines but when you have more
the 1000 lines how you can do ?
thanks
hi,
what is difference between dataframe and dataframwrite ?
hi ,
i use pyspark 1.5.0 on yarn cluster with 19 nodes and 200 GO
and 4 cores eache (include driver)
2016-06-16 15:42 GMT+02:00 pseudo oduesp :
> Hi ,
> who i can dummies large set of columns with STRINGindexer fast ?
> becasue i tested with 89 values and eache one had 10 max distinc
Hi ,
who i can dummies large set of columns with STRINGindexer fast ?
becasue i tested with 89 values and eache one had 10 max distinct values
and that take
lot of time
thanks
hi,
if i cache same data frame and transforme and add collumns i should cache
second times
df.cache()
transforamtion
add new columns
df.cache()
?
hi ,
what is limite of modalties in Stingindexer :
if i have columns with 1000 modalities it good to use STRINGindexers ?
or i should try other function and which one please ?
thanks
Hi ,
i have dataframe with 1000 columns to dummies with stingIndexer
when i apply pipliene take long times whene i want merge result with other
data frame
i mean :
originnal data frame + columns indexed by STringindexers
PB save stage it s long why ?
code
indexers = [StringIndexer(inp
hi ,
i want ask question about vector.dense or spars :
imagine i have dataframe with columns and one of them contain vectors .
my question can i give this columns to machine learning algorithmes like
one value ?
df.col1 | df.col2 |
1 | (1,[2],[3] ,[] ...[6])
2 | (1,[5],[3] ,[]
hello
why i get this error
when using
assembleur = VectorAssembler( inputCols=l_CDMVT,
outputCol="aev"+"CODEM")
output = assembler.transform(df_aev)
L_CDMTV list of columns
thanks ?
Hi,
after spark 1.3 we have dataframe ( thanks good ) , instead rdd :
in machine learning algorithmes we should give him an RDD or dataframe?
i mean when i build modele :
Model = algoritme(rdd)
or
Model = algorithme(df)
if you have an exemple with data frame i prefer work with
Hi,
how we can compare multiples columns in datframe i mean
if df it s dataframe like that :
df.col1 | df.col2 | df.col3
0.2 0.3 0.4
how we can compare columns to get max of row not columns and get name of
columns where max it
hi ,
i want ask if somone used oozie with spark ?
if you can give me example:
how ? we can configure on yarn
thanks
Hi ,
why np.unique return list instead of list in this function ?
def unique_item_df(df,list_var):
l = df.select(list_var).distinct().collect()
return np.unique(l)
df it s data frmae and list it lits of variables .
(pyspark) code
thanks
.
Hi ,
can i ask you how we can convert string like dd/mm/ to date type in
hivecontext?
i try with unix_timestemp and with format date but i get null .
thank you.
hi guys ,
it s similare thing to do :
sqlcontext.join("select * from t1 join t2 on condition) and
df1.join(df2,condition,'inner")??
ps:df1.registertable('t1')
ps:df2.registertable('t2')
thanks
hi guys ,
-i get this errors with pyspark 1.5.0 under cloudera CDH 5.5 (yarn)
-i use yarn to deploy job on cluster.
-i use hive context and parquet file to save my data.
limit container 16 GB
number of executor i tested befor it s 12 GB (executor memory)
-i tested to increase number of partition
someone can help me about this issues
py4j.protocol.Py4JJavaError: An error occurred while calling o126.parquet.
: org.apache.spark.SparkException: Job aborted.
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation
hi guys ,
i have this error after 5 hours of processing i make lot of joins 14 left
joins
with small table :
i saw in the spark ui and console log evrithing ok but when he save
last join i get this error
Py4JJavaError: An error occurred while calling o115.parquet. _metadata is
not a Parquet f
hi , i spent two months of my times to make 10 joins whith folowin tables :
1go tbal1
3go table 2
500mo table 3
400 mo table 4
20 mo table 5
100 mo table 6
30 mo table 7
40 mo table 8
700 mo table 9
800 mo table 10
i use hivecontext.sql("select * from table1 lef
62 matches
Mail list logo