Re: java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-18 Thread trsell
Hi, The stack trace suggests you're doing a join as well? and it's python.. I wonder if you're seeing this? https://issues.apache.org/jira/browse/SPARK-17100 Are you using spark 2.0.0? Tim On Tue, 16 Aug 2016 at 16:58 Sumit Khanna wrote: > This is just the stacktrace,but where is it you cca

Spark streaming

2016-08-18 Thread Diwakar Dhanuskodi
Hi, Is there a way to  specify in  createDirectStream to receive only last 'n' offsets of a specific topic and partition. I don't want to filter out in foreachRDD.   Sent from Samsung Mobile.

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Ah, BTW, there is an issue, SPARK-16216, about printing dates and timestamps here. So please ignore the integer values for dates 2016-08-19 9:54 GMT+09:00 Hyukjin Kwon : > Ah, sorry, I should have read this carefully. Do you mind if I ask your > codes to test? > > I would like to reproduce. > > >

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Ah, sorry, I should have read this carefully. Do you mind if I ask your codes to test? I would like to reproduce. I just tested this by myself but I couldn't reproduce as below (is this what your doing, right?): case class ClassData(a: String, b: Date) val ds: Dataset[ClassData] = Seq( ("a",

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Efe Selcuk
Thanks for the response. The problem with that thought is that I don't think I'm dealing with a complex nested type. It's just a dataset where every record is a case class with only simple types as fields, strings and dates. There's no nesting. That's what confuses me about how it's interpreting t

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-18 Thread Arun Luthra
This might be caused by a few large Map objects that Spark is trying to serialize. These are not broadcast variables or anything, they're just regular objects. Would it help if I further indexed these maps into a two-level Map i.e. Map[String, Map[String, Int]] ? Or would this still count against

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Felix Cheung
I think lots of components expect to have read/write permission to a /tmp directory on HDFS. Glad it works out! _ From: Andy Davidson mailto:a...@santacruzintegration.com>> Sent: Thursday, August 18, 2016 5:12 PM Subject: Re: pyspark unable to create UDF: java.lang.

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andy Davidson
NICE CATCH!!! Many thanks. I spent all day on this bug The error msg report /tmp. I did not think to look on hdfs. [ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/ Found 1 items -rw-r--r-- 3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp [ec2-user@ip-172-

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Hi Efe, If my understanding is correct, supporting to write/read complex types is not supported because CSV format can't represent the nested types in its own format. I guess supporting them in writing in external CSV is rather a bug. I think it'd be great if we can write and read back CSV in it

Re: Unable to see external table that is created from Hive Context in the list of hive tables

2016-08-18 Thread Mich Talebzadeh
Which Hive database did you specify in your CREATE EXTERNAL TABLE statement? How did you specify the location of the file? This one creates an external table testme in database test in the location under database test scala> sqltext = """ | CREATE EXTERNAL TABLE test.testme ( | col1

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Felix Cheung
Do you have a file called tmp at / on HDFS? On Thu, Aug 18, 2016 at 2:57 PM -0700, "Andy Davidson" mailto:a...@santacruzintegration.com>> wrote: For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error Py4JJavaError: An error occurr

Unable to see external table that is created from Hive Context in the list of hive tables

2016-08-18 Thread SRK
Hi, I created an external table in Spark sql using hiveContext ...something like CREATE EXTERNAL TABLE IF NOT EXISTS sampleTable stored as ORC LOCATION ... I can see the files getting created under the location I specified and able to query it as well... but, I don't see the table in Hive when I

pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andy Davidson
For unknown reason I can not create UDF when I run the attached notebook on my cluster. I get the following error Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is

[Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Efe Selcuk
We have an application working in Spark 1.6. It uses the databricks csv library for the output format when writing out. I'm attempting an upgrade to Spark 2. When writing with both the native DataFrameWriter#csv() method and with first specifying the "com.databricks.spark.csv" format (I suspect un

Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-18 Thread Andy Davidson
Hi I am using python3, Java8 and spark-1.6.1. I am running my code in Jupyter notebook The following code runs fine on my mac udfRetType = ArrayType(StringType(), True) findEmojiUDF = udf(lambda s : re.findall(emojiPattern2, s), udfRetType) retDF = (emojiSpecialDF # convert into a li

Re: Model Persistence

2016-08-18 Thread Nick Pentreath
Model metadata (mostly parameter values) are usually tiny. The parquet data is most often for model coefficients. So this depends on the size of your model, i.e. Your feature dimension. A high-dimensional linear model can be quite large - but still typically easy to fit into main memory on a singl

py4j.Py4JException: Method lower([class java.lang.String]) does not exist

2016-08-18 Thread Andy Davidson
I am still working on spark-1.6.1. I am using java8. Any idea why (df.select("*", functions.lower("rawTag").alias("tag²)) Would raise py4j.Py4JException: Method lower([class java.lang.String]) does not exist Thanks in advance Andy https://spark.apache.org/docs/1.6.0/api/pytho

Extra string added to column name? (withColumn & expr)

2016-08-18 Thread rachmaninovquartet
Hi, I'm trying to implement a custom one hot encoder, since I want the output to be a specific way, suitable to theano. Basically, it will give a new column for each distinct member of the original features and have it set to 1 if the observation contains the specific member of the distinct featur

Re: Large where clause StackOverflow 1.5.2

2016-08-18 Thread rachmaninovquartet
I solved this by using a Window partitioned by 'id'. I used lead and lag to create columns, which contained nulls in the places that I needed to delete, in each fold. I then removed those rows with the nulls and my additional columns. -- View this message in context: http://apache-spark-user-l

Model Persistence

2016-08-18 Thread Rich Tarro
The following Databricks blog on Model Persistence states "Internally, we save the model metadata and parameters as JSON and the data as Parquet." https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html What data associated with a model or Pipeline

Structured Stream Behavior on failure

2016-08-18 Thread Cornelio
Hi I have a couple of questions. 1.- When Spark shutdowns or fails doc states that "In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. " -To acheive this do I just need to set the checkpoint dir as "

Re: Aggregations with scala pairs

2016-08-18 Thread Andrés Ivaldi
Thanks!!! On Thu, Aug 18, 2016 at 3:35 AM, Jean-Baptiste Onofré wrote: > Agreed. > > Regards > JB > On Aug 18, 2016, at 07:32, Olivier Girardot com> wrote: >> >> CC'ing dev list, >> you should open a Jira and a PR related to it to discuss it c.f. >> https://cwiki.apache.org/confluence/display/S

createDirectStream parallelism

2016-08-18 Thread Diwakar Dhanuskodi
We are using createDirectStream api to receive messages from 48 partitioned topic. I am setting up --num-executors 48 & --executor-cores 1 in spark-submit All partitions were parallely received and corresponding RDDs in foreachRDD were executed in parallel. But when I join a transformed RDD js

Re: Standalone executor memory is fixed while executor cores are load balanced between workers

2016-08-18 Thread Mich Talebzadeh
Can you provide some info In your conf/spark-env.sh, what do you set these # Options for the daemons used in the standalone deploy mode SPARK_WORKER_CORES=? ##, total number of cores to be used by executors by each worker SPARK_WORKER_MEMORY=?g ##, to set how much total memory workers have to giv

RE: pyspark.sql.functions.last not working as expected

2016-08-18 Thread Alexander Peletz
Is the issue that the default rangeBetween = rangeBetween(-sys.maxsize, 0)? That would explain the behavior below. Is this default documented somewhere? From: Alexander Peletz [mailto:alexand...@slalom.com] Sent: Wednesday, August 17, 2016 8:48 PM To: user Subject: RE: pyspark.sql.functions.last

Standalone executor memory is fixed while executor cores are load balanced between workers

2016-08-18 Thread Petr Novak
Hello, when I set spark.executor.cores e.g. to 8 cores and spark.executor.memory to 8GB. It can allocate more executors with less cores for my app but each executors gets 8GB RAM. It is a problem because I can allocate more memory across cluster than expected, the worst case is 8x 1core executors,

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread janardhan shetty
There is a spark-ts package developed by Sandy which has rdd version. Not sure about the dataframe roadmap. http://sryza.github.io/spark-timeseries/0.3.0/index.html On Aug 18, 2016 12:42 AM, "ayan guha" wrote: > Thanks a lot. I resolved it using an UDF. > > Qs: does spark support any time series

Reporting errors from spark sql

2016-08-18 Thread yael aharon
Hello, I am working on an SQL editor which is powered by spark SQL. When the SQL is not valid, I would like to provide the user with a line number and column number where the first error occurred. I am having a hard time finding a mechanism that will give me that information programmatically. Most

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Yes, i looked into the source code implementation. sparkConf is serialized and saved during checkpointing and re-created from the checkpoint directory at time of restart. So any sparkConf parameter which you load from application.config and set in sparkConf object in code cannot be changed and re

Re: JavaRDD to DataFrame fails with null pointer exception in 1.6.0

2016-08-18 Thread Aditya
Check if schema is generated correctly. On Wednesday 17 August 2016 10:15 AM, sudhir patil wrote: Tested with java 7 & 8 , same issue on both versions. On Aug 17, 2016 12:29 PM, "spats" > wrote: Cannot convert JavaRDD to DataFrame in spark 1.6.0, throws n

Re: SparkStreaming source code

2016-08-18 Thread Adonis Settouf
Might it be this: https://github.com/apache/spark/tree/master/streaming Software engineer at TTTech *My personal page * [image: https://github.com/asettouf] 20

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread Cody Koeninger
Checkpointing is not kafka-specific. It encompasses metadata about the application. You can't re-use a checkpoint if your application has changed. http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash w

Re: [Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Aditya
Try using --files /path/of/hive-site.xml in spark-submit and run. On Thursday 18 August 2016 05:26 PM, Diwakar Dhanuskodi wrote: Hi Can you cross check by providing same library path in --jars of spark-submit and run . Sent from Samsung Mobile. Original message From:

SparkStreaming source code

2016-08-18 Thread Aditya
Hi, I need to set up source code of Spark Streaming for exploring purpose. Can any one suggest the link for the Spark Streaming source code? Regards, Aditya Calangutkar - To unsubscribe e-mail: user-unsubscr...@spark.apache.

RE: [Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Diwakar Dhanuskodi
Hi Can  you  cross check by providing same library path in --jars of spark-submit and run . Sent from Samsung Mobile. Original message From: "颜发才(Yan Facai)" Date:18/08/2016 15:17 (GMT+05:30) To: "user.spark" Cc: Subject: [Spark 2.0] ClassNotFoundException is thrown wh

Re: Converting Dataframe to resultSet in Spark Java

2016-08-18 Thread rakesh sharma
Hi Sree I dont think what you are trying to do is correct. DataFrame and ResultSet are two different types. And no strongly typed language will alow you to do that. If your intention is to traverse the DataFrame or get the individual rows and columns then you must try the map function and pass

Re: [Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Mich Talebzadeh
when you start spark-shell does it work or this issue is only with spark-submit? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzad

Spark Streaming application failing with Token issue

2016-08-18 Thread Kamesh
Hi all, I am running a spark streaming application that store events into Secure(Kerborized) HBase cluster. I launched this spark streaming application by passing --principal and --keytab. Despite this, spark streaming application is failing after *7days* with Token issue. Can someone please sugg

[Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Yan Facai
Hi, all. I copied hdfs-site.xml, core-site.xml and hive-site.xml to $SPARK_HOME/conf. And spark-submit is used to submit task to yarn, and run as **client** mode. However, ClassNotFoundException is thrown. some details of logs are list below: ``` 16/08/12 17:07:32 INFO hive.HiveUtils: Initializin

Re: 2.0.1/2.1.x release dates

2016-08-18 Thread Sean Owen
Historically, minor releases happen every ~4 months, and maintenance releases are a bit ad hoc but come about a month after the minor release. It's up to the release manager to decide to do them but maybe realistic to expect 2.0.1 in early September. On Thu, Aug 18, 2016 at 10:35 AM, Adrian Bridge

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Is it possible that i use checkpoint directory to restart streaming but with modified parameter value in config file (e.g. username/password for db connection) ? Thanks in advance. Regards, Chandan On Thu, Aug 18, 2016 at 1:10 PM, chandan prakash wrote: > Hi, > I am using direct kafka with ch

2.0.1/2.1.x release dates

2016-08-18 Thread Adrian Bridgett
Just wondering if there were any rumoured release dates for either of the above. I'm seeing some odd hangs with 2.0.0 and mesos (and I know that the mesos integration has had a bit of updating in 2.1.x). Looking at JIRA, there's no suggested release date and issues seem to be added to a rele

GraphX VerticesRDD issue - java.lang.ArrayStoreException: java.lang.Long

2016-08-18 Thread Gerard Casey
Dear all, I am building a graph from two JSON files. Spark version 1.6.1 Creating Edge and Vertex RDDs from JSON files. The vertex JSON files looks like this: {"toid": "osgb400031043205", "index": 1, "point": [508180.748, 195333.973]} {"toid": "osgb400031043206", "inde

Re: How to Improve Random Forest classifier accuracy

2016-08-18 Thread Jörn Franke
Depends on your data... How did you split training and test set? How does the model fit to the data? You could try of course also to have more data to fed into the model Have you considered alternative machine learning models? I do not think this is a Spark problem, but you should ask the mac

How to Improve Random Forest classifier accuracy

2016-08-18 Thread 陈哲
Hi All I using spark ml Random Forest classifier, I have only two label categories (1, 0) ,about 30 features and data size over 100, 000. I run the spark JavaRandomForestClassifierExample code, the model came out with the results (I make some change, show more detail result): Test Error = 0.0223

How to Improve Random Forest classifier accuracy

2016-08-18 Thread 陈哲
Hi All

Converting Dataframe to resultSet in Spark Java

2016-08-18 Thread Sree Eedupuganti
Retrieved the data to DataFrame but i can't convert into ResultSet Is there any possible way how to convert...Any suggestions please... Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.DataFrame cannot be cast to com.datastax.driver.core.ResultSet -- Best Regards,

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread ayan guha
Thanks a lot. I resolved it using an UDF. Qs: does spark support any time series model? Is there any roadmap to know when a feature will be roughly available? On 18 Aug 2016 16:46, "Yanbo Liang" wrote: > If you want to tie them with other data, I think the best way is to use > DataFrame join ope

spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Hi, I am using direct kafka with checkpointing of offsets same as : https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala I need to change some parameters like db connection params : username/password for db connection . I stopped streaming grac

Re: Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-18 Thread Yanbo Liang
It looks like you mixed use ALS in spark.ml and spark.mllib package. You can train the model by either one, meanwhile, you should use the corresponding save/load functions. You can not train/save the model by spark.mllib ALS, and then use spark.ml ALS to load the model. It will throw exceptions. I