Re: Third party library

2016-12-13 Thread Jakob Odersky
Hi Vineet, great to see you solved the problem! Since this just appeared in my inbox, I wanted to take the opportunity for a shameless plug: https://github.com/jodersky/sbt-jni. In case you're using sbt and also developing the native library, this plugin may help with the pains of building and pack

Re: Optimization for Processing a million of HTML files

2016-12-12 Thread Jakob Odersky
Assuming the bottleneck is IO, you could try saving your files to HDFS. This will distribute your data and allow for better concurrent reads. On Mon, Dec 12, 2016 at 3:06 PM, Reth RM wrote: > Hi, > > I have millions of html files in a directory, using "wholeTextFiles" api to > load them and proce

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Also, in case the issue was not due to the string length (however it is still valid and may get you later), the issue may be due to some other indexing issues which are currently being worked on here https://issues.apache.org/jira/browse/SPARK-6235 On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Hi Pradeep, I'm afraid you're running into a hard Java issue. Strings are indexed with signed integers and can therefore not be longer than approximately 2 billion characters. Could you use `textFile` as a workaround? It will give you an RDD of the files' lines instead. In general, this guide htt

Re: custom generate spark application id

2016-12-05 Thread Jakob Odersky
The app ID is assigned internally by spark's task scheduler https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala#L35. You could probably change the naming, however I'm pretty sure that the ID will always have to be unique for a context on a

Re: SparkILoop doesn't run

2016-11-21 Thread Jakob Odersky
libraries of multiple scala versions on the same classpath. You mention that it worked before, can you recall what libraries you upgraded before it broke? --Jakob On Mon, Nov 21, 2016 at 2:34 PM, Jakob Odersky wrote: > Trying it out locally gave me an NPE. I'll look into it in more > deta

Re: SparkILoop doesn't run

2016-11-21 Thread Jakob Odersky
Trying it out locally gave me an NPE. I'll look into it in more detail, however the SparkILoop.run() method is dead code. It's used nowhere in spark and can be removed without any issues. On Thu, Nov 17, 2016 at 11:16 AM, Mohit Jaggi wrote: > Thanks Holden. I did post to the user list but since t

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread Jakob Odersky
> how do I tell my spark driver program to not create so many? This may depend on your driver program. Do you spawn any threads in it? Could you share some more information on the driver program, spark version and your environment? It would greatly help others to help you On Mon, Oct 31, 2016 at

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
further then. Thanks for the quick help. > > On Mon, Oct 24, 2016 at 7:34 PM Jakob Odersky wrote: >> >> What you're seeing is merely a strange representation, 0E-18 is zero. >> The E-18 represents the precision that Spark uses to store the decimal >> >> On

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
What you're seeing is merely a strange representation, 0E-18 is zero. The E-18 represents the precision that Spark uses to store the decimal On Mon, Oct 24, 2016 at 7:32 PM, Jakob Odersky wrote: > An even smaller example that demonstrates the same behaviour: > > Seq(Dat

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
An even smaller example that demonstrates the same behaviour: Seq(Data(BigDecimal(0))).toDS.head On Mon, Oct 24, 2016 at 7:03 PM, Efe Selcuk wrote: > I’m trying to track down what seems to be a very slight imprecision in our > Spark application; two of our columns, which should be netting ou

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-19 Thread Jakob Odersky
Another reason I could imagine is that files are often read from HDFS, which by default uses line terminators to separate records. It is possible to implement your own hdfs delimiter finder, however for arbitrary json data, finding that delimiter would require stateful parsing of the file and woul

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky
Just thought of another potential issue: you should use the "provided" scope when depending on spark. I.e in your project's pom: org.apache.spark spark-core_2.11 2.0.1 provided On Mon, Oct 10, 2016 at 2:00 PM, Jakob Odersky wrot

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread Jakob Odersky
Ho do you submit the application? A version mismatch between the launcher, driver and workers could lead to the bug you're seeing. A common reason for a mismatch is if the SPARK_HOME environment variable is set. This will cause the spark-submit script to use the launcher determined by that environm

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-07 Thread Jakob Odersky
ROR"); > Receiver receiver = new Receiver(config); > JavaReceiverInputDStream jsonMessagesDStream = > ssc.receiverStream(receiver); > jsonMessagesDStream.count() > ssc.start(); > ssc.awaitTermination(); > > > Not using Mixmax yet?

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-06 Thread Jakob Odersky
You can change the kind of log messages that are shown by calling "context.setLogLevel()" with an appropriate level: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN. See http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@setLogLevel(logLevel:String):Unit for fu

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-05 Thread Jakob Odersky
Oct 5, 2016 at 3:35 PM, kant kodali wrote: > I am running locally so they all are on one host > > > > On Wed, Oct 5, 2016 3:12 PM, Jakob Odersky ja...@odersky.com wrote: > >> Are all spark and scala versions the same? By "all" I mean the master, >> worker and driver instances. >> >

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-05 Thread Jakob Odersky
Are all spark and scala versions the same? By "all" I mean the master, worker and driver instances.

Re: Error downloading Spark 2.0.1

2016-10-04 Thread Jakob Odersky
confirmed On Tue, Oct 4, 2016 at 11:56 AM, Daniel wrote: > When you try download Spark 2.0.1 from official webpage you get this error: > > NoSuchKeyThe specified key does not > exist.spark-2.0.1-bin-hadoop2.7.tgz6EA5F8FFFE6CCAEFg8UIuHetxWoGE0J/w2UtHn7DjKwATRKtHHHKu/2Mj2SmUPhPBZ+aoDPb+2uwn5J4Uj2vo

Re: Package org.apache.spark.annotation no longer exist in Spark 2.0?

2016-10-04 Thread Jakob Odersky
It's still there on master. It is in the "spark-tags" module however (under common/tags), maybe something changed in the build environment and it isn't made available as a dependency to your project? What happens if you include the module as a direct dependency? --Jakob On Tue, Oct 4, 2016 at 10:

Re: get different results when debugging and running scala program

2016-09-30 Thread Jakob Odersky
There is no image attached, I'm not sure how the apache mailing lists handle them. Can you provide the output as text? best, --Jakob On Fri, Sep 30, 2016 at 8:25 AM, chen yong wrote: > Hello All, > > > > I am using IDEA 15.0.4 to debug a scala program. It is strange to me that > the results were

Re: Apache Spark JavaRDD pipe() need help

2016-09-22 Thread Jakob Odersky
e know. This code will be executed in > all the nodes in a cluster. > > Hope my requirement is now clear. How to do this? > > Regards, > Shash > > On Thu, Sep 22, 2016 at 4:13 AM, Jakob Odersky wrote: >> >> Can you provide more details? It's unclear what

Re: Apache Spark JavaRDD pipe() need help

2016-09-21 Thread Jakob Odersky
Can you provide more details? It's unclear what you're asking On Wed, Sep 21, 2016 at 10:14 AM, shashikant.kulka...@gmail.com wrote: > Hi All, > > I am trying to use the JavaRDD.pipe() API. > > I have one object with me from the JavaRDD ---

Re: Has anyone installed the scala kernel for Jupyter notebook

2016-09-21 Thread Jakob Odersky
One option would be to use Apache Toree. A quick setup guide can be found here https://toree.incubator.apache.org/documentation/user/quick-start On Wed, Sep 21, 2016 at 2:02 PM, Arif,Mubaraka wrote: > Has anyone installed the scala kernel for Jupyter notebook. > > > > Any blogs or steps to follow

Re: Task Deserialization Error

2016-09-21 Thread Jakob Odersky
Your app is fine, I think the error has to do with the way inttelij launches applications. Is your app forked in a new jvm when you run it? On Wed, Sep 21, 2016 at 2:28 PM, Gokula Krishnan D wrote: > Hello Sumit - > > I could see that SparkConf() specification is not being mentioned in your > pr

Re: Can I assign affinity for spark executor processes?

2016-09-13 Thread Jakob Odersky
Hi Xiaoye, could it be that the executors were spawned before the affinity was set on the worker? Would it help to start spark worker with taskset from the beginning, i.e. "taskset [mask] start-slave.sh"? Workers in spark (standalone mode) simply create processes with the standard java process API.

Re: iterating over DataFrame Partitions sequentially

2016-09-09 Thread Jakob Odersky
not the best use-case of Spark though and will probably be a performance bottleneck. On Fri, Sep 9, 2016 at 11:45 AM, Jakob Odersky wrote: > Hi Sujeet, > > going sequentially over all parallel, distributed data seems like a > counter-productive thing to do. What are you trying to acc

Re: iterating over DataFrame Partitions sequentially

2016-09-09 Thread Jakob Odersky
Hi Sujeet, going sequentially over all parallel, distributed data seems like a counter-productive thing to do. What are you trying to accomplish? regards, --Jakob On Fri, Sep 9, 2016 at 3:29 AM, sujeet jog wrote: > Hi, > Is there a way to iterate over a DataFrame with n partitions sequentially,

Re: Returning DataFrame as Scala method return type

2016-09-08 Thread Jakob Odersky
(Maybe unrelated FYI): in case you're using only Scala or Java with Spark, I would recommend to use Datasets instead of DataFrames. They provide exactly the same functionality, yet offer more type-safety. On Thu, Sep 8, 2016 at 11:05 AM, Lee Becker wrote: > > On Thu, Sep 8, 2016 at 11:35 AM, Ashi

Re: Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Jakob Odersky
Spark currently requires at least Java 1.7, so adding a Java 1.8-specific encoder will not be straightforward without affecting requirements. I can think of two solutions: 1. add a Java 1.8 build profile which includes such encoders (this may be useful for Scala 2.12 support in the future as well)

Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
Forgot to answer your question about feature parity of Python w.r.t. Spark's different components I mostly work with scala so I can't say for sure but I think that all pre-2.0 features (that's basically everything except Structured Streaming) are on par. Structured Streaming is a pretty new feature

Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
As you point out, often the reason that Python support lags behind is that functionality is implemented in Scala, so the API in that language is "free" whereas Python support needs to be added explicitly. Nevertheless, Python bindings are an important part of Spark and is used by many people (this

Re: Possible Code Generation Bug: Can Spark 2.0 Datasets handle Scala Value Classes?

2016-09-01 Thread Jakob Odersky
I'm not sure how the shepherd thing works, but just FYI Michael Armbrust originally wrote Catalyst, the engine behind Datasets. You can find a list of all committers here https://cwiki.apache.org/confluence/display/SPARK/Committers. Another good resource is to check https://spark-prs.appspot.com/

Re: Scala Vs Python

2016-09-01 Thread Jakob Odersky
gt;>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical

Re: Scala Vs Python

2016-09-01 Thread Jakob Odersky
> However, what really worries me is not having Dataset APIs at all in Python. I think thats a deal breaker. What is the functionality you are missing? In Spark 2.0 a DataFrame is just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in core/.../o/a/s/sql/package.scala). Since python is

Re: Possible Code Generation Bug: Can Spark 2.0 Datasets handle Scala Value Classes?

2016-09-01 Thread Jakob Odersky
Hi Aris, thanks for sharing this issue. I can confirm that value classes currently don't work, however I can't think of reason why they shouldn't be supported. I would therefore recommend that you report this as a bug. (Btw, value classes also currently aren't definable in the REPL. See https://is

Re: How to use custom class in DataSet

2016-08-30 Thread Jakob Odersky
Implementing custom encoders is unfortunately not well supported at the moment (IIRC there are plans to eventually add an api for user defined encoders). That being said, there are a couple of encoders that can work with generic, serializable data types: "javaSerialization" and "kryo", found here

Re: Error in Word Count Program

2016-07-19 Thread Jakob Odersky
Does the file /home/user/spark-1.5.1-bin-hadoop2.4/bin/README.md exist? On Tue, Jul 19, 2016 at 4:30 AM, RK Spark wrote: > val textFile = sc.textFile("README.md")val linesWithSpark = > textFile.filter(line => line.contains("Spark")) > linesWithSpark.saveAsTextFile("output1") > > > Same error: >

Re: I'm trying to understand how to compile Spark

2016-07-19 Thread Jakob Odersky
Hi Eli, to build spark, just run build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests package in your source directory, where package is the actual word "package". This will recompile the whole project, so it may take a while when running the first time. Replacing a single file i

Re: why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Jakob Odersky
Spark actually used to depend on Akka. Unfortunately this brought in all of Akka's dependencies (in addition to Spark's already quite complex dependency graph) and, as Todd mentioned, led to conflicts with projects using both Spark and Akka. It would probably be possible to use Akka and shade it t

Re: I want to unsubscribe

2016-04-05 Thread Jakob Odersky
to unsubscribe, send an email to user-unsubscr...@spark.apache.org On Tue, Apr 5, 2016 at 4:50 PM, Ranjana Rajendran wrote: > I get to see the threads in the public mailing list. I don;t want so many > messages in my inbox. I want to unsubscribe. -

Re: Building spark submodule source code

2016-03-21 Thread Jakob Odersky
Another gotcha to watch out for are the SPARK_* environment variables. Have you exported SPARK_HOME? In that case, 'spark-shell' will use Spark from the variable, regardless of the place the script is called from. I.e. if SPARK_HOME points to a release version of Spark, your code changes will never

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-20 Thread Jakob Odersky
Can you share a snippet that reproduces the error? What was spark.sql.autoBroadcastJoinThreshold before your last change? On Thu, Mar 17, 2016 at 10:03 AM, Jiří Syrový wrote: > Hi, > > any idea what could be causing this issue? It started appearing after > changing parameter > > spark.sql.aut

Re: ClassNotFoundException in RDD.map

2016-03-20 Thread Jakob Odersky
The error is very strange indeed, however without code that reproduces it, we can't really provide much help beyond speculation. One thing that stood out to me immediately is that you say you have an RDD of Any where every Any should be a BigDecimal, so why not specify that type information? When

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Jakob Odersky
Doesn't FileInputFormat require type parameters? Like so: class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord] extends FileInputFormat[LW, RD] I haven't verified this but it could be related to the compile error you're getting. On Thu, Mar 17, 2016 at 9:53 AM, Benyi Wang wrote: >

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky
Hi, regarding 1, packages are resolved locally. That means that when you specify a package, spark-submit will resolve the dependencies and download any jars on the local machine, before shipping* them to the cluster. So, without a priori knowledge of dataproc clusters, it should be no different to

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky
/spark.apache.org/docs/latest/submitting-applications.html >> >> _ >> From: Jakob Odersky >> Sent: Thursday, March 17, 2016 6:40 PM >> Subject: Re: installing packages with pyspark >> To: Ajinkya Kale >> Cc: >> >> >&g

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
se two changes to dependencies >>>> >>>> >>>> org.apache.spark >>>> spark-core_2.10 >>>> 1.5.1 >>>> >>>> >>>> org.apache.spark >>>> spark-sql_2.10 >>>> 1.5.1 >>>> >>>> >>>> >>>> [D

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
Hi Mich, probably unrelated to the current error you're seeing, however the following dependencies will bite you later: spark-hive_2.10 spark-csv_2.11 the problem here is that you're using libraries built for different Scala binary versions (the numbers after the underscore). The simple fix here is

Re: Installing Spark on Mac

2016-03-15 Thread Jakob Odersky
version1.5.2 with Java > 7 and SCALA version 2.10.6; got the same error messages > > Do you think it would be worth me trying to change the IP address in > SPARK_MASTER_IP to the IP address of the master node? If so, how would I go > about doing that? > > Thanks, >

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-14 Thread Jakob Odersky
Have you tried setting the configuration `spark.executor.extraLibraryPath` to point to a location where your .so's are available? (Not sure if non-local files, such as HDFS, are supported) On Mon, Mar 14, 2016 at 2:12 PM, Tristan Nixon wrote: > What build system are you using to compile your code

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky
regarding my previous message, I forgot to mention to run netstat as root (sudo netstat -plunt) sorry for the noise On Fri, Mar 11, 2016 at 12:29 AM, Jakob Odersky wrote: > Some more diagnostics/suggestions: > > 1) are other services listening to ports in the 4000 range (run > &qu

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky
is the directory I created for Spark once I >>>> downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark >>>> >>>> Tried running ./bin/spark-shell ; comes back with same error as below; i.e >>>> could not bind to port 0 etc. >>

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky
Sorry had a typo in my previous message: > try running just "/bin/spark-shell" please remove the leading slash (/) On Wed, Mar 9, 2016 at 1:39 PM, Aida Tefera wrote: > Hi there, tried echo $SPARK_HOME but nothing comes back so I guess I need to > set it. How would I do that? > > Thanks > > Sent

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky
As Tristan mentioned, it looks as though Spark is trying to bind on port 0 and then 1 (which is not allowed). Could it be that some environment variables from you previous installation attempts are polluting your configuration? What does running "env | grep SPARK" show you? Also, try running just

Re: Confusing RDD function

2016-03-08 Thread Jakob Odersky
Hi Jeff, > But in our development environment, the returned RDD results were empty and > b.function(_) was never executed what do you mean by "the returned RDD results were empty", did you try running a foreach, collect or any other action on the returned RDD[C]? Spark provides two kinds of oper

Re: Installing Spark on Mac

2016-03-08 Thread Jakob Odersky
I've had some issues myself with the user-provided-Hadoop version. If you simply just want to get started, I would recommend downloading Spark (pre-built, with any of the hadoop versions) as Cody suggested. A simple step-by-step guide: 1. curl http://apache.arvixe.com/spark/spark-1.6.0/spark-1.6.

Re: How could I do this algorithm in Spark?

2016-02-24 Thread Jakob Odersky
Hi Guillermo, assuming that the first "a,b" is a typo and you actually meant "a,d", this is a sorting problem. You could easily model your data as an RDD or tuples (or as a dataframe/set) and use the sortBy (or orderBy for dataframe/sets) methods. best, --Jakob On Wed, Feb 24, 2016 at 2:26 PM, G

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Jakob Odersky
You can `filter` (scaladoc ) your dataframes before saving them to- or after reading them from parquet files On Wed, Feb 24, 2016 at 1:28 AM, Cheng Lian wrote: > Parquet is a rea

Re: SparkMaster IP

2016-02-22 Thread Jakob Odersky
Spark master by default binds to whatever ip address your current host resolves to. You have a few options to change that: - override the ip by setting the environment variable SPARK_LOCAL_IP - change the ip in your local "hosts" file (/etc/hosts on linux, not sure on windows) - specify a different

Re: Option[Long] parameter in case class parsed from JSON DataFrame failing when key not present in JSON

2016-02-22 Thread Jakob Odersky
I think the issue is that the `json.read` function has no idea of the underlying schema, in fact the documentation (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader) says: > Unless the schema is specified using schema function, this function goes > thr

Re: How to parallel read files in a directory

2016-02-11 Thread Jakob Odersky
Hi Junjie, How do you access the files currently? Have you considered using hdfs? It's designed to be distributed across a cluster and Spark has built-in support. Best, --Jakob On Feb 11, 2016 9:33 AM, "Junjie Qian" wrote: > Hi all, > > I am working with Spark 1.6, scala and have a big dataset

Re: How to collect/take arbitrary number of records in the driver?

2016-02-10 Thread Jakob Odersky
Another alternative: rdd.take(1000).drop(100) //this also preserves ordering Note however that this can lead to an OOM if the data you're taking is too large. If you want to perform some operation sequentially on your driver and don't care about performance, you could do something similar as Moha

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky
> > > So it basically boils down to this demarcation as suggested which looks > clearer > > val errlog = sc.textFile("/unix_files/*.ksh") > errlog.filter(line => line.contains("sed")).collect().foreach(line => > println(line)) > > Regards,

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky
Hi Mich, your assumptions 1 to 3 are all correct (nitpick: they're method *calls*, the methods being the part before the parentheses, but I assume that's what you meant). The last one is also a method call but uses syntactic sugar on top: `foreach(println)` boils down to `foreach(line => println(li

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jakob Odersky
To address one specific question: > Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array at some point for optimized GC and memory. My question is why is it only applicable to SQL/Dataframe and not RDD? RDD has types too! A principal difference between RDDs and Dat

Re: Spark 1.5.2 memory error

2016-02-02 Thread Jakob Odersky
Can you share some code that produces the error? It is probably not due to spark but rather the way data is handled in the user code. Does your code call any reduceByKey actions? These are often a source for OOM errors. On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov wrote: > Hi Guys, > > I need

Re: Spark 2.0.0 release plan

2016-01-29 Thread Jakob Odersky
I'm not an authoritative source but I think it is indeed the plan to move the default build to 2.11. See this discussion for more detail http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html On Fri, Jan 29, 2016 at 11:43 AM, Deenar Toraskar wrote: > A re

Re: Maintain state outside rdd

2016-01-27 Thread Jakob Odersky
Be careful with mapPartitions though, since it is executed on worker nodes, you may not see side-effects locally. Is it not possible to represent your state changes as part of your rdd's transformations? I.e. return a tuple containing the modified data and some accumulated state. If that really do

Re: Python UDFs

2016-01-27 Thread Jakob Odersky
Have you checked: - the mllib doc for python https://spark.apache.org/docs/1.6.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector - the udf doc https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.udf You should be fine in returning a DenseVector as

Re: How to debug ClassCastException: java.lang.String cannot be cast to java.lang.Long in SparkSQL

2016-01-27 Thread Jakob Odersky
> the data type mapping has been taken care of in my code, could you share this? On Tue, Jan 26, 2016 at 8:30 PM, Anfernee Xu wrote: > Hi, > > I'm using Spark 1.5.0, I wrote a custom Hadoop InputFormat to load data from > 3rdparty datasource, the data type mapping has been taken care of in my > c

Re: Escaping tabs and newlines not working

2016-01-27 Thread Jakob Odersky
Can you provide some code the reproduces the issue, specifically in a spark job? The linked stackoverflow question is related to plain scala and the proposed answers offer a solution. On Wed, Jan 27, 2016 at 1:57 PM, Harshvardhan Chauhan wrote: > > > Hi, > > Escaping newline and tad dosent seem

Re: Using Spark in mixed Java/Scala project

2016-01-27 Thread Jakob Odersky
JavaSparkContext has a wrapper constructor for the "scala" SparkContext. In this case all you need to do is declare a SparkContext that is accessible both from the Java and Scala sides of your project and wrap the context with a JavaSparkContext. Search for java source compatibilty with scala for

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Jakob Odersky
Have you followed the guide on how to import spark into eclipse https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse ? On 18 January 2016 at 13:04, Andy Davidson wrote: > Hi > > My project is implemented using Java 8 and Python. Some times its han

Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
the time, i guess in > different threads indeed (its in akka) > > On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia > wrote: > >> RDDs actually are thread-safe, and quite a few applications use them this >> way, e.g. the JDBC server. >> >> Matei >> >

Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
I don't think RDDs are threadsafe. More fundamentally however, why would you want to run RDD actions in parallel? The idea behind RDDs is to provide you with an abstraction for computing parallel operations on distributed data. Even if you were to call actions from several threads at once, the indi

Re: Put all elements of RDD into array

2016-01-11 Thread Jakob Odersky
Hey, I just reread your question and saw I overlooked some crucial information. Here's a solution: val data = model.asInstanceOf[DistributedLDAModel].topicDistributions.sortByKey().collect() val tpdist = data.map(doc => doc._2.toArray) hope it works this time On 11 January 2016 at 17:1

Re: Put all elements of RDD into array

2016-01-11 Thread Jakob Odersky
Hi Daniel, You're actually not modifying the original array: `array :+ x ` will give you a new array with `x` appended to it. In your case the fix is simple: collect() already returns an array, use it as the assignment value to your val. In case you ever want to append values iteratively, search

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Jakob Odersky
Check the configuration guide for a description on units ( http://spark.apache.org/docs/latest/configuration.html#spark-properties). In your case, 5GB would be specified as 5g. On 6 January 2016 at 10:29, unk1102 wrote: > Hi As part of Spark 1.6 release what should be ideal value or unit for > s

Re: Why is this job running since one hour?

2016-01-06 Thread Jakob Odersky
What is the job doing? How much data are you processing? On 6 January 2016 at 10:33, unk1102 wrote: > Hi I have one main Spark job which spawns multiple child spark jobs. One of > the child spark job is running for an hour and it keeps on hanging there I > have taken snap shot please see > < > h

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-17 Thread Jakob Odersky
It might be a good idea to see how many files are open and try increasing the open file limit (this is done on an os level). In some application use-cases it is actually a legitimate need. If that doesn't help, make sure you close any unused files and streams in your code. It will also be easier t

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
For future reference, this should be fixed with PR #10337 ( https://github.com/apache/spark/pull/10337) On 16 December 2015 at 11:01, Jakob Odersky wrote: > Yeah, the same kind of error actually happens in the JIRA. It actually > succeeds but a load of exceptions are thrown. Subsequen

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
ng > the result that surprised me. > > I want to see if there is a way of getting rid of the exceptions. > > Thanks > > On Wed, Dec 16, 2015 at 10:53 AM, Jakob Odersky > wrote: > >> When you re-run the last statement a second time, does it work? Could it >> be r

Re: File not found error running query in spark-shell

2015-12-16 Thread Jakob Odersky
When you re-run the last statement a second time, does it work? Could it be related to https://issues.apache.org/jira/browse/SPARK-12350 ? On 16 December 2015 at 10:39, Ted Yu wrote: > Hi, > I used the following command on a recently refreshed checkout of master > branch: > > ~/apache-maven-3.3.

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky
With DataFrames you loose type-safety. Depending on the language you are using this can also be considered a drawback. On 15 December 2015 at 15:08, Jakob Odersky wrote: > By using DataFrames you will not need to specify RDD operations explicity, > instead the operations are built and opt

Re: what are the cons/drawbacks of a Spark DataFrames

2015-12-15 Thread Jakob Odersky
By using DataFrames you will not need to specify RDD operations explicity, instead the operations are built and optimized for by using the information available in the DataFrame's schema. The only draw-back I can think of is some loss of generality: given a dataframe containing types A, you will be

Re: ideal number of executors per machine

2015-12-15 Thread Jakob Odersky
Hi Veljko, I would assume keeping the number of executors per machine to a minimum is best for performance (as long as you consider memory requirements as well). Each executor is a process that can run tasks in multiple threads. On a kernel/hardware level, thread switches are much cheaper than proc

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky
sorry typo, I meant *without* the addJar On 14 December 2015 at 11:13, Jakob Odersky wrote: > > Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop > 2.4.1.but I also find something strange like this : > > > > http://apache-spark-user-list.1001560

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-14 Thread Jakob Odersky
> Sorry,I'm late.I try again and again ,now I use spark 1.4.0 ,hadoop 2.4.1.but I also find something strange like this : > http://apache-spark-user-list.1001560.n3.nabble.com/worker-java-lang-ClassNotFoundException-ttt-test-anonfun-1-td25696.html > (if i use "textFile",It can't run.) In the

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky
Btw, Spark 1.5 comes with support for hadoop 2.2 by default On 11 December 2015 at 03:08, Bonsen wrote: > Thank you,and I find the problem is my package is test,but I write package > org.apache.spark.examples ,and IDEA had imported the > spark-examples-1.5.2-hadoop2.6.0.jar ,so I can run it,and

Re: Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-11 Thread Jakob Odersky
It looks like you have an issue with your classpath, I think it is because you add a jar containing Spark twice: first, you have a dependency on Spark somewhere in your build tool (this allows you to compile and run your application), second you re-add Spark here > sc.addJar("/home/hadoop/spark-a

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Jakob Odersky
Is there any other process using port 7077? On 10 December 2015 at 08:52, Andy Davidson wrote: > Hi > > I am using spark-1.5.1-bin-hadoop2.6. Any idea why I get this warning. My > job seems to run with out any problem. > > Kind regards > > Andy > > + /root/spark/bin/spark-submit --class > com.pw

Re: StackOverflowError when writing dataframe to table

2015-12-10 Thread Jakob Odersky
Can you give us some more info about the dataframe and caching? Ideally a set of steps to reproduce the issue On 9 December 2015 at 14:59, apu mishra . rr wrote: > The command > > mydataframe.write.saveAsTable(name="tablename") > > sometimes results in java.lang.StackOverflowError (see below fo

Re: HELP! I get "java.lang.String cannot be cast to java.lang.Intege " for a long time.

2015-12-10 Thread Jakob Odersky
Could you provide some more context? What is rawData? On 10 December 2015 at 06:38, Bonsen wrote: > I do like this "val secondData = rawData.flatMap(_.split("\t").take(3))" > > and I find: > 15/12/10 22:36:55 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 219.216.65.129): java.lang.Cl

Re: Help with type check

2015-11-30 Thread Jakob Odersky
Hi Eyal, what you're seeing is not a Spark issue, it is related to boxed types. I assume 'b' in your code is some kind of java buffer, where b.getDouble() returns an instance of java.lang.Double and not a scala.Double. Hence muCouch is an Array[java.lang.Double], an array containing boxed doubles

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky
to the RDD API for > constructing dataflows that are backed by catalyst logical plans > > So everything is still operating on RDDs but I anticipate most users will > eventually migrate to the higher level APIs for convenience and automatic > optimization > > On Mon, Nov 2

Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky
Hi everyone, I'm doing some reading-up on all the newer features of Spark such as DataFrames, DataSets and Project Tungsten. This got me a bit confused on the relation between all these concepts. When starting to learn Spark, I read a book and the original paper on RDDs, this lead me to basically

Re: Blocked REPL commands

2015-11-19 Thread Jakob Odersky
; Jacek Laskowski | https://medium.com/@jaceklaskowski/ | > http://blog.jaceklaskowski.pl > Mastering Apache Spark > https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski >

Blocked REPL commands

2015-11-19 Thread Jakob Odersky
I was just going through the spark shell code and saw this: private val blockedCommands = Set("implicits", "javap", "power", "type", "kind") What is the reason as to why these commands are blocked? thanks, --Jakob

Re: Status of 2.11 support?

2015-11-11 Thread Jakob Odersky
Hi Sukant, Regarding the first point: when building spark during my daily work, I always use Scala 2.11 and have only run into build problems once. Assuming a working build I have never had any issues with the resulting artifacts. More generally however, I would advise you to go with Scala 2.11 u

  1   2   >