Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Enrico Minack
It is not about very large or small, it is about how large your cluster is w.r.t. your data. Caching is only useful if you have the respective memory available across your executors. Otherwise you could either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) or indeed have to do t

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Mich Talebzadeh
I stripped everything from the jar list. This is all I have sspark-shell --jars shc-core-1.1.1-2.1-s_2.11.jar, \ json4s-native_2.11-3.5.3.jar, \ json4s-jackson_2.11-3.5.3.jar, \ hbase-client-1.2.3.jar, \ hbase-common-1.2.3.jar Now I still ge

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Mich Talebzadeh
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsi

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Jörn Franke
Is there a reason why different Scala (it seems at least 2.10/2.11) versions are mixed? This never works. Do you include by accident a dependency to with an old Scala version? Ie the Hbase datasource maybe? > Am 17.02.2020 um 22:15 schrieb Mich Talebzadeh : > >  > Thanks Muthu, > > > I am u

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Muthu Jayakumar
Hello Mich, Thank you for the mail. From, what I can understand from json4s history, spark and the versions you have... 1. Apache Spark 2.4.3 uses json4s 3.5.3 (to be specific it uses json4s-jackson) 2. json4s 3.2.11 and 3.2.10 is not compatible (ref: https://github.com/json4s/json4s/issues/212) 3

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Mich Talebzadeh
Thanks Muthu, I am using the following jar files for now in local mode i.e. spark-shell_local --jars ….. json4s-jackson_2.10-3.2.10.jar json4s_2.11-3.2.11.jar json4s-native_2.10-3.4.0.jar Which one is the incorrect one please/ Regards, Mich *Disclaimer:* Use it at your own risk. Any and al

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Nicolas PARIS
> .dropDuplicates() \ .cache() | > Since df_actions is cached, you can count inserts and updates quickly > with only that one join in df_actions: Hi Enrico. I am wondering if this is ok for very large tables ? Is caching faster than recomputing both insert/update ? Thanks Enrico Minack writes

Re: Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Muthu Jayakumar
I suspect the spark job is somehow having an incorrect (newer) version of json4s in the classpath. json4s 3.5.3 is the utmost version that can be used. Thanks, Muthu On Mon, Feb 17, 2020, 06:43 Mich Talebzadeh wrote: > Hi, > > Spark version 2.4.3 > Hbase 1.2.7 > > Data is stored in Hbase as Jso

Spark reading from Hbase throws java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods

2020-02-17 Thread Mich Talebzadeh
Hi, Spark version 2.4.3 Hbase 1.2.7 Data is stored in Hbase as Json. example of a row shown below [image: image.png] I am trying to read this table in Spark Scala import org.apache.spark.sql.{SQLContext, _} import org.apache.spark.sql.execution.datasources.hbase._ import org.apache.spark.{SparkC

Re: Apache Arrow support for Apache Spark

2020-02-17 Thread Chris Teoh
1. I'd also consider how you're structuring the data before applying the join, naively doing the join could be expensive so doing a bit of data preparation may be necessary to improve join performance. Try to get a baseline as well. Arrow would help improve it. 2. Try storing it back as Parquet bu

[ML] [How-to]: How to unload the loaded W2V model in Pyspark?

2020-02-17 Thread Zhefu PENG
Hi all, I'm using pyspark and Spark-ml to train and use Word2Vect model, here is the logic of my program: model = Word2VecModel.load("save path") result_list = model.findSynonymsArray(target, top_N) Then I use the graphframe and result_list to create graph and do some computing. However the pro