RDD.cacheDataSet() not working intermittently

2017-05-08 Thread jasbir.sing
Hi, I have a scenario in which I am caching my RDDs for future use. But I observed that when I use my RDD, complete DAG is re-executed and RDD gets created again. How can I avoid this scenario and make sure that RDD.cacheDataSet() caches RDD every time. Regards, Jasbir Singh __

How to read large size files from a directory ?

2017-05-08 Thread ashwini anand
I am reading each file of a directory using wholeTextFiles. After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below:def processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0] result = "\n\n"+f

Updating schemas

2017-05-08 Thread Jorge Magallón
Hello, Im trying to update a parquet schema. The only way that i know is read the file change the schema a write the file again. The problem i think is that with large data is too slow and ineficient Is there other way to update schema?? Do you know other solution?? Thanx in advance Envia

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
Ok, great, Well I havn't provided a good example of what I'm doing. Let's assume that my case class is case class A(tons of fields, with sub classes) val df = sqlContext.sql("select * from a").as[A] val df2 = spark.emptyDataset[A] df.union(df2) This code will throw the exception. Is this expec

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz
Yes, unfortunately. This should actually be fixed, and the union's schema should have the less restrictive of the DataFrames. On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > HI Burak, > By nullability you mean that if I have the exactly the same sche

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
HI Burak, By nullability you mean that if I have the exactly the same schema, but one side support null and the other doesn't, this exception (in union dataset) will be thrown? 2017-05-08 16:41 GMT-03:00 Burak Yavuz : > I also want to add that generally these may be caused by the `nullability`

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz
I also want to add that generally these may be caused by the `nullability` field in the schema. On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu wrote: > This is because RDD.union doesn't check the schema, so you won't see the > problem unless you run RDD and hit the incompatible column probl

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Shixiong(Ryan) Zhu
This is because RDD.union doesn't check the schema, so you won't see the problem unless you run RDD and hit the incompatible column problem. For RDD, You may not see any error if you don't use the incompatible column. Dataset.union requires compatible schema. You can print ds.schema and ds1.schema

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Bruce Packer
> On May 8, 2017, at 11:07 AM, Dirceu Semighini Filho > wrote: > > Hello, > I've a very complex case class structure, with a lot of fields. > When I try to union two datasets of this class, it doesn't work with the > following error : > ds.union(ds1) > Exception in thread "main" org.apache.spa

Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
Hello, I've a very complex case class structure, with a lot of fields. When I try to union two datasets of this class, it doesn't work with the following error : ds.union(ds1) Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatibl

Re: Spark Shell issue on HDInsight

2017-05-08 Thread Denny Lee
This appears to be an issue with the Spark to DocumentDB connector, specifically version 0.0.1. Could you run the 0.0.3 version of the jar and see if you're still getting the same error? i.e. spark-shell --master yarn --jars azure-documentdb-spark-0.0.3-SNAPSHOT.jar,azure-documentdb-1.10.0.ja

Re: Join streams Apache Spark

2017-05-08 Thread saulshanabrook
I, actually, just ran it in a Docker image. But the point is, it doesn't need to run in the JVM, because it just runs as a separate process. Then your Java (or any other client) code sends messages to it over TCP and it relays them to Spark. On Mon, May 8, 2017 at 4:07 AM tencas [via Apache Spark

Spark Shell issue on HDInsight

2017-05-08 Thread ayan guha
Hi I am facing an issue while trying to use azure-document-db connector from Microsoft. Instructions/Github . Error while trying to add jar in spark-shell: spark-shell --jars azure-documentdb-spark

how to set up h2o sparkling water on jupyter notebook on a windows machine

2017-05-08 Thread Zeming Yu
Hi, I'm a newbie, so please bear with me. *I'm using a windows 10 machine. I installed spark here:* C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7 *I also installed h2o sparkling water here:* C:\sparkling-water-2.1.1 *I use this code in command line to launch a jupyter notebook for pysp

Re: Join streams Apache Spark

2017-05-08 Thread tencas
Yep, I mean the first script you posted. So, you can compile it to Java binaries for example ? Ok, I have no idea about Go. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28662.html Sent from the Apache Spark User List mail

hbase + spark + hdfs

2017-05-08 Thread mathieu ferlay
Hi everybody. I’m totally new in Spark and I wanna know one stuff that I do not manage to find. I have a full ambary install with hbase, Hadoop and spark. My code reads and writes in hdfs via hbase. Thus, as I understood, all data stored are in bytes format in hdfs. Now, I know that it’s possible t

Spark 2.1.0 with Hive 2.1.1?

2017-05-08 Thread Lohith Samaga M
Hi, Good day. My setup: 1. Single node Hadoop 2.7.3 on Ubuntu 16.04. 2. Hive 2.1.1 with metastore in MySQL. 3. Spark 2.1.0 configured using hive-site.xml to use MySQL metastore. 4. The VERSION table contains SCHEMA_VERSION = 2.1.0 Hive