date:20150408

RE: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-08 Thread Puneet Kumar Ojha

Thanks From: Nick Pentreath [mailto:nick.pentre...@gmail.com] Sent: Tuesday, April 07, 2015 5:52 PM To: Puneet Kumar Ojha Cc: user@spark.apache.org Subject: Re: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data There is no difference - textFile calls hadoopFile with a Text

Re: Parquet Hive table become very slow on 1.3?

2015-04-08 Thread Zheng, Xudong

Hi Cheng, I tried both these patches, and seems still not resolve my issue. And I found the most time is spend on this line in newParquet.scala: ParquetFileReader.readAllFootersInParallel( sparkContext.hadoopConfiguration, seqAsJavaList(leaves), taskSideMetaData) Which need read all the files

Re: RDD collect hangs on large input data

2015-04-08 Thread Zsolt Tóth

I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause the issue? Did you test it with Java 8?

Need subscription process

2015-04-08 Thread Jeetendra Gangele

Hi All how can I subscribe myself in this group so that every mail sent to this group comes to me as well. I already sent request to user-subscr...@spark.apache.org ,still Iam not getting mail sent to this group by other persons. Regards Jeetendra

Re: Need subscription process

2015-04-08 Thread ๏̯͡๏

Check your spam or any filter, On Wed, Apr 8, 2015 at 2:17 PM, Jeetendra Gangele wrote: > Hi All how can I subscribe myself in this group so that every mail sent to > this group comes to me as well. > I already sent request to user-subscr...@spark.apache.org ,still Iam not > getting mail sent t

partition by category

2015-04-08 Thread SiMaYunRui

Hi folks, I am writing to ask how to filter and partition a set of files thru Spark. The situation is that I have N big files (cannot fit into single machine). And each line of files starts with a category (say Sport, Food, etc), while only have less than 100 categories actually. I need a progr

Spark Tasks failing with "Cannot find address"

2015-04-08 Thread ๏̯͡๏

I have a spark stage that has 8 tasks. 7/8 have completed. However 1 task is failing with Cannot find address Aggregated Metrics by ExecutorExecutor IDAddressTask TimeTotal TasksFailed TasksSucceeded TasksShuffle Read Size / RecordsShuffle Write Size / RecordsShuffle Spill (Memory)Shuffle Spill

Re: Spark Tasks failing with "Cannot find address"

2015-04-08 Thread ๏̯͡๏

Spark Version 1.3 Command: ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1

Re: need info on Spark submit on yarn-cluster mode

2015-04-08 Thread Steve Loughran

This means the spark workers exited with code "15"; probably nothing YARN related itself (unless there are classpath-related problems). Have a look at the logs of the app/container via the resource manager. You can also increase the time that logs get kept on the nodes themselves to something

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Hao Ren

Hi Michael， In fact, I find that all workers are hanging when SQL/DF join is running. So I picked the master and one of the workers. jstack is the following: Master 2015-04-08

RE: EC2 spark-submit --executor-memory

2015-04-08 Thread java8964

If you are using Spark Standalone deployment, make sure you set the WORKER_MEMROY over 20G, and you do have 20G physical memory. Yong > Date: Tue, 7 Apr 2015 20:58:42 -0700 > From: li...@adobe.com > To: user@spark.apache.org > Subject: EC2 spark-submit --executor-memory > > Dear Spark team, > >

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist

To use the HiveThriftServer2.startWithContext, I thought one would use the following artifact in the build: "org.apache.spark"%% "spark-hive-thriftserver" % "1.3.0" But I am unable to resolve the artifact. I do not see it in maven central or any other repo. Do I need to build Spark and p

Opening many Parquet files = slow

2015-04-08 Thread Eric Eijkelenboom

Hi guys I’ve got: 180 days of log data in Parquet. Each day is stored in a separate folder in S3. Each day consists of 20-30 Parquet files of 256 MB each. Spark 1.3 on Amazon EMR This makes approximately 5000 Parquet files with a total size if 1.5 TB. My code: val in = sqlContext.parquetFile(“da

Spark SQL Avro Library for 1.2

2015-04-08 Thread roy

How do I build Spark SQL Avro Library for Spark 1.2 ? I was following this https://github.com/databricks/spark-avro and was able to build spark-avro_2.10-1.0.0.jar by simply running sbt/sbt package from the project root. but we are on Spark 1.2 and need compatible spark-avro jar. Any idea how do

2015-04-08 Thread Idris Ali

Re: Subscribe

2015-04-08 Thread Ted Yu

Please email user-subscr...@spark.apache.org > On Apr 8, 2015, at 6:28 AM, Idris Ali wrote: > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Spark 1.3 on CDH 5.3.1 YARN

2015-04-08 Thread roy

Hi, We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current version in CDH5.3.2), But We want to try Spark 1.3 without breaking existing setup, so is it possible to have Spark 1.3 on existing setup ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.

Error running Spark on Cloudera

2015-04-08 Thread Vijayasarathy Kannan

I am trying to run a Spark application using spark-submit on a cluster using Cloudera manager. I get the error "Exception in thread "main" java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory//app-20150408094126-0008" Adding the below lines in /etc/spark/conf/

Re: Opening many Parquet files = slow

2015-04-08 Thread Ted Yu

You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1 Cheers On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom < eric.eijkelenb...@gmail.com> wrote: > Hi guys > > *I’ve got:* > >- 180 days of log data in Parquet. >- Each day is stored in a separate folder in S3. >- Ea

Re: Issue with pyspark 1.3.0, sql package and rows

2015-04-08 Thread Stefano Parmesan

Did anybody by any chance had a look at this bug? It keeps on happening to me, and it's quite blocking, I would like to understand if there's something wrong in what I'm doing, or whether there's a workaround or not. Thank you all, -- Dott. Stefano Parmesan Backend Web Developer and Data Lover ~

Re: Spark 1.3 on CDH 5.3.1 YARN

2015-04-08 Thread Sean Owen

Yes, should be fine since you are running on YARN. This is probably more appropriate for the cdh-user list. On Apr 8, 2015 9:35 AM, "roy" wrote: > Hi, > > We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current > version in CDH5.3.2), But We want to try Spark 1.3 without breaking >

Maintaining state

2015-04-08 Thread boston2004_williams

It should be noted I'm a newbie to Spark so please have patience ... I'm trying to convert an existing application over to spark and am running into some "high level " questions that I can't seem to resolve. Possibly because what I'm trying to do is not supported. In a nutshell as I process t

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller

+1 Interestingly, I ran into the exactly the same issue yesterday. I couldn’t find any documentation about which project to include as a dependency in build.sbt to use HiveThriftServer2. Would appreciate help. Mohammed From: Todd Nist [mailto:tsind...@gmail.com] Sent: Wednesday, April 8, 2015

[ThriftServer] User permissions warning

2015-04-08 Thread Yana Kadiyska

Hi folks, I am noticing a pesky and persistent warning in my logs (this is from Spark 1.2.1): 15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception trying to get groups for user anonymous org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user at org.apach

Re: Issue with pyspark 1.3.0, sql package and rows

2015-04-08 Thread Davies Liu

I will look into this today. On Wed, Apr 8, 2015 at 7:35 AM, Stefano Parmesan wrote: > Did anybody by any chance had a look at this bug? It keeps on happening to > me, and it's quite blocking, I would like to understand if there's something > wrong in what I'm doing, or whether there's a workarou

Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] when create context

2015-04-08 Thread Shuai Zheng

Hi All, In some cases, I have below exception when I run spark in local mode (I haven't see this in a cluster). This is weird but also affect my local unit test case (it is not always happen, but usually one per 4-5 times run). From the stack, looks like error happen when create the context, bu

start-slave.sh not starting

2015-04-08 Thread Mohit Anchlia

I am trying to start the worker by: sbin/start-slave.sh spark://ip-10-241-251-232:7077 In the logs it's complaining about: Master must be a URL of the form spark://hostname:port I also have this in spark-defaults.conf spark.master spark://ip-10-241-251-232:7077 Did I miss

Re: Error running Spark on Cloudera

2015-04-08 Thread Marcelo Vanzin

"spark.eventLog.dir" should contain the full HDFS URL. In general, this should be sufficient: spark.eventLog.dir=hdfs:/user/spark/applicationHistory On Wed, Apr 8, 2015 at 6:45 AM, Vijayasarathy Kannan wrote: > I am trying to run a Spark application using spark-submit on a cluster using > Cloud

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread Tathagata Das

There are a couple of options. Increase timeout (see Spark configuration). Also see past mails in the mailing list. Another option you may try (I have gut feeling that may work, but I am not sure) is calling GC on the driver periodically. The cleaning up of stuff is tied to GCing of RDD objects a

Reading file with Unicode characters

2015-04-08 Thread Arun Lists

Hi, Does SparkContext's textFile() method handle files with Unicode characters? How about files in UTF-8 format? Going further, is it possible to specify encodings to the method? If not, what should one do if the files to be read are in some encoding? Thanks, arun

Re: org.apache.spark.ml.recommendation.ALS

2015-04-08 Thread Jay Katukuri

some additional context: Since, I am using features of spark 1.3.0, I have downloaded spark 1.3.0 and used spark-submit from there. The cluster is still on spark-1.2.0. So, this looks to me that at runtime, the executors could not find some libraries of spark-1.3.0, even though I ran spark-subm

RE: Reading file with Unicode characters

2015-04-08 Thread java8964

Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost only supporting Linux, so UTF-8 is the only encoding supported, as it is the the one on Linux. If you have other encoding data, you may want to vote for this Jira:https://issues.apache.org/jira/browse/MAPREDUCE-232 Yon

Spark Streaming and SQL

2015-04-08 Thread Vadim Bichutskiy

Hi all, I am using Spark Streaming to monitor an S3 bucket for objects that contain JSON. I want to import that JSON into Spark SQL DataFrame. Here's my current code: *from pyspark import SparkContext, SparkConf* *from pyspark.streaming import StreamingContext* *import json* *from pyspark.sql im

Re: Opening many Parquet files = slow

2015-04-08 Thread Michael Armbrust

Thanks for the report. We improved the speed here in 1.3.1 so would be interesting to know if this helps. You should also try disabling schema merging if you do not need that feature (i.e. all of your files are the same schema). sqlContext.load("path", "parquet", Map("mergeSchema" -> "false"))

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-08 Thread Michael Armbrust

I think your thread dump for the master is actually just a thread dump for SBT that is waiting on a forked driver program. ... java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x7fed624ff528> (a java.lang.UNIXProcess) at java.lang.Obj

Re: Incremently load big RDD file into Memory

2015-04-08 Thread Guillaume Pitel

Hi Muhammad, There are lots of ways to do it. My company actually develops a text mining solution which embeds a very fast Approximate Neighbours solution (a demo with real time queries on the wikipedia dataset can be seen at wikinsights.org). For the record, we now prepare a dataset of 4.5 m

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Michael Armbrust

Sorry guys. I didn't realize that https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet. You can publish locally in the mean time (sbt/sbt publishLocal). On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller wrote: > +1 > > > > Interestingly, I ran into the exactly the same issue yeste

Re: parquet partition discovery

2015-04-08 Thread Michael Armbrust

Back to the user list so everyone can see the result of the discussion... Ah. It all makes sense now. The issue is that when I created the parquet > files, I included an unnecessary directory name (data.parquet) below the > partition directories. It’s just a leftover from when I started with > Mic

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller

Michael, Thank you! Looks like the sbt build is broken for 1.3. I downloaded the source code for 1.3, but I get the following error a few minutes after I run “sbt/sbt publishLocal” [error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency: org.apache.spark#spark-network-co

Class incompatible error

2015-04-08 Thread Mohit Anchlia

I am seeing the following, is this because of my maven version? 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-241-251-232.us-west-2.compute.internal): java.io.InvalidClassException: org.apache.spark.Aggregator; local class incompatible: stream classdesc serialVers

Support for Joda

2015-04-08 Thread Patrick Grandjean

Hi, I have an RDD with objects containing Joda's LocalDate. When trying to save the RDD as Parquet, I get an exception. Here is the code: - val sqlC = new org.apache.spark.sql.SQLContext(sc) import sqlC._ myRDD.s

Unit testing with HiveContext

2015-04-08 Thread Daniel Siegmann

I am trying to unit test some code which takes an existing HiveContext and uses it to execute a CREATE TABLE query (among other things). Unfortunately I've run into some hurdles trying to unit test this, and I'm wondering if anyone has a good approach. The metastore DB is automatically created in

Empty RDD?

2015-04-08 Thread Vadim Bichutskiy

When I call *transform* or *foreachRDD *on* DStream*, I keep getting an error that I have an empty RDD, which make sense since my batch interval maybe smaller than the rate at which new data are coming in. How to guard against it? Thanks, Vadim ᐧ

Re: Class incompatible error

2015-04-08 Thread Ted Yu

What version of Java do you use to build ? Cheers On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia wrote: > I am seeing the following, is this because of my maven version? > > 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > ip-10-241-251-232.us-west-2.compute.internal)

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread N B

Hi TD, Thanks for the response. Since you mentioned GC, this got me thinking. Given that we are running in local mode (all in a single JVM) for now, does the option "spark.executor.extraJavaOptions" set to "-XX:+UseConcMarkSweepGC" inside SparkConf object take effect at all before we use it to cr

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread Tathagata Das

Its does take effect on the executors, not on the driver. Which is okay because executors have all the data and therefore have GC issues, not so usually for the driver. If you want to double-sure, print the JVM flag (e.g. http://stackoverflow.com/questions/10486375/print-all-jvm-flags) However, th

Re: Unit testing with HiveContext

2015-04-08 Thread Ted Yu

Please take a look at sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala : protected def configure(): Unit = { warehousePath.delete() metastorePath.delete() setConf("javax.jdo.option.ConnectionURL", s"jdbc:derby:;databaseName=$metastorePath;create=true")

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Todd Nist

Hi Mohammed, I think you just need to add -DskipTests to you build. Here is how I built it: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests clean package install build/sbt does however fail even if only doing package which should skip tests. I am able to b

Re: Support for Joda

2015-04-08 Thread Ted Yu

Which version of Joda are you using ? Here is snippet of dependency:tree out w.r.t. Joda : [INFO] +- org.apache.flume:flume-ng-core:jar:1.4.0:compile ... [INFO] | +- joda-time:joda-time:jar:2.1:compile FYI On Wed, Apr 8, 2015 at 12:53 PM, Patrick Grandjean wrote: > Hi, > > I have an RDD with

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread N B

Since we are running in local mode, won't all the executors be in the same JVM as the driver? Thanks NB On Wed, Apr 8, 2015 at 1:29 PM, Tathagata Das wrote: > Its does take effect on the executors, not on the driver. Which is okay > because executors have all the data and therefore have GC issu

Pyspark query by binary type

2015-04-08 Thread jmalm

I am loading some avro data into spark using the following code: sqlContext.sql("CREATE TEMPORARY TABLE foo USING com.databricks.spark.avro OPTIONS (path 'hdfs://*.avro')") The avro data contains some binary fields that get translated to the BinaryType data type. I am struggling with how to use

Re: Add row IDs column to data frame

2015-04-08 Thread olegshirokikh

More generic version of a question below: Is it possible to append a column to existing DataFrame at all? I understand that this is not an easy task in Spark environment, but is there any workaround? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-I

Re: Incremently load big RDD file into Memory

2015-04-08 Thread MUHAMMAD AAMIR

Hi, Thanks a lot for such a detailed response. On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel wrote: > Hi Muhammad, > > There are lots of ways to do it. My company actually develops a text > mining solution which embeds a very fast Approximate Neighbours solution (a > demo with real time quer

Re: org.apache.spark.ml.recommendation.ALS

2015-04-08 Thread Jay Katukuri

Hi Xiangrui, I tried running this on my local machine (laptop) and got the same error: Here is what I did: 1. downloaded spark 1.30 release version (prebuilt for hadoop 2.4 and later) "spark-1.3.0-bin-hadoop2.4.tgz". 2. Ran the following command: spark-submit --class ALSNew --master local[8

Re: Add row IDs column to data frame

2015-04-08 Thread barmaley

Hi Bojan, Could you please expand your idea on how to append to RDD? I can think of how to append a constant value to each row on RDD: //oldRDD - RDD[Array[String]] val c = "const" val newRDD = oldRDD.map(r=>c+:r) But how to append a custom column to RDD? Something like: val colToAppend = sc.ma

Re: Spark Streaming and SQL

2015-04-08 Thread Vadim Bichutskiy

Hi all, I figured it out! The DataFrames and SQL example in Spark Streaming docs were useful. Best, Vadim ᐧ On Wed, Apr 8, 2015 at 2:38 PM, Vadim Bichutskiy wrote: > Hi all, > > I am using Spark Streaming to monitor an S3 bucket for objects that > contain JSON. I want > to import that JSON int

sortByKey with multiple partitions

2015-04-08 Thread Tom

Hi, If I perform a sortByKey(true, 2).saveAsTextFile("filename") on a cluster, will the data be sorted per partition, or in total. (And is this guaranteed?) Example: Input 4,2,3,6,5,7 Sorted per partition: part-0: 2,3,7 part-1: 4,5,6 Sorted in total: part-0: 2,3,4 part-1: 5,6,7

Re: Add row IDs column to data frame

2015-04-08 Thread Bojan Kostic

You could convert DF to RDD, then in map phase or in join add new column, and then again convert to DF. I know this is not elegant solution and maybe it is not a solution at all. :) But this is the first thing that popped in my mind. I am new also to DF api. Best Bojan On Apr 9, 2015 00:37, "olegsh

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread N B

Thanks TD. I believe that might have been the issue. Will try for a few days after passing in the GC option on the java command line when we start the process. Thanks for your timely help. NB On Wed, Apr 8, 2015 at 6:08 PM, Tathagata Das wrote: > Yes, in local mode they the driver and executor

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread Tathagata Das

Yes, in local mode they the driver and executor will be same the process. And in that case the Java options in SparkConf configuration will not work. On Wed, Apr 8, 2015 at 1:44 PM, N B wrote: > Since we are running in local mode, won't all the executors be in the same > JVM as the driver? > >

Re: sortByKey with multiple partitions

2015-04-08 Thread Ted Yu

See the scaladoc from OrderedRDDFunctions.scala : * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of records * (in the `save` case, they will be written to multi

Re: Class incompatible error

2015-04-08 Thread Ted Yu

bq. one is Oracle and the other is OpenJDK I don't have experience with mixed JDK's. Can you try with using single JDK ? Cheers On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia wrote: > For the build I am using java version "1.7.0_65" which seems to be the > same as the one on the spark host. How

Re: Opening many Parquet files = slow

2015-04-08 Thread Cheng Lian

Hi Eric - Would you mind to try either disabling schema merging as what Michael suggested, or disabling the new Parquet data source by sqlContext.setConf("spark.sql.parquet.useDataSourceApi", "false") Cheng On 4/9/15 2:43 AM, Michael Armbrust wrote: Thanks for the report. We improved the spee

RE: Advice using Spark SQL and Thrift JDBC Server

2015-04-08 Thread Mohammed Guller

Hey Patrick, Michael and Todd, Thank you for your help! As you guys recommended, I did a local install and got my code to compile. As an FYI, on my local machine the sbt build fails even if I add –DskipTests. So I used mvn. Mohammed From: Patrick Wendell [mailto:patr...@databricks.com] Sent:

Re: parquet partition discovery

2015-04-08 Thread Cheng Lian

On 4/9/15 3:09 AM, Michael Armbrust wrote: Back to the user list so everyone can see the result of the discussion... Ah. It all makes sense now. The issue is that when I created the parquet files, I included an unnecessary directory name (data.parquet) below the partition directori

Re: Empty RDD?

2015-04-08 Thread Tathagata Das

Aah yes. The jsonRDD method needs to walk through the whole RDD to understand the schema, and does not work if there is not data in it. Making sure there is no data in it using take(1) should work. TD

Re: Cannot run unit test.

2015-04-08 Thread Mike Trienis

It's because your tests are running in parallel and you can only have one context running at a time. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-unit-test-tp14459p22429.html Sent from the Apache Spark User List mailing list archive at Nabble.

Regarding GroupBy

2015-04-08 Thread Jeetendra Gangele

I wanted to run the groupBy(partition ) but this is not working. here first part in pairvendorData will be repeated multiple second part. Both are object do I need to overrite the equals and hash code? Is groupBy fast enough? JavaPairRDD pairvendorData =matchRdd.flatMapToPair( new PairFlatMapFunc

Re: Reading file with Unicode characters

2015-04-08 Thread Arun Lists

Thanks! arun On Wed, Apr 8, 2015 at 10:51 AM, java8964 wrote: > Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is > almost only supporting Linux, so UTF-8 is the only encoding supported, as > it is the the one on Linux. > > If you have other encoding data, you may want to v

Re: Empty RDD?

2015-04-08 Thread Vadim Bichutskiy

Thanks TD! > On Apr 8, 2015, at 9:36 PM, Tathagata Das wrote: > > Aah yes. The jsonRDD method needs to walk through the whole RDD to understand > the schema, and does not work if there is not data in it. Making sure there > is no data in it using take(1) should work. > > TD --

Re: Opening many Parquet files = slow

2015-04-08 Thread Prashant Kommireddi

We noticed similar perf degradation using Parquet (outside of Spark) and it happened due to merging of multiple schemas. Would be good to know if disabling merge of schema (if the schema is same) as Michael suggested helps in your case. On Wed, Apr 8, 2015 at 11:43 AM, Michael Armbrust wrote: >

Re: function to convert to pair

2015-04-08 Thread Ted Yu

Please take a look at zipWithIndex() of RDD. Cheers On Wed, Apr 8, 2015 at 3:40 PM, Jeetendra Gangele wrote: > Hi All I have a RDD I want to convert it to > RDD this sequence number can be 1 for first > SomeObject 2 for second SomeOjejct > > > Regards > jeet >

Re: [ThriftServer] User permissions warning

2015-04-08 Thread Cheng Lian

The Thrift server hasn't support authentication or Hadoop doAs yet, so you can simply ignore this warning. To avoid this, when connecting via JDBC you may specify the user to the same user who starts the Thrift server process. For Beeline, use "-n ". On 4/8/15 11:49 PM, Yana Kadiyska wrote:

Re: Empty RDD?

2015-04-08 Thread Tathagata Das

What is the computation you are doing in the foreachRDD, that is throwing the exception? One way to guard against is to do a take(1) to see if you get back any data. If there is none, then don't do anything with the RDD. TD On Wed, Apr 8, 2015 at 1:08 PM, Vadim Bichutskiy wrote: > When I call *

function to convert to pair

2015-04-08 Thread Jeetendra Gangele

Hi All I have a RDD I want to convert it to RDD this sequence number can be 1 for first SomeObject 2 for second SomeOjejct Regards jeet

Re: Class incompatible error

2015-04-08 Thread Mohit Anchlia

For the build I am using java version "1.7.0_65" which seems to be the same as the one on the spark host. However one is Oracle and the other is OpenJDK. Does that make any difference? On Wed, Apr 8, 2015 at 1:24 PM, Ted Yu wrote: > What version of Java do you use to build ? > > Cheers > > On We

76 matches

Mail list logo