Re: Distinct on key-value pair of JavaRDD

2015-11-19 Thread Ramkumar V
I thought some specific function would be there but I'm using reducebykey now. Its working fine. Thanks a lot. *Thanks*, On Tue, Nov 17, 2015 at 6:21 PM, ayan guha wrote: > How about using reducebykey? > On 17 Nov 2015 22:00, "Ramkumar V" wrote: > >>

Re: Calculating Timeseries Aggregation

2015-11-19 Thread Sanket Patil
Hey Sandip: TD has already outlined the right approach, but let me add a couple of thoughts as I recently worked on a similar project. I had to compute some real-time metrics on streaming data. Also, these metrics had to be aggregated for hour/day/week/month. My data pipeline was Kafka --> Spark S

[SPARK STREAMING] multiple hosts and multiple ports for Stream job

2015-11-19 Thread diplomatic Guru
Hello team, I was wondering whether it is a good idea to have multiple hosts and multiple ports for a spark job. Let's say that there are two hosts, and each has 2 ports, is this a good idea? If this is not an issue then what is the best way to do it. Currently, we pass it as an argument comma sep

[no subject]

2015-11-19 Thread aman solanki
Hi All, I want to know how one can get historical data of jobs,stages,tasks etc of a running spark application. Please share the information regarding the same. Thanks, Aman Solanki

Re:

2015-11-19 Thread Chintan Bhatt
yes you can get it by: https://data.gov.in/ On Thu, Nov 19, 2015 at 2:23 PM, aman solanki wrote: > Hi All, > > I want to know how one can get historical data of jobs,stages,tasks etc of > a running spark application. > > Please share the information regarding the same. > > Thanks, > Aman Solank

Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread D
I am trying to write a simple "Hello World" kind of application using spark streaming and RabbitMq, in which Apache Spark Streaming will read message from RabbitMq via the RabbitMqReceiver and print it in the console. But some how I am not able to

Re:

2015-11-19 Thread 金国栋
I don't really know what you mean by saying historical data, but if you have configured logs, so you can always get historical data of jobs, stages, tasks etc, unless you delete them. Best, Jelly 2015-11-19 16:53 GMT+08:00 aman solanki : > Hi All, > > I want to know how one can get historical da

SparkSQL optimizations and Spark streaming

2015-11-19 Thread Sela, Amit
There is a lot of work done on SparkSQL and DataFrames which optimizes the execution, with some of it working on the data source – I.e., optimizing read from Parquet. I was wondering if using SparkSQL with streaming (in transform/foreachRDD) could benefit in optimization ? Although (currently)

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian
/cc Spark user list I'm confused here, you mentioned that you were writing Parquet files using MR jobs. What's the relation between that Parquet writing task and this JavaPairRDD one? Is it a separate problem? Spark supports dynamic partitioning (e.g. df.write.partitionBy("col1", "col2").for

Re: dounbts on parquet

2015-11-19 Thread Cheng Lian
/cc Spark user list I'm confused here, you mentioned that you were writing Parquet files using MR jobs. What's the relation between that Parquet writing task and this JavaPairRDD one? Is it a separate problem? Spark supports dynamic partitioning (e.g. df.write.partitionBy("col1", "col2").for

Re: Calculating Timeseries Aggregation

2015-11-19 Thread Sandip Mehta
Thank you Sanket for the feedback. Regards SM > On 19-Nov-2015, at 1:57 PM, Sanket Patil wrote: > > Hey Sandip: > > TD has already outlined the right approach, but let me add a couple of > thoughts as I recently worked on a similar project. I had to compute some > real-time metrics on streami

ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi, I try to throw an exception of my own exception class (MyException extends SparkException) on one of the executors. This works fine on Spark 1.3.x, 1.4.x but throws a deserialization/ClassNotFound exception on Spark 1.5.x. This happens only when I throw it on an executor, on the driver it succ

Re: Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-11-19 Thread Jeff Zhang
Seems your jdbc url is not correct. Should be jdbc:mysql:// 192.168.41.229:3306 On Thu, Nov 19, 2015 at 6:03 PM, wrote: > hi guy, > >I also found --driver-class-path and spark.driver.extraClassPath > is not working when I'm accessing mysql driver in my spark APP. > >the attache

Re: Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-11-19 Thread Zsolt Tóth
Hi, this is exactly the same as my issue, seems to be a bug in 1.5.x. (see my thread for details) 2015-11-19 11:20 GMT+01:00 Jeff Zhang : > Seems your jdbc url is not correct. Should be jdbc:mysql:// > 192.168.41.229:3306 > > On Thu, Nov 19, 2015 at 6:03 PM, wrote: > >> hi guy, >> >>I a

Why Spark Streaming keeps all batches in memory after processing?

2015-11-19 Thread Artem Moskvin
Hello there! I wonder why Spark Streaming keeps all processed batches in memory? It leads to getting out of memory on executors but I really don't need them after processing. Can it be configured somewhere so that batches are not kept in memory after processing? Respectfully, Artem Moskvin

Re: ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Tamas Szuromi
Hi Zsolt, How you load the jar and how you prepend it to the classpath? Tamas On 19 November 2015 at 11:02, Zsolt Tóth wrote: > Hi, > > I try to throw an exception of my own exception class (MyException extends > SparkException) on one of the executors. This works fine on Spark 1.3.x, > 1.4

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-19 Thread swetha kasireddy
That was actually an issue with our Mesos. On Wed, Nov 18, 2015 at 5:29 PM, Tathagata Das wrote: > If possible, could you give us the root cause and solution for future > readers of this thread. > > On Wed, Nov 18, 2015 at 6:37 AM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> It w

FastUtil DataStructures in Spark

2015-11-19 Thread swetha
Hi, Has anybody used FastUtil equivalent to HashSet for Strings in Spark? Any example would be of great help. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FastUtil-DataStructures-in-Spark-tp25429.html Sent from the Apache Spark User List

Re: ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi Tamás, the exception class is in the application jar, I'm using the spark-submit script. 2015-11-19 11:54 GMT+01:00 Tamas Szuromi : > Hi Zsolt, > > How you load the jar and how you prepend it to the classpath? > > Tamas > > > > > On 19 November 2015 at 11:02, Zsolt Tóth wrote: > >> Hi, >> >>

Re: Issue while Spark Job fetching data from Cassandra DB

2015-11-19 Thread satish chandra j
HI Ted, CASSANDRA-7894 does has the same error details "UnauthorizedException: User has no SELECT permission on or any of its parents" and cause for the error is Java Runtime Exception but my stack trace has the cause as "UnauthorizedException: User has no SELECT permission on or any of its par

Error not found value sqlContext

2015-11-19 Thread satish chandra j
HI All, we have recently migrated from Spark 1.2.1 to Spark 1.4.0, I am fetching data from an RDBMS using JDBCRDD and register it as temp table to perform SQL query Below approach is working fine in Spark 1.2.1: JDBCRDD --> apply map using Case Class --> apply createSchemaRDD --> registerTempTabl

Moving avg in saprk streaming

2015-11-19 Thread anshu shukla
Any formal way to do moving avg over fixed window duration . I calculated a simple moving average by creating a count stream and a sum stream; then joined them and finally calculated the mean. This was not per time window since time periods were part of the tuples. -- Thanks & Regards, Anshu Shu

Re: How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-19 Thread swetha kasireddy
OK. We have a long running streaming job. I was thinking that may be we should have a cron to clear files that are older than 2 days. What would be an appropriate way to do that? On Wed, Nov 18, 2015 at 7:43 PM, Ted Yu wrote: > Have you seen SPARK-5836 ? > Note TD's comment at the end. > > Cheer

RE: SequenceFile and object reuse

2015-11-19 Thread jeff saremi
Sandy, Ryan, Andrew Thanks very much. I think i now understand it better. Jeff From: ryan.blake.willi...@gmail.com Date: Thu, 19 Nov 2015 06:00:30 + Subject: Re: SequenceFile and object reuse To: sandy.r...@cloudera.com; jeffsar...@hotmail.com CC: user@spark.apache.org Hey Jeff, in addition t

Re:

2015-11-19 Thread Dean Wampler
If you mean retaining data from past jobs, try running the history server, documented here: http://spark.apache.org/docs/latest/monitoring.html Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @

Re: spark with breeze error of NoClassDefFoundError

2015-11-19 Thread Ted Yu
I don't have Spark 1.4 source code on hand. You can use the following command: mvn dependency:tree to find out the answer to your question. Cheers On Wed, Nov 18, 2015 at 10:18 PM, Jack Yang wrote: > Back to my question. If I use “*provided*”, the jar file > will expect some libraries are p

Receiver stage fails but Spark application stands RUNNING

2015-11-19 Thread Pierre Van Ingelandt
Hi, I currently have two Spark Streaming applications (Spark 1.3.1), one using a custom JMS Receiver and the other one a Kafka Receiver. Most of the time when a job fails (ie smthg like "Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times"), the application gets either state

Spark streaming and custom partitioning

2015-11-19 Thread Sachin Mousli
Hi, I would like to implement a custom partitioning algorithm in a streaming environment, preferably in a generic manner using a single pass. The only sources I could find mention Apache Kafka which adds a complexity I would like to avoid since there seem to be no documentation regarding ways to a

Re: Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread Daniel Carroza
Hi, Just answered on Github: https://github.com/Stratio/rabbitmq-receiver/issues/20 Regards Daniel Carroza Santana Vía de las Dos Castillas, 33, Ática 4, 3ª Planta. 28224 Pozuelo de Alarcón. Madrid. Tel: +34 91 828 64 73 // *@stratiobd * 2015-11-19 10:02 GMT+01:0

Spark 1.5.3 release

2015-11-19 Thread Madabhattula Rajesh Kumar
Hi, Please let me know Spark 1.5.3 release date details Regards, Rajesh

PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL transforms on a JSON data set that I load into a data frame. The data set is not large (~100GB) and most stages execute without any issues. However, some more complex stages tend to lose executors/nodes regularly. What wou

Re: Spark streaming and custom partitioning

2015-11-19 Thread Cody Koeninger
Not sure what you mean by "no documentation regarding ways to achieve effective communication between the 2", but the docs on integrating with kafka are at http://spark.apache.org/docs/latest/streaming-kafka-integration.html As far as custom partitioners go, Learning Spark from O'Reilly has a good

Re: PySpark Lost Executors

2015-11-19 Thread Ted Yu
Do you have YARN log aggregation enabled ? You can try retrieving log for the container using the following command: yarn logs -applicationId application_1445957755572_0176 -containerId container_1445957755572_0176_01_03 Cheers On Thu, Nov 19, 2015 at 8:02 AM, wrote: > I am running Spark

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any log files.’ Where can I enable log aggregation? On Nov 19, 2015, at 11:07 AM, Ted Yu mailto:yuzhih...@gmail.com>> wrote: Do you have YARN log aggregation enabled ? You can try retrieving log for the container using t

create a table for csv files

2015-11-19 Thread xiaohe lan
Hi, I have some csv file in HDFS with headers like col1, col2, col3, I want to add a column named id, so the a record would be How can I do this using Spark SQL ? Can id be auto increment ? Thanks, Xiaohe

Want 1-1 map between input files and output files in map-only job

2015-11-19 Thread Arun Luthra
Hello, Is there some technique for guaranteeing that there is a 1-1 correspondence between the input files and the output files? For example if my input directory has files called input001.txt, input002.txt, ... etc. I would like Spark to generate output files named something like part-1, part

Re: Issue while Spark Job fetching data from Cassandra DB

2015-11-19 Thread Ted Yu
I am not Cassandra expert :-) Please consider consulting Cassandra mailing list. On Thu, Nov 19, 2015 at 4:03 AM, satish chandra j wrote: > HI Ted, > CASSANDRA-7894 does has the same error details "UnauthorizedException: > User has no SELECT permission on > or any of its parents" and cause fo

Re: PySpark Lost Executors

2015-11-19 Thread Sandy Ryza
Hi Ross, This is most likely occurring because YARN is killing containers for exceeding physical memory limits. You can make this less likely to happen by bumping spark.yarn.executor.memoryOverhead to something higher than 10% of your spark.executor.memory. -Sandy On Thu, Nov 19, 2015 at 8:14 A

Re: PySpark Lost Executors

2015-11-19 Thread Ted Yu
Here are the parameters related to log aggregation : yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 2592000 yarn.nodemanager.log-aggregation.compression-type gz yarn.nodemanager.log-aggregation.deb

what is algorithm to optimize function with nonlinear constraints

2015-11-19 Thread Zhiliang Zhu
Hi all, I have some optimization problem, I have googled a lot but still did not get the exact algorithm or third-party open package to apply in it. Its type is like this, Objective function: f(x1, x2, ..., xn)   (n >= 100, and f may be linear or non-linear)Constraint functions: x1 + x2 + ... + x

Shuffle performance tuning. How to tune netty?

2015-11-19 Thread t3l
I am facing a very tricky issue here. I have a treeReduce task. The reduce-function returns a very large object. In fact it is a Map[Int, Array[Double]]. Each reduce task inserts and/or updates values into the map or updates the array. My problem is, that this Map can become very large. Currently,

Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-19 Thread David Rosenstrauch
I ran into this recently. Turned out we had an old org-xerial-snappy.properties file in one of our conf directories that had the setting: # Disables loading Snappy-Java native library bundled in the # snappy-java-*.jar file forcing to load the Snappy-Java native # library from the java.library

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Thank you Ted and Sandy for getting me pointed in the right direction. From the logs: WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 25.4 GB of 25.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. On Nov 19, 2015, at 12:20 PM, Ted

Re: Invocation of StreamingContext.stop() hangs in 1.5

2015-11-19 Thread jiten
Hi, Thanks to Ted Vu and Nilanjan. Stopping the streaming context asynchronously did the trick! Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Invocation-of-StreamingContext-stop-hangs-in-1-5-tp25402p25434.html Sent from the Apache Spark User L

Re: Reading from RabbitMq via Apache Spark Streaming

2015-11-19 Thread Sabarish Sasidharan
The stack trace is clear enough: Caused by: com.rabbitmq.client.ShutdownSignalException: channel error; protocol method: #method(reply-code=406, reply-text=PRECONDITION_FAILED - inequivalent arg 'durable' for queue 'hello1' in vhost '/': received 'true' but current is 'false', class-id=50, method-

External Table not getting updated from parquet files written by spark streaming

2015-11-19 Thread Abhishek Anand
Hi , I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like : CREATE TABLE if not exists rolluptable USING org.apache.spark.sql.parquet OPTIONS ( path "hdfs:" ); I had an impression that in case o

Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-19 Thread Eran Medan
you mean the top example, right? you are right and I'm aware of this as I stated :) but for the 2nd example (the initial jdbc load), I debugged it and apparently JDBCRDD does the select before the filter is pushed down I mean the where clause of the initial partition loading is not taking the fil

Blocked REPL commands

2015-11-19 Thread Jakob Odersky
I was just going through the spark shell code and saw this: private val blockedCommands = Set("implicits", "javap", "power", "type", "kind") What is the reason as to why these commands are blocked? thanks, --Jakob

Re: Error not found value sqlContext

2015-11-19 Thread Michael Armbrust
http://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13 On Thu, Nov 19, 2015 at 4:19 AM, satish chandra j wrote: > HI All, > we have recently migrated from Spark 1.2.1 to Spark 1.4.0, I am fetching > data from an RDBMS using JDBCRDD and register it as

Re: External Table not getting updated from parquet files written by spark streaming

2015-11-19 Thread Michael Armbrust
We cache metadata to speed up interactive querying. try REFRESH TABLE On Thu, Nov 19, 2015 at 11:18 AM, Abhishek Anand wrote: > Hi , > > I am using spark streaming to write the aggregated output as parquet files > to the hdfs using SaveMode.Append. I have an external table created like : > > >

Re: Blocked REPL commands

2015-11-19 Thread Jacek Laskowski
Hi, Dunno the answer, but :reset should be blocked, too, for obvious reasons. ➜ spark git:(master) ✗ ./bin/spark-shell ... Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0-SNAPSHOT /_/ Using Scala v

spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Rachana Srivastava
Issue: I have a random forest model that am trying to load during streaming using following code. The code is working fine when I am running the code from Eclipse but getting NPE when running the code using spark-submit. JavaStreamingContext jssc = new JavaStreamingContext(jsc, Durations.secon

Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Shuai Zheng
Hi All, I face a very weird case. I have already simplify the scenario to the most so everyone can replay the scenario. My env: AWS EMR 4.1.0, Spark1.5 My code can run without any problem when I run it in a local mode, and it has no problem when it run on a EMR cluster with one mas

Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-19 Thread Afshartous, Nick
Hi, On Spark 1.5 on EMR 4.1 the message below appears in stderr in the Yarn UI. ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console. I do see that there is /usr/lib/spark/conf/log4j.properties Can someone please advise o

Drop multiple columns in the DataFrame API

2015-11-19 Thread Benjamin Fradet
Hi everyone, I was wondering if there is a better way to drop mutliple columns from a dataframe or why there is no drop(cols: Column*) method in the dataframe API. Indeed, I tend to write code like this: val filteredDF = df.drop("colA") .drop("colB") .drop("colC") //etc which is a bit

Re: Blocked REPL commands

2015-11-19 Thread Jakob Odersky
that definitely looks like a bug, go ahead with filing an issue I'll check the scala repl source code to see what, if any, other commands there are that should be disabled On 19 November 2015 at 12:54, Jacek Laskowski wrote: > Hi, > > Dunno the answer, but :reset should be blocked, too, for obvi

Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-19 Thread Jonathan Kelly
This file only exists on the master and not the slave nodes, so you are probably running into https://issues.apache.org/jira/browse/SPARK-11105, which has already been fixed in the not-yet-released Spark 1.6.0. EMR will upgrade to Spark 1.6.0 once it is released. ~ Jonathan On Thu, Nov 19, 2015 a

Re: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Jonathan Kelly
I don't know if this actually has anything to do with why your job is hanging, but since you are using EMR you should probably not set those fs.s3 properties but rather let it use EMRFS, EMR's optimized Hadoop FileSystem implementation for interacting with S3. One benefit is that it will automatica

newbie: unable to use all my cores and memory

2015-11-19 Thread Andy Davidson
I am having a heck of a time figuring out how to utilize my cluster effectively. I am using the stand alone cluster manager. I have a master and 3 slaves. Each machine has 2 cores. I am trying to run a streaming app in cluster mode and pyspark at the same time. t1) On my console I see *

Re: create a table for csv files

2015-11-19 Thread Andrew Or
There's not an easy way. The closest thing you can do is: import org.apache.spark.sql.functions._ val df = ... df.withColumn("id", monotonicallyIncreasingId()) -Andrew 2015-11-19 8:23 GMT-08:00 xiaohe lan : > Hi, > > I have some csv file in HDFS with headers like col1, col2, col3, I want to >

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi, Could you please submit this via JIRA as a bug report? It will be very helpful if you include the Spark version, system details, and other info too. Thanks! Joseph On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > *Issue:* > > I have a random

RE: spark with breeze error of NoClassDefFoundError

2015-11-19 Thread Jack Yang
Thnaks for all the help so far. Ok so I got this from mllib=> pom.xml org.scalanlp breeze_${scala.binary.version} 0.11.2 junit junit org.apache.commons commons-math3 Presumab

question about combining small input splits

2015-11-19 Thread Nezih Yigitbasi
Hey everyone, I have a Hive table that has a lot of small parquet files and I am creating a data frame out of it to do some processing, but since I have a large number of splits/files my job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provid

Re: python.worker.memory parameter

2015-11-19 Thread Ted Yu
I found this parameter in python/pyspark/rdd.py partitionBy() has some explanation on how it tries to reduce amount of data transferred to Java. FYI On Tue, Oct 27, 2015 at 6:29 PM, Connor Zanin wrote: > Hi all, > > I am running a simple word count job on a cluster of 4 nodes (24 cores per > n

spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS

2015-11-19 Thread Andy Davidson
I am working on a simple POS. I am running into a really strange problem. I wrote a java streaming app that collects tweets using the spark twitter package and stores the to disk in JSON format. I noticed that when I run the code on my mac. The file are written to the local files system as I expect

Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2 when compile spark-1.5.2

2015-11-19 Thread ck...@126.com
Hey everyone when compile spark-1.5.2 using Intelligent IDEA there is an error like this: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-core_2.10: wrap: java.lang.ClassNotFoundException: xsbt.CompilerInterface: invalid

RE: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-19 Thread Afshartous, Nick
< log4j.properties file only exists on the master and not the slave nodes, so you are probably running into https://issues.apache.org/jira/browse/SPARK-11105, which has already been fixed in the not-yet-released Spark 1.6.0. EMR will upgrade to Spark 1.6.0 once it is released. Thanks for the

SparkR DataFrame , Out of memory exception for very small file.

2015-11-19 Thread vipulrai
Hi Users, I have a general doubt regarding DataFrames in SparkR. I am trying to read a file from Hive and it gets created as DataFrame. sqlContext <- sparkRHive.init(sc) #DF sales <- read.df(sqlContext, "hdfs://sample.csv", header ='true', source = "com.databricks.spark.csv",

Re: spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS

2015-11-19 Thread Andy Davidson
Turns out data is in python format. ETL pipeline was over writing original data Andy From: Andrew Davidson Date: Thursday, November 19, 2015 at 6:58 PM To: "user @spark" Subject: spark streaming problem saveAsTextFiles() does not write valid JSON to HDFS > I am working on a simple POS. I a

has any spark write orc document

2015-11-19 Thread zhangjp
Hi, has any spark write orc document which like the parquet document. http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files Thanks

Re: has any spark write orc document

2015-11-19 Thread Jeff Zhang
It should be very similar with parquet in the api perspective, Please refer this doc http://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/ On Fri, Nov 20, 2015 at 2:59 PM, zhangjp <592426...@qq.com> wrote: > Hi, > has any spark write orc document which like the parquet d

Re: has any spark write orc document

2015-11-19 Thread Fengdong Yu
You can use DataFrame: sqlContext.write.format(“orc”).save(“") > On Nov 20, 2015, at 2:59 PM, zhangjp <592426...@qq.com> wrote: > > Hi, > has any spark write orc document which like the parquet document. > http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files >

?????? has any spark write orc document

2015-11-19 Thread zhangjp
Thanks to Jeff Zhang and FengDong. When i use hive i know how to set orc table properties,but i don't know how to set orcfile properties when i write a orcfile using spark api. for example strip.size ??index.rows.-- -- ??: "Fengdong Yu" : 2015