date:20141208

Count-based windows

2014-12-08 Thread Tobias Pfeiffer

Hi, I am interested in building an application that uses sliding windows not based on the time when the item was received, but on either * a timestamp embedded in the data, or * a count (like: every 10 items, look at the last 100 items). Also, I want to do this on stream data received from Kafka,

How can I make Spark Streaming count the words in a file in a unit test?

2014-12-08 Thread Emre Sevinc

Hello, I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala at https://github.com/apache/spark/blob/branch-1.1/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala . When I submit this application to m

Conference in Paris next Thursday

2014-12-08 Thread Alexis Seigneurin

Hi, I know there are not so many conferences on Spark in Paris, so I just wanted to let you know you that Ippon will be holding one on Thursday next week (11th of December): http://blog.ippon.fr/2014/12/03/ippevent-spark-ou-comment-traiter-des-donnees-a-la-vitesse-de-leclair/ There will be 3 talk

column level encryption/decryption with key management

2014-12-08 Thread Chirag Aggarwal

Hi, There have been some efforts going on in providing column level encryption/decryption on hive tables. https://issues.apache.org/jira/browse/HIVE-7934 Is there any plan to extend the functionality over sparksql also? Thanks, Chirag

Re: Spark SQL: How to get the hierarchical element with SQL?

2014-12-08 Thread Raghavendra Pandey

Yeah, the dot notation works. It works even for arrays. But I am not sure if it can handle complex hierarchies. On Mon Dec 08 2014 at 11:55:19 AM Cheng Lian wrote: > You may access it via something like SELECT filterIp.element FROM tb, > just like Hive. Or if you’re using Spark SQL DSL, you can

Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis

Hello all, I am working on a graph problem using vanilla Spark (not GraphX) and at some point I would like to do a self join on an edges RDD[(srcID, dstID, w)] on the dst key, in order to get all pairs of incoming edges. Since this is the performance bottleneck for my code, I was wondering if the

Programmatically running spark jobs using yarn-client

2014-12-08 Thread Aniket Bhatnagar

I am trying to create (yet another) spark as a service tool that lets you submit jobs via REST APIs. I think I have nearly gotten it to work baring a few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but I have hit the road block with the following issue. I have created a sim

Re: Efficient self-joins

2014-12-08 Thread Koert Kuipers

spark can do efficient joins if both RDDs have the same partitioner. so in case of self join I would recommend to create an rdd that has explicit partitioner and has been cached. On Dec 8, 2014 8:52 AM, "Theodore Vasiloudis" < theodoros.vasilou...@gmail.com> wrote: > Hello all, > > I am working on

Error when mapping a schema RDD when converting lists

2014-12-08 Thread sahanbull

Hi Guys, I used applySchema to store a set of nested dictionaries and lists in a parquet file. http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-td20228.html#a20461 It was successful and i could successf

Locking for shared RDDs

2014-12-08 Thread aditya.athalye

I am relatively new to Spark. I am planning to use Spark Streaming for my OLAP use case, but I would like to know how RDDs are shared between multiple workers. If I need to constantly compute some stats on the streaming data, presumably shared state would have to updated serially by different spar

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos

Could you not use a groupByKey instead of the join? I mean something like this: val byDst = rdd.map { case (src, dst, w) => dst -> (src, w) } byDst.groupByKey.map { case (dst, edges) => for { (src1, w1) <- edges (src2, w2) <- edges } { ??? // Do something. } ??? // Return somet

Re: Error when mapping a schema RDD when converting lists

2014-12-08 Thread sahanbull

As a tempary fix, it works when I convert field six to a list manually. That is: def generateRecords(line): # input : the row stored in parquet file # output : a python dictionary with all the key value pairs field1 = line.field1 summary = {} summary['

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis

Using groupByKey was our first approach, and as noted in the docs is highly inefficient due to the need to shuffle all the data. See http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html On Mon, Dec 8, 2014 at 3:47 PM, Daniel D

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos

I do not see how you hope to generate all incoming edge pairs without repartitioning the data by dstID. You need to perform this shuffle for joining too. Otherwise two incoming edges could be in separate partitions and never meet. Am I missing something? On Mon, Dec 8, 2014 at 3:53 PM, Theodore Va

Understanding reported times on the Spark UI [+ Streaming]

2014-12-08 Thread Gerard Maas

Hi, I'm confused about the Stage times reported on the Spark-UI (Spark 1.1.0) for an Spark-Streaming job. I'm hoping somebody can shine some light on it: Let's do this with an example: On the /stages page, stage # 232 is reported to have lasted 18 seconds: 232runJob at RDDFunctions.scala:23

Re: NoClassDefFoundError

2014-12-08 Thread Akhil Das

Hi Julius, You can add those external jars to spark while creating the sparkContext (sc.addJar("/path/to/the/jar")), if you are submitting the job using spark-submit then you can use the --jars option and get those jars shipped. Thanks Best Regards On Sun, Dec 7, 2014 at 11:05 PM, Julius K wrot

Re: spark Exception while performing saveAsTextFiles

2014-12-08 Thread Akhil Das

Check in your worker logs for exact reason, if you let the job run for 2 days then mostly this is because of you ran out of disk space or something. Looking at the worker logs will give you a clear picture. Thanks Best Regards On Mon, Dec 8, 2014 at 12:49 PM, Hafiz Mujadid wrote: > I am facing

Re: monitoring for spark standalone

2014-12-08 Thread Akhil Das

You can setup and customize nagios for all these requirements. Or you can use Ganglia if you are not looking for something with alerts (email etc) Thanks Best Regards On Mon, Dec 8, 2014 at 1:05 PM, Judy Nash wrote: > Hello, > > > > Are there ways we can programmatically get health status of m

Re: Programmatically running spark jobs using yarn-client

2014-12-08 Thread Akhil Das

How are you submitting the job? You need to create a jar of your code (sbt package will give you one inside target/scala-*/projectname-*.jar) and then use it while submitting. If you are not using spark-submit then you can simply add this jar to spark by sc.addJar("/path/to/target/scala*/projectnam

Re: Spark SQL: How to get the hierarchical element with SQL?

2014-12-08 Thread Alessandro Panebianco

I went through complex hierarchal JSON structures and Spark seems to fail in querying them no matter what syntax is used. Hope this helps, Regards, Alessandro > On Dec 8, 2014, at 6:05 AM, Raghavendra Pandey > wrote: > > Yeah, the dot notation works. It works even for arrays. But I am not

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis

@Daniel It's true that the first map in your code is needed, i.e. mapping so that dstID is the new RDD key. The self-join on the dstKey will then create all the pairs of incoming edges (plus self-referential and duplicates that need to be filtered out). @Koert Are there any guidelines about setti

Re: Efficient self-joins

2014-12-08 Thread Daniel Darabos

On Mon, Dec 8, 2014 at 5:26 PM, Theodore Vasiloudis < theodoros.vasilou...@gmail.com> wrote: > @Daniel > It's true that the first map in your code is needed, i.e. mapping so that > dstID is the new RDD key. > You wrote groupByKey is "highly inefficient due to the need to shuffle all the data", bu

Install Apache Spark on a Cluster

2014-12-08 Thread riginos

My thesis is related to big data mining and I have a cluster in the laboratory of my university. My task is to install apache spark on it and use it for extraction purposes. Is there any understandable guidance on how to do this ? -- View this message in context: http://apache-spark-user-list.

Re: Install Apache Spark on a Cluster

2014-12-08 Thread Ritesh Kumar Singh

On a rough note, Step 1: Install Hadoop2.x in all the machines on cluster Step 2: Check if Hadoop cluster is working Step 3: Setup Apache Spark as given on the documentation page for the cluster. Check the status of cluster on the master UI As it is some data mining project, configure Hive too. Y

Error: Spark-streaming to Cassandra

2014-12-08 Thread m.sarosh

Hi, I am intending to save the streaming data from kafka into Cassandra, using spark-streaming: But there seems to be problem with line javaFunctions(data).writerBuilder("testkeyspace", "test_table", mapToRow(TestTable.class)).saveToCassandra(); I am getting 2 errors. the code, the error-log and

Re: Error when mapping a schema RDD when converting lists

2014-12-08 Thread Davies Liu

This is fixed in 1.2. Also, in 1.2+ you could call row.asDict() to convert the Row object into dict. On Mon, Dec 8, 2014 at 6:38 AM, sahanbull wrote: > Hi Guys, > > I used applySchema to store a set of nested dictionaries and lists in a > parquet file. > > http://apache-spark-user-list.1001560.n3

Re: Efficient self-joins

2014-12-08 Thread Theodore Vasiloudis

@Daniel Not an expert either, I'm just going by what I see performance-wise currently. Our groupByKey implementation was more than an order of magnitude slower than using the self join and then reduceByKey. FTA: *"pairs on the same machine with the same key are combined (by using the lamdba funct

Re: How can I make Spark Streaming count the words in a file in a unit test?

2014-12-08 Thread Burak Yavuz

Hi, https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf contains some performance tests for streaming. There are examples of how to generate synthetic files during the test in that repo, maybe you can find some code snippets that you can use there.

Spark-SQL JDBC driver

2014-12-08 Thread Anas Mosaad

Hello Everyone, I'm brand new to spark and was wondering if there's a JDBC driver to access spark-SQL directly. I'm running spark in standalone mode and don't have hadoop in this environment. -- *Best Regards/أطيب المنى,* *Anas Mosaad*

How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis

I have a function which generates a Java object and I want to explore failures which only happen when processing large numbers of these object. the real code is reading a many gigabyte file but in the test code I can generate similar objects programmatically. I could create a small list, paralleli

RE: Spark-SQL JDBC driver

2014-12-08 Thread Judy Nash

You can use thrift server for this purpose then test it with beeline. See doc: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server From: Anas Mosaad [mailto:anas.mos...@incorta.com] Sent: Monday, December 8, 2014 11:01 AM To: user@spark.apache.org Subje

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Michael Armbrust

This is by hive's design. From the Hive documentation: The column change command will only modify Hive's metadata, and will not > modify data. Users should make sure the actual data layout of the > table/partition conforms with the metadata definition. On Sat, Dec 6, 2014 at 8:28 PM, Jianshi H

SOLVED -- Re: scopt.OptionParser

2014-12-08 Thread Caron

Update: The issue in my previous post was solved: I had to change the sbt file name from .sbt to build.sbt. - Thanks! -Caron -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/scopt-OptionParser-tp8436p20581.html Sent from the Apache Spark User List mai

Re: spark-submit on YARN is slow

2014-12-08 Thread Sandy Ryza

Hey Tobias, Can you try using the YARN Fair Scheduler and set yarn.scheduler.fair.continuous-scheduling-enabled to true? -Sandy On Sun, Dec 7, 2014 at 5:39 PM, Tobias Pfeiffer wrote: > Hi, > > thanks for your responses! > > On Sat, Dec 6, 2014 at 4:22 AM, Sandy Ryza > wrote: >> >> What versio

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Yin Huai

Hello Jianshi, You meant you want to convert a Map to a Struct, right? We can extract some useful functions from JsonRDD.scala, so others can access them. Thanks, Yin On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang wrote: > I checked the source code for inferSchema. Looks like this is exactly w

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Daniel Darabos

Hi, I think you have the right idea. I would not even worry about flatMap. val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x => generateRandomObject(x)) Then when you try to evaluate something on this RDD, it will happen partition-by-partition. So 1000 random objects will be generate

Re: representing RDF literals as vertex properties

2014-12-08 Thread spr

OK, have waded into implementing this and have gotten pretty far, but am now hitting something I don't understand, an NoSuchMethodError. The code looks like [...] val conf = new SparkConf().setAppName(appName) //conf.set("fs.default.name", "file://"); val sc = new SparkContext(c

MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-08 Thread Sameer Tilak

Hi All, I was able to run LinearRegressionwithSGD for a largeer dataset (> 2GB sparse). I have now filtered the data and I am running regression on a subset of it (~ 200 MB). I see this error, which is strange since it was running fine with the superset data. Is this a formatting issue

Re: Why KMeans with mllib is so slow ?

2014-12-08 Thread DB Tsai

You just need to use the latest master code without any configuration to get performance improvement from my PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 8, 2014 at 7:53 AM

Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis

looks good but how do I say that in Java as far as I can see sc.parallelize (in Java) has only one implementation which takes a List - requiring an in memory representation On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Hi, > I think you have the rig

Re: Is there a way to get column names using hiveContext ?

2014-12-08 Thread Michael Armbrust

You can call .schema on SchemaRDDs. For example: results.schema.fields.map(_.name) On Sun, Dec 7, 2014 at 11:36 PM, abhishek wrote: > Hi, > > I have iplRDD which is a json, and I do below steps and query through > hivecontext. I get the results but without columns headers. Is there is a > way

Transform SchemaRDDs into new SchemaRDDs

2014-12-08 Thread Sunita Arvind

Hi, I need to generate some flags based on certain columns and add it back to the schemaRDD for further operations. Do I have to use case class (reflection or programmatically). I am using parquet files, so schema is being automatically derived. This is a great feature. thanks to Spark developers,

Re: Print Node info. of Decision Tree

2014-12-08 Thread Manish Amde

Hi Jake, The "toString" method should print the full model in versions 1.1.x. The current master branch has a method "toDebugString" for DecisionTreeModel which should print out all the node classes and the "toString" method has been updated to print the summary only so there is a slight change i

Learning rate or stepsize automation

2014-12-08 Thread Bui, Tri

Hi, Is there any way to auto calculate the optimum learning rate or stepsize via MLLIB for SGD ? Thx tri

Re: representing RDF literals as vertex properties

2014-12-08 Thread Ankur Dave

At 2014-12-08 12:12:16 -0800, spr wrote: > OK, have waded into implementing this and have gotten pretty far, but am now > hitting something I don't understand, an NoSuchMethodError. > [...] > The (short) traceback looks like > > Exception in thread "main" java.lang.NoSuchMethodError: > org.apache

Re: Learning rate or stepsize automation

2014-12-08 Thread Debasish Das

Hi Bui, Please use BFGS based solvers...For BFGS you don't have to specify step size since the line search will find sufficient decrease each time... Regularization you still have to do grid search...it's not possible to automate that but on master you will find nice ways to automate grid search.

Error outputing to CSV file

2014-12-08 Thread DataNut

I am trying to create a CSV file, I have managed to create the actual string I want to output to a file, but when I try to write the file, I get the following error. /saveAsTextFile is not a member of String/ My string is perfect, when I call this line to actually save the file, I get the above e

Re: run JavaAPISuite with mavem

2014-12-08 Thread Ted Yu

Running JavaAPISuite (master branch) on Linux, I got: testGuavaOptional(org.apache.spark.JavaAPISuite) Time elapsed: 32.945 sec <<< ERROR! org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org $apa

Re: Efficient self-joins

2014-12-08 Thread Koert Kuipers

Yeah, spark has very little overhead per partition, so generally more partitions is better. On Mon, Dec 8, 2014 at 1:46 PM, Theodore Vasiloudis < theodoros.vasilou...@gmail.com> wrote: > @Daniel > > Not an expert either, I'm just going by what I see performance-wise > currently. Our groupByKey im

MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-08 Thread jaykatukuri

Hi all, I am running into an out of memory error while running ALS using MLLIB on a reasonably small data set consisting of around 6 Million ratings. The stack trace is below: java.lang.OutOfMemoryError: Java heap space at org.jblas.DoubleMatrix.(DoubleMatrix.java:323) at org.jbl

In Java how can I create an RDD with a large number of elements

2014-12-08 Thread Steve Lewis

assume I don't care about values which may be created in a later map - in scala I can say val rdd = sc.parallelize(1 to 10, numSlices = 1000) but in Java JavaSparkContext can only paralellize a List - limited to Integer,MAX_VALUE elements and required to exist in memory - the best I can do

RE: Fair scheduling accross applications in stand-alone mode

2014-12-08 Thread Mohammed Guller

Hi - Does anybody have any ideas how to dynamically allocate cores instead of statically partitioning them among multiple applications? Thanks. Mohammed From: Mohammed Guller Sent: Friday, December 5, 2014 11:26 PM To: user@spark.apache.org Subject: Fair scheduling accross applications in stand

Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-08 Thread Xuelin Cao

Hi, I'm wondering whether there is an efficient way to continuously append new data to a registered spark SQL table. This is what I want: I want to make an ad-hoc query service to a json formated system log. Certainly, the system log is continuously generated. I will use spark

Can not write out data as snappy-compressed files

2014-12-08 Thread Tao Xiao

I'm using CDH 5.1.0 and Spark 1.0.0, and I'd like to write out data as snappy-compressed files but encounted a problem. My code is as follows: val InputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/text/new.txt" val OutputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/compressedText/" val

Re: In Java how can I create an RDD with a large number of elements

2014-12-08 Thread praveen seluka

Steve, Something like this will do I think => sc.parallelize(1 to 1000, 1000).flatMap(x => 1 to 10) the above will launch 1000 tasks (maps), with each task creating 10^5 numbers (total of 100 million elements) On Mon, Dec 8, 2014 at 6:17 PM, Steve Lewis wrote: > assume I don't care about

query classification using Apache spark Mlib

2014-12-08 Thread Huang,Jin

I have a question as the title says, the question link is http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib,thanks Jin

How to convert RDD to JSON?

2014-12-08 Thread YaoPau

Pretty straightforward: Using Scala, I have an RDD that represents a table with four columns. What is the recommended way to convert the entire RDD to one JSON object? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-RDD-to-JSON-tp20585.html S

Re: spark-submit on YARN is slow

2014-12-08 Thread Tobias Pfeiffer

Hi, On Tue, Dec 9, 2014 at 4:39 AM, Sandy Ryza wrote: > > Can you try using the YARN Fair Scheduler and set > yarn.scheduler.fair.continuous-scheduling-enabled to true? > I'm using Cloudera 5.2.0 and my configuration says yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.re

Re: How to convert RDD to JSON?

2014-12-08 Thread Yin Huai

If you are using spark SQL in 1.2, you can use toJson to convert a SchemaRDD to an RDD[String] that contains one JSON object per string value. Thanks, Yin On Mon, Dec 8, 2014 at 11:52 PM, YaoPau wrote: > Pretty straightforward: Using Scala, I have an RDD that represents a table > with four col

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Jianshi Huang

Hi Huai, Exactly, I'll probably implement one using the new data source API when I have time... I've found the utility functions in JsonRDD. Jianshi On Tue, Dec 9, 2014 at 3:41 AM, Yin Huai wrote: > Hello Jianshi, > > You meant you want to convert a Map to a Struct, right? We can extract > som

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang

Ah... I see. Thanks for pointing it out. Then it means we cannot mount external table using customized column names. hmm... Then the only option left is to use a subquery to add a bunch of column alias. I'll try it later. Thanks, Jianshi On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust wrote:

Re: Locking for shared RDDs

2014-12-08 Thread Raghavendra Pandey

You don't need to worry about locks as such as one thread/worker is responsible exclusively for one partition of the RDD. You can use Accumulator variables that spark provides to get the state updates. On Mon Dec 08 2014 at 8:14:28 PM aditya.athalye wrote: > I am relatively new to Spark. I am pl

Saving Data only if Dstream is not empty

2014-12-08 Thread Hafiz Mujadid

Hi Experts! I want to save DStream to HDFS only if it is not empty such that it contains some kafka messages to be stored. What is an efficient way to do this. var data = KafkaUtils.createStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder](ssc, params, topicMap, Storag

Writing and reading file faster than memory option

2014-12-08 Thread Malte

I am facing a somewhat confusing problem: My spark app reads data from a database, calculates certain values and then runs a shortest path Pregel operation on them. If I save the RDD to disk and then read the information out again, my app runs between 30-50% faster than keeping it in memory, plus

spark broadcast unavailable

2014-12-08 Thread 十六夜涙

Hi allIn my spark application,I load a csv file and map the datas to a Map vairable for later uses on driver node ,then broadcast it,every thing works fine untill the exception java.io.FileNotFoundException occurs.the console log information shows me the broadcast unavailable,I googled this

Re: spark broadcast unavailable

2014-12-08 Thread Akhil Das

You cannot pass the sc object (*val b = Utils.load(sc,ip_lib_path)*) inside a map function and that's why the Serialization exception is popping up( since sc is not serializable). You can try tachyon's cache if you want to persist the data in memory kind of forever. Thanks Best Regards On Tue, De

Re: "vcores used" in cluster metrics(yarn resource manager ui) when running spark on yarn

2014-12-08 Thread Sandy Ryza

Hi yuemeng, Are you possibly running the Capacity Scheduler with the default resource calculator? -Sandy On Sat, Dec 6, 2014 at 7:29 PM, yuemeng1 wrote: > Hi, all > When i running an app with this cmd: ./bin/spark-sql --master > yarn-client --num-executors 2 --executor-cores 3, i noticed

Re: Spark on YARN memory utilization

2014-12-08 Thread Sandy Ryza

Another thing to be aware of is that YARN will round up containers to the nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults to 1024. -Sandy On Sat, Dec 6, 2014 at 3:48 PM, Denny Lee wrote: > Got it - thanks! > > On Sat, Dec 6, 2014 at 14:56 Arun Ahuja wrote: > >> Hi Den

PhysicalRDD problem?

2014-12-08 Thread nitin

Hi All, I am facing following problem on Spark-1.2 rc1 where I get Treenode exception (unresolved attributes) :- https://issues.apache.org/jira/browse/SPARK-2063 To avoid this, I do something following :- val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD, existingSchemaRDD.schema) It

Re: spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-12-08 Thread ZHENG, Xu-dong

I meet the same issue. Any solution? On Wed, Nov 12, 2014 at 2:54 PM, tridib wrote: > Hi Friends, > I am trying to save a json file to parquet. I got error "Unsupported > datatype TimestampType". > Is not parquet support date? Which parquet version does spark uses? Is > there > any work around?

Re: How to convert RDD to JSON?

2014-12-08 Thread lihu

RDD is just a wrap of the scala collection, Maybe you can use the .collect() method to get the scala collection type, you can then transfer to a JSON object using the Scala method.

Issues on schemaRDD's function got stuck

2014-12-08 Thread LinQili

Hi all:I was running HiveFromSpark on yarn-cluster. While I got the hive select's result schemaRDD and tried to run `collect()` on it, the application got stuck and don't know what's wrong with it. Here is my code: val sqlStat = s"SELECT * FROM $TABLE_NAME" val result = hiveContext.hql(sqlStat)

72 matches

Mail list logo