date:20160410

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Mich Talebzadeh

This should work. Make sure that you use HiveContext.sql and sqlContext correctly This is an example in Spark, reading a CSV file, doing some manipulation, creating a temp table, saving data as ORC file, adding another column and inserting values to table in Hive with default values for new rows

Getting NPE when trying to do spark streaming with Twitter

2016-04-10 Thread krisgari

I am new to SparkStreaming, when tried to submit the Spark-Twitter streaming job, getting the following error: --- Lost task 0.0 in stage 0.0 (TID 0,sandbox.hortonworks.com):java.lang.NullPointerException at org.apache.spark.util.Utils$.decodeFileNameInURI(Utils.scala:340) at org.apache.spark.util

High GC time when setting custom input partitions

2016-04-10 Thread Johnny W.

Hi spark-user, I am using spark 1.6 to build reverse index for one month of twitter data (~50GB). The split size of HDFS is 1GB, thus by default sc.textFile creates 50 partitions. I'd like to increase the parallelism by increase the number of input partitions. Thus, I use textFile(..., 200) to yie

Re: Graphframes pattern causing java heap space errors

2016-04-10 Thread Buntu Dev

Thanks Ted for the input. I was able to get it working with pyspark shell but the same job submitted via 'spark-submit' using client or cluster deploy mode ends up with these errors: ~ java.lang.OutOfMemoryError: Java heap space at java.lang.Object.clone(Native Method) at akka.util.CompactByte

Connection closed Exception.

2016-04-10 Thread Bijay Kumar Pathak

Hi, I am running Spark 1.6 on EMR. I have workflow which does the following things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table whi

Re: Datasets combineByKey

2016-04-10 Thread Koert Kuipers

yes it is On Apr 10, 2016 3:17 PM, "Amit Sela" wrote: > I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking > for, makes sense ? > > On Sun, Apr 10, 2016 at 4:08 PM Amit Sela wrote: > >> I'm mapping RDD API to Datasets API and I was wondering if I was missing >> something o

Fwd: Connection closed Exception.

2016-04-10 Thread Bijay Pathak

Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table wh

Connection closed Exception.

2016-04-10 Thread Bijay Pathak

Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table wh

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Maurin Lenglart

Your solution works in hive, but not in spark, even if I use hive context. I tried to create a temp table and then this query: - sqlContext.sql("insert into table myTable select * from myTable_temp”) But I still get the same error. thanks From: Mich Talebzadeh mailto:mich.talebza...@gmail.com>>

Multiple folders to SqlContext

2016-04-10 Thread KhajaAsmath Mohammed

Hi, I am looking on how to add multiple folders to spark context and then make it as a dataframe. Lets say I have below folder /daas/marts/US/file1.txt /daas/marts/CH/file2.txt /daas/marts/SG/file3.txt. Above files have same schema. I dont want to create multiple dataframes instead create only

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Mich Talebzadeh

Hi, I am confining myself to Hive tables. As I stated it before I have not tried it in Spark. So I stand corrected. Let us try this simple test in Hive -- Create table hive> *create table testme(col1 int);*OK --insert a row hive> *insert into testme values(1);* Loading data to table test.testm

Re: Datasets combineByKey

2016-04-10 Thread Amit Sela

I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking for, makes sense ? On Sun, Apr 10, 2016 at 4:08 PM Amit Sela wrote: > I'm mapping RDD API to Datasets API and I was wondering if I was missing > something or is this functionality is missing. > > > On Sun, Apr 10, 2016 at

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Maurin Lenglart

Hi, So basically you are telling me that I need to recreate a table, and re-insert everything every time I update a column? I understand the constraints, but that solution doesn’t look good to me. I am updating the schema everyday and the table is a couple of TB of data. Do you see any other op

Re: Sqoop on Spark

2016-04-10 Thread Jörn Franke

I am not 100% sure, but you could export to CSV in Oracle using external tables. Oracle has also the Hadoop Loader, which seems to support Avro. However, I think you need to buy the Big Data solution. > On 10 Apr 2016, at 16:12, Mich Talebzadeh wrote: > > Yes I meant MR. > > Again one cannot

Infinite recursion in createDataFrame for avro types

2016-04-10 Thread Brad Cox

I'm getting a StackOverflowError from inside the createDataFrame call in this example. It originates in scala code involving java type inferencing which calls itself in an infinite loop. final EventParser parser = new EventParser(); JavaRDD eventRDD = sc.textFile(path) .map(new Function(

How Application jar is copied to worker machines?

2016-04-10 Thread Hemalatha A

Hello, I want to know on doing spark-submit, how is the Application jar copied to worker machines? Who does the copying of Jars? Similarly who copies DAG from driver to executors? -- Regards Hemalatha

Re: Weird error while serialization

2016-04-10 Thread Ted Yu

Have you considered using PairRDDFunctions.aggregateByKey or PairRDDFunctions.reduceByKey in place of the groupBy to achieve better performance ? Cheers On Sat, Apr 9, 2016 at 2:00 PM, SURAJ SHETH wrote: > Hi, > I am using Spark 1.5.2 > > The file contains 900K rows each with twelve fields (tab

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-10 Thread Ted Yu

Jasmine: Let's know if listening to more events would give you better picture. Thanks On Thu, Apr 7, 2016 at 1:54 PM, Jasmine George wrote: > Hi Ted, > > > > Thanks for replying so fast. > > > > We are using spark 1.5.2. > > I was collecting only TaskEnd Events. > > I can do the event wise summ

Re: Number of executors in spark-1.6 and spark-1.5

2016-04-10 Thread Vikash Pareek

Hi Talebzadeh, Thank for your quick response. >>in 1.6, how many executors do you see for each node? I have1 executor for 1 node with SPARK_WORKER_INSTANCES=1. >>in standalone mode how are you increasing the number of worker instances. Are you starting another slave on each node? No, I am not st

Re: Sqoop on Spark

2016-04-10 Thread Mich Talebzadeh

Yes I meant MR. Again one cannot beat the RDBMS export utility. I was specifically referring to Oracle in above case that does not provide any specific text bases export except the binary one Exp, data pump etc). In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) that can be p

Re: Sqoop on Spark

2016-04-10 Thread Michael Segel

Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) The largest problem with sqoop is that in order to gain parallelism you need to know how your underlying table is partitioned and to do multiple range queries. This may not be known, or your data may or may not be equally distribu

Re: Datasets combineByKey

2016-04-10 Thread Amit Sela

I'm mapping RDD API to Datasets API and I was wondering if I was missing something or is this functionality is missing. On Sun, Apr 10, 2016 at 3:00 PM Ted Yu wrote: > Haven't found any JIRA w.r.t. combineByKey for Dataset. > > What's your use case ? > > Thanks > > On Sat, Apr 9, 2016 at 7:38 PM

RE: Unable run Spark in YARN mode

2016-04-10 Thread Yu, Yucai

Could you follow this guide http://spark.apache.org/docs/latest/running-on-yarn.html#configuration? Thanks, Yucai -Original Message- From: maheshmath [mailto:mahesh.m...@gmail.com] Sent: Saturday, April 9, 2016 1:58 PM To: user@spark.apache.org Subject: Unable run Spark in YARN mode I

LinkedIn streams in Spark

2016-04-10 Thread Deepak Sharma

Hello All, I am looking for a use case where anyone have used spark streaming integration with LinkedIn. -- Thanks Deepak

Re: Datasets combineByKey

2016-04-10 Thread Ted Yu

Haven't found any JIRA w.r.t. combineByKey for Dataset. What's your use case ? Thanks On Sat, Apr 9, 2016 at 7:38 PM, Amit Sela wrote: > Is there (planned ?) a combineByKey support for Dataset ? > Is / Will there be a support for combiner lifting ? > > Thanks, > Amit >

Re: Graphframes pattern causing java heap space errors

2016-04-10 Thread Ted Yu

Looks like the exception occurred on driver. Consider increasing the values for the following config: conf.set("spark.driver.memory", "10240m") conf.set("spark.driver.maxResultSize", "2g") Cheers On Sat, Apr 9, 2016 at 9:02 PM, Buntu Dev wrote: > I'm running it via pyspark against yarn in cli

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Mich Talebzadeh

I have not tried it on Spark but the column added in Hive to an existing table cannot be updated for existing rows. In other words the new column is set to null which does not require the change in the existing file length. So basically as I understand when a column is added to an already table.

Re: Number of executors in spark-1.6 and spark-1.5

2016-04-10 Thread Mich Talebzadeh

Hi, in 1.6, how many executors do you see for each node? in standalone mode how are you increasing the number of worker instances. Are you starting another slave on each node? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8P

Number of executors in spark-1.6 and spark-1.5

2016-04-10 Thread Vikash Pareek

Hi, I have upgraded 5 node spark cluster from spark-1.5 to spark-1.6 (to use mapWithState function). After using spark-1.6, I am getting a strange behaviour of spark, jobs are not using multiple executors of different nodes at a time means there is no parallel processing if each node having single

Re: alter table add columns aternatives or hive refresh

Getting NPE when trying to do spark streaming with Twitter

High GC time when setting custom input partitions

Re: Graphframes pattern causing java heap space errors

Connection closed Exception.

Re: Datasets combineByKey

Fwd: Connection closed Exception.

Connection closed Exception.

Re: alter table add columns aternatives or hive refresh

Multiple folders to SqlContext

Re: alter table add columns aternatives or hive refresh

Re: Datasets combineByKey

Re: alter table add columns aternatives or hive refresh

Re: Sqoop on Spark

Infinite recursion in createDataFrame for avro types

How Application jar is copied to worker machines?

Re: Weird error while serialization

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

Re: Number of executors in spark-1.6 and spark-1.5

Re: Sqoop on Spark

Re: Sqoop on Spark

Re: Datasets combineByKey

RE: Unable run Spark in YARN mode

LinkedIn streams in Spark

Re: Datasets combineByKey

Re: Graphframes pattern causing java heap space errors

Re: alter table add columns aternatives or hive refresh

Re: Number of executors in spark-1.6 and spark-1.5

Number of executors in spark-1.6 and spark-1.5

29 matches

Site Navigation

Mail list logo

Footer information