This should work. Make sure that you use HiveContext.sql and sqlContext
correctly
This is an example in Spark, reading a CSV file, doing some manipulation,
creating a temp table, saving data as ORC file, adding another column and
inserting values to table in Hive with default values for new rows
I am new to SparkStreaming, when tried to submit the Spark-Twitter streaming
job, getting the following error:
---
Lost task 0.0 in stage 0.0 (TID
0,sandbox.hortonworks.com):java.lang.NullPointerException
at org.apache.spark.util.Utils$.decodeFileNameInURI(Utils.scala:340)
at org.apache.spark.util
Hi spark-user,
I am using spark 1.6 to build reverse index for one month of twitter data
(~50GB). The split size of HDFS is 1GB, thus by default sc.textFile creates
50 partitions. I'd like to increase the parallelism by increase the number
of input partitions. Thus, I use textFile(..., 200) to yie
Thanks Ted for the input. I was able to get it working with pyspark shell
but the same job submitted via 'spark-submit' using client or cluster
deploy mode ends up with these errors:
~
java.lang.OutOfMemoryError: Java heap space
at java.lang.Object.clone(Native Method)
at akka.util.CompactByte
Hi,
I am running Spark 1.6 on EMR. I have workflow which does the following
things:
1. Read the 2 flat file, create the data frame and join it.
2. Read the particular partition from the hive table and joins the
dataframe from 1 with it.
3. Finally, insert overwrite into hive table whi
yes it is
On Apr 10, 2016 3:17 PM, "Amit Sela" wrote:
> I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking
> for, makes sense ?
>
> On Sun, Apr 10, 2016 at 4:08 PM Amit Sela wrote:
>
>> I'm mapping RDD API to Datasets API and I was wondering if I was missing
>> something o
Hi,
I am running Spark 1.6 on EMR. I have workflow which does the fiollowing
things:
1. Read the 2 flat file, create the data frame and join it.
2. Read the particular partition from the hive table and joins the
dataframe from 1 with it.
3. Finally, insert overwrite into hive table wh
Hi,
I am running Spark 1.6 on EMR. I have workflow which does the fiollowing
things:
1. Read the 2 flat file, create the data frame and join it.
2. Read the particular partition from the hive table and joins the
dataframe from 1 with it.
3. Finally, insert overwrite into hive table wh
Your solution works in hive, but not in spark, even if I use hive context.
I tried to create a temp table and then this query:
- sqlContext.sql("insert into table myTable select * from myTable_temp”)
But I still get the same error.
thanks
From: Mich Talebzadeh
mailto:mich.talebza...@gmail.com>>
Hi,
I am looking on how to add multiple folders to spark context and then make
it as a dataframe.
Lets say I have below folder
/daas/marts/US/file1.txt
/daas/marts/CH/file2.txt
/daas/marts/SG/file3.txt.
Above files have same schema. I dont want to create multiple dataframes
instead create only
Hi,
I am confining myself to Hive tables. As I stated it before I have not
tried it in Spark. So I stand corrected.
Let us try this simple test in Hive
-- Create table
hive>
*create table testme(col1 int);*OK
--insert a row
hive> *insert into testme values(1);*
Loading data to table test.testm
I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking
for, makes sense ?
On Sun, Apr 10, 2016 at 4:08 PM Amit Sela wrote:
> I'm mapping RDD API to Datasets API and I was wondering if I was missing
> something or is this functionality is missing.
>
>
> On Sun, Apr 10, 2016 at
Hi,
So basically you are telling me that I need to recreate a table, and re-insert
everything every time I update a column?
I understand the constraints, but that solution doesn’t look good to me. I am
updating the schema everyday and the table is a couple of TB of data.
Do you see any other op
I am not 100% sure, but you could export to CSV in Oracle using external tables.
Oracle has also the Hadoop Loader, which seems to support Avro. However, I
think you need to buy the Big Data solution.
> On 10 Apr 2016, at 16:12, Mich Talebzadeh wrote:
>
> Yes I meant MR.
>
> Again one cannot
I'm getting a StackOverflowError from inside the createDataFrame call in this
example. It originates in scala code involving java type inferencing which
calls itself in an infinite loop.
final EventParser parser = new EventParser();
JavaRDD eventRDD = sc.textFile(path)
.map(new Function(
Hello,
I want to know on doing spark-submit, how is the Application jar copied to
worker machines? Who does the copying of Jars?
Similarly who copies DAG from driver to executors?
--
Regards
Hemalatha
Have you considered using PairRDDFunctions.aggregateByKey
or PairRDDFunctions.reduceByKey in place of the groupBy to achieve better
performance ?
Cheers
On Sat, Apr 9, 2016 at 2:00 PM, SURAJ SHETH wrote:
> Hi,
> I am using Spark 1.5.2
>
> The file contains 900K rows each with twelve fields (tab
Jasmine:
Let's know if listening to more events would give you better picture.
Thanks
On Thu, Apr 7, 2016 at 1:54 PM, Jasmine George wrote:
> Hi Ted,
>
>
>
> Thanks for replying so fast.
>
>
>
> We are using spark 1.5.2.
>
> I was collecting only TaskEnd Events.
>
> I can do the event wise summ
Hi Talebzadeh,
Thank for your quick response.
>>in 1.6, how many executors do you see for each node?
I have1 executor for 1 node with SPARK_WORKER_INSTANCES=1.
>>in standalone mode how are you increasing the number of worker instances.
Are you starting another slave on each node?
No, I am not st
Yes I meant MR.
Again one cannot beat the RDBMS export utility. I was specifically
referring to Oracle in above case that does not provide any specific text
bases export except the binary one Exp, data pump etc).
In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) that
can be p
Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce)
The largest problem with sqoop is that in order to gain parallelism you need to
know how your underlying table is partitioned and to do multiple range queries.
This may not be known, or your data may or may not be equally distribu
I'm mapping RDD API to Datasets API and I was wondering if I was missing
something or is this functionality is missing.
On Sun, Apr 10, 2016 at 3:00 PM Ted Yu wrote:
> Haven't found any JIRA w.r.t. combineByKey for Dataset.
>
> What's your use case ?
>
> Thanks
>
> On Sat, Apr 9, 2016 at 7:38 PM
Could you follow this guide
http://spark.apache.org/docs/latest/running-on-yarn.html#configuration?
Thanks,
Yucai
-Original Message-
From: maheshmath [mailto:mahesh.m...@gmail.com]
Sent: Saturday, April 9, 2016 1:58 PM
To: user@spark.apache.org
Subject: Unable run Spark in YARN mode
I
Hello All,
I am looking for a use case where anyone have used spark streaming
integration with LinkedIn.
--
Thanks
Deepak
Haven't found any JIRA w.r.t. combineByKey for Dataset.
What's your use case ?
Thanks
On Sat, Apr 9, 2016 at 7:38 PM, Amit Sela wrote:
> Is there (planned ?) a combineByKey support for Dataset ?
> Is / Will there be a support for combiner lifting ?
>
> Thanks,
> Amit
>
Looks like the exception occurred on driver.
Consider increasing the values for the following config:
conf.set("spark.driver.memory", "10240m")
conf.set("spark.driver.maxResultSize", "2g")
Cheers
On Sat, Apr 9, 2016 at 9:02 PM, Buntu Dev wrote:
> I'm running it via pyspark against yarn in cli
I have not tried it on Spark but the column added in Hive to an existing
table cannot be updated for existing rows. In other words the new column is
set to null which does not require the change in the existing file length.
So basically as I understand when a column is added to an already table.
Hi,
in 1.6, how many executors do you see for each node?
in standalone mode how are you increasing the number of worker instances.
Are you starting another slave on each node?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8P
Hi,
I have upgraded 5 node spark cluster from spark-1.5 to spark-1.6 (to use
mapWithState function).
After using spark-1.6, I am getting a strange behaviour of spark, jobs are
not using multiple executors of different nodes at a time means there is no
parallel processing if each node having single
29 matches
Mail list logo