Hi All,
I was trying out spark + SalesforceLibabry on cloudera 5.9 I am having SSL
handhshake issue please check out my question on stack over flow no one
answered.
The library works ok on windows it fails when I try to run on cloudera edge
node.
https://stackoverflow.com/questions/47820372/ssl
Hi All,
I managed to write business requirement in spark-sql and hive I am still
learning scala how this below sql be written using spark RDD not spark data
frames.
SELECT DATE,balance,
SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW) daily_balance
FROM table
Hi All,
can I set spark.local.dir to HDFS location instead of /tmp folder ?
I tried setting up temp folder to HDFS but it didn't worked can
spark.local.dir write to HDFS ?
.set("spark.local.dir","hdfs://namednode/spark_tmp/")
16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local d
I found the jira for the issue will there be a fix in future ? or no fix ?
https://issues.apache.org/jira/browse/SPARK-6221
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html
Sent from the Apache Spark User
Hi Neelesh,
I told you in my emails it's not spark-Scala application , I am working on just
spark SQL.
I am launching spark-SQL shell and running my hive code inside spark SQL she'll.
Spark SQL she'll accepts functions which relate to spark SQL doesn't accepts
fictions like collasece which is
Hi All,
I am running hive in spark-sql in yarn client mode, the sql is pretty simple
load dynamic partitions to target parquet table.
I used hive configurations parameters such as (set
hive.merge.smallfiles.avgsize=25600;set
hive.merge.size.per.task=256000;) which usually merges small f
Try using collasece function to repartition to desired number of partitions
files, to merge already output files use hive and insert overwrite table
using below options.
set hive.merge.smallfiles.avgsize=256;
set hive.merge.size.per.task=256;
set
--
View this message in context:
http:
I am not sure but you can use collasece function to reduce number of output
files .
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/output-part-files-max-size-tp17013p27262.html
Sent from the Apache Spark User List mailing list archive at Nabble.
Hi All,
I did worked on spark installed on Hadoop cluster but never worked on spark
on standalone cluster.
My question how to set number of partitions in spark when it's running on
spark standalone cluster?
If spark on Hadoop I calculate my formula using hdfs block sizes but how I
calculate with
Hi All,
I had used broadcast join in spark-scala applications I did used partitionby
(Hash Partitioner) and then persit for wide dependencies, present project
which I am working on pretty much Hive migration to spark-sql which is
pretty much sql to be honest no scala or python apps.
My question
Hi ,
I am using spark-sql shell wile launching I am running it as spark-sql
--conf spark.driver.maxResultSize=20g
I tried using spark-sql --conf "spark.driver.maxResults"="20g" but still no
luck do I need to use set command something like
spark-sql --conf set "spark.driver.maxReults"="20g"
Hi All ,
I am getting spark driver memory issue even after overriding the conf by
using --conf spark.driver.maxResultSize=20g and I also mentioned in my sql
script (set spark.driver.maxResultSize =16;) but still the same error
happening.
Job aborted due to stage failure: Total size of serialized
Hi All ,
Is there a way to ask spark and spark-sql to use Hive 0.14 version instead
of inbuilt hive 1.2.1.
I am testing spark-sql locally by downloading spark 1.6 from internet , I
want to execute my hive queries in spark sql using hive version 0.14 can I
go back to previous version just for a s
Hi All,
Just realized cloudera version of spark on my cluster is 1.2, the jar which
I built using maven is version 1.6 which is causing issue.
Is there a way to run spark version 1.6 in 1.2 version of spark ?
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.100156
Fixed by creating a new netezza Dialect and registered in jdbcDialects
using JdbcDialects.registerDialect(NetezzaDialect) method
(spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala)
package com.citi.ocean.spark.elt
/**
* Created by st84879 on 26/01/2016.
*/
import ja
Hi All,
I am using Spark jdbc df to store data into Netezza , I think spark is
trying to create table using data type TEXT for string column , netezza
doesn't support data type text.
how to overwrite spark method to use VARCHAR instead of data type text ?
val
sourcedfmode=sourcedf.persist(Stora
Hi All,
worked OK by adding below in VM options.
-Xms128m -Xmx512m -XX:MaxPermSize=300m -ea
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-Issue-tp25893p25920.html
Sent from the Apache Spark User List mailing list archive at Nabble
Hi All,
I am running my app in IntelliJ Idea (locally) my config local[*] , the code
worked ok with spark 1.5 but when I upgraded to 1.6 I am having below issue.
is this a bug in 1.6 ? I change back to 1.5 it worked ok without any error
do I need to pass executor memory while running in local in
Hi All,
Does anyone know pass percentage for Apache spark certification exam ?
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-spark-certification-pass-percentage-tp25761.html
Sent from the Apache Spark User List mailing list archive at
Hi All,
Imagine I have a Production spark streaming kafka (direct connection)
subscriber and publisher jobs running which publish and subscriber (receive)
data from a kafka topic and I save one day's worth of data using
dstream.slice to Cassandra daily table (so I create daily table before
runnin
Hi All,
https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L48
In Present spark version in line 48 there is a bug, to check whether table
exists in a database using limit doesnt work for all databases sql server
Hi All,
does anyone know exact release data for spark 1.6 ?
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Release-data-for-spark-1-6-tp25654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Hi Prateek,
you mean writing spark output to any storage system ? yes you can .
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637p25651.html
Sent from the Apache Sp
Hi Spark Contributors,
I am trying to append data to target table using df.write.mode("append")
functionality but spark throwing up table already exists exception.
Is there a fix scheduled in later spark release ?, I am using spark 1.5.
val sourcedfmode=sourcedf.write.mode("append")
sourcedfmod
Hi All,
I have the same error in spark 1.5 is there any solution to get around with
this ? I also tried using sourcedf.write.mode("append") but still no luck .
val sourcedfmode=sourcedf.write.mode("append")
sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops)
Thanks
Sri
--
Vie
Hi All,
I did implemented random_numbers using scala spark , is there a function to
get row_number equivalent in spark sql ?
example:-
sql server:-row_number()
Netezza:- sequence number
mysql:- sequence number
Example:-
val testsql=sqlContext.sql("select column1,column2,column3,column4,colum
I found a way out.
import java.text.SimpleDateFormat
import java.util.Date;
val format = new SimpleDateFormat("-M-dd hh:mm:ss")
val testsql=sqlContext.sql("select column1,column2,column3,column4,column5
,'%s' as TIME_STAMP from TestTable limit 10".format(format.format(new
Date(
Thanks
Hi All,
Is there a spark sql function which returns current time stamp
Example:-
In Impala:- select NOW();
SQL Server:- select GETDATE();
Netezza:- select NOW();
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-current-time-stamp-fu
Hi All,
I re wrote my code to use sqlContext.read.jdbc which lets me specify
upperbound,lowerbound,numberofparitions etc .. which might run in parallel,
I need to try on a cluster which I will do when I have time.
But please confirm read.jdbc does parallel reads ?
Spark code:-
package com.kali
Hi all,
I wrote below spark code to extract data from SQL server using spark
SQLContext.read.format with several different options , question does by
default sqlContext.read load function run in parallel does it use all the
available cores available ? when I am saving the output to a file it is
g
Hi All,
If write ahead logs are enabled in spark streaming does all the received
data gets written to HDFS path ? or it only writes the metadata.
How does clean up works , does HDFS path gets bigger and bigger up everyday
do I need to write an clean up job to delete data from write ahead logs
Hi All,
In sql say for example I have table1 (moveid) and table2 (movieid,moviename)
in sql we write something like select moviename ,movieid,count(1) from
table2 inner join table table1 on table1.movieid=table2.moveid group by
, here in sql table1 has only one column where as table 2 has tw
Full Error:-
at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:104)
at
org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:83
Hi All,
got this weird error when I tried to run spark on YARN-CLUSTER mode , I have
33 files and I am looping spark in bash one by one most of them worked ok
except few files.
Is this below error HDFS or spark error ?
Exception in thread "Driver" java.lang.IllegalArgumentException: Pathname
/
Hi All,
Is there any performance impact when I use collectAsMap on my RDD instead of
rdd.collect().toMap ?
I have a key value rdd and I want to convert to HashMap as far I know
collect() is not efficient on large data sets as it runs on driver can I use
collectAsMap instead is there any performa
Hi All,
can I pass number of partitions to all the RDD explicitly while submitting
the spark Job or di=o I need to mention in my spark code itself ?
Thanks
Sri
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Pass-spark-partition-explicitly-tp25113.html
S
Hi All,
I n Unix I can print some warning or info using LogMessage WARN "Hi All" or
LogMessage INFO "Hello World" is there similar thing in Spark ?
Imagine I wan to print count of RDD in Logs instead of using Println
Thanks
Sri
--
View this message in context:
http://apache-spark-user-lis
Hi All,
Can spark be used as an alternative to gem fire cache ? we use gem fire
cache to save (cache) dimension data in memory which is later used by our
Java custom made ETL tool can I do something like below ?
can I cache a RDD in memory for a whole day ? as of I know RDD will get
empty once t
Hi All,
just wonderign is there any better way of writing this below code, I am new
to spark an I feel what I wrote is pretty simple and basic and straight
forward is there any better way of writing using functional paradigm.
val QuoteRDD=quotefile.map(x => x.split("\\|")).
filter(line
Got it ..., created hashmap and saved it to file please follow below steps ..
val QuoteRDD=quotefile.map(x => x.split("\\|")).
filter(line => line(0).contains("1017")).
map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
if (x(15) =="B")
(
{if (x(25) == "") x(9) else x(
Hi All,
I changed my way of approach now I am bale to load data into MAP and get
data out using get command.
val QuoteRDD=quotefile.map(x => x.split("\\|")).
filter(line => line(0).contains("1017")).
map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
if (x(15) =="B")
if (x(
Hi all,
I am trying to create a hashmap using two rdd, but having issues key not
found
do I need to convert RDD to list first ?
1) rdd has key data
2) rdd has value data
Key Rdd:-
val quotekey=file.map(x => x.split("\\|")).filter(line =>
line(0).contains("1017")).map(x => x(5)+x(4))
Value
Guys sorry I figured it out.
val
x=p.collect().mkString("\n").replace("[","").replace("]","").replace(",","~")
Full Code:-
package com.examples
/**
* Created by kalit_000 on 22/09/2015.
*/
import kafka.producer.KeyedMessage
import kafka.producer.Producer
import kafka.producer.ProducerConfig
Hi All,
I am new bee in spark.
I managed to write up kafka prodcuder in spark where data comes from
Cassandra table but I have few questions.
Spark data output from Cassandra looks like below.
[2,Joe,Smith]
[1,Barack,Obama]
I would like something like this in my kafka consumer, so need to rem
Hi All,
figured it out for got mention local as loca[2] , at least two node
required.
package com.examples
/**
* Created by kalit_000 on 19/09/2015.
*/
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
imp
Hi All,
I am unable to see the output getting printed in the console can anyone
help.
package com.examples
/**
* Created by kalit_000 on 19/09/2015.
*/
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
impo
Hi ,
I am trying to develop in intellij Idea same code I am having the same issue
is there any work around.
Error in intellij:- cannot resolve symbol createDirectStream
import kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.
Hi All,
I would like to achieve this below output using spark , I managed to write
in Hive and call it in spark but not in just spark (scala), how to group
word counts on particular user (column) for example.
Imagine users and their given tweets I want to do word count based on user
name.
Input:-
Hi All,
I am trying to do word count on number of tweets, my first step is to get
data from table using spark sql and then run split function on top of it to
calculate word count.
Error:- valuse split is not a member of org.apache.spark.sql.SchemaRdd
Spark Code that doesn't work to do word coun
49 matches
Mail list logo