I looked at doc on this. It is not clear what goes behind the scene. Very
little documentation on it
First in Hive a database has to exist before it can be used so sql(“use
mytable”) will not create a database for you.
Also you cannot call your table mytable in database mytable!
Reme
That's very useful information.
The reason for weird problem is because of the non-determination of RDD
before applying randomSplit.
By caching RDD, we can make RDD become deterministic and so problem is
solved.
Thank you for your help.
2016-02-21 11:12 GMT+07:00 Ted Yu :
> Have you looked at:
>
I have a DataFrame that has a Python dict() as one of the columns. I'd like
to filter he DataFrame for those Rows that where the dict() contains a
specific value. e.g something like this:-
DF2 = DF1.filter('name' in DF1.params)
but that gives me this error
ValueError: Cannot convert column i
Hello Spark users,
I have to aggregate messages from kafka and at some fixed interval (say
every half hour) update a memory persisted RDD and run some computation.
This computation uses last one day data. Steps are:
- Read from realtime Kafka topic X in spark streaming batches of 5 seconds
- Filt
It sounds like another window operation on top of the 30-min window will
achieve the desired objective.
Just keep in mind that you'll need to set the clean TTL (spark.cleaner.ttl)
to a long enough value and you will require enough resources (mem & disk)
to keep the required data.
-kr, Gerard.
O
I am trying to parse xml file using spark-xml. But for some reason when i
print schema it only shows root instead of the hierarchy. I am using
sqlcontext to read the data. I am proceeding according to this video :
https://www.youtube.com/watch?v=NemEp53yGbI
The structure of xml file is somewhat l
Can you paste the code you are using?
On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte
wrote:
> I am trying to parse xml file using spark-xml. But for some reason when i
> print schema it only shows root instead of the hierarchy. I am using
> sqlcontext to read the data. I am proceeding accord
This is the code I am using for parsing xml file:
import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.sql.{DataFrame,SQLContext}
import com.databricks.spark.xml
object XmlProcessing {
def main(args : Array[String]) = {
val conf = new SparkConf()
.setAppName("
Just ran that code and it works fine, here is the output:
What version are you using?
val ctx = SQLContext.getOrCreate(sc)
val df = ctx.read.format("com.databricks.spark.xml").option("rowTag",
"book").load("file:///tmp/sample.xml")
df.printSchema()
root
|-- name: long (nullable = true)
|-- ord
I am using spark 1.4.0 with scala 2.10.4 and 0.3.2 of spark-xmlOrderid is empty for some books and multiple entries of it for other books,did you include that in your xml file?
No because you didn't say that explicitly. Can you share a sample file too?
On Sun, 21 Feb 2016, 14:34 Prathamesh Dharangutte
wrote:
> I am using spark 1.4.0 with scala 2.10.4 and 0.3.2 of spark-xml
> Orderid is empty for some books and multiple entries of it for other
> books,did you include
Hi, everyone.
I have a spark program, where df0 is a DataFrame
val idList = List("1", "2", "3", ...)
val rdd0 = df0.rdd
val schema = df0.schema
val rdd1 = rdd0.filter(r => !idList.contains(r(0)))
val df1 = sc.createDataFrame(rdd1, schema)
When I run df1.count(), I got the following Exception at
I tried the following in spark-shell:
scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B",
"C", "num")
df0: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more
fields]
scala> val idList = List("1", "2", "3")
idList: List[String] = List(1, 2, 3)
scala> val rdd
Make sure the xml input file is well formed (check your end tags).
Sent from my iPhone
> On Feb 21, 2016, at 8:14 AM, Prathamesh Dharangutte
> wrote:
>
> This is the code I am using for parsing xml file:
>
>
>
> import org.apache.spark.{SparkConf,SparkContext}
> import org.apache.spark.sq
Hello,
I have input lines like below
*Input*
t1, file1, 1, 1, 1
t1, file1, 1, 2, 3
t1, file2, 2, 2, 2, 2
t2, file1, 5, 5, 5
t2, file2, 1, 1, 2, 2
and i want to achieve the output like below rows which is a vertical
addition of the corresponding numbers.
*Output*
“file1” : [ 1+1+5, 1+2+5, 1+3+5
w.r.t. cleaner TTL, please see:
[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0
FYI
On Sun, Feb 21, 2016 at 4:16 AM, Gerard Maas wrote:
>
> It sounds like another window operation on top of the 30-min window will
> achieve the desired objective.
> Just keep in mind that you'll n
Thanks a lot!
Best Regards,
Weiwei
On Sat, Feb 20, 2016 at 11:53 PM, Hemant Bhanawat
wrote:
> toDF internally calls sqlcontext.createDataFrame which transforms the RDD
> to RDD[InternalRow]. This RDD[InternalRow] is then mapped to a dataframe.
>
> Type conversions (from scala types to catalyst
Hi,
I'm running a spark streaming application onto a spark cluster that spans 6
machines/workers. I'm using spark cluster standalone mode. Each machine has
8 cores. Is there any way to specify that I want to run my application on
all 6 machines and just use 2 cores on each machine?
Thanks
Hello Vinti,
One way to get this done is you split your input line into key and value
tuple and then you can simply use groupByKey and handle the values the way
you want. For example:
Assuming you have already split the values into a 5 tuple:
myDStream.map(record => (record._2, (record._3, record
Thanks for your reply Jatin. I changed my parsing logic to what you
suggested:
def parseCoverageLine(str: String) = {
val arr = str.split(",")
...
...
(arr(0), arr(1) :: count.toList) // (test, [file, 1, 1, 2])
}
Then in the grouping, can i use a global hash map p
Nevermind, seems like an executor level mutable map is not recommended as
stated in
http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/
On Sun, Feb 21, 2016 at 9:54 AM, Vinti Maheshwari
wrote:
> Thanks for your reply Jatin. I changed my parsing logic to what you
> s
You will need to do a collect and update a global map if you want to.
myDStream.map(record => (record._2, (record._3, record_4, record._5))
.groupByKey((r1, r2) => (r1._1 + r2._1, r1._2 + r2._2, r1._3 +
r2._3))
.foreachRDD(rdd => {
rdd.collect().foreach((fileName, valu
Hi Ted,
Thanks a lot for you reply
I tried your code in spark-shell on my laptop it works well.
But when I tried it on another computer installed with spark I got an Error
scala> val df0 = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B",
"C", "num")
:11: error: value toDF is not a m
good catch on the cleaner.ttl
@jatin- when you say "memory-persisted RDD", what do you mean exactly? and
how much data are you talking about? remember that spark can evict these
memory-persisted RDDs at any time. they can be recovered from Kafka, but this
is not a good situation to be in.
Do you have the following in your IDEA Scala console ?
import scala.language.implicitConversions
On Sun, Feb 21, 2016 at 10:16 AM, Tenghuan He wrote:
> Hi Ted,
>
> Thanks a lot for you reply
>
> I tried your code in spark-shell on my laptop it works well.
> But when I tried it on another compu
w.r.t. the new mapWithState(), there have been some bug fixes since the
release of 1.6.0
e.g.
SPARK-13121 java mapWithState mishandles scala Option
Looks like 1.6.1 RC should come out next week.
FYI
On Sun, Feb 21, 2016 at 10:47 AM, Chris Fregly wrote:
> good catch on the cleaner.ttl
>
> @jat
Hi Tamara,
few basic questions first.
How many executors are you using?
Is the data getting all cached into the same executor?
How many partitions do you have of the data?
How many fields are you trying to use in the join?
If you need any help in finding answer to these questions please let me
k
Sorry,
please include the following questions to the list above:
the SPARK version?
whether you are using RDD or DataFrames?
is the code run locally or in SPARK Cluster mode or in AWS EMR?
Regards,
Gourav Sengupta
On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta
wrote:
> Hi Tamara,
>
> few b
I believe the best way would be to use reduceByKey operation.
On Mon, Feb 22, 2016 at 5:10 AM, Jatin Kumar <
jku...@rocketfuelinc.com.invalid> wrote:
> You will need to do a collect and update a global map if you want to.
>
> myDStream.map(record => (record._2, (record._3, record_4, record._5))
>
df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark
version 1.6.0
df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark
version 1.4
--
Jacky Wang
At 2016-02-21 17:35:53, "Mich Talebzadeh" wrote:
I looked at doc on this. It is not clear
Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin.
For reference, final solution:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
// create a StreamingContext, the main entry point for a
>
> Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin.
> For reference, final solution:
>
> def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("HBaseStream")
> val sc = new SparkContext(conf)
> // create a StreamingContext, the main en
Well my version of Spark is 1.5.2
On 21/02/2016 23:54, Jacky Wang wrote:
> df.write.saveAsTable("db_name.tbl_name") // works, spark-shell, latest spark
> version 1.6.0
> df.write.saveAsTable("db_name.tbl_name") // NOT work, spark-shell, old spark
> version 1.4
>
> --
>
> Jacky Wang
>
Hi there,
I had similar problem in Java with the standalone cluster on Linux but got
that working by passing the following option
-Dspark.jars=file:/path/to/sparkapp.jar
sparkapp.jar has the launch application
Hope that helps.
Regards,
Patrick
-Original Message-
From: Arko Provo Mu
Hi,
I have observed that Spark SQL is not returning records for hive bucketed
ORC tables on HDP.
On spark SQL , I am able to list all tables , but queries on hive bucketed
tables are not returning records.
I have also tried the same for non-bucketed hive tables. it is working fine.
Same is
Hi,
Is the transaction attribute set on your table? I observed that hive
transaction storage structure do not work with spark yet. You can confirm
this by looking at the transactional attribute in the output of "desc
extended " in hive console.
If you'd need to access transactional table, conside
Hi Varadharajan,
Thanks for your response.
Yes it is transnational table; See below *show create table. *
Table hardly have 3 records , and after triggering minor compaction on
tables , it start showing results on spark SQL.
> *ALTER TABLE hivespark COMPACT 'major';*
> *show create table hiv
Yes, I was burned down by this issue couple of weeks back. This also means
that after every insert job, compaction should be run to access new rows
from Spark. Sad that this issue is not documented / mentioned anywhere.
On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh
wrote:
> Hi Varadharajan,
>
>
Hi,
I am trying to dynamically create Dataframe by reading subdirectories under
parent directory
My code looks like
> import org.apache.spark._
> import org.apache.spark.sql._
> val hadoopConf = new org.apache.hadoop.conf.Configuration()
> val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new
>
Max number of cores per executor can be controlled using
spark.executor.cores. And maximum number of executors on a single worker
can be determined by environment variable: SPARK_WORKER_INSTANCES.
However, to ensure that all available cores are used, you will have to take
care of how the stream is
Compaction would have been triggered automatically as following properties
already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* property not
been set for these tables.
hive.compactor.initiator.on
true
hive.compactor.worker.threads
1
Do
Thanks Gourav, Eduardo
I tried http://localhost:8080 and http://OAhtvJ5MCA:8080/ . Both
cases the forefox just hangs.
Also I tried with lynx text based browser. I get the message "HTTP
request sent; waiting for response." and it hangs as well.
Is there way to enable debug logs in spark
On the line preceding the one that the compiler is complaining about (which
doesn't actually have a problem in itself), you declare df as
"df"+fileName, making it a string. Then you try to assign a DataFrame to
df, but it's already a string. I don't quite understand your intent with
that previous l
Hi
In am using an EMR cluster for running my spark jobs, but after the job
finishes logs disappear,
I have added a log4j.properties in my jar, but all the logs still redirects
to EMR resource manager which vanishes after jobs completes, is there a way
i could redirect the logs to a location in f
Hi Folks,
I am exploring spark for streaming from two sources (a) Kinesis and (b)
HDFS for some of our use-cases. Since we maintain state gathered over last
x hours in spark streaming, we would like to replay the data from last x
hours as batches during deployment. I have gone through the Spark
Your logs are getting archived in your logs bucket in S3.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html
Regards
Sab
On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR
wrote:
> Hi
>
> In am using an EMR cluster for running my spark jobs, but after the jo
Hi,
Can anybody help me by providing me example how can we read schema of the
data set from the file.
Thanks,
Divya
On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:
> Hi Folks,
>
>
>
> I am exploring spark for streaming from two sources (a) Kinesis and (b)
> HDFS for some of our use-cases. Since we maintain state gathered over last
> x hours in spark streaming, we
48 matches
Mail list logo