Hi ,
I am trying to execute a simple query with join on 3 tables. When I look at
the execution plan , it varies with position of table in the "from" clause.
Execution plan looks more optimized when the position of table with
predicates is specified before any other table.
Original query :
selec
More details :
Execution plan for Original query
select distinct pge.portfolio_code
from table1 pge join table2 p
on p.perm_group = pge.anc_port_group
join table3 uge
on p.user_group=uge.anc_user_group
where uge.user_name = 'user' and p.perm_type = 'TEST'
== Physical Plan ==
TungstenAggre
Hi ,
I have a table sourced from* 2 parquet files* with few extra columns in one
of the parquet file. Simple * queries works fine but queries with predicate
on extra column doesn't work and I get column not found
*Column resp_party_type exist in just one parquet file*
a) Query that work :
select
Hi ,
I have a set of data, I need to group by specific key and then save as
parquet. Refer to the code snippet below. I am querying trade and then
grouping by date
val df = sqlContext.sql("SELECT * FROM trade")
val dfSchema = df.schema
val partitionKeyIndex = dfSchema.fieldNames.seq.indexOf("da
Please refer to the code snippet below . I get following error
*/tmp/temp/trade.parquet/part-r-00036.parquet is not a Parquet file.
expected magic number at tail [80, 65, 82, 49] but found [20, -28, -93, 93]
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
at
org
Hi ,
What is the difference between SPARK_LOCAL_DIRS and SPARK_WORKER_DIR ? Also
does spark clean these up after the execution ?
Regards,
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-LOCAL-DIRS-and-SPARK-WORKER-DIR-tp21612.html
Sent from t
Hi ,
I created some hive tables and trying to list them through Beeline , but not
getting any results. I can list the tables through spark-sql.
When I connect beeline, it starts up with following message :
Connecting to jdbc:hive2://tst001:10001
Enter username for jdbc:hive2://tst001:10001: <>
E
Hi ,
I am trying to run a simple hadoop job (that uses
CassandraHadoopInputOutputWriter) on spark (v1.2 , Hadoop v 1.x) but getting
NullPointerException in TaskSetManager
WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost
task 14.2 in stage 0.0 (TID 29, devntom003.dev.black
Hi ,
I am coverting jsonRDD to parquet by saving it as parquet file
(saveAsParquetFile)
cacheContext.jsonFile("file:///u1/sample.json").saveAsParquetFile("sample.parquet")
I am reading parquet file and registering it as a table :
val parquet = cacheContext.parquetFile("sample_trades.parquet")
par
Hi tsingfu ,
Thanks for your reply, I tried with other columns but the problem is same
with other Integer columns.
Regards,
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Integer-column-in-schema-RDD-from-parquet-being-considered-as-string-tp21917p
Hi ,
I am playing around with Spark SQL 1.3 and noticed that "max" function does
not give the correct result i.e doesn't give the maximum value. The same
query works fine in Spark SQL 1.2 .
Is any one aware of this issue ?
Regards,
Gaurav
--
View this message in context:
http://apache-spark-
Hi ,
I am using jsonRDD in spark sql and having trouble iterating through array
inside the json object. Please refer to the schema below :
-- Preferences: struct (nullable = true)
||-- destinations: array (nullable = true)
|-- user: string (nullable = true)
Sample Data:
-- Preferences: st
Thanks . I am not using hive context . I am loading data from Cassandra and
then converting it into json and then querying it through SQL context . Can
I use use hive context to query on a jsonRDD ?
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/fla
Hi ,
I am reading data from Cassandra through datastax spark-cassandra connector
converting it into JSON and then running spark-sql on it. Refer to the code
snippet below :
step 1 > val o_rdd = sc.cassandraTable[CassandraRDDWrapper](
'', '')
step 2 > val tempObjectRDD = sc.parallelize(o_r
Hi ,
Thanks it worked, really appreciate your help. I have also been trying to do
multiple Lateral Views, but it doesn't seem to be working.
Query :
hiveContext.sql("Select t2 from fav LATERAL VIEW explode(TABS) tabs1 as t1
LATERAL VIEW explode(t1) tabs2 as t2")
Exception
org.apache.spark.sql.c
Are you using spark 1.1 ? If yes you would have to update the datastax
cassandra connector code and remove ref to log methods from
CassandraConnector.scala
Regards,
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-connector-tp13896p13897.htm
My bad, please ignore, it works !!!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/flattening-a-list-in-spark-sql-tp13300p13901.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Hi ,
I have been using spark shell to execute all SQLs. I am connecting to
Cassandra , converting the data in JSON and then running queries on it, I
am using HiveContext (and not SQLContext) because of "explode "
functionality in it.
I want to see how can I use Spark SQL CLI for directly runnin
Hi ,
I have bunch of Xml files and I want to run spark SQL on it, is there a
recommended approach ? I am thinking of either converting Xml in json and
then jsonRDD
Please let me know your thoughts
Regards,
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nab
Hi ,
I have been working on Spark SQL and want to expose this functionality to
other applications. Idea is to let other applications to send sql to be
executed on spark cluster and get the result back. I looked at spark job
server (https://github.com/ooyala/spark-jobserver) but it provides a RESTf
Hi ,
I am trying to read a csv file in the following way :
val csvData = sc.textFile("file:///tmp/sample.csv")
csvData.collect().length
This works file on spark-shell but when I try to do spark-submit of the jar,
I get the following exceptions :
java.lang.IllegalStateException: unread block data
Does spark guarantee to push the processing to the data ? Before creating
tasks does spark always check for data location ? So for example if I have 3
spark nodes (Node1, Node2, Node3) and data is local to just 2 nodes (Node1
and Node2) , will spark always schedule tasks on the node for which the d
You can also look at Spark Job Server
https://github.com/spark-jobserver/spark-jobserver
- Gaurav
> On Jan 9, 2015, at 10:25 PM, Corey Nolet wrote:
>
> Cui Lin,
>
> The solution largely depends on how you want your services deployed (Java web
> container, Spray framework, etc...) and if you
23 matches
Mail list logo