Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-12 Thread Jacek Laskowski
Hi, You could use QueryExecutionListener or Spark listeners to intercept query execution events and extract whatever is required. That's what web UI does (as it's simply a bunch of SparkListeners --> https://youtu.be/mVP9sZ6K__Y ;-)). Pozdrawiam, Jacek Laskowski "The Internals Of" Online Boo

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-11 Thread Chitral Verma
try explain codegen on your DF and then pardee the string On Fri, 7 Apr, 2023, 3:53 pm Chenghao Lyu, wrote: > Hi, > > The detailed stage page shows the involved WholeStageCodegen Ids in its > DAG visualization from the Spark UI when running a SparkSQL. (e.g., under > the link > node:18088/histor

Re: SparkSQL vs Dataframe vs Dataset

2021-12-06 Thread yonghua
‌ >From my experience, SQL is easy for the guys who already know SQL syntax. With >the correct indexing SQL is also fast. But within programs dataframe is must >faster and convenient for loading large data structure from the external.   De : "rajat kumar" A : "user @spark" Envoyé: lundi 6 Décembr

Re: [SparkSQL] Full Join Return Null Value For Funtion-Based Column

2021-01-18 Thread 刘 欢
Sorry, I know the reason. closed 发件人: 刘 欢 日期: 2021年1月18日 星期一 下午1:39 收件人: "user@spark.apache.org" 主题: [SparkSQL] Full Join Return Null Value For Funtion-Based Column Hi All: Here I got two tables: Table A name num tom 2 jerry 3 jerry 4 null null Table B name score tom 12 jerry

Re: sparksql 2.4.0 java.lang.NoClassDefFoundError: com/esotericsoftware/minlog/Log

2020-07-09 Thread Ivan Petrov
spark/jars -> minlog-1.3.0.jar I see that jar is there. What do I do wrong? чт, 9 июл. 2020 г. в 20:43, Ivan Petrov : > Hi there! > I'm seeing this exception in Spark Driver log. > Executor log stays empty. No exceptions, nothing. > 8 tasks out of 402 failed with this exception. > What is the r

Re: sparksql in sparkR?

2019-06-07 Thread Felix Cheung
This seem to be more a question of spark-sql shell? I may suggest you change the email title to get more attention. From: ya Sent: Wednesday, June 5, 2019 11:48:17 PM To: user@spark.apache.org Subject: sparksql in sparkR? Dear list, I am trying to use sparksql

Re: SparkSQL read Hive transactional table

2018-10-17 Thread Gourav Sengupta
- *发件人:* "Gourav Sengupta"; *发送时间:* 2018年10月16日(星期二) 晚上6:35 *收件人:* "daily"; *抄送:* "user"; "dev"; *主题:* Re: SparkSQL read Hive transactional table Hi, can I please ask which version of Hive and Spark are you using? Regards, Gourav Sengupta On Tue, Oct

Re: SparkSQL read Hive transactional table

2018-10-16 Thread Gourav Sengupta
Hi, can I please ask which version of Hive and Spark are you using? Regards, Gourav Sengupta On Tue, Oct 16, 2018 at 2:42 AM daily wrote: > Hi, > > I use HCatalog Streaming Mutation API to write data to hive transactional > table, and then, I use SparkSQL to read data from the hive transaction

Re: [SparkSQL] Count Distinct issue

2018-09-17 Thread kathleen li
Hi, I can't reproduce your issue: scala> spark.sql("select distinct * from dfv").show() ++++++++++++++++---+ | a| b| c| d| e| f| g| h| i| j| k| l| m| n| o| p| ++++++++++

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Tin Vu
Hi Gaurav, Thank you for your response. This is the answer for your questions: 1. Spark 2.3.0 2. I was using 'spark-sql' command, for example: 'spark-sql --master spark:/*:7077 --database tpcds_bin_partitioned_orc_100 -f $file_name' wih file_name is the file that contains SQL script ("select * fro

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-31 Thread Gourav Sengupta
Hi Tin, This sounds interesting. While I would prefer to think that Presto and Drill have can you please provide the following details: 1. SPARK version 2. The exact code used in SPARK (the full code that was used) 3. HADOOP version I do think that SPARK and DRILL have complementary and differen

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
need to make a call whether you want to take the upfront cost of a shuffle, or you want to live with large number of tasks From: Tin Vu Date: Thursday, March 29, 2018 at 10:47 AM To: "Lalwani, Jayesh" Cc: "user@spark.apache.org" Subject: Re: [SparkSQL] SparkSQL performance o

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Tin Vu
You are right. There are too much tasks was created. How can we reduce the number of tasks? On Thu, Mar 29, 2018, 7:44 AM Lalwani, Jayesh wrote: > Without knowing too many details, I can only guess. It could be that Spark > is creating a lot of tasks even though there are less records. Creation

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-29 Thread Lalwani, Jayesh
Without knowing too many details, I can only guess. It could be that Spark is creating a lot of tasks even though there are less records. Creation and distribution of tasks has a noticeable overhead on smaller datasets. You might want to look at the driver logs, or the Spark Application Detail U

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Thanks for your response. What do you mean when you said "immediately return"? On Wed, Mar 28, 2018, 10:33 PM Jörn Franke wrote: > I don’t think select * is a good benchmark. You should do a more complex > operation, otherwise optimizes might see that you don’t do anything in the > query and im

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke
I don’t think select * is a good benchmark. You should do a more complex operation, otherwise optimizes might see that you don’t do anything in the query and immediately return (similarly count might immediately return by using some statistics). > On 29. Mar 2018, at 02:03, Tin Vu wrote: > >

Re: SparkSQL not support CharType

2017-11-23 Thread Jörn Franke
Or bytetype depending on the use case > On 23. Nov 2017, at 10:18, Herman van Hövell tot Westerflier > wrote: > > You need to use a StringType. The CharType and VarCharType are there to > ensure compatibility with Hive and ORC; they should not be used anywhere else. > >> On Thu, Nov 23, 2017

Re: SparkSQL not support CharType

2017-11-23 Thread Herman van Hövell tot Westerflier
You need to use a StringType. The CharType and VarCharType are there to ensure compatibility with Hive and ORC; they should not be used anywhere else. On Thu, Nov 23, 2017 at 4:09 AM, 163 wrote: > Hi, > when I use Dataframe with table schema, It goes wrong: > > val test_schema = StructType(

Re: SparkSQL to read XML Blob data to create multiple rows

2017-07-08 Thread Amol Talap
--+ > > > scala>df.withColumn("comment", > explode(df("Comments.Comment"))).select($"comment.Description", > $"comment.Title").show > +---++ > | Description| Title| > +-

RE: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Talap, Amol
Thanks so much Zhang. This definitely helps. From: Yong Zhang [mailto:java8...@hotmail.com] Sent: Thursday, June 29, 2017 4:59 PM To: Talap, Amol; Judit Planas; user@spark.apache.org Subject: Re: SparkSQL to read XML Blob data to create multiple rows scala>spark.version res6: String = 2.

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Yong Zhang
+---++ |Description_1.1|Title1.1| |Description_1.2|Title1.2| |Description_1.3|Title1.3| +---++ From: Talap, Amol Sent: Thursday, June 29, 2017 9:38 AM To: Judit Planas; user@spark.apache.org Subject: RE: SparkSQL to read XML Blo

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas
nt, Corets, Eva Regards, Amol *From:*Judit Planas [mailto:judit.pla...@epfl.ch] *Sent:* Thursday, June 29, 2017 3:46 AM *To:* user@spark.apache.org *Subject:* Re: SparkSQL to read XML Blob data to create multiple rows Hi Amol, Not sure I understand completely your question, but the SQL fun

RE: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Talap, Amol
: Judit Planas [mailto:judit.pla...@epfl.ch] Sent: Thursday, June 29, 2017 3:46 AM To: user@spark.apache.org Subject: Re: SparkSQL to read XML Blob data to create multiple rows Hi Amol, Not sure I understand completely your question, but the SQL function "explode" may help you: http://s

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Judit Planas
Hi Amol, Not sure I understand completely your question, but the SQL function "explode" may help you: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode Here you can find a nice example: https://stackoverflow.com/questions/38210507/explode-in-pyspark

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread ayan guha
Hi Not sure if I follow your issue. Can you please post output of books_inexp.show()? On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol wrote: > Hi: > > > > We are trying to parse XML data to get below output from given input > sample. > > Can someone suggest a way to pass one DFrames output into lo

RE: [SparkSQL] Escaping a query for a dataframe query

2017-06-16 Thread mark.jenki...@baesystems.com
ror(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36) From: Gourav Sengupta [mailto:gourav.sengu...@gmail.com] Sent: 15 June 2017 19:35 To: Michael Mior Cc: Jenkins, Mark (UK Guildford); user@spark.apache.org Subject: Re: [SparkSQL

Re: [SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread Gourav Sengupta
It might be something that I am saying wrong but sometimes it may just make sense to see the difference between *” *and " <”> 8221, Hex 201d, Octal 20035 <"> 34, Hex 22, Octal 042 Regards, Gourav On Thu, Jun 15, 2017 at 6:45 PM, Michael Mior wrote: > Assuming the parameter to your UDF sh

Re: [SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread Michael Mior
Assuming the parameter to your UDF should be start"end (with a quote in the middle) then you need to insert a backslash into the query (which must also be escaped in your code). So just add two extra backslashes before the quote inside the string. sqlContext.sql("SELECT * FROM mytable WHERE (mycol

Re: SparkSQL not able to read a empty table location

2017-05-21 Thread Bajpai, Amit X. -ND
processed the query will again fail. How to deal with this scenario. From: Sea <261810...@qq.com> Date: Sunday, May 21, 2017 at 8:04 AM To: Steve Loughran , "Bajpai, Amit X. -ND" Cc: "user@spark.apache.org" Subject: Re: SparkSQL not able to read a empty t

Re: SparkSQL not able to read a empty table location

2017-05-21 Thread Sea
please try spark.sql.hive.verifyPartitionPath true -- Original -- From: "Steve Loughran";; Date: Sat, May 20, 2017 09:19 PM To: "Bajpai, Amit X. -ND"; Cc: "user@spark.apache.org"; Subject: Re: SparkSQL not able to read a empt

Re: SparkSQL not able to read a empty table location

2017-05-20 Thread Steve Loughran
On 20 May 2017, at 01:44, Bajpai, Amit X. -ND mailto:n...@disney.com>> wrote: Hi, I have a hive external table with the S3 location having no files (but the S3 location directory does exists). When I am trying to use Spark SQL to count the number of records in the table it is throwing error s

Re: [SparkSQL] too many open files although ulimit set to 1048576

2017-03-13 Thread darin
I think your sets not works try add `ulimit -n 10240 ` in spark-env.sh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-too-many-open-files-although-ulimit-set-to-1048576-tp28490p28491.html Sent from the Apache Spark User List mailing list archive a

RE: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Gurdit Singh
21日 23:29 收件人: Yong Zhang mailto:java8...@hotmail.com>> 抄送: Jacek Laskowski mailto:ja...@japila.pl>>; Linyuxin mailto:linyu...@huawei.com>>; user mailto:user@spark.apache.org>> 主题: Re: [SparkSQL] pre-check syntex before running spark job? You can also run it on REPL an

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Irving Duran
> > Yong > > > -- > *From:* Jacek Laskowski > *Sent:* Tuesday, February 21, 2017 4:34 AM > *To:* Linyuxin > *Cc:* user > *Subject:* Re: [SparkSQL] pre-check syntex before running spark job? > > Hi, > > Never heard about such a tool before. You could use Antlr to

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Yong Zhang
You can always use explain method to validate your DF or SQL, before any action. Yong From: Jacek Laskowski Sent: Tuesday, February 21, 2017 4:34 AM To: Linyuxin Cc: user Subject: Re: [SparkSQL] pre-check syntex before running spark job? Hi, Never heard

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Jacek Laskowski
Hi, Never heard about such a tool before. You could use Antlr to parse SQLs (just as Spark SQL does while parsing queries). I think it's a one-hour project. Jacek On 21 Feb 2017 4:44 a.m., "Linyuxin" wrote: Hi All, Is there any tool/api to check the sql syntax without running spark job actuall

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Mich Talebzadeh
right let us simplify this. can you run the whole thing *once* only and send dag execution output from UI? you can use snipping tool to take the image. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Rabin Banerjee
Hi , 1. You are doing some analytics I guess? *YES* 2. It is almost impossible to guess what is happening except that you are looping 50 times over the same set of sql? *I am Not Looping any SQL, All SQLs are called exactly once , which requires output from prev SQL.* 3. Your

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-09 Thread Mich Talebzadeh
Hi 1. You are doing some analytics I guess? 2. It is almost impossible to guess what is happening except that you are looping 50 times over the same set of sql? 3. Your sql step n depends on step n-1. So spark cannot get rid of 1 -n steps 4. you are not storing anything in memor

Re: SPARKSQL with HiveContext My job fails

2016-08-04 Thread Mich Talebzadeh
Well the error states Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded Cause: The detail message "GC overhead limit exceeded" indicates that the garbage collector is runni

Re: SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2016-06-20 Thread Satya
Hello, We are also experiencing the same error. Can you please provide the steps that resolved the issue. Thanks Satya -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-issue-Spark-1-3-1-hadoop-2-6-on-CDH5-3-with-parquet-tp22808p27197.html Sent from

Re: SparkSQL with large result size

2016-05-10 Thread Buntu Dev
Thanks Chris for pointing out the issue. I think I was able to get over this issue by: - repartitioning to increase the number of partitions (about 6k partitions) - apply sort() on the resulting dataframe to coalesce into single sorted partition file - read the sorted file and then adding just lim

Re: SparkSQL with large result size

2016-05-10 Thread Christophe Préaud
Hi, You may be hitting this bug: SPARK-9879 In other words: did you try without the LIMIT clause? Regards, Christophe. On 02/05/16 20:02, Gourav Sengupta wrote: Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and w

Re: SparkSQL with large result size

2016-05-02 Thread Gourav Sengupta
Hi, I have worked on 300GB data by querying it from CSV (using SPARK CSV) and writing it to Parquet format and then querying parquet format to query it and partition the data and write out individual csv files without any issues on a single node SPARK cluster installation. Are you trying to cac

Re: SparkSQL with large result size

2016-05-02 Thread Ted Yu
That's my interpretation. On Mon, May 2, 2016 at 9:45 AM, Buntu Dev wrote: > Thanks Ted, I thought the avg. block size was already low and less than > the usual 128mb. If I need to reduce it further via parquet.block.size, it > would mean an increase in the number of blocks and that should incre

Re: SparkSQL with large result size

2016-05-02 Thread Buntu Dev
Thanks Ted, I thought the avg. block size was already low and less than the usual 128mb. If I need to reduce it further via parquet.block.size, it would mean an increase in the number of blocks and that should increase the number of tasks/executors. Is that the correct way to interpret this? On Mo

Re: SparkSQL with large result size

2016-05-02 Thread Ted Yu
Please consider decreasing block size. Thanks > On May 1, 2016, at 9:19 PM, Buntu Dev wrote: > > I got a 10g limitation on the executors and operating on parquet dataset with > block size 70M with 200 blocks. I keep hitting the memory limits when doing a > 'select * from t1 order by c1 limit

Re: SparkSQL with large result size

2016-05-02 Thread ayan guha
How many executors are you running? Is your partition scheme ensures data is distributed evenly? It is possible that your data is skewed and one of the executors failing. Maybe you can try reduce per executor memory and increase partitions. On 2 May 2016 14:19, "Buntu Dev" wrote: > I got a 10g li

Re: SparkSQL exception on spark.sql.codegen

2016-03-27 Thread song ma
Hi Eric and Michael: I run into this problem with Spark 1.4.1 too. The error stack is: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$ at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.sc

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Spencer Uresk
Ok, that helped a lot - and I understand the feature/change better now. Thank you! On Fri, Mar 25, 2016 at 4:32 PM, Michael Armbrust wrote: > Oh, I'm sorry I didn't fully understand what you were trying to do. If > you don't need partitioning, you can set > "spark.sql.sources.partitionDiscovery

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Oh, I'm sorry I didn't fully understand what you were trying to do. If you don't need partitioning, you can set "spark.sql.sources.partitionDiscovery.enabled=false". Otherwise, I think you need to use the unioning approach. On Fri, Mar 25, 2016 at 1:35 PM, Spencer Uresk wrote: > Thanks for the

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Spencer Uresk
Thanks for the suggestion - I didn't try it at first because it seems like I have multiple roots and not necessarily partitioned data. Is this the correct way to do that? sqlContext.read.option("basePath", "hdfs://user/hdfs/analytics/").json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") If so, it

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Have you tried setting a base path for partition discovery? Starting from Spark 1.6.0, partition discovery only finds partitions under > the given paths by default. For the above example, if users pass > path/to/table/gender=male to either SQLContext.read.parquet or > SQLContext.read.load, gender

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Ted Yu
This is the original subject of the JIRA: Partition discovery fail if there is a _SUCCESS file in the table's root dir If I remember correctly, there were discussions on how (traditional) partition discovery slowed down Spark jobs. Cheers On Fri, Mar 25, 2016 at 10:15 AM, suresk wrote: > In pr

Re: SparkSQL/DataFrame - Is `JOIN USING` syntax null-safe?

2016-02-15 Thread Zhong Wang
Just checked the code and wrote some tests. Seems it is not null-safe... Shall we consider providing a null-safe option for `JOIN USING` syntax? Zhong On Mon, Feb 15, 2016 at 7:25 PM, Zhong Wang wrote: > Is it null-safe when we use this interface? > -- > > def join(right: DataFrame, usingColum

Re: SparkSQL parallelism

2016-02-11 Thread Rishi Mishra
I am not sure why all 3 nodes should query. If you have not mentioned any partitions it should only be one partition of JDBCRDD where all dataset should reside. On Fri, Feb 12, 2016 at 10:15 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a spark cluster with One Ma

Re: SparkSQL : "select non null values from column"

2016-01-25 Thread Deng Ching-Mallete
Hi, Have you tried using IS NOT NULL for the where condition? Thanks, Deng On Mon, Jan 25, 2016 at 7:00 PM, Eli Super wrote: > Hi > > I try to select all values but not NULL values from column contains NULL > values > > with > > sqlContext.sql("select my_column from my_table where my_column <>

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Jerry Lam
Hi Kostiantyn, Yes. If security is a concern then this approach cannot satisfy it. The keys are visible in the properties files. If the goal is to hide them, you might be able go a bit further with this approach. Have you look at spark security page? Best Regards, Jerry Sent from my iPhone

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Kostiantyn Kudriavtsev
Hi guys, the only one big issue with this approach: > spark.hadoop.s3a.access.key is now visible everywhere, in logs, in spark > webui and is not secured at all... On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev wrote: > thanks Jerry, it works! > really appreciate your help > > Thank y

Re: SparkSQL integration issue with AWS S3a

2016-01-02 Thread KOSTIANTYN Kudriavtsev
thanks Jerry, it works! really appreciate your help Thank you, Konstantin Kudryavtsev On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam wrote: > Hi Kostiantyn, > > You should be able to use spark.conf to specify s3a keys. > > I don't remember exactly but you can add hadoop properties by prefixing > spa

Re: [SparkSQL][Parquet] Read from nested parquet data

2016-01-01 Thread lin
Hi Cheng, Thank you for your informative explanation; it is quite helpful. We'd like to try both approaches; should we have some progress, we would update this thread so that anybody interested can follow. Thanks again @yanboliang, @chenglian!

Re: SparkSQL integration issue with AWS S3a

2016-01-01 Thread Jerry Lam
Hi Kostiantyn, You should be able to use spark.conf to specify s3a keys. I don't remember exactly but you can add hadoop properties by prefixing spark.hadoop.* * is the s3a properties. For instance, spark.hadoop.s3a.access.key wudjgdueyhsj Of course, you need to make sure the property key is r

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread KOSTIANTYN Kudriavtsev
Hi Jerry, what you suggested looks to be working (I put hdfs-site.xml into $SPARK_HOME/conf folder), but could you shed some light on how it can be federated per user? Thanks in advance! Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam wrote: > Hi Kostiantyn, > > I

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread KOSTIANTYN Kudriavtsev
Hi Jerry, thanks for the hint, could you please more specific how can I pass different spark-{usr}.conf per user during job submit and which propery I can use to specify custom hdfs-site.xml? I tried to google, but didn't find nothing Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 2:3

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread Brian London
Since you're running in standalone mode, can you try it using Spark 1.5.1 please? On Thu, Dec 31, 2015 at 9:09 AM Steve Loughran wrote: > > > On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev < > kudryavtsev.konstan...@gmail.com> wrote: > > > > Hi Jerry, > > > > I want to run different jobs on dif

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread Steve Loughran
> On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev > wrote: > > Hi Jerry, > > I want to run different jobs on different S3 buckets - different AWS creds - > on the same instances. Could you shed some light if it's possible to achieve > with hdfs-site? > > Thank you, > Konstantin Kudryavtsev

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-31 Thread Cheng Lian
Hey Lin, This is a good question. The root cause of this issue lies in the analyzer. Currently, Spark SQL can only resolve a name to a top level column. (Hive suffers the same issue.) Take the SQL query and struct you provided as an example, col_b.col_d.col_g is resolved as two nested GetStru

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-30 Thread lin
Hi yanbo, thanks for the quick response. Looks like we'll need to do some work-around. But before that, we'd like to dig into some related discussions first. We've looked through the following urls, but none seems helpful. Mailing list threads: http://search-hadoop.com/m/q3RTtLkgZl1K4oyx/v=thread

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn, I want to confirm that it works first by using hdfs-site.xml. If yes, you could define different spark-{user-x}.conf and source them during spark-submit. let us know if hdfs-site.xml works first. It should. Best Regards, Jerry Sent from my iPhone > On 30 Dec, 2015, at 2:31 pm,

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Hi Jerry, I want to run different jobs on different S3 buckets - different AWS creds - on the same instances. Could you shed some light if it's possible to achieve with hdfs-site? Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam wrote: > Hi Kostiantyn, > > Can you d

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn, Can you define those properties in hdfs-site.xml and make sure it is visible in the class path when you spark-submit? It looks like a conf sourcing issue to me. Cheers, Sent from my iPhone > On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev > wrote: > > Chris, > > thanks

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Chris, thanks for the hist with AIM roles, but in my case I need to run different jobs with different S3 permissions on the same cluster, so this approach doesn't work for me as far as I understood it Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly wrote: > cou

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
couple things: 1) switch to IAM roles if at all possible - explicitly passing AWS credentials is a long and lonely road in the end 2) one really bad workaround/hack is to run a job that hits every worker and writes the credentials to the proper location (~/.awscredentials or whatever) ^^ i would

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Chris, good question, as you can see from the code I set up them on driver, so I expect they will be propagated to all nodes, won't them? Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly wrote: > are the credentials visible from each Worker node to all the Execu

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
are the credentials visible from each Worker node to all the Executor JVMs on each Worker? > On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev > wrote: > > Dear Spark community, > > I faced the following issue with trying accessing data on S3a, my code is the > following: > > val sparkCo

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Hi Blaz, I did, the same result Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 12:54 PM, Blaž Šnuderl wrote: > Try setting s3 credentials using keys specified here > https://github.com/Aloisius/hadoop-s3a/blob/master/README.md > > Blaz > On Dec 30, 2015 6:48 PM, "KOSTIANTYN Kudriavt

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Blaž Šnuderl
Try setting s3 credentials using keys specified here https://github.com/Aloisius/hadoop-s3a/blob/master/README.md Blaz On Dec 30, 2015 6:48 PM, "KOSTIANTYN Kudriavtsev" < kudryavtsev.konstan...@gmail.com> wrote: > Dear Spark community, > > I faced the following issue with trying accessing data on

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Dawid Wysakowicz
I do understand that Snappy is not splittable as such, but ORCFile is. In ORC blocks are compressed with snappy so there should be no problem with it. Anyway ZLIB(used both in ORC and Parquet by default) is also not splittable but it works perfectly fine. 2015-12-30 16:26 GMT+01:00 Chris Fregly :

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Chris Fregly
Reminder that Snappy is not a splittable format. I've had success with Hive + LZF (splittable) and bzip2 (also splittable). Gzip is also not splittable, so you won't be utilizing your cluster to process this data in parallel as only 1 task can read and process unsplittable data - versus many task

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Dawid Wysakowicz
Didn't anyone used spark with orc and snappy compression? 2015-12-29 18:25 GMT+01:00 Dawid Wysakowicz : > Hi, > > I have a table in hive stored as orc with compression = snappy. I try to > execute a query on that table that fails (previously I run it on table in > orc-zlib format and parquet so i

Re: [SparkSQL][Parquet] Read from nested parquet data

2015-12-30 Thread Yanbo Liang
This problem has been discussed long before, but I think there is no straight forward way to read only col_g. 2015-12-30 17:48 GMT+08:00 lin : > Hi all, > > We are trying to read from nested parquet data. SQL is "select > col_b.col_d.col_g from some_table" and the data schema for some_table is: >

Re: SparkSQL AVRO

2015-12-07 Thread Deenar Toraskar
By default Spark will create one file per partition. Spark SQL defaults to using 200 partitions. If you want to reduce the number of files written out, repartition your dataframe using repartition and give it the desired number of partitions. originalDF.repartition(10).write.avro("masterNew.avro")

Re: SparkSQL AVRO

2015-12-07 Thread Ruslan Dautkhanov
How many reducers you had that created those avro files? Each reducer very likely creates its own avro part- file. We normally use Parquet, but it should be the same for Avro, so this might be relevant http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-ta

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-05 Thread Michael Armbrust
> > Follow up question in this case: what is the cost of registering a temp > table? Is there a limit to the number of temp tables that can be registered > by Spark context? > It is pretty cheap. Just an entry in an in-memory hashtable to a query plan (similar to a view).

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan
Thanks all for your reply! I tested both approaches: registering the temp table then executing SQL vs. saving to HDFS filepath directly. The problem with the second approach is that I am inserting data into a Hive table, so if I create a new partition with this method, Hive metadata is not updated

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Michael Armbrust
you might also coalesce to 1 (or some small number) before writing to avoid creating a lot of files in that partition if you know that there is not a ton of data. On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra wrote: > As long as all your data is being inserted by Spark , hence using the same > h

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Rishi Mishra
As long as all your data is being inserted by Spark , hence using the same hash partitioner, what Fengdong mentioned should work. On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu wrote: > Hi > you can try: > > if your table under location “/test/table/“ on HDFS > and has partitions: > > “/test/tabl

Re: sparkSQL Load multiple tables

2015-12-02 Thread censj
ok, thank all。I change method! > 在 2015年12月2日,16:46,Jeff Zhang 写道: > > BTW it is also impossible to do split if you want to use sql to load multiple > tables. > > On Wed, Dec 2, 2015 at 4:44 PM, Jeff Zhang > wrote: > Do you want to load multiple tables by using sql ?

Re: sparkSQL Load multiple tables

2015-12-02 Thread Jeff Zhang
BTW it is also impossible to do split if you want to use sql to load multiple tables. On Wed, Dec 2, 2015 at 4:44 PM, Jeff Zhang wrote: > Do you want to load multiple tables by using sql ? JdbcRelation now only > can load single table. It doesn't accept sql as loading command. > > On Wed, Dec 2,

Re: sparkSQL Load multiple tables

2015-12-02 Thread Jeff Zhang
Do you want to load multiple tables by using sql ? JdbcRelation now only can load single table. It doesn't accept sql as loading command. On Wed, Dec 2, 2015 at 4:33 PM, censj wrote: > hi Fengdong Yu: > I want to use sqlContext.read.format('jdbc').options( ... ).load() > but this function o

Re: sparkSQL Load multiple tables

2015-12-02 Thread censj
hi Fengdong Yu: I want to use sqlContext.read.format('jdbc').options( ... ).load() but this function only load a table so i want to know through some operations load multiple tables? > 在 2015年12月2日,16:28,Fengdong Yu 写道: > > It cannot read multiple tables, > > but if your tables have t

Re: sparkSQL Load multiple tables

2015-12-02 Thread Fengdong Yu
It cannot read multiple tables, but if your tables have the same columns, you can read them one by one, then unionAll them, such as: val df1 = sqlContext.table(“table1”) val df2 = sqlContext.table(“table2”) val df = df1.unionAll(df2) > On Dec 2, 2015, at 4:06 PM, censj wrote: > > Dear a

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu
Hi you can try: if your table under location “/test/table/“ on HDFS and has partitions: “/test/table/dt=2012” “/test/table/dt=2013” df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") > On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: > > df.write.partitionBy("date").i

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Jeff Zhang
I don't think there's api for that, but think it is reasonable and helpful for ETL. As a workaround you can first register your dataframe as temp table, and use sql to insert to the static partition. On Wed, Dec 2, 2015 at 10:50 AM, Isabelle Phan wrote: > Hello, > > Is there any API to insert d

Re: SparkSQL JDBC to PostGIS

2015-11-05 Thread Mustafa Elbehery
Hi Stefano, Thanks for prompt reply. Actually I am using *Magellan, *a geospatial library on top of spark. I know that I can load the data in RDDs, or DFs, and use them directly. However, for requirement purposes, I am trying to query the data from PostGIS directly. So, as I have mentioned above,

Re: SparkSQL JDBC to PostGIS

2015-11-04 Thread Stefano Baghino
Hi Mustafa, are you trying to run geospatial queries on the PostGIS DB with SparkSQL? Correct me if I'm wrong, but I think SparkSQL itself would need to support the geospatial extensions in order for this to work. On Wed, Nov 4, 2015 at 1:46 PM, Mustafa Elbehery wrote: > Hi Folks, > > I am tryi

Re: SparkSQL implicit conversion on insert

2015-11-03 Thread Michael Armbrust
Today you have to do an explicit conversion. I'd really like to open up a public UDT interface as part of Spark Datasets (SPARK-) that would allow you to register custom classes with conversions, but this won't happen till Spark 1.7 likely. On Mon, Nov 2, 2015 at 8:40 PM, Bryan Jeffrey wrote

Re: SparkSQL: What is the cost of DataFrame.registerTempTable(String)? Can I have multiple tables referencing to the same DataFrame?

2015-10-29 Thread Michael Armbrust
Its super cheap. Its just a hashtable stored on the driver. Yes you can have more than one name for the same DF. On Wed, Oct 28, 2015 at 6:17 PM, Anfernee Xu wrote: > Hi, > > I just want to understand the cost of DataFrame.registerTempTable(String), > is it just a trivial operation(like creati

RE: SparkSQL on hive error

2015-10-27 Thread Cheng, Hao
Hi Anand, can you paste the table creating statement? I’d like to reproduce that in my local first, and BTW, which version are you using? Hao From: Anand Nalya [mailto:anand.na...@gmail.com] Sent: Tuesday, October 27, 2015 11:35 PM To: spark users Subject: SparkSQL on hive error Hi, I've a part

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-12 Thread Xiao Li
Hi, Lloyd, Both runs are cold/warm? Memory/cache hit/miss could be a big factor if your application is IO intensive. You need to monitor your system to understand what is your bottleneck. Good lucks, Xiao Li

  1   2   3   4   5   >