Re: dataframe cumulative sum

2015-05-29 Thread Yin Huai
Hi Cesar, We just added it in Spark 1.4. In Spark 1.4, You can use window function in HiveContext to do it. Assuming you want to calculate the cumulative sum for every flag, import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ df.select( $"flag", $"price",

Re: Spark 1.3.0 -> 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-30 Thread Yin Huai
Looks like your program somehow picked up a older version of parquet (spark 1.3.1 uses parquet 1.6.0rc3 and seems NO_FILTER field was introduced in 1.6.0rc2). Is it possible that you can check the parquet lib version in your classpath? Thanks, Yin On Sat, May 30, 2015 at 2:44 PM, ogoh wrote: >

Re: Spark 1.4.0-rc3: Actor not found

2015-06-02 Thread Yin Huai
Does it happen every time you read a parquet source? On Tue, Jun 2, 2015 at 3:42 AM, Anders Arpteg wrote: > The log is from the log aggregation tool (hortonworks, "yarn logs ..."), > so both executors and driver. I'll send a private mail to you with the full > logs. Also, tried another job as yo

Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-03 Thread Yin Huai
Hi Doug, Actually, sqlContext.table does not support database name in both Spark 1.3 and Spark 1.4. We will support it in future version. Thanks, Yin On Wed, Jun 3, 2015 at 10:45 AM, Doug Balog wrote: > Hi, > > sqlContext.table(“db.tbl”) isn’t working for me, I get a > NoSuchTableException.

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Yin Huai
Can you put the following setting in spark-defaults.conf and try again? spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni https://issues.apache.org/jira/browse/SPAR

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-04 Thread Yin Huai
Are you using RC4? On Wed, Jun 3, 2015 at 10:58 PM, Night Wolf wrote: > Thanks Yin, that seems to work with the Shell. But on a compiled > application with Spark-submit it still fails with the same exception. > > On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai wrote: > >> Can

Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-04 Thread Yin Huai
Context, hiveContext, Catalog) could > be better. > > Thanks, > > Doug > > > On Jun 3, 2015, at 8:21 PM, Yin Huai wrote: > > > > Hi Doug, > > > > Actually, sqlContext.table does not support database name in both Spark > 1.3 and Spark 1.4. We will suppor

Re: Spark 1.4.0-rc4 HiveContext.table("db.tbl") NoSuchTableException

2015-06-05 Thread Yin Huai
Hi Doug, For now, I think you can use "sqlContext.sql("USE databaseName")" to change the current database. Thanks, Yin On Thu, Jun 4, 2015 at 12:04 PM, Yin Huai wrote: > Hi Doug, > > sqlContext.table does not officially support database name. It only > supp

Re: Issues with `when` in Column class

2015-06-12 Thread Yin Huai
Hi Chris, Have you imported "org.apache.spark.sql.functions._"? Thanks, Yin On Fri, Jun 12, 2015 at 8:05 AM, Chris Freeman wrote: > I’m trying to iterate through a list of Columns and create new Columns > based on a condition. However, the when method keeps giving me errors that > don’t quit

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Yin Huai
I saw it once but I was not clear how to reproduce it. The jira I created is https://issues.apache.org/jira/browse/SPARK-7837. More information will be very helpful. Were those errors from speculative tasks or regular tasks (the first attempt of the task)? Is this error deterministic (can you repr

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Yin Huai
Hi John, Did you also set spark.sql.planner.externalSort to true? Probably you will not see executor lost with this conf. For now, maybe you can manually split the query to two parts, one for skewed keys and one for other records. Then, you union then results of these two parts together. Thanks,

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-17 Thread Yin Huai
the same shell session, every time I do a write I > get a random number of tasks failing on the first run with the NPE. > > Using dynamic allocation of executors in YARN mode. No speculative > execution is enabled. > > On Tue, Jun 16, 2015 at 3:11 PM, Yin Huai wrote: > >> I sa

Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Yin Huai
btw, user listt will be a better place for this thread. On Thu, Jun 18, 2015 at 8:19 AM, Yin Huai wrote: > Is it the full stack trace? > > On Thu, Jun 18, 2015 at 6:39 AM, Sea <261810...@qq.com> wrote: > >> Hi, all: >> >> I want to run spark sql on yarn

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
Are you writing to an existing hive orc table? On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian wrote: > Thanks for reporting this. Would you mind to help creating a JIRA for this? > > > On 6/16/15 2:25 AM, patcharee wrote: > >> I found if I move the partitioned columns in schemaString and in Row to

Re: HiveContext saveAsTable create wrong partition

2015-06-18 Thread Yin Huai
lease feel free to open a jira about removing this requirement for inserting into an existing hive table. Thanks, Yin On Thu, Jun 18, 2015 at 9:39 PM, Yin Huai wrote: > Are you writing to an existing hive orc table? > > On Wed, Jun 17, 2015 at 3:25 PM, Cheng Lian wrote: > >>

Re: Help optimising Spark SQL query

2015-06-22 Thread Yin Huai
Hi James, Maybe it's the DISTINCT causing the issue. I rewrote the query as follows. Maybe this one can finish faster. select sum(cnt) as uses, count(id) as users from ( select count(*) cnt, cast(id as string) as id, from usage_events where from_unixtime(cast(timestamp_mill

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-06-24 Thread Yin Huai
The function accepted by explode is f: Row => TraversableOnce[A]. Seems user_mentions is an array of structs. So, can you change your pattern matching to the following? case Row(rows: Seq[_]) => rows.asInstanceOf[Seq[Row]].map(elem => ...) On Wed, Jun 24, 2015 at 5:27 AM, Gustavo Arjones wrote:

Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Yin Huai
Axel, Can you file a jira and attach your code in the description of the jira? This looks like a bug. For the third row of df1, the name is "alice" instead of "carol", right? Otherwise, "carol" should appear in the expected output. Btw, to get rid of those columns with the same name after the jo

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
Hi Sim, Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3 (explained in https://issues.apache.org/jira/browse/SPARK-8776). Can you add --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=256m" in the command you used to launch Spark shell? This will increase the PermGen size f

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-02 Thread Yin Huai
HiveContext's methods). Can you just use the sqlContext created by the shell and try again? Thanks, Yin On Thu, Jul 2, 2015 at 12:50 PM, Yin Huai wrote: > Hi Sim, > > Spark 1.4.0's memory consumption on PermGen is higher then Spark 1.3 > (explained in https://issues.apache.org/

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai
Sim, Can you increase the PermGen size? Please let me know what is your setting when the problem disappears. Thanks, Yin On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee wrote: > I had run into the same problem where everything was working swimmingly > with Spark 1.3.1. When I switched to Spark 1.4

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai
ause the exercise started looking silly. It > is clear that 1.4.0 is using memory in a substantially different manner. > > I'd be happy to share the test file so you can reproduce this in your > own environment. > > /Sim > > Simeon Simeonov, Founder & CTO, Swoop &

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
t; spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages >> com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g --executor-memory 4g >> >> /Sim >> >> Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/> >> @simeons <http://twitter.com/simeon

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
the environment variable, however, as > the behavior of the shell changed from hanging to quitting when the env var > value got to 1g. > > /Sim > > Simeon Simeonov, Founder & CTO, Swoop <http://swoop.com/> > @simeons <http://twitter.com/simeons> | blog.simeo

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Yin Huai
Hi Brandon, Can you explain what did you mean by "It simply does not work"? You did not see new data files? Thanks, Yin On Fri, Jul 10, 2015 at 11:55 AM, Brandon White wrote: > Why does this not work? Is insert into broken in 1.3.1? It does not throw > any errors, fail, or throw exceptions. I

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Yin Huai
Jerrick, Let me ask a few clarification questions. What is the version of Spark? Is the table a hive table? What is the format of the table? Is the table partitioned? Thanks, Yin On Sun, Jul 12, 2015 at 6:01 PM, ayan guha wrote: > Describe computes statistics, so it will try to query the tabl

Re: [SPARK-SQL] Window Functions optimization

2015-07-13 Thread Yin Huai
Your query will be partitioned once. Then, a single Window operator will evaluate these three functions. As mentioned by Harish, you can take a look at the plan (sql("your sql...").explain()). On Mon, Jul 13, 2015 at 12:26 PM, Harish Butani wrote: > Just once. > You can see this by printing the

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
We do this in SparkILookp ( https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037). What is the version of Spark you are using? How did you add the spark-csv jar? On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers wrote: > has a

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
> that i use >> >> but i also tried adding spark-csv with --package for spark-submit, and >> got the same error >> >> On Thu, Jul 16, 2015 at 4:31 PM, Yin Huai wrote: >> >>> We do this in SparkILookp ( >>> https://github.com/apache/spark/bl

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
No problem:) Glad to hear that! On Thu, Jul 16, 2015 at 8:22 PM, Koert Kuipers wrote: > that solved it, thanks! > > On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers wrote: > >> thanks i will try 1.4.1 >> >> On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai wrote: &g

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Yin Huai
Hi Eran, Can you try 1.6? With the change in https://github.com/apache/spark/pull/10288, JSON data source will not throw a runtime exception if there is any record that it cannot parse. Instead, it will put the entire record to the column of "_corrupt_record". Thanks, Yin On Sun, Dec 20, 2015 a

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
Hi, we made the change because the partitioning discovery logic was too flexible and it introduced problems that were very confusing to users. To make your case work, we have introduced a new data source option called basePath. You can use DataFrame df = hiveContext.read().format("orc").option("ba

Re: How to load specific Hive partition in DataFrame Spark 1.6?

2016-01-07 Thread Yin Huai
No problem! Glad it helped! On Thu, Jan 7, 2016 at 12:05 PM, Umesh Kacha wrote: > Hi Yin, thanks much your answer solved my problem. Really appreciate it! > > Regards > > > On Fri, Jan 8, 2016 at 1:26 AM, Yin Huai wrote: > >> Hi, we made the change because the parti

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
Hi Pala, Can you add the full stacktrace of the exception? For now, can you use create temporary function to workaround the issue? Thanks, Yin On Wed, Sep 30, 2015 at 11:01 AM, Pala M Muthaia < mchett...@rocketfuelinc.com.invalid> wrote: > +user list > > On Tue, Sep 29, 2015 at 3:43 PM, Pala M

Re: Hive permanent functions are not available in Spark SQL

2015-10-01 Thread Yin Huai
.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apa

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Yin Huai
btw, what version of Spark did you use? On Mon, Oct 19, 2015 at 1:08 PM, YaoPau wrote: > I've connected Spark SQL to the Hive Metastore and currently I'm running > SQL > code via pyspark. Typically everything works fine, but sometimes after a > long-running Spark SQL job I get the error below,

Re: Anyone feels sparkSQL in spark1.5.1 very slow?

2015-10-26 Thread Yin Huai
@filthysocks, can you get the output of jmap -histo before the OOM ( http://docs.oracle.com/javase/7/docs/technotes/tools/share/jmap.html)? On Mon, Oct 26, 2015 at 6:35 AM, filthysocks wrote: > We upgrade from 1.4.1 to 1.5 and it's a pain > see > > http://apache-spark-user-list.1001560.n3.nabble

Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Yin Huai
Hi Ross, What version of spark are you using? There were two issues that affected the results of window function in Spark 1.5 branch. Both of issues have been fixed and will be released with Spark 1.5.2 (this release will happen soon). For more details of these two issues, you can take a look at h

Re: DataFrame equality does not working in 1.5.1

2015-11-05 Thread Yin Huai
Can you attach the result of eventDF.filter($"entityType" === "user").select("entityId").distinct.explain(true)? Thanks, Yin On Thu, Nov 5, 2015 at 1:12 AM, 千成徳 wrote: > Hi All, > > I have data frame like this. > > Equality expression is not working in 1.5.1 but, works as expected in 1.4.0 > W

Re: About Databricks's spark-sql-perf

2015-08-13 Thread Yin Huai
Hi Todd, We have not got a chance to update it. We will update it after 1.5 release. Thanks, Yin On Thu, Aug 13, 2015 at 6:49 AM, Todd wrote: > Hi, > I got a question about the spark-sql-perf project by Databricks at > https://github.com/databricks/spark-sql-perf/ > > > The Tables.scala ( >

Re: SQLContext Create Table Problem

2015-08-19 Thread Yin Huai
Can you try to use HiveContext instead of SQLContext? Your query is trying to create a table and persist the metadata of the table in metastore, which is only supported by HiveContext. On Wed, Aug 19, 2015 at 8:44 AM, Yusuf Can Gürkan wrote: > Hello, > > I’m trying to create a table with sqlCont

Re: How to evaluate custom UDF over window

2015-08-24 Thread Yin Huai
For now, user-defined window function is not supported. We will add it in future. On Mon, Aug 24, 2015 at 6:26 AM, xander92 wrote: > The ultimate aim of my program is to be able to wrap an arbitrary Scala > function (mostly will be statistics / customized rolling window metrics) in > a UDF and e

Re: java.util.NoSuchElementException: key not found

2015-09-11 Thread Yin Huai
Looks like you hit https://issues.apache.org/jira/browse/SPARK-10422, it has been fixed in branch 1.5. 1.5.1 release will have it. On Fri, Sep 11, 2015 at 3:35 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hi all , > After upgrade spark to 1.5 , Streaming throw > java.util.No

Re: Null Value in DecimalType column of DataFrame

2015-09-14 Thread Yin Huai
btw, move it to user list. On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai wrote: > A scale of 10 means that there are 10 digits at the right of the decimal > point. If you also have precision 10, the range of your data will be [0, 1) > and casting "10.5" to DecimalType(10, 1

Re: Null Value in DecimalType column of DataFrame

2015-09-17 Thread Yin Huai
etting. Is this really the expected behavior? Never seen something > returning null in other Scala tools that I used. > > Regards, > > > 2015-09-14 18:54 GMT-03:00 Yin Huai : > >> btw, move it to user list. >> >> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai wrote: &g

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry, Looks like it is a Python-specific issue. Can you create a JIRA? Thanks, Yin On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > Hi Spark Developers, > > I just ran some very simple operations on a dataset. I was surprise by the > execution plan of take(1), head() or first(). > > Fo

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
btw, does 1.4 has the same problem? On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > Hi Jerry, > > Looks like it is a Python-specific issue. Can you create a JIRA? > > Thanks, > > Yin > > On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > >> Hi Spark De

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Seems 1.4 has the same issue. On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > >> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create a JIRA? >&g

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
>> pyspark code just calls the scala code via py4j. I didn't expect that this >> bug is pyspark specific. That surprises me actually a bit. I created a >> ticket for this (SPARK-10731 >> <https://issues.apache.org/jira/browse/SPARK-10731>). >> >> Best

Re: Generic DataType in UDAF

2015-09-25 Thread Yin Huai
Hi Ritesh, Right now, we only allow specific data types defined in the inputSchema. Supporting abstract types (e.g. NumericType) may cause the logic of a UDAF be more complex. It will be great to understand the use cases first. What kinds of possible input data types that you want to support and d

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Yin Huai
Hello Michael, Thank you for reporting this issue. It will be fixed by https://github.com/apache/spark/pull/16080. Thanks, Yin On Tue, Nov 29, 2016 at 11:34 PM, Timur Shenkao wrote: > Hi! > > Do you have real HIVE installation? > Have you built Spark 2.1 & Spark 2.0 with HIVE support ( -Phive

[ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hi all, Apache Spark 2.1.0 is the second release of Spark 2.x line. This release makes significant strides in the production readiness of Structured Streaming, with added support for event time watermarks

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-29 Thread Yin Huai
Hello, We spent sometime preparing artifacts and changes to the website (including the release notes). I just sent out the the announcement. 2.1.0 is officially released. Thanks, Yin On Wed, Dec 28, 2016 at 12:42 PM, Justin Miller < justin.mil...@protectwise.com> wrote: > Interesting, because

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
st time when I noticed rxin stepped back and a > new release manager stepped in. Congrats on your first ANNOUNCE! > > I can only expect even more great stuff coming in to Spark from the dev > team after Reynold spared some time 😉 > > Can't wait to read the changes... > > Ja

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Yin Huai
Hello Michael, In Spark SQL, we have our internal concepts of Output Partitioning (representing the partitioning scheme of an operator's output) and Required Child Distribution (representing the requirement of input data distribution of an operator) for a physical operator. Let's say we have two o

Re: Spark SQL - Column name including a colon in a SELECT clause

2015-02-10 Thread Yin Huai
Can you try using backticks to quote the field name? Like `f:price`. On Tue, Feb 10, 2015 at 5:47 AM, presence2001 wrote: > Hi list, > > I have some data with a field name of f:price (it's actually part of a JSON > structure loaded from ElasticSearch via elasticsearch-hadoop connector, but > I d

Re: Can we execute "create table" and "load data" commands against Hive inside HiveContext?

2015-02-10 Thread Yin Huai
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory was introduced in Hive 0.14 and Spark SQL only supports Hive 0.12 and 0.13.1. Can you change the setting of hive.security.authorization.manager to someone accepted by 0.12 or 0.13.1? On Thu, Feb 5, 2015

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Yin Huai
Regarding backticks: Right. You need backticks to quote the column name timestamp because timestamp is a reserved keyword in our parser. On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani wrote: > actually i tried in spark shell , got same error and then for some reason > i tried to back tick the

Re: SQLContext.applySchema strictness

2015-02-13 Thread Yin Huai
Hi Justin, It is expected. We do not check if the provided schema matches rows since all rows need to be scanned to give a correct answer. Thanks, Yin On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony wrote: > Per the documentation: > > It is important to make sure that the structure of every

Re: [SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yin Huai
Seems you hit https://issues.apache.org/jira/browse/SPARK-4296. It has been fixed in 1.2.1 and 1.3. On Thu, Feb 26, 2015 at 1:22 PM, Yana Kadiyska wrote: > Can someone confirm if they can run UDFs in group by in spark1.2? > > I have two builds running -- one from a custom build from early Decemb

Re: Issues reading in Json file with spark sql

2015-03-02 Thread Yin Huai
Is the string of the above JSON object in the same line? jsonFile requires that every line is a JSON object or an array of JSON objects. On Mon, Mar 2, 2015 at 11:28 AM, kpeng1 wrote: > Hi All, > > I am currently having issues reading in a json file using spark sql's api. > Here is what the json

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1 (versions that we support). Seems https://issues.apache.org/jira/browse/HIVE-5472 added it Hive recently. On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao wrote: > The temp table in metastore can not be shared cross SQLContext

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
@Shahab, based on https://issues.apache.org/jira/browse/HIVE-5472, current_date was added in Hive *1.2.0 (not 0.12.0)*. For my previous email, I meant current_date is not in neither Hive 0.12.0 nor Hive 0.13.1 (Spark SQL currently supports these two Hive versions). On Tue, Mar 3, 2015 at 8:55 AM,

Re: Spark-SQL and Hive - is Hive required?

2015-03-06 Thread Yin Huai
Hi Edmon, No, you do not need to install Hive to use Spark SQL. Thanks, Yin On Fri, Mar 6, 2015 at 6:31 AM, Edmon Begoli wrote: > Does Spark-SQL require installation of Hive for it to run correctly or > not? > > I could not tell from this statement: > > https://spark.apache.org/docs/latest/s

Re: Spark 1.3 SQL Type Parser Changes?

2015-03-10 Thread Yin Huai
Hi Nitay, Can you try using backticks to quote the column name? Like org.apache.spark.sql.hive.HiveMetastoreTypes.toDataType( "struct<`int`:bigint>")? Thanks, Yin On Tue, Mar 10, 2015 at 2:43 PM, Michael Armbrust wrote: > Thanks for reporting. This was a result of a change to our DDL parser

Re: Spark SQL. Cast to Bigint

2015-03-13 Thread Yin Huai
Are you using SQLContext? Right now, the parser in the SQLContext is quite limited on the data type keywords that it handles (see here ) and unfortunately "bigint" is not hand

Re: Loading in json with spark sql

2015-03-13 Thread Yin Huai
Seems you want to use array for the field of "providers", like "providers":[{"id": ...}, {"id":...}] instead of "providers":{{"id": ...}, {"id":...}} On Fri, Mar 13, 2015 at 7:45 PM, kpeng1 wrote: > Hi All, > > I was noodling around with loading in a json file into spark sql's hive > context and

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
Seems "basic_null_diluted_d" was not resolved? Can you check if basic_null_diluted_d is in you table? On Tue, Mar 17, 2015 at 9:34 AM, Ophir Cohen wrote: > Hi Guys, > I'm registering a function using: > sqlc.registerFunction("makeEstEntry",ReutersDataFunctions.makeEstEntry _) > > Then I register

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
check it soon and update. > Can you elaborate what does it mean the # and the number? Is that a > reference to the field in the rdd? > Thank you, > Ophir > On Mar 17, 2015 7:06 PM, "Yin Huai" wrote: > >> Seems "basic_null_diluted_d" was not resolved?

Re: HiveContext can't find registered function

2015-03-17 Thread Yin Huai
ean > 'resolved attribute'? > On Mar 17, 2015 8:14 PM, "Yin Huai" wrote: > >> The number is an id we used internally to identify an resolved Attribute. >> Looks like basic_null_diluted_d was not resolved since there is no id >> associated with it. >>

Re: Date and decimal datatype not working

2015-03-17 Thread Yin Huai
p(0) is a String. So, you need to explicitly convert it to a Long. e.g. p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals value, you need to create BigDecimal objects from your String values. On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA wrote: > Hi All, > > I am very ne

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Yin Huai
Hi Roberto, For now, if the "timestamp" is a top level column (not a field in a struct), you can use use backticks to quote the column name like `timestamp `. Thanks, Yin On Wed, Mar 18, 2015 at 12:10 PM, Roberto Coluccio < roberto.coluc...@gmail.com> wrote: > Hey Cheng, thank you so much for

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Yin Huai
Seems there are too many distinct groups processed in a task, which trigger the problem. How many files do your dataset have and how large is a file? Seems your query will be executed with two stages, table scan and map-side aggregation in the first stage and the final round of reduce-side aggrega

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
Hi Christian, Your table is stored correctly in Parquet format. For saveAsTable, the table created is *not* a Hive table, but a Spark SQL data source table ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources). We are only using Hive's metastore to store the metadata (to b

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
created https://issues.apache.org/jira/browse/SPARK-6413 to track the improvement on the output of DESCRIBE statement. On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai wrote: > Hi Christian, > > Your table is stored correctly in Parquet format. > > For saveAsTable, the table created

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yin Huai
Yin, > > Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The > number of tasks launched is equal to the number of parquet files. Do you > have any idea on how to deal with this situation? > > Thanks a lot > On 18 Mar 2015 17:35, "Yin Huai" wrote: >

Re: Spark SQL filter DataFrame by date?

2015-03-19 Thread Yin Huai
Can you add your code snippet? Seems it's missing in the original email. Thanks, Yin On Thu, Mar 19, 2015 at 3:22 PM, kamatsuoka wrote: > I'm trying to filter a DataFrame by a date column, with no luck so far. > Here's what I'm doing: > > > > When I run reqs_day.count() I get zero, apparently

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yin Huai
umber of tasks and again the same error. :( >> >> Thanks a lot! >> >> On 19 March 2015 at 17:40, Yiannis Gkoufas wrote: >> >>> Hi Yin, >>> >>> thanks a lot for that! Will give it a shot and let you know. >>> >>> On

Re: Use pig load function in spark

2015-03-23 Thread Yin Huai
Hello Kevin, You can take a look at our generic load function . For example, you can use val df = sqlContext.load("/myData", "parquet") To load a parquet dataset stored in "/myData" as a DataFrame

Re: Date and decimal datatype not working

2015-03-23 Thread Yin Huai
help me to write a proper syntax to store output in a CSV file? > > > > > > Thanks & Regards > > --- > > Ananda Basak > > Ph: 425-213-7092 > > > > *From:* BASAK, ANANDA > *Sent:* Tuesday, March 17, 2015 3:08 PM > *To:* Yin Hua

Re: rdd.toDF().saveAsParquetFile("tachyon://host:19998/test")

2015-03-27 Thread Yin Huai
You are hitting https://issues.apache.org/jira/browse/SPARK-6330. It has been fixed in 1.3.1, which will be released soon. On Fri, Mar 27, 2015 at 10:42 PM, sud_self <852677...@qq.com> wrote: > spark version is 1.3.0 with tanhyon-0.6.1 > > QUESTION DESCRIPTION: rdd.saveAsObjectFile("tachyon://hos

Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Yin Huai
For cast, you can use selectExpr method. For example, df.selectExpr("cast(col1 as int) as col1", "cast(col2 as bigint) as col2"). Or, df.select(df("colA").cast("int"), ...) On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust wrote: > val df = Seq(("test", 1)).toDF("col1", "col2") > > You can use SQ

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip wrote: > Hello, > > I have a parquet file of around 55M rows (~ 1G on disk). Performing simple > grouping operation is pretty efficient (I get results w

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
gt; > On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai wrote: > >> Hi Justin, >> >> Does the schema of your data have any decimal, array, map, or struct type? >> >> Thanks, >> >> Yin >> >> On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip >> wrote

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Can your code that can reproduce the problem? On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda wrote: > Hi > > As per JIRA this issue is resolved, but i am still facing this issue. > > SPARK-2734 - DROP TABLE should also uncache table > > > -- > > [image: Sigmoid Analytics]

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
; On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai wrote: > >> Can your code that can reproduce the problem? >> >> On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda < >> ar...@sigmoidanalytics.com> wrote: >> >>> Hi >>> >>> As per JIRA th

Re: Spark SQL query key/value in Map

2015-04-16 Thread Yin Huai
For Map type column, fields['driver'] is the syntax to retrieve the map value (in the schema, you can see "fields: map"). The syntax of fields.driver is used for struct type. On Thu, Apr 16, 2015 at 12:37 AM, jc.francisco wrote: > Hi, > > I'm new with both Cassandra and Spark and am experimentin

Re: dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Yin Huai
Hi Neal, spark.sql.shuffle.partitions is the property to control the number of tasks after shuffle (to generate t2, there is a shuffle for the aggregations specified by groupBy and agg.) You can use sqlContext.setConf("spark.sql.shuffle.partitions", "newNumber") or sqlContext.sql("set spark.sql.sh

Re: dataframe can not find fields after loading from hive

2015-04-19 Thread Yin Huai
Hi Cesar, Can you try 1.3.1 ( https://spark.apache.org/releases/spark-release-1-3-1.html) and see if it still shows the error? Thanks, Yin On Fri, Apr 17, 2015 at 1:58 PM, Reynold Xin wrote: > This is strange. cc the dev list since it might be a bug. > > > > On Thu, Apr 16, 2015 at 3:18 PM, C

Re: Parquet Hive table become very slow on 1.3?

2015-04-22 Thread Yin Huai
Xudong and Rex, Can you try 1.3.1? With PR 5339 , after we get a hive parquet from metastore and convert it to our native parquet code path, we will cache the converted relation. For now, the first access to that hive parquet table reads all of the footer

Re: Bug? Can't reference to the column by name after join two DataFrame on a same name key

2015-04-23 Thread Yin Huai
Hi Shuai, You can use "as" to create a table alias. For example, df1.as("df1"). Then you can use $"df1.col" to refer it. Thanks, Yin On Thu, Apr 23, 2015 at 11:14 AM, Shuai Zheng wrote: > Hi All, > > > > I use 1.3.1 > > > > When I have two DF and join them on a same name key, after that, I ca

Re: Convert DStream to DataFrame

2015-04-24 Thread Yin Huai
Hi Sergio, I missed this thread somehow... For the error "case classes cannot have more than 22 parameters.", it is the limitation of scala (see https://issues.scala-lang.org/browse/SI-7296). You can follow the instruction at https://spark.apache.org/docs/latest/sql-programming-guide.html#programm

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark? On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang wrote: > Hi, > > My data looks like this: > > +---++--+ > | col_name |

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
i, Apr 24, 2015 at 11:00 AM, Yin Huai wrote: > >> The exception looks like the one mentioned in >> https://issues.apache.org/jira/browse/SPARK-4520. What is the version of >> Spark? >> >> On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Hu

Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread Yin Huai
Can you attach the full stack trace? Thanks, Yin On Fri, May 8, 2015 at 4:44 PM, barmaley wrote: > Given a registered table from data frame, I'm able to execute queries like > sqlContext.sql("SELECT STDDEV(col1) FROM table") from Spark Shell just > fine. > However, when I run exactly the same

Re: Problems with connecting Spark to Hive

2014-06-03 Thread Yin Huai
Hello Lars, Can you check the value of "hive.security.authenticator.manager" in hive-site.xml? I guess the value is "org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator". This class was introduced in hive 0.13, but Spark SQL is based on hive 0.12 right now. Can you change the value of "hive.

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Yin Huai
Hi Durin, I guess that blank lines caused the problem (like Aaron said). Right now, jsonFile does not skip faulty lines. Can you first use sc.textfile to load the file as RDD[String] and then use filter to filter out those blank lines (code snippet can be found below)? val sqlContext = new org.ap

Re: jsonFile function in SQLContext does not work

2014-06-26 Thread Yin Huai
Yes. It will be added in later versions. Thanks, Yin On Wed, Jun 25, 2014 at 3:39 PM, durin wrote: > Hi Yin an Aaron, > > thanks for your help, this was indeed the problem. I've counted 1233 blank > lines using grep, and the code snippet below works with those. > > From what you said, I guess

Re: Spark SQL : Join throws exception

2014-07-01 Thread Yin Huai
Seems it is a bug. I have opened https://issues.apache.org/jira/browse/SPARK-2339 to track it. Thank you for reporting it. Yin On Tue, Jul 1, 2014 at 12:06 PM, Subacini B wrote: > Hi All, > > Running this join query > sql("SELECT * FROM A_TABLE A JOIN B_TABLE B WHERE > A.status=1").collect

Re: Spark SQL : Join throws exception

2014-07-07 Thread Yin Huai
Hi Subacini, Just want to follow up on this issue. SPARK-2339 has been merged into the master and 1.0 branch. Thanks, Yin On Tue, Jul 1, 2014 at 2:00 PM, Yin Huai wrote: > Seems it is a bug. I have opened > https://issues.apache.org/jira/browse/SPARK-2339 to track it. > > T

  1   2   >