Re: [PSA] Python 2, 3.4 and 3.5 are now dropped

2020-07-13 Thread Hyukjin Kwon
cc user mailing list too. 2020년 7월 14일 (화) 오전 11:27, Hyukjin Kwon 님이 작성: > I am sending another email to make sure dev people know. Python 2, 3.4 and > 3.5 are now dropped at https://github.com/apache/spark/pull/28957. > > >

PySpark documentation main page

2020-08-01 Thread Hyukjin Kwon
Hi all, I am trying to write up the main page of PySpark documentation at https://github.com/apache/spark/pull/29320. While I think the current proposal might be good enough, I would like to collect more feedback about the contents, structure and image since this is the entrance page of PySpark d

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Hyukjin Kwon
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped R 3.5 and below at branch 2.4 as well. On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, wrote: > Hi, All. > > As of today, master branch (Apache Spark 3.1.0) resolved > 852+ JIRA issues and 606+ issues are 3.1.0-only patches. >

Re: [SparkR] gapply with strings with arrow

2020-10-10 Thread Hyukjin Kwon
If it works without Arrow optimization, it's likely a bug. Please feel free to file a JIRA for that. On Wed, 7 Oct 2020, 22:44 Jacek Pliszka, wrote: > Hi! > > Is there any place I can find information how to use gapply with arrow? > > I've tried something very simple > > collect(gapply( > df,

[ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Hyukjin Kwon
We are excited to announce Spark 3.1.1 today. Apache Spark 3.1.1 is the second release of the 3.x line. This release adds Python type annotations and Python dependency management support as part of Project Zen. Other major updates include improved ANSI SQL compliance support, history server suppor

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Hyukjin Kwon
es / >>>>>> Greenplum >>>>>> with Spark SQL and DataFrames, 10~100x faster.* >>>>>> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A >>>>>> library that brings excellent and useful functions fro

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Hyukjin Kwon
awesome! 2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성: > We are happy to announce the availability of Spark 3.1.2! > > Spark 3.1.2 is a maintenance release containing stability fixes. This > release is based on the branch-3.1 maintenance branch of Spark. We strongly > recommend all 3.1 users to u

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
Thanks for pinging me Sean. Yes, there's an optimization on DataFrame.collect which tries to collect few first partitioms and see if the number of rows are found (and repeat). DataFrame.toPandas does not have such optimization. I suspect that the shuffle isn't actual shuffle but just collects lo

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
here. We could have a configuration to enable and disable but the implementation of this in DataFrame.toPandas would be complicated due to existing optimization such as Arrow. Haven't taken a deeper look though but my guts say it's not worthwhile. On Sat, Nov 13, 2021 at 12:05 PM Hyuk

Re: [R] SparkR on conda-forge

2021-12-19 Thread Hyukjin Kwon
Awesome! On Mon, 20 Dec 2021 at 09:43, yonghua wrote: > Nice release. thanks for sharing. > > On 2021/12/20 3:55, Maciej wrote: > > FYI ‒ thanks to good folks from conda-forge we have now these: > > - > To unsubscribe e-mail: us

Re: Conda Python Env in K8S

2021-12-24 Thread Hyukjin Kwon
Can you share the logs, settings, environment, etc. and file a JIRA? There are integration test cases for K8S support, and I myself also tested it before. It would be helpful if you try what I did at https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html and see if

Re: Stickers and Swag

2022-06-14 Thread Hyukjin Kwon
Woohoo On Tue, 14 Jun 2022 at 15:04, Xiao Li wrote: > Hi, all, > > The ASF has an official store at RedBubble > that Apache Community > Development (ComDev) runs. If you are interested in buying Spark Swag, 70 > products featuring the Spark logo are

Re: [Feature Request] make unix_micros() and unix_millis() available in PySpark (pyspark.sql.functions)

2022-10-16 Thread Hyukjin Kwon
You can workaround it by leveraging expr, e.g., expr("unix_micros(col)") for now. Should better have Scala binding first before we have Python one FWIW, On Sat, 15 Oct 2022 at 06:19, Martin wrote: > Hi everyone, > > In *Spark SQL* there are several timestamp related functions > >- unix_micro

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Hyukjin Kwon
Thanks, Yuming. On Wed, 26 Oct 2022 at 16:01, L. C. Hsieh wrote: > Thank you for driving the release of Apache Spark 3.3.1, Yuming! > > On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun > wrote: > > > > It's great. Thank you so much, Yuming! > > > > Dongjoon > > > > On Tue, Oct 25, 2022 at 11:23 P

Re: Slack for PySpark users

2023-03-27 Thread Hyukjin Kwon
Yeah, actually I think we should better have a slack channel so we can easily discuss with users and developers. On Tue, 28 Mar 2023 at 03:08, keen wrote: > Hi all, > I really like *Slack *as communication channel for a tech community. > There is a Slack workspace for *delta lake users* ( > http

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark 3.4.1 is a maintenance release containing st

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking tool that >> aims to make Apache

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome! On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun wrote: > Hi, All. > > As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community > starts to have test coverage for all supported Python versions from Today. > > - https://github.com/apache/spark/actions/runs/7061665420 > > Her

Re: Architecture of Spark Connect

2023-12-14 Thread Hyukjin Kwon
By default for now, yes. One Spark Connect server handles multiple Spark Sessions. To multiplex or run multiple Drivers, you need some work such as gateway. On Thu, 14 Dec 2023 at 12:03, Kezhi Xiong wrote: > Hi, > > My understanding is there is only one driver/spark context for all user > sessio

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
Just fyi streaming python data source is in progress https://github.com/apache/spark/pull/44416 we will likely release this in spark 4.0 On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович wrote: > Yes, it's actual data. > > > > Best regards, > > Stanislav Porotikov > > > > *From:*

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions. This

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel ( https://github.com/conda-forge/r-sparkr-feedstock). This is fully run by the community unofficially. On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh wrote: > +1 for me > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engi

Documentation for "hidden" RESTful API for submitting jobs (not history server)

2016-03-14 Thread Hyukjin Kwon
Hi all, While googling Spark, I accidentally found a RESTful API existing in Spark for submitting jobs. The link is here, http://arturmkrtchyan.com/apache-spark-hidden-rest-api As Josh said, I can see the history of this RESTful API, https://issues.apache.org/jira/browse/SPARK-5388 and also goo

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread Hyukjin Kwon
Hi, For RESTful API for submitting an application, please take a look at this link. http://arturmkrtchyan.com/apache-spark-hidden-rest-api On 26 Mar 2016 12:07 p.m., "vetal king" wrote: > Prateek > > It's possible to submit spark application from outside application. If you > are using java the

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Hyukjin Kwon
Could I ask which version are you using? It looks the cause is the empty line right after header (because that case is not being checked in tests). However, for empty lines before the header or inside date, they are being tested. https://raw.githubusercontent.com/databricks/spark-csv/master/src/

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-29 Thread Hyukjin Kwon
Hi, I guess this is not a CSV-datasource specific problem. Does loading any file (eg. textFile()) work as well? I think this is related with this thread, http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html . 2016-03-

Re: Spark/Parquet

2016-04-14 Thread Hyukjin Kwon
Currently Spark uses Parquet 1.7.0 (parquet-mr). If you meant writer version2 (parquet-format), you can specify this by manually setting as below: sparkContext.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_2_0.toString) 2016-04-15 2:21 GMT+0

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Hyukjin Kwon
Hi, String comparison itself is pushed down fine but the problem is to deal with Cast. It was pushed down before but is was reverted, ( https://github.com/apache/spark/pull/8049). Several fixes were tried here, https://github.com/apache/spark/pull/11005 and etc. but there were no changes to ma

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Hyukjin Kwon
I hope it was not too late :). It is possible. Please check csvRdd api here, https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150 . Thanks! On 2 Apr 2016 2:47 a.m., "Benjamin Kim" wrote: > Does anyone know if this is possible? I have

Re: In-Memory Only Spark Shuffle

2016-04-15 Thread Hyukjin Kwon
This reminds me of this Jira, https://issues.apache.org/jira/browse/SPARK-3376 and this PR, https://github.com/apache/spark/pull/5403. AFAIK, it is not and won't be supported. On 2 Apr 2016 4:13 a.m., "slavitch" wrote: > Hello; > > I’m working on spark with very large memory systems (2TB+) and n

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Hyukjin Kwon
l learning Scala on my > own. Can you help me to start? > > Thanks, > Ben > > On Apr 15, 2016, at 8:02 AM, Hyukjin Kwon wrote: > > I hope it was not too late :). > > It is possible. > > Please check csvRdd api here, > https://github.com/databricks/spa

Re: JSON Usage

2016-04-17 Thread Hyukjin Kwon
Hi! Personally, I don't think it necessarily needs to be DataSet for your goal. Just select your data at "s3" from DataFrame loaded by sqlContext.read.json(). You can try to printSchema() to check the nested schema and then select the data. Also, I guess (from your codes) you are trying to send

Re: WELCOME to user@spark.apache.org

2016-04-17 Thread Hyukjin Kwon
Hi Jinan, There are some examples for XML here, https://github.com/databricks/spark-xml/blob/master/src/test/java/com/databricks/spark/xml/JavaXmlSuite.java for test codes. Or, you can see documentation in README.md. https://github.com/databricks/spark-xml#java-api. There are other basic Java

Re: How does .jsonFile() work?

2016-04-19 Thread Hyukjin Kwon
Hi, I hope I understood correctly. This is a simplified procedures. Precondition - JSON file is written line by line. Each is each JSON document. - Root array is supported, eg. [{...}, {...} {...}] Procedures - Schema inference (If user schema is not given) 1. R

Re: XML Data Source for Spark

2016-04-25 Thread Hyukjin Kwon
Hi Janan, Sorry, I was sleeping. I guess you sent a email to me first and then ask it to mailing list because I am not answering. I just tested this to double-check and could produce the same exception below: java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less; a

Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
Could you maybe share your codes? On 26 Apr 2016 9:51 p.m., "Ramkumar V" wrote: > Hi, > > I had loaded JSON file in parquet format into SparkSQL. I can't able to > read List which is inside JSON. > > Sample JSON > > { > "TOUR" : { > "CITIES" : ["Paris","Berlin","Prague"] > }, > "BUDJET" : 10

Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
) { public String call(Row row) throws Exception { return row.getString(1); } }); *Thanks*, <https://in.linkedin.com/in/ramkumarcs31> On Tue, Apr 26, 2016 at 3:48 PM, Hyukjin Kwon wrote: > Could you maybe share your codes? > On 26 Apr 2016 9:51 p.m., &quo

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
And also https://spark.apache.org/docs/1.6.0/programming-guide.html If the file is single file, then this would not be distributed. On 26 Apr 2016 11:52 p.m., "Ted Yu" wrote: > Please take a look at: > core/src/main/scala/org/apache/spark/SparkContext.scala > >* Do `val rdd = sparkContext.wh

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
wholeTextFile() API uses WholeTextFileInputFormat, https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala, which returns false for isSplittable. In this case, only single mapper appears for the entire f

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
EDIT: not mapper but a task for HadoopRDD maybe as far as I know. I think the most clear way is just to run a job on multiple files with the API and check the number of tasks in the job. On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon" wrote: wholeTextFile() API uses WholeTextFileInputFor

Re: removing header from csv file

2016-04-26 Thread Hyukjin Kwon
There are two ways to do so. Firstly, this way will make sure cleanly it skips the header. But of course the use of mapWithIndex decreases performance rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter } Secondly, you can do val header = rdd.first() val data = rd

Re: Error in spark-xml

2016-04-30 Thread Hyukjin Kwon
Hi Sourav, I think it is an issue. XML will assume the element by the rowTag as object. Could you please open an issue in https://github.com/databricks/spark-xml/issues please? Thanks! 2016-05-01 5:08 GMT+09:00 Sourav Mazumder : > Hi, > > Looks like there is a problem in spark-xml if the xml

Re: Error in spark-xml

2016-05-01 Thread Hyukjin Kwon
I tested this with the codes below val path = "path-to-file" sqlContext.read .format("xml") .option("rowTag", "bkval") .load(path) .show() ​ Thanks! 2016-05-01 15:11 GMT+09:00 Hyukjin Kwon : > Hi Sourav, > > I think it is an iss

Re: Parse Json in Spark

2016-05-08 Thread Hyukjin Kwon
I remember this Jira, https://issues.apache.org/jira/browse/SPARK-7366. Parsing multiple lines are not supported in Json fsta source. Instead this can be done by sc.wholeTextFiles(). I found some examples here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files Altho

Re: XML Processing using Spark SQL

2016-05-12 Thread Hyukjin Kwon
Hi Arunkumar, I guess your records are self-closing ones. There is an issue open here, https://github.com/databricks/spark-xml/issues/92 This is about XmlInputFormat.scala and it seems a bit tricky to handle the case so I left open until now. Thanks! 2016-05-13 5:03 GMT+09:00 Arunkumar Chan

Re: Does spark support Apache Arrow

2016-05-19 Thread Hyukjin Kwon
FYI, there is a JIRA for this, https://issues.apache.org/jira/browse/SPARK-13534 I hope this link is helpful. Thanks! 2016-05-20 11:18 GMT+09:00 Sun Rui : > 1. I don’t think so > 2. Arrow is for in-memory columnar execution. While cache is for in-memory > columnar storage > > On May 20, 2016,

Re: Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread Hyukjin Kwon
Yea, I met this case before. I guess this is related with https://issues.apache.org/jira/browse/SPARK-15393. 2016-06-15 8:46 GMT+09:00 antoniosi : > I tried the following code in both Spark 1.5.1 and Spark 1.6.0: > > import org.apache.spark.sql.types.{ > StructType, StructField, StringType, I

Re: Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread Hyukjin Kwon
reverted. I wrote your case in the comments in that JIRA. 2016-06-15 10:26 GMT+09:00 Hyukjin Kwon : > Yea, I met this case before. I guess this is related with > https://issues.apache.org/jira/browse/SPARK-15393. > > 2016-06-15 8:46 GMT+09:00 antoniosi : > >> I tried the f

Re: how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Hyukjin Kwon
It will 'auto-detect' the compression codec by the file extension and then will decompress and read it correctly. Thanks! 2016-06-16 20:27 GMT+09:00 Vamsi Krishna : > Hi, > > I'm using Spark 1.4.1 (HDP 2.3.2). > As per the spark-csv documentation ( > https://github.com/databricks/spark-csv), I s

Re: Processing json document

2016-07-06 Thread Hyukjin Kwon
There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file. In this case, this would only work in single

Re: Processing json document

2016-07-06 Thread Hyukjin Kwon
d need to have a look at it, but one large file does > not mean one Executor independent of the underlying format. > > On 07 Jul 2016, at 08:12, Hyukjin Kwon wrote: > > There is a good link for this here, > http://searchdatascience.com/spark-adventures-1-processing-multi-line-js

RE: Processing json document

2016-07-07 Thread Hyukjin Kwon
;, "lastName":"Smith" }, { "firstName":"Peter", "lastName":"Jones"} ] } On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon wrote: The link uses wholeTextFiles() API which treats each file as each record. 2016-

Re: Large files with wholetextfile()

2016-07-12 Thread Hyukjin Kwon
Otherwise, please consider using https://github.com/databricks/spark-xml. Actually, there is a function to find the input file name, which is.. input_file_name function, https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/func

Re: java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Hyukjin Kwon
I just wonder how your CSV data structure looks like. If my understanding is correct, is SQL type of the VectorUDT is StructType and CSV data source does not support ArrayType and StructType. Anyhow, it seems CSV does not support UDT for now anyway. https://github.com/apache/spark/blob/e1dc85373

Re: spark java - convert string to date

2016-07-31 Thread Hyukjin Kwon
I haven't used this by myself but I guess those functions should work. unix_timestamp() ​ See https://github.com/apache/spark/blob/480c870644595a71102be6597146d80b1c0816e4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2513-L2530 2016-07-31 22:57 GMT+09:00 Tony Lane : > Any bu

Re: DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Hyukjin Kwon
Do you mind if I ask which format you used to save the data? I guess you used CSV and there is a related PR open here https://github.com/apache/spark/pull/14279#issuecomment-237434591 2016-08-12 6:04 GMT+09:00 Jestin Ma : > When I load in a timestamp column and try to save it immediately witho

Re: Flattening XML in a DataFrame

2016-08-12 Thread Hyukjin Kwon
Hi Sreekanth, Assuming you are using Spark 1.x, I believe this code below: sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "emp").load("/tmp/sample.xml") .selectExpr("manager.id", "manager.name", "explode(manager.subordinates.clerk) as clerk") .selectExpr("id", "name", "c

Re: Flattening XML in a DataFrame

2016-08-16 Thread Hyukjin Kwon
gt; > Please suggest. Thanks in advance. > > > > Thanks, > > Sreekanth > > > > *From:* Sreekanth Jella [mailto:srikanth.je...@gmail.com] > *Sent:* Sunday, August 14, 2016 11:46 AM > *To:* 'Hyukjin Kwon' > *Cc:* 'user @spark' > *Subje

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Hi Efe, If my understanding is correct, supporting to write/read complex types is not supported because CSV format can't represent the nested types in its own format. I guess supporting them in writing in external CSV is rather a bug. I think it'd be great if we can write and read back CSV in it

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
just a dataset where > every record is a case class with only simple types as fields, strings and > dates. There's no nesting. > > That's what confuses me about how it's interpreting the schema. The schema > seems to be one complex field rather than a bunch of simple fields

Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Ah, BTW, there is an issue, SPARK-16216, about printing dates and timestamps here. So please ignore the integer values for dates 2016-08-19 9:54 GMT+09:00 Hyukjin Kwon : > Ah, sorry, I should have read this carefully. Do you mind if I ask your > codes to test? > > I would like

Re: what is the difference between coalese() and repartition() ?Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Hyukjin Kwon
Hi Andy, This link explains the difference well. https://bzhangusc.wordpress.com/2015/08/11/repartition-vs-coalesce/ Simply the difference is whether it "repartitions" partitions or not. Actually coalesce() with suffering performs exactly woth repartition(). On 29 Dec 2015 08:10, "Andy Davidson

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Hyukjin Kwon
Hi Divya, Are you using or have you tried Spark CSV datasource https://github.com/databricks/spark-csv ? Thanks! 2015-12-28 18:42 GMT+09:00 Divya Gehlot : > Hi, > I have input data set which is CSV file where I have date columns. > My output will also be CSV file and will using this output CSV

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Hyukjin Kwon
UMN6: > string, COLUMN7: int, COLUMN8: int, COLUMN9: string, COLUMN10: int, COLUMN11: > int, COLUMN12: int, COLUMN13: string, COLUMN14: string, COLUMN15: string, > COLUMN16: string, COLUMN17: string, COLUMN18: string, COLUMN19: string, > COLUMN20: string, COLUMN21: string, COLUMN22: stri

Re: NA value handling in sparkR

2016-01-27 Thread Hyukjin Kwon
Hm.. As far as I remember, you can set the value to treat as null with *nullValue* option. Although I am hitting network issues with Github so I can't check this now but please try that option as described in https://github.com/databricks/spark-csv. 2016-01-28 0:55 GMT+09:00 Felix Cheung : > That

Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-31 Thread Hyukjin Kwon
Hm.. As I said here https://github.com/databricks/spark-csv/issues/245#issuecomment-177682354, It sounds reasonable in a way though. For me, this might be to deal with some narrow use-cases. How about using csvRdd(), https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databrick

Re: spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-02-25 Thread Hyukjin Kwon
Hi, it looks you forgot to specify the "rowTag" option, which is "book" for the case of the sample data. Thanks 2016-01-29 8:16 GMT+09:00 Andrés Ivaldi : > Hi, could you get it work, tomorrow I'll be using the xml parser also, On > windows 7, I'll let you know the results. > > Regards, > > > >

Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-08 Thread Hyukjin Kwon
Hi all, While wring some parquet files by Spark, I found it actually only writes the parquet files with writer version1. This differs encoding types of the file. Is this intendedly fixed for some reasons? I changed codes and tested to write this as writer version2 and it looks fine. In more d

Filter applied on merged Parquet shemsa with new column fails.

2015-10-27 Thread Hyukjin Kwon
When enabling mergedSchema and predicate filter, this fails since Parquet filters are pushed down regardless of each schema of the splits (or rather files). Dominic Ricard reported this issue ( https://issues.apache.org/jira/browse/SPARK-11103) Even though this would work okay by setting spark.sq

Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Hi all, I am writing this email to both user-group and dev-group since this is applicable to both. I am now working on Spark XML datasource ( https://github.com/databricks/spark-xml). This uses a InputFormat implementation which I downgraded to Hadoop 1.x for version compatibility. However, I fo

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
a big change to 2.x API. if you agree, I can do, but I cannot > promise the time within one or two weeks because of my daily job. > > > > > > On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon wrote: > > Hi all, > > I am writing this email to both user-group and dev-gr

Inquery about contributing codes

2015-08-10 Thread Hyukjin Kwon
Dear Sir / Madam, I have a plan to contribute some codes about passing filters to a datasource as physical planning. In more detail, I understand when we want to build up filter operations from data like Parquet (when actually reading and filtering HDFS blocks at first not filtering in memory wit

Re: Best way to read XML data from RDD

2016-08-21 Thread Hyukjin Kwon
Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do th

Re: Entire XML data as one of the column in DataFrame

2016-08-21 Thread Hyukjin Kwon
I can't say this is the best way to do so but my instant thought is as below: Create two df sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"") sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"") sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8") val strXmlDf = sc

Re: [Spark2] Error writing "complex" type to CSV

2016-08-22 Thread Hyukjin Kwon
at has changed that this is no longer > possible? The pull request said that it prints garbage. Was that some > regression in 2.0? The same code prints fine in 1.6.1. The field prints as > an array of the values of its fields. > > On Thu, Aug 18, 2016 at 5:56 PM, Hyukjin Kwon wrot

Re: Best way to read XML data from RDD

2016-08-22 Thread Hyukjin Kwon
> > Original message > From: Darin McBeath > Date:21/08/2016 17:44 (GMT+05:30) > To: Hyukjin Kwon , Jörn Franke > > Cc: Diwakar Dhanuskodi , Felix Cheung < > felixcheun...@hotmail.com>, user > Subject: Re: Best way to read XML data from RDD > >

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Hyukjin Kwon
Hi Don, I guess this should be fixed from 2.0.1. Please refer this PR. https://github.com/apache/spark/pull/14339 On 1 Sep 2016 2:48 a.m., "Don Drake" wrote: > I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark > 2.0 and have encountered some interesting issues. > > First,

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
Hi Selvam, If your report is commented with any character (e.g. #), you can skip these lines via comment option [1]. If you are using Spark 1.x, then you might be able to do this by manually skipping from the RDD and then making this to DataFrame as below: I haven’t tested this but I think this

Re: Spark CSV output

2016-09-10 Thread Hyukjin Kwon
Have you tried the quote related options (e.g. `quote` or `quoteMode` *https://github.com/databricks/spark-csv/blob/master/README.md#features )*? On 11 Sep 2016 12:22 a.m., "ayan guha" wrote: > CSV standard uses quote to ide

Re: Reading a TSV file

2016-09-10 Thread Hyukjin Kwon
Yeap. also, sep is preferred and has a higher precedence than delimiter. ​ 2016-09-11 0:44 GMT+09:00 Jacek Laskowski : > Hi Muhammad, > > sep or delimiter should both work fine. > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 http://bit.

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
> | reader.readAll().map(data => Row(data(3),data(4),data(7), > data(9),data(14)))} > > The above code throws arrayoutofbounce exception for empty line and report > line. > > > On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon wrote: > >> Hi Selvam, >>

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin, I have few questions on this. Does that only not work with write.json() ? I just wonder if write.text, csv or another API does not work as well and it is a JSON specific issue. Also, does that work with small data? I want to make sure if this happen only on large data. Thanks! 2016

How many are there PySpark Windows users?

2016-09-17 Thread Hyukjin Kwon
Hi all, We are currently testing SparkR on Windows[1] and it seems several problems are being identified time to time. Although It seems it is not easy to automate Spark's tests in Scala on Windows because I think we should introduce a proper change detection to run only related tests rather than

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? On 20 Sep 2016 2:15 a.m., "Mohamed ismail" wrote: > Hi all > > I am trying to read: > > sc.textFile(DataFile).mapPartitions(lines => { > val parser = new CSVParser(",") >

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? BTW, it seems there is something wrong with your email address. I am sending this again. On 20 Sep 2016 8:32 a.m., "Hyukjin Kwon" wrote: > It seems not an issue in Spark. Does

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon
Hi Mich, I guess you could use nullValue option by setting it to null. If you are reading them into strings at the first please, then, you would meet https://github.com/apache/spark/pull/14118 first which is resolved from 2.0.1 Unfortunately, this bug also exists in external csv library for stri

Re: spark sql on json

2016-09-29 Thread Hyukjin Kwon
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java#L104-L181 2016-09-29 18:58 GMT+09:00 Hitesh Goyal : > Hi team, > > > > I have a json document. I want to put spark SQL to it. > > Can you please send me an example app built i

Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Hyukjin Kwon
It seems obviously a bug. It was introduced from my PR, https://github.com/apache/spark/commit/d37c7f7f042f7943b5b684e53cf4284c601fb347 +1 for creating a JIRA and PR. If you have any problem with this, I would like to do this quickly. On 5 Oct 2016 9:12 p.m., "Laurent Legrand" wrote: > Hello,

Re: Support for uniVocity in Spark 2.x

2016-10-06 Thread Hyukjin Kwon
Yeap, there is an option to switch Apache common CSV to Univocity in external CSV library but it become univocity by default and Apache common CSV was removed during porting it into Spark 2.0. On 7 Oct 2016 2:53 a.m., "Sean Owen" wrote: > It still uses univocity, but this is an implementation de

Re: JSON Arrays and Spark

2016-10-10 Thread Hyukjin Kwon
FYI, it supports [{...}, {...} ...] Or {...} format as input. On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin" wrote: > Thanks Luciano - I think this is my issue :( > > On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote: > > Please take a look at > http://spark.apache.org/docs/latest/sql-pro

Re: JSON Arrays and Spark

2016-10-11 Thread Hyukjin Kwon
No, I meant it should be in a single line but it supports array type too as a root wrapper of JSON objects. If you need to parse multiple lines, I have a reference here. http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ 2016-10-12 15:04 GMT+09:00 Kappaganthu, Siva

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Hyukjin Kwon
Regarding his recent PR[1], I guess he meant multiple line json. As far as I know, single line json also conplies the standard. I left a comment with RFC in the PR but please let me know if I am wrong at any point. Thanks! [1]https://github.com/apache/spark/pull/15511 On 19 Oct 2016 7:00 a.m.,

Re: how to extract arraytype data to file

2016-10-18 Thread Hyukjin Kwon
This reminds me of https://github.com/databricks/spark-xml/issues/141#issuecomment-234835577 Maybe using explode() would be helpful. Thanks! 2016-10-19 14:05 GMT+09:00 Divya Gehlot : > http://stackoverflow.com/questions/33864389/how-can-i- > create-a-spark-dataframe-from-a-nested-array-of-struc

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Hyukjin Kwon
I am also interested in this issue. I will try to look into this too within coming few days.. 2016-10-24 21:32 GMT+09:00 Sean Owen : > I actually think this is a general problem with usage of DateFormat and > SimpleDateFormat across the code, in that it relies on the default locale > of the JVM.

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
There are now timestampFormat for TimestampType and dateFormat for DateType. Do you mind if I ask to share your codes? On 27 Oct 2016 2:16 a.m., "Koert Kuipers" wrote: > is there a reason a column with dates in format -mm-dd in a csv file > is inferred to be TimestampType and not DateType?

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
in spark 2.0.1: > spark.read > .format("csv") > .option("header", true) > .option("inferSchema", true) > .load("test.csv") > .printSchema > > the result is: > root > |-- date: timestamp (nullable = true) > > >

Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Hyukjin Kwon
Hi Koert, I am curious about your case. I guess the purpose of timestampFormat and dateFormat is to infer timestamps/dates when parsing/inferring but not to exclude the type inference/parsing. Actually, it does try to infer/parse in 2.0.0 as well (but it fails) so actually I guess there wouldn't

Re: Spark XML ignore namespaces

2016-11-03 Thread Hyukjin Kwon
Oh, that PR was actually about not concerning the namespaces (meaning leaving data as they are, including prefixes). The problem was, each partition needs to produce each record with knowing the namesapces. It is fine to deal with them if they are within each XML documentation (represented as a

Re: Error creating SparkSession, in IntelliJ

2016-11-03 Thread Hyukjin Kwon
Hi Shyla, there is the documentation for setting up IDE - https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup I hope this is helpful. 2016-11-04 9:10 GMT+09:00 shyla deshpande : > Hello Everyone, > > I just installed Spark 2.0.1, spark shell w

Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi, Have you maybe tried the quote related options specified in the documentation? http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv Thanks. 2016-11-06 6:58 GMT+09:00 Femi Anthony : > Hi, I am trying to process a very large comma delimited csv

  1   2   >