Re: Parquet Metadata

2021-06-23 Thread Sam
Hi, I only know about comments which you can add to each column where you can add these key values. Thanks. On Wed, Jun 23, 2021 at 11:31 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi folks, > > > > Maybe not the right audience but maybe you came along such an requirement. >

Re: Parquet read performance for different schemas

2019-09-20 Thread Julien Laurenceau
Hi Tomas, Parquet tuning time !!! I strongly recommend you to read papers by CERN on spark parquet tuning https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example You have to check the size of the row groups in your parquet files and maybe tweak it a little

Re: Parquet read performance for different schemas

2019-09-20 Thread Tomas Bartalos
I forgot to mention important part that I'm issuing same query to both parquets - selecting only one column: df.select(sum('amount)) BR, Tomas št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a): > Hello, > > I have 2 parquets (each containing 1 file): > >- parquet-wide - schema has 25 top le

Re: Parquet 'bucketBy' creates a ton of files

2019-07-10 Thread Silvio Fiorito
-from-the-field-episode-ii-applying-best-practices-to-your-apache-spark-applications-with-silvio-fiorito From: Gourav Sengupta Date: Wednesday, July 10, 2019 at 3:14 AM To: Silvio Fiorito Cc: Arwin Tio , "user@spark.apache.org" Subject: Re: Parquet 'bucketBy' creates a ton

Re: Parquet 'bucketBy' creates a ton of files

2019-07-10 Thread Gourav Sengupta
yeah makes sense, also is there any massive performance improvement using bucketBy in comparison to sorting? Regards, Gourav On Thu, Jul 4, 2019 at 1:34 PM Silvio Fiorito wrote: > You need to first repartition (at a minimum by bucketColumn1) since each > task will write out the buckets/files. I

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Silvio Fiorito
You need to first repartition (at a minimum by bucketColumn1) since each task will write out the buckets/files. If the bucket keys are distributed randomly across the RDD partitions, then you will get multiple files per bucket. From: Arwin Tio Date: Thursday, July 4, 2019 at 3:22 AM To: "user@s

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Phillip Henry
Hi, Arwin. If I understand you correctly, this is totally expected behaviour. I don't know much about saving to S3 but maybe you could write to HDFS first then copy everything to S3? I think the write to HDFS will probably be much faster as Spark/HDFS will write locally or to a machine on the sam

Re: Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Matt Kuiper
the Parquet file. Matt From: Gabor Somogyi Sent: Wednesday, March 27, 2019 10:20:18 AM To: Matt Kuiper Cc: user@spark.apache.org Subject: Re: Parquet File Output Sink - Spark Structured Streaming Hi Matt, Maybe you could set maxFilesPerTrigger to 1. BR, G

Re: Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Gabor Somogyi
Hi Matt, Maybe you could set maxFilesPerTrigger to 1. BR, G On Wed, Mar 27, 2019 at 4:45 PM Matt Kuiper wrote: > Hello, > > I am new to Spark and Structured Streaming and have the following File > Output Sink question: > > Wondering what (and how to modify) triggers a Spark Sturctured Streami

Re: Parquet

2018-07-20 Thread Muthu Jayakumar
I generally write to Parquet when I want to repeat the operation of reading data and perform different operations on it every time. This would save db time for me. Thanks Muthu On Thu, Jul 19, 2018, 18:34 amin mohebbi wrote: > We do have two big tables each includes 5 billion of rows, so my que

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan
You can use EXPLAIN statement to see optimized plan for each query. ( https://stackoverflow.com/questions/35883620/spark-how-can-get-the-logical-physical-query-execution-using-thirft-hive ). 2018-03-19 0:52 GMT+07:00 CPC : > Hi nguyen, > > Thank you for quick response. But what i am trying to und

Re: parquet late column materialization

2018-03-18 Thread CPC
Hi nguyen, Thank you for quick response. But what i am trying to understand is in both query predicate evolution require only one column. So actually spark does not need to read all column in projection if they are not used in filter predicate. Just to give an example, amazon redshift has this kin

Re: parquet late column materialization

2018-03-18 Thread nguyen duc Tuan
Hi @CPC, Parquet is column storage format, so if you want to read data from only one column, you can do that without accessing all of your data. Spark SQL consists of a query optimizer ( see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it will optimi

Re: parquet vs orc files

2018-03-01 Thread Sushrut Ikhar
To add, schema evaluation is better for parquet compared to orc (at the cost of a bit slowness) as orc is truly index based; especially useful in case you would want to delete some column later. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar

Re: parquet vs orc files

2018-02-22 Thread Jörn Franke
Look at the documentation of the formats. In any case: * use additionally partitions on the filesystem * sort the data on filter columns - otherwise you do not benefit form min/max and bloom filters > On 21. Feb 2018, at 22:58, Kane Kim wrote: > > Thanks, how does min/max index work? Can spar

Re: parquet vs orc files

2018-02-22 Thread Kurt Fehlhauer
Hi Kane, It really depends on your use case. I generally use Parquet because it seems to have better support beyond Spark. However, if you are dealing with partitioned Hive tables, the current versions of Spark have an issue where compression will not be applied. This will be fixed in version 2.3.

Re: parquet vs orc files

2018-02-21 Thread Stephen Joung
In case of parquet, best source for me to configure and to ensure "min/max statistics" was https://www.slideshare.net/mobile/RyanBlue3/parquet-performance-tuning-the-missing-guide --- I don't have any experience in orc. 2018년 2월 22일 (목) 오전 6:59, Kane Kim 님이 작성: > Thanks, how does min/max index

Re: parquet vs orc files

2018-02-21 Thread Kane Kim
Thanks, how does min/max index work? Can spark itself configure bloom filters when saving as orc? On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke wrote: > In the latest version both are equally well supported. > > You need to insert the data sorted on filtering columns > Then you will benefit from

Re: parquet vs orc files

2018-02-21 Thread Jörn Franke
In the latest version both are equally well supported. You need to insert the data sorted on filtering columns Then you will benefit from min max indexes and in case of orc additional from bloom filters, if you configure them. In any case I recommend also partitioning of files (do not confuse wit

Re: Parquet files from spark not readable in Cascading

2017-11-20 Thread Vikas Gandham
I tried spark.sql.parquet.writeLegacyFormat to true but still issue persists. Thanks Vikas Gandham On Thu, Nov 16, 2017 at 10:25 AM, Yong Zhang wrote: > I don't have experience with Cascading, but we saw similar issue for > importing the data generated in Spark into Hive. > > > Did you try this

Re: Parquet files from spark not readable in Cascading

2017-11-16 Thread Yong Zhang
I don't have experience with Cascading, but we saw similar issue for importing the data generated in Spark into Hive. Did you try this setting "spark.sql.parquet.writeLegacyFormat" to true? https://stackoverflow.com/questions/44279870/why-cant-impala-read-parquet-files-after-spark-sqls-write

Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-13 Thread Yong Zhang
esday, June 13, 2017 1:54 AM To: Angel Francisco Orta Cc: Yong Zhang; user@spark.apache.org Subject: Re: Parquet file generated by Spark, but not compatible read by Hive Try setting following Param: conf.set("spark.sql.hive.convertMetastoreParquet","false") On Tue, Jun 1

Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread ayan guha
Try setting following Param: conf.set("spark.sql.hive.convertMetastoreParquet","false") On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta < angel.francisco.o...@gmail.com> wrote: > Hello, > > Do you use df.write or you make with hivecontext.sql(" insert into ...")? > > Angel. > > El 12 jun.

Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Angel Francisco Orta
Hello, Do you use df.write or you make with hivecontext.sql(" insert into ...")? Angel. El 12 jun. 2017 11:07 p. m., "Yong Zhang" escribió: > We are using Spark *1.6.2* as ETL to generate parquet file for one > dataset, and partitioned by "brand" (which is a string to represent brand > in this

Re: Parquet file amazon s3a timeout

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 11:13, Karin Valisova mailto:ka...@datapine.com>> wrote: Hello! I'm working with some parquet files saved on amazon service and loading them to dataframe with Dataset df = spark.read() .parquet(parketFileLocation); however, after some time I get the "Timeout waiting for con

Re: parquet optimal file structure - flat vs nested

2017-05-03 Thread Steve Loughran
> On 30 Apr 2017, at 09:19, Zeming Yu wrote: > > Hi, > > We're building a parquet based data lake. I was under the impression that > flat files are more efficient than deeply nested files (say 3 or 4 levels > down). Is that correct? > > Thanks, > Zeming Where's the data going to live: HDFS

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Can you give more details on the schema? Is it 6 TB just airport information as below? > On 30. Apr 2017, at 23:08, Zeming Yu wrote: > > I thought relational databases with 6 TB of data can be quite expensive? > >> On 1 May 2017 12:56 am, "Muthu Jayakumar" wrote: >> I am not sure if parquet

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
You have to find out how the user filters - by code? By airport name? Then you can have the right structure. Although, in the scenario below ORC with bloom filters may have some advantages. It is crucial that you sort the data when inserting it on the columns your user wants to filter. E.g. If f

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
I thought relational databases with 6 TB of data can be quite expensive? On 1 May 2017 12:56 am, "Muthu Jayakumar" wrote: > I am not sure if parquet is a good fit for this? This seems more like > filter lookup than an aggregate like query. I am curious to see what others > have to say. > Would i

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Another question: I need to store airport info in a parquet file and present it when a user makes a query. For example: "airport": { "code": "TPE", "name": "Taipei (Taoyuan Intl.)",

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Depends on your queries, the data structure etc. generally flat is better, but if your query filter is on the highest level then you may have better performance with a nested structure, but it really depends > On 30. Apr 2017, at 10:19, Zeming Yu wrote: > > Hi, > > We're building a parquet ba

Re: Parquet Gzipped Files

2017-02-14 Thread Benjamin Kim
Jörn, I agree with you, but the vendor is a little difficult to work with. For now, I will try to decompress it from S3 and save it plainly into HDFS. If someone already has this example, please let me know. Cheers, Ben > On Feb 13, 2017, at 9:50 AM, Jörn Franke wrote: > > Your vendor shoul

Re: Parquet Gzipped Files

2017-02-13 Thread Jörn Franke
Your vendor should use the parquet internal compression and not take a parquet file and gzip it. > On 13 Feb 2017, at 18:48, Benjamin Kim wrote: > > We are receiving files from an outside vendor who creates a Parquet data file > and Gzips it before delivery. Does anyone know how to Gunzip the

Re: Parquet with group by queries

2016-12-21 Thread Anil Langote
I tried caching the parent data set but it slows down the execution time, last column in the input data set is double array and requirement is to add last column double array after doing group by. I have implemented an aggregation function which adds the last column. Hence the query is Select

Re: parquet table in spark-sql

2016-05-03 Thread Sandeep Nemuri
We don't need any delimiters for Parquet file format. ᐧ On Tue, May 3, 2016 at 5:31 AM, Varadharajan Mukundan wrote: > Hi, > > Yes, it is not needed. Delimiters are need only for text files. > > On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote: > >> hi, I want to ask a question abo

Re: parquet table in spark-sql

2016-05-03 Thread Varadharajan Mukundan
Hi, Yes, it is not needed. Delimiters are need only for text files. On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote: > hi, I want to ask a question about parquet table in spark-sql table. > > I think that parquet have schema information in its own file. > so you don't need define r

Re: Parquet block size from spark-sql cli

2016-01-28 Thread Ted Yu
Have you tried the following (sc is SparkContext)? sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE) On Thu, Jan 28, 2016 at 9:16 AM, ubet wrote: > Can I set the Parquet block size (parquet.block.size) in spark-sql. We are > loading about 80 table partitions in parallel on 1.5.2 a

Re: Parquet write optimization by row group size config

2016-01-21 Thread Pavel Plotnikov
I have got about 25 separated gzipped log files per hour. File sizes is very different, from 10MB to 50MB of gzipped JSON data. So, i'am convert this data in parquet each hour. Code very simple on python: text_file = sc.textFile(src_file) df = sqlCtx.jsonRDD(text_file.map(lambda x: x.split('\t

Re: Parquet write optimization by row group size config

2016-01-20 Thread Jörn Franke
What is your data size, the algorithm and the expected time? Depending on this the group can recommend you optimizations or tell you that the expectations are wrong > On 20 Jan 2016, at 18:24, Pavel Plotnikov > wrote: > > Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i mi

Re: Parquet write optimization by row group size config

2016-01-20 Thread Akhil Das
It would be good if you can share the code, someone here or I can guide you better if you can post the code snippet. Thanks Best Regards On Wed, Jan 20, 2016 at 10:54 PM, Pavel Plotnikov < pavel.plotni...@team.wrike.com> wrote: > Thanks, Akhil! It helps, but this jobs still not fast enough, mayb

Re: Parquet write optimization by row group size config

2016-01-20 Thread Pavel Plotnikov
Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i missed something Regards, Pavel On Wed, Jan 20, 2016 at 9:51 AM Akhil Das wrote: > Did you try re-partitioning the data before doing the write? > > Thanks > Best Regards > > On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <

Re: Parquet write optimization by row group size config

2016-01-19 Thread Akhil Das
Did you try re-partitioning the data before doing the write? Thanks Best Regards On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov < pavel.plotni...@team.wrike.com> wrote: > Hello, > I'm using spark on some machines in standalone mode, data storage is > mounted on this machines via nfs. A have in

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-12 Thread Cheng Lian
I see. So there are actually 3000 tasks instead of 3000 jobs right? Would you mind to provide the full stack trace of the GC issue? At first I thought it's identical to the _metadata one in the mail thread you mentioned. Cheng On 1/11/16 5:30 PM, Gavin Yue wrote: Here is how I set the conf:

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Gavin Yue
Here is how I set the conf: sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") This actually works, I do not see the _metadata file anymore. I think I made a mistake. The 3000 jobs are coming from repartition("id"). I have 7600 json files and want to save as parquet. So if

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Cheng Lian
Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially, you mentioned you saw "3000 jobs" failed. Were you writing each Parquet file with an individual job? (Usually people use write.partitionBy

Re: parquet file doubts

2015-12-08 Thread Cheng Lian
eet mailto:absi...@informatica.com>> wrote: Yes, Parquet has min/max. *From:*Cheng Lian [mailto:l...@databricks.com <mailto:l...@databricks.com>] *Sent:* Monday, December 07, 2015 11:21 AM *To:* Ted Yu *Cc:* Shushant Arora; user@spark.apache.org <mailto

Re: parquet file doubts

2015-12-07 Thread Shushant Arora
Singh, Abhijeet wrote: > Yes, Parquet has min/max. > > > > *From:* Cheng Lian [mailto:l...@databricks.com] > *Sent:* Monday, December 07, 2015 11:21 AM > *To:* Ted Yu > *Cc:* Shushant Arora; user@spark.apache.org > *Subject:* Re: parquet file doubts > > > > O

RE: parquet file doubts

2015-12-07 Thread Singh, Abhijeet
Yes, Parquet has min/max. From: Cheng Lian [mailto:l...@databricks.com] Sent: Monday, December 07, 2015 11:21 AM To: Ted Yu Cc: Shushant Arora; user@spark.apache.org Subject: Re: parquet file doubts Oh sorry... At first I meant to cc spark-user list since Shushant and I had been discussed some

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
Oh sorry... At first I meant to cc spark-user list since Shushant and I had been discussed some Spark related issues before. Then I realized that this is a pure Parquet issue, but forgot to change the cc list. Thanks for pointing this out! Please ignore this thread. Cheng On 12/7/15 12:43 PM,

Re: parquet file doubts

2015-12-06 Thread Ted Yu
Cheng: I only see user@spark in the CC. FYI On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian wrote: > cc parquet-dev list (it would be nice to always do so for these general > questions.) > > Cheng > > On 12/6/15 3:10 PM, Shushant Arora wrote: > >> Hi >> >> I have few doubts on parquet file format. >

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
cc parquet-dev list (it would be nice to always do so for these general questions.) Cheng On 12/6/15 3:10 PM, Shushant Arora wrote: Hi I have few doubts on parquet file format. 1.Does parquet keeps min max statistics like in ORC. how can I see parquet version(whether its1.1,1.2or1.3) for pa

Re: Parquet files not getting coalesced to smaller number of files

2015-11-29 Thread Cheng Lian
RDD.coalesce(n) returns a new RDD rather than modifying the original RDD. So what you need is: metricsToBeSaved.coalesce(1500).saveAsNewAPIHadoopFile(...) Cheng On 11/29/15 12:21 PM, SRK wrote: Hi, I have the following code that saves the parquet files in my hourly batch to hdfs. My idea

Re: Parquet file size

2015-10-08 Thread Cheng Lian
Lian; user@spark.apache.org *Subject:* Re: Parquet file size Hi, In our case, we're using the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE to increase the size of the RDD partitions when loading text files, so it would generate larger parquet files. We just set

RE: Parquet file size

2015-10-07 Thread Younes Naguib
orld.com> From: odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete [och...@apache.org] Sent: Wednesday, October 07, 2015 9:14 PM To: Younes Naguib Cc: Cheng Lian; user@spark.apache.org Subject: Re: Parquet file size Hi, In our case, we&#

Re: Parquet file size

2015-10-07 Thread Deng Ching-Mallete
el.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib > @tritondigital.com > -- > *From:* Cheng Lian [lian.cs@gmail.com] > *Sent:* Wednesday, October 07, 2015 7:01 PM > > *To:* Younes Naguib; 'user@spark.apache.org' >

RE: Parquet file size

2015-10-07 Thread Younes Naguib
7:01 PM To: Younes Naguib; 'user@spark.apache.org' Subject: Re: Parquet file size The reason why so many small files are generated should probably be the fact that you are inserting into a partitioned table with three partition columns. If you want a large Parquet files, you may try

Re: Parquet file size

2015-10-07 Thread Cheng Lian
:* Younes Naguib; 'user@spark.apache.org' *Subject:* Re: Parquet file size Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and c

RE: Parquet file size

2015-10-07 Thread Younes Naguib
The TSV original files is 600GB and generated 40k files of 15-25MB. y From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: October-07-15 3:18 PM To: Younes Naguib; 'user@spark.apache.org' Subject: Re: Parquet file size Why do you want larger files? Doesn't the result Parquet f

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day)..

Re: parquet error

2015-09-18 Thread Cheng Lian
Not sure what's happening here, but I guess it's probably a dependency version issue. Could you please give vanilla Apache Spark a try to see whether its a CDH specific issue or not? Cheng On 9/17/15 11:44 PM, Chengi Liu wrote: Hi, I did some digging.. I believe the error is caused by jets3

Re: parquet error

2015-09-17 Thread Chengi Liu
Hi, I did some digging.. I believe the error is caused by jets3t jar. Essentially these lines locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 'org/apache/hadoop/fs/s3/S3Credentials', 'org/jets3t/service/security/AWSCr

Re: Parquet partitioning performance issue

2015-09-13 Thread Dean Wampler
One general technique is perform a second pass later over the files, for example the next day or once a week, to concatenate smaller files into larger ones. This can be done for all file types and allows you make recent data available to analysis tools, while avoiding a large build up of small file

Re: Parquet Array Support Broken?

2015-09-08 Thread Cheng Lian
Yeah, this is a typical Parquet interoperability issue due to unfortunate historical reasons. Hive (actually parquet-hive) gives the following schema for array: message m0 { optional group f (LIST) { repeated group bag { optional int32 array_element; } } } while Spark SQL gives me

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
Thank you - it works if the file is created in Spark On Mon, Sep 7, 2015 at 3:06 PM, Ruslan Dautkhanov wrote: > Read response from Cheng Lian on Aug/27th - it > looks the same problem. > > Workarounds > 1. write that parquet file in Spark; > 2. upgrade to Spark 1.5. > > -- > Ruslan Dautkhanov >

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov
Read response from Cheng Lian on Aug/27th - it looks the same problem. Workarounds 1. write that parquet file in Spark; 2. upgrade to Spark 1.5. -- Ruslan Dautkhanov On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov wrote: > No, it was created in Hive by CTAS, but any help is appreciated... > > On

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
No, it was created in Hive by CTAS, but any help is appreciated... On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov wrote: > That parquet table wasn't created in Spark, is it? > > There was a recent discussion on this list that complex data types in > Spark prior to 1.5 often incompatible with

Re: Parquet Array Support Broken?

2015-09-07 Thread Ruslan Dautkhanov
That parquet table wasn't created in Spark, is it? There was a recent discussion on this list that complex data types in Spark prior to 1.5 often incompatible with Hive for example, if I remember correctly. On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote: > I am trying to read an (array typed) p

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
The same error if I do: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val results = sqlContext.sql("SELECT * FROM stats") but it does work from Hive shell directly... On Mon, Sep 7, 2015 at 1:56 PM, Alex Kozlov wrote: > I am trying to read an (array typed) parquet file in spar

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Cheng Lian
What version of Spark were you using? Have you tried increasing --executor-memory? This schema looks pretty normal. And Parquet stores all keys of a map in a single column. Cheng On 9/4/15 4:00 PM, Kohki Nishio wrote: The stack trace is this java.lang.OutOfMemoryError: Java heap space

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Kohki Nishio
The stack trace is this java.lang.OutOfMemoryError: Java heap space at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) at parquet.bytes.CapacityByteArrayOutputStream.(CapacityByteArrayOutputStream.java:57) at parquet.column.va

Re: Parquet partitioning for unique identifier

2015-09-03 Thread Cheng Lian
Could you please provide the full stack track of the OOM exception? Another common case of Parquet OOM is super wide tables, say hundred or thousands of columns. And in this case, the number of rows is mostly irrelevant. Cheng On 9/4/15 1:24 AM, Kohki Nishio wrote: let's say I have a data li

Re: Parquet partitioning for unique identifier

2015-09-03 Thread Kohki Nishio
let's say I have a data like htis ID | Some1 | Some2| Some3 | A1 | kdsfajfsa | dsafsdafa | fdsfafa | A2 | dfsfafasd | 23jfdsjkj | 980dfs | A3 | 99989df | jksdljas | 48dsaas | .. Z00.. | fdsafdsfa | fdsdafdas | 89sdaff | My understanding is that if I giv

Re: Parquet partitioning for unique identifier

2015-09-02 Thread Adrien Mogenet
Any code / Parquet schema to provide? I'm not sure to understand which step fails right there... On 3 September 2015 at 04:12, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Did you specify partitioning column while saving data.. > On Sep 3, 2015 5:41 AM, "Kohki Nishio" wrote: > >>

Re: Parquet partitioning for unique identifier

2015-09-02 Thread Raghavendra Pandey
Did you specify partitioning column while saving data.. On Sep 3, 2015 5:41 AM, "Kohki Nishio" wrote: > Hello experts, > > I have a huge json file (> 40G) and trying to use Parquet as a file > format. Each entry has a unique identifier but other than that, it doesn't > have 'well balanced value'

Re: Parquet without hadoop: Possible?

2015-08-12 Thread Cheng Lian
st 11, 2015 12:01 PM *To:* Ellafi, Saif A.; deanwamp...@gmail.com *Cc:* user@spark.apache.org *Subject:* RE: Parquet without hadoop: Possible? Sorry, I provided bad information. This example worked fine with reduced parallelism. It seems my problem have to do with something specific with the

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Jerry Lam
> From: Ellafi, Saif A. > Sent: Tuesday, August 11, 2015 12:01 PM > To: Ellafi, Saif A.; deanwamp...@gmail.com > Cc: user@spark.apache.org > Subject: RE: Parquet without hadoop: Possible? > > Sorry, I provided bad information. This example worked fine with reduced > parall

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From: Ellafi, Saif A. Sent: Tuesday, August 11, 2015 12:01 PM To: Ellafi, Saif A.; deanwamp...@gmail.com Cc: user@spark.apache.org Subject: RE: Parquet without hadoop: Possible? Sorry

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
11:49 AM To: deanwamp...@gmail.com Cc: user@spark.apache.org Subject: RE: Parquet without hadoop: Possible? I am launching my spark-shell spark-1.4.1-bin-hadoop2.6/bin/spark-shell 15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
August 11, 2015 11:39 AM To: Ellafi, Saif A. Cc: user@spark.apache.org Subject: Re: Parquet without hadoop: Possible? It should work fine. I have an example script here: https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala (Spark 1

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Dean Wampler
It should work fine. I have an example script here: https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala (Spark 1.4.X) What does "I am failing to do so" mean? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Parquet SaveMode.Append Trouble.

2015-08-04 Thread Cheng Lian
You need to import org.apache.spark.sql.SaveMode Cheng On 7/31/15 6:26 AM, satyajit vegesna wrote: Hi, I am new to using Spark and Parquet files, Below is what i am trying to do, on Spark-shell, val df = sqlContext.parquetFile("/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet") Hav

Re: [Parquet + Dataframes] Column names with spaces

2015-07-30 Thread Michael Armbrust
You can't use these names due to limitations in parquet (and the library it self with silently generate corrupt files that can't be read, hence the error we throw). You can alias a column by df.select(df("old").alias("new")), which is essential what withColumnRenamed does. Alias in this case mean

Re: Parquet writing gets progressively slower

2015-07-26 Thread Cheng Lian
Actually no. In general, Spark SQL doesn't trust Parquet summary files. The reason is that it's not unusual to fail to write Parquet summary files. For example, Hive never writes summary files for Parquet tables because it uses NullOutputCommitter, which bypasses Parquet's own output committer.

Re: Parquet writing gets progressively slower

2015-07-25 Thread Michael Kelly
Thanks for the suggestion Cheng, I will try that today. Are there any implications when reading the parquet data if there are no summary files present? Michael On Sat, Jul 25, 2015 at 2:28 AM, Cheng Lian wrote: > The time is probably spent by ParquetOutputFormat.commitJob. While > committing a s

Re: Parquet writing gets progressively slower

2015-07-24 Thread Cheng Lian
The time is probably spent by ParquetOutputFormat.commitJob. While committing a successful write job, Parquet writes a pair of summary files, containing metadata like schema, user defined key-value metadata, and Parquet row group information. To gather all the necessary information, Parquet sca

Re: Parquet problems

2015-07-22 Thread Michael Misiewicz
For what it's worth, my data set has around 85 columns in Parquet format as well. I have tried bumping the permgen up to 512m but I'm still getting errors in the driver thread. On Wed, Jul 22, 2015 at 1:20 PM, Jerry Lam wrote: > Hi guys, > > I noticed that too. Anders, can you confirm that it wo

Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg wrote: > No, never really resolved the problem, except by increasing the per

Re: Parquet problems

2015-07-22 Thread Anders Arpteg
No, never really resolved the problem, except by increasing the permgem space which only partially solved it. Still have to restart the job multiple times so make the whole job complete (it stores intermediate results). The parquet data sources have about 70 columns, and yes Cheng, it works fine w

Re: Parquet problems

2015-07-22 Thread Cheng Lian
How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause fai

Re: Parquet problems

2015-07-22 Thread Michael Misiewicz
Hi Anders, Did you ever get to the bottom of this issue? I'm encountering it too, but only in "yarn-cluster" mode running on spark 1.4.0. I was thinking of trying 1.4.1 today. Michael On Thu, Jun 25, 2015 at 5:52 AM, Anders Arpteg wrote: > Yes, both the driver and the executors. Works a little

Re: Parquet problems

2015-06-25 Thread Anders Arpteg
Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan skrev: > Did you try

Re: Parquet problems

2015-06-24 Thread Sabarish Sasidharan
Did you try increasing the perm gen for the driver? Regards Sab On 24-Jun-2015 4:40 pm, "Anders Arpteg" wrote: > When reading large (and many) datasets with the Spark 1.4.0 DataFrames > parquet reader (the org.apache.spark.sql.parquet format), the following > exceptions are thrown: > > Exception

Re: Parquet Multiple Output

2015-06-12 Thread Cheng Lian
Spark 1.4 supports dynamic partitioning, you can first convert your RDD to a DataFrame and then save the contents partitioned by date column. Say you have a DataFrame df containing three columns a, b, and c, you may have something like this: df.write.partitionBy("a", "b").mode("overwrite"

Re: Parquet number of partitions

2015-05-07 Thread Eric Eijkelenboom
Funny enough, I observe different behaviour on EC2 vs EMR (Spark on EMR installed with https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark ). Both with Spark 1.3.1/Hadoop 2. Reading a folder with 12 Parquet give

Re: Parquet number of partitions

2015-05-07 Thread Archit Thakur
Hi. No. of partitions are determined by the RDD it uses in the plan it creates. It uses NewHadoopRDD which gives partitions by getSplits of input format it is using. It uses FilteringParquetRowInputFormat subclass of ParquetInputFormat. To change the no of partitions write a new input format and ma

Re: Parquet number of partitions

2015-05-05 Thread Masf
Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x=> number of partitions. Depend on your case, repartition could be a heavy task Regards. Miguel

Re: Parquet error reading data that contains array of structs

2015-04-29 Thread Cheng Lian
Thanks for the detailed information! Now I can confirm that this is a backwards-compatibility issue. The data written by parquet 1.6rc7 follows the standard LIST structure. However, Spark SQL still uses old parquet-avro style two-level structures, which causes the problem. Cheng On 4/27/15

Re: Parquet error reading data that contains array of structs

2015-04-27 Thread Jianshi Huang
FYI, Parquet schema output: message pig_schema { optional binary cust_id (UTF8); optional int32 part_num; optional group ip_list (LIST) { repeated group ip_t { optional binary ip (UTF8); } } optional group vid_list (LIST) { repeated group vid_t { optional binary

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Cheng Lian
Had an offline discussion with Jianshi, the dataset was generated by Pig. Jianshi - Could you please attach the output of "parquet-schema "? I guess this is a Parquet format backwards-compatibility issue. Parquet hadn't standardized representation of LIST and MAP until recently, thus many syst

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Cheng Lian
Had an offline discussion with Jianshi, the dataset was generated by Pig. Jianshi - Could you please attach the output of "parquet-schema "? I guess this is a Parquet format backwards-compatibility issue. Parquet hadn't standardized representation of LIST and MAP until recently, thus many syst

  1   2   >