Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Arun Patel
issue? I tried playing with spark.memory.fraction and spark.memory.storageFraction. But, it did not help. Appreciate your help on this!!! On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel wrote: > Thanks for the quick response. > > Its a single XML file and I am using a top level rowTag

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
f you still face this problem? > > I will try to take a look at my best. > > > Thank you. > > > 2016-11-16 9:12 GMT+09:00 Arun Patel : > >> I am trying to read an XML file which is 1GB is size. I am getting an >> error 'java.lang.OutOfMemoryErr

Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
I am trying to read an XML file which is 1GB is size. I am getting an error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit' after reading 7 partitions in local mode. In Yarn mode, it throws 'java.lang.OutOfMemoryError: Java heap space' error after reading 3 partitions. Any su

Spark XML ignore namespaces

2016-11-03 Thread Arun Patel
I see that 'ignoring namespaces' issue is resolved. https://github.com/databricks/spark-xml/pull/75 How do we enable this option and ignore namespace prefixes? - Arun

Re: Best approach for processing all files parallelly

2016-10-10 Thread Arun Patel
) returns schema for FileType > > This for loop DOES NOT process files sequentially. It creates dataframes > on all files which are of same types sequentially. > > On Fri, Oct 7, 2016 at 12:08 AM, Arun Patel > wrote: > >> Thanks Ayan. Couple of questions: >> >&

Re: Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
is case, if you see, t[1] is NOT the file content, as I have added a > "FileType" field. So, this collect is just bringing in the list of file > types, should be fine > > On Thu, Oct 6, 2016 at 11:47 PM, Arun Patel > wrote: > >> Thanks Ayan. I am really concerned ab

Re: Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
ist.append(df) > > > > On Thu, Oct 6, 2016 at 10:26 PM, Arun Patel > wrote: > >> My Pyspark program is currently identifies the list of the files from a >> directory (Using Python Popen command taking hadoop fs -ls arguments). For >> each file, a Dataframe is cr

Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
My Pyspark program is currently identifies the list of the files from a directory (Using Python Popen command taking hadoop fs -ls arguments). For each file, a Dataframe is created and processed. This is sequeatial. How to process all files paralelly? Please note that every file in the directory

Re: Check if a nested column exists in DataFrame

2016-09-13 Thread Arun Patel
at 5:28 PM, Arun Patel wrote: > I'm trying to analyze XML documents using spark-xml package. Since all > XML columns are optional, some columns may or may not exist. When I > register the Dataframe as a table, how do I check if a nested column is > existing or not? My column na

Check if a nested column exists in DataFrame

2016-09-12 Thread Arun Patel
I'm trying to analyze XML documents using spark-xml package. Since all XML columns are optional, some columns may or may not exist. When I register the Dataframe as a table, how do I check if a nested column is existing or not? My column name is "emp" which is already exploded and I am trying to c

Re: spark-xml to avro - SchemaParseException: Can't redefine

2016-09-09 Thread Arun Patel
ed the title from Save DF with nested records with the same > name to spark-avro fails to save DF with nested records having the same > name Jun 23, 2015 > > > > -- > *From:* Arun Patel > *Sent:* Thursday, September 8, 2016 5:31 PM > *To:* u

spark-xml to avro - SchemaParseException: Can't redefine

2016-09-08 Thread Arun Patel
I'm trying to convert XML to AVRO. But, I am getting SchemaParser exception for 'Rules' which is existing in two separate containers. Any thoughts? XML is attached. df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='GGLResponse',attributePrefix='').load('GGL.xml') df.show

Re: Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
his to parquet file? >> >> There are two ways to specify "path": >> >> 1. Using option method >> 2. start(path: String): StreamingQuery >> >> Pozdrawiam, >> Jacek Laskowski >> >> https://medium.com/@jaceklaskowski/ >> Mas

Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
I am trying out Structured streaming parquet sink. As per documentation, parquet is the only available file sink. I am getting an error like 'path' is not specified. scala> val query = streamingCountsDF.writeStream.format("parquet").start() java.lang.IllegalArgumentException: 'path' is not speci

Re: Graphframe Error

2016-07-07 Thread Arun Patel
I have tied this already. It does not work. What version of Python is needed for this package? On Wed, Jul 6, 2016 at 12:45 AM, Felix Cheung wrote: > This could be the workaround: > > http://stackoverflow.com/a/36419857 > > > > > On Tue, Jul 5, 2016 at 5:

Re: Graphframe Error

2016-07-05 Thread Arun Patel
amples works well in my laptop. > > Or you can use try with > > bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar > > to launch PySpark with graphframes enabled. You should set "--py-files" > and "--jars" options with the directory where you saved

Graphframe Error

2016-07-03 Thread Arun Patel
I started my pyspark shell with command (I am using spark 1.6). bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6 I have copied http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar to the lib directory of Spark as well. I w

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Arun Patel
Can anyone answer these questions please. On Mon, Jun 13, 2016 at 6:51 PM, Arun Patel wrote: > Thanks Michael. > > I went thru these slides already and could not find answers for these > specific questions. > > I created a Dataset and converted it to DataFrame in 1.6 and 2

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
k-dataframes-datasets-and-streaming-by-michael-armbrust > > On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel > wrote: > >> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an >> alias for a Dataset of type row. I have few questions. >> >> 1) What

Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an alias for a Dataset of type row. I have few questions. 1) What does this really mean to an Application developer? 2) Why this unification was needed in Spark 2.0? 3) What changes can be observed in Spark 2.0 vs Spark 1.6?

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Thanks Sean and Jacek. Do we have any updated documentation for 2.0 somewhere? On Tue, Jun 7, 2016 at 9:34 AM, Jacek Laskowski wrote: > On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen wrote: > > That's not any kind of authoritative statement, just my opinion and > guess. > > Oh, come on. You're not

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
ski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Thu, Apr 28, 2016 at 1:43 PM, Arun Patel > wrote: > > A small request. > > > > Would you min

Spark 2.0 Release Date

2016-04-28 Thread Arun Patel
A small request. Would you mind providing an approximate date of Spark 2.0 release? Is it early May or Mid May or End of May? Thanks, Arun

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Arun Patel
So, Does that mean only one RDD is created by all receivers? On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao wrote: > Normally there will be one RDD in each batch. > > You could refer to the implementation of DStream#getOrCompute. > > > On Mon, Dec 21, 2015 at 11:04 AM, A

Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Arun Patel
It may be simple question...But, I am struggling to understand this DStream is a sequence of RDDs created in a batch window. So, how do I know how many RDDs are created in a batch? I am clear about the number of partitions created which is Number of Partitions = (Batch Interval / spark.str

Scheduling across applications - Need suggestion

2015-04-22 Thread Arun Patel
I believe we can use the properties like --executor-memory --total-executor-cores to configure the resources allocated for each application. But, in a multi user environment, shells and applications are being submitted by multiple users at the same time. All users are requesting resources with d

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
5 at 4:05 PM, Arun Patel > wrote: > >> What is difference between mapPartitions vs foreachPartition? >> >> When to use these? >> >> Thanks, >> Arun >> > >

mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
What is difference between mapPartitions vs foreachPartition? When to use these? Thanks, Arun

Code Deployment tools in Production

2015-04-19 Thread Arun Patel
Generally what tools are used to schedule spark jobs in production? How is spark streaming code is deployed? I am interested in knowing the tools used like cron, oozie etc. Thanks, Arun

Re: Dataframes Question

2015-04-19 Thread Arun Patel
tand DataFrame is >> grandfathering SchemaRDD. This was done for API stability as spark sql >> matured out of alpha as part of 1.3.0 release. >> >> It is forward looking and brings (dataframe like) syntax that was not >> available with the older schema RDD. >>

Dataframes Question

2015-04-18 Thread Arun Patel
Experts, I have few basic questions on DataFrames vs Spark SQL. My confusion is more with DataFrames. 1) What is the difference between Spark SQL and DataFrames? Are they same? 2) Documentation says SchemaRDD is renamed as DataFrame. This means SchemaRDD is not existing in 1.3? 3) As per doc