Dataframe size using RDDStorageInfo objects

2018-03-16 Thread Bahubali Jain
Hi, I am trying to figure out a way to find the size of *persisted *dataframes using the *sparkContext.getRDDStorageInfo() * RDDStorageInfo object has information related to the number of bytes stored in memory and on disk. For eg: I have 3 dataframes which i have cached. df1.cache() df2.cache() d

Compression during shuffle writes

2017-11-09 Thread Bahubali Jain
Hi, I have compressed data of size 500GB .I am repartitioning this data since the underlying data is very skewed and is causing a lot of issues for the downstream jobs. During repartioning the *shuffles writes* are not getting compressed due to this I am running into disk space issues.Below is the

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
If you are looking for workaround, the JIRA ticket clearly show you how to > increase your driver heap. 1G in today's world really is kind of small. > > > Yong > > > -- > *From:* Bahubali Jain > *Sent:* Thursday, March 16, 2017 10:34 PM > *

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
; issues.apache.org > Executing a sql statement with a large number of partitions requires a > high memory space for the driver even there are no requests to collect data > back to the driver. > > > > -- > *From:* Bahubali Jain > *Sent:* Thursday,

Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
Hi, While saving a dataset using * mydataset.write().csv("outputlocation") * I am running into an exception *"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"* Does it mean that for saving a dataset whole of

SPARK ML- Feature Selection Techniques

2016-09-05 Thread Bahubali Jain
Hi, Do we have any feature selection techniques implementation(wrapper methods,embedded methods) available in SPARK ML ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain
s the model, then > add the model to the pipeline where it will only transform. > > val featureVectorIndexer = new VectorIndexer() > .setInputCol("feature") > .setOutputCol("indexedfeature") > .setMaxCategories(180) > .fit(completeDataset) &g

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain
Hi, I had run into similar exception " java.util.NoSuchElementException: key not found: " . After further investigation I realized it is happening due to vectorindexer being executed on training dataset and not on entire dataset. In the dataframe I have 5 categories , each of these have to go thru

Large files with wholetextfile()

2016-07-12 Thread Bahubali Jain
Hi, We have a requirement where in we need to process set of xml files, each of the xml files contain several records (eg: data of record 1.. data of record 2.. Expected output is Since we needed file name as well in output ,we chose wholetextfile() . We had to go against

DAG related query

2015-08-20 Thread Bahubali Jain
Hi, How would the DAG look like for the below code JavaRDD rdd1 = context.textFile(); JavaRDD rdd2 = rdd1.map(); rdd1 = rdd2.map(); Does this lead to any kind of cycle? Thanks, Baahu

JavaRDD and saveAsNewAPIHadoopFile()

2015-06-27 Thread Bahubali Jain
Hi, Why doesn't JavaRDD has saveAsNewAPIHadoopFile() associated with it. Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Multiple dir support : newApiHadoopFile

2015-06-26 Thread Bahubali Jain
Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Pseudo Spark Streaming ?

2015-04-05 Thread Bahubali Jain
Hi, I have a requirement in which I plan to use the SPARK Streaming. I am supposed to calculate the access count to certain webpages.I receive the webpage access information thru log files. By Access count I mean "how many times was the page accessed *till now* " I have the log files for past 2 yea

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain
Feb 11, 2015 at 6:35 AM, Akhil Das > wrote: > > Did you try : > > > > temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, > String.class,(Class) > > TextOutputFormat.class); > > > > Thanks > > Best Regards > > > > O

Writing to HDFS from spark Streaming

2015-02-10 Thread Bahubali Jain
Hi, I am facing issues while writing data from a streaming rdd to hdfs.. JavaPairDstream temp; ... ... temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, String.class,TextOutputFormat.class); I see compilation issues as below... The method saveAsHadoopFiles(String, String, Class, Class, Cla

textFileStream() issue?

2014-12-03 Thread Bahubali Jain
Hi, I am trying to use textFileStream("some_hdfs_location") to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I "move" them from a location with in hdfs to location at which textFileStream() is checking for new file

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread Bahubali Jain
Hi, You can associate all the messages of a 3min interval with a unique key and then group by and finally add up. Thanks On Dec 1, 2014 9:02 PM, "pankaj" wrote: > Hi, > > My incoming message has time stamp as one field and i have to perform > aggregation over 3 minute of time slice. > > Message

Re: Help with Spark Streaming

2014-11-16 Thread Bahubali Jain
Hi, Can anybody help me on this please, haven't been able to find the problem :( Thanks. On Nov 15, 2014 4:48 PM, "Bahubali Jain" wrote: > Hi, > Trying to use spark streaming, but I am struggling with word count :( > I want consolidate output of the word count (not on a

Help with Spark Streaming

2014-11-15 Thread Bahubali Jain
Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window basis), so I am using updateStateByKey(), but for some reason this is not working. The function it self is not being invoked(do not see the sysout output on con

Issue with Custom Key Class

2014-11-08 Thread Bahubali Jain
Hi, I have a custom key class.In this class equals() and hashcode() have been overridden. I have a javaPairRDD which has this class as the key .When groupbykey() or reducebykey() is called a null object is being passed to the function *equals*(Object obj) as a result the grouping is failing. Is t