from:"Bahubali Jain"

Dataframe size using RDDStorageInfo objects

2018-03-16 Thread Bahubali Jain

Hi, I am trying to figure out a way to find the size of *persisted *dataframes using the *sparkContext.getRDDStorageInfo() * RDDStorageInfo object has information related to the number of bytes stored in memory and on disk. For eg: I have 3 dataframes which i have cached. df1.cache() df2.cache() d

Compression during shuffle writes

2017-11-09 Thread Bahubali Jain

Hi, I have compressed data of size 500GB .I am repartitioning this data since the underlying data is very skewed and is causing a lot of issues for the downstream jobs. During repartioning the *shuffles writes* are not getting compressed due to this I am running into disk space issues.Below is the

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain

If you are looking for workaround, the JIRA ticket clearly show you how to > increase your driver heap. 1G in today's world really is kind of small. > > > Yong > > > -- > *From:* Bahubali Jain > *Sent:* Thursday, March 16, 2017 10:34 PM > *

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain

; issues.apache.org > Executing a sql statement with a large number of partitions requires a > high memory space for the driver even there are no requests to collect data > back to the driver. > > > > -- > *From:* Bahubali Jain > *Sent:* Thursday,

Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain

Hi, While saving a dataset using * mydataset.write().csv("outputlocation") * I am running into an exception *"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"* Does it mean that for saving a dataset whole of

SPARK ML- Feature Selection Techniques

2016-09-05 Thread Bahubali Jain

Hi, Do we have any feature selection techniques implementation(wrapper methods,embedded methods) available in SPARK ML ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain

s the model, then > add the model to the pipeline where it will only transform. > > val featureVectorIndexer = new VectorIndexer() > .setInputCol("feature") > .setOutputCol("indexedfeature") > .setMaxCategories(180) > .fit(completeDataset) &g

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain

Hi, I had run into similar exception " java.util.NoSuchElementException: key not found: " . After further investigation I realized it is happening due to vectorindexer being executed on training dataset and not on entire dataset. In the dataframe I have 5 categories , each of these have to go thru

Large files with wholetextfile()

2016-07-12 Thread Bahubali Jain

Hi, We have a requirement where in we need to process set of xml files, each of the xml files contain several records (eg: data of record 1.. data of record 2.. Expected output is Since we needed file name as well in output ,we chose wholetextfile() . We had to go against

DAG related query

2015-08-20 Thread Bahubali Jain

Hi, How would the DAG look like for the below code JavaRDD rdd1 = context.textFile(); JavaRDD rdd2 = rdd1.map(); rdd1 = rdd2.map(); Does this lead to any kind of cycle? Thanks, Baahu

JavaRDD and saveAsNewAPIHadoopFile()

2015-06-27 Thread Bahubali Jain

Hi, Why doesn't JavaRDD has saveAsNewAPIHadoopFile() associated with it. Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Multiple dir support : newApiHadoopFile

2015-06-26 Thread Bahubali Jain

Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Pseudo Spark Streaming ?

2015-04-05 Thread Bahubali Jain

Hi, I have a requirement in which I plan to use the SPARK Streaming. I am supposed to calculate the access count to certain webpages.I receive the webpage access information thru log files. By Access count I mean "how many times was the page accessed *till now* " I have the log files for past 2 yea

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain

Feb 11, 2015 at 6:35 AM, Akhil Das > wrote: > > Did you try : > > > > temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, > String.class,(Class) > > TextOutputFormat.class); > > > > Thanks > > Best Regards > > > > O

Writing to HDFS from spark Streaming

2015-02-10 Thread Bahubali Jain

Hi, I am facing issues while writing data from a streaming rdd to hdfs.. JavaPairDstream temp; ... ... temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, String.class,TextOutputFormat.class); I see compilation issues as below... The method saveAsHadoopFiles(String, String, Class, Class, Cla

textFileStream() issue?

2014-12-03 Thread Bahubali Jain

Hi, I am trying to use textFileStream("some_hdfs_location") to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I "move" them from a location with in hdfs to location at which textFileStream() is checking for new file

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread Bahubali Jain

Hi, You can associate all the messages of a 3min interval with a unique key and then group by and finally add up. Thanks On Dec 1, 2014 9:02 PM, "pankaj" wrote: > Hi, > > My incoming message has time stamp as one field and i have to perform > aggregation over 3 minute of time slice. > > Message

Re: Help with Spark Streaming

2014-11-16 Thread Bahubali Jain

Hi, Can anybody help me on this please, haven't been able to find the problem :( Thanks. On Nov 15, 2014 4:48 PM, "Bahubali Jain" wrote: > Hi, > Trying to use spark streaming, but I am struggling with word count :( > I want consolidate output of the word count (not on a

Help with Spark Streaming

2014-11-15 Thread Bahubali Jain

Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window basis), so I am using updateStateByKey(), but for some reason this is not working. The function it self is not being invoked(do not see the sysout output on con

Issue with Custom Key Class

2014-11-08 Thread Bahubali Jain

Hi, I have a custom key class.In this class equals() and hashcode() have been overridden. I have a javaPairRDD which has this class as the key .When groupbykey() or reducebykey() is called a null object is being passed to the function *equals*(Object obj) as a result the grouping is failing. Is t

Dataframe size using RDDStorageInfo objects

Compression during shuffle writes

Re: Dataset : Issue with Save

Re: Dataset : Issue with Save

Dataset : Issue with Save

SPARK ML- Feature Selection Techniques

Re: Random Forest Classification

Re: Random Forest Classification

Large files with wholetextfile()

DAG related query

JavaRDD and saveAsNewAPIHadoopFile()

Multiple dir support : newApiHadoopFile

Pseudo Spark Streaming ?

Re: Writing to HDFS from spark Streaming

Writing to HDFS from spark Streaming

textFileStream() issue?

Re: Time based aggregation in Real time Spark Streaming

Re: Help with Spark Streaming

Help with Spark Streaming

Issue with Custom Key Class

20 matches

Site Navigation

Mail list logo

Footer information