Multiple dir support : newApiHadoopFile

2015-06-26 Thread Bahubali Jain
Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

JavaRDD and saveAsNewAPIHadoopFile()

2015-06-27 Thread Bahubali Jain
Hi, Why doesn't JavaRDD has saveAsNewAPIHadoopFile() associated with it. Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Large files with wholetextfile()

2016-07-12 Thread Bahubali Jain
Hi, We have a requirement where in we need to process set of xml files, each of the xml files contain several records (eg: data of record 1.. data of record 2.. Expected output is Since we needed file name as well in output ,we chose wholetextfile() . We had to go against

DAG related query

2015-08-20 Thread Bahubali Jain
Hi, How would the DAG look like for the below code JavaRDD rdd1 = context.textFile(); JavaRDD rdd2 = rdd1.map(); rdd1 = rdd2.map(); Does this lead to any kind of cycle? Thanks, Baahu

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain
Hi, I had run into similar exception " java.util.NoSuchElementException: key not found: " . After further investigation I realized it is happening due to vectorindexer being executed on training dataset and not on entire dataset. In the dataframe I have 5 categories , each of these have to go thru

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain
s the model, then > add the model to the pipeline where it will only transform. > > val featureVectorIndexer = new VectorIndexer() > .setInputCol("feature") > .setOutputCol("indexedfeature") > .setMaxCategories(180) > .fit(completeDataset) &g

SPARK ML- Feature Selection Techniques

2016-09-05 Thread Bahubali Jain
Hi, Do we have any feature selection techniques implementation(wrapper methods,embedded methods) available in SPARK ML ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
Hi, While saving a dataset using * mydataset.write().csv("outputlocation") * I am running into an exception *"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"* Does it mean that for saving a dataset whole of

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
; issues.apache.org > Executing a sql statement with a large number of partitions requires a > high memory space for the driver even there are no requests to collect data > back to the driver. > > > > -- > *From:* Bahubali Jain > *Sent:* Thursday,

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
If you are looking for workaround, the JIRA ticket clearly show you how to > increase your driver heap. 1G in today's world really is kind of small. > > > Yong > > > -- > *From:* Bahubali Jain > *Sent:* Thursday, March 16, 2017 10:34 PM > *

Compression during shuffle writes

2017-11-09 Thread Bahubali Jain
Hi, I have compressed data of size 500GB .I am repartitioning this data since the underlying data is very skewed and is causing a lot of issues for the downstream jobs. During repartioning the *shuffles writes* are not getting compressed due to this I am running into disk space issues.Below is the

Dataframe size using RDDStorageInfo objects

2018-03-16 Thread Bahubali Jain
Hi, I am trying to figure out a way to find the size of *persisted *dataframes using the *sparkContext.getRDDStorageInfo() * RDDStorageInfo object has information related to the number of bytes stored in memory and on disk. For eg: I have 3 dataframes which i have cached. df1.cache() df2.cache() d

Writing to HDFS from spark Streaming

2015-02-10 Thread Bahubali Jain
Hi, I am facing issues while writing data from a streaming rdd to hdfs.. JavaPairDstream temp; ... ... temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, String.class,TextOutputFormat.class); I see compilation issues as below... The method saveAsHadoopFiles(String, String, Class, Class, Cla

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain
Feb 11, 2015 at 6:35 AM, Akhil Das > wrote: > > Did you try : > > > > temp.saveAsHadoopFiles("DailyCSV",".txt", String.class, > String.class,(Class) > > TextOutputFormat.class); > > > > Thanks > > Best Regards > > > > O

Pseudo Spark Streaming ?

2015-04-05 Thread Bahubali Jain
Hi, I have a requirement in which I plan to use the SPARK Streaming. I am supposed to calculate the access count to certain webpages.I receive the webpage access information thru log files. By Access count I mean "how many times was the page accessed *till now* " I have the log files for past 2 yea

Issue with Custom Key Class

2014-11-08 Thread Bahubali Jain
Hi, I have a custom key class.In this class equals() and hashcode() have been overridden. I have a javaPairRDD which has this class as the key .When groupbykey() or reducebykey() is called a null object is being passed to the function *equals*(Object obj) as a result the grouping is failing. Is t

Help with Spark Streaming

2014-11-15 Thread Bahubali Jain
Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window basis), so I am using updateStateByKey(), but for some reason this is not working. The function it self is not being invoked(do not see the sysout output on con

Re: Help with Spark Streaming

2014-11-16 Thread Bahubali Jain
Hi, Can anybody help me on this please, haven't been able to find the problem :( Thanks. On Nov 15, 2014 4:48 PM, "Bahubali Jain" wrote: > Hi, > Trying to use spark streaming, but I am struggling with word count :( > I want consolidate output of the word count (not on a

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread Bahubali Jain
Hi, You can associate all the messages of a 3min interval with a unique key and then group by and finally add up. Thanks On Dec 1, 2014 9:02 PM, "pankaj" wrote: > Hi, > > My incoming message has time stamp as one field and i have to perform > aggregation over 3 minute of time slice. > > Message

textFileStream() issue?

2014-12-03 Thread Bahubali Jain
Hi, I am trying to use textFileStream("some_hdfs_location") to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I "move" them from a location with in hdfs to location at which textFileStream() is checking for new file