List of tables is not large , RDD is created on table list to parllelise
the work of fetching tables in multiple mappers at same time.Since time
taken to fetch a table is significant , so can't run that sequentially.
Content of table fetched by a map job is large, so one option is to dump
content to hdfs using filesystem api from inside map function for every few
rows of table fetched.
I cannot keep complete table in memory and then dump in hdfs using below
map function-
JavaRDD<String> tablecontent = tablelistrdd.map(new
Function<String,Iterable<String>>)
{public Iterable<String> call(String tablename){
..make jdbc connection get table data and populate in list and return that..
}
tablecontent .saveAsTextFile("hdfspath");
Here I wanted to create customRDD- whose partitions would be in memory on
multiple executors and contains parts of table data. And i would have
called saveAsTextFile on customRDD directly to save in hdfs.
On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang <[email protected]>
wrote:
>
> On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora <[email protected]>
> wrote:
>
>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>
>
> You are already creating an RDD in Java here ;)
>
> However, it's not clear to me why you'd want to make this an RDD. Is the
> list of tables so large that it doesn't fit on a single machine? If not,
> you may be better off spinning up one spark job for dumping each table in
> tables using a JDBC datasource
> <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
> .
>
> On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito <
> [email protected]> wrote:
>
>> Sure, you can create custom RDDs. Haven’t done so in Java, but in
>> Scala absolutely.
>>
>> From: Shushant Arora
>> Date: Wednesday, July 1, 2015 at 1:44 PM
>> To: Silvio Fiorito
>> Cc: user
>> Subject: Re: custom RDD in java
>>
>> ok..will evaluate these options but is it possible to create RDD in
>> java?
>>
>>
>> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
>> [email protected]> wrote:
>>
>>> If all you’re doing is just dumping tables from SQLServer to HDFS,
>>> have you looked at Sqoop?
>>>
>>> Otherwise, if you need to run this in Spark could you just use the
>>> existing JdbcRDD?
>>>
>>>
>>> From: Shushant Arora
>>> Date: Wednesday, July 1, 2015 at 10:19 AM
>>> To: user
>>> Subject: custom RDD in java
>>>
>>> Hi
>>>
>>> Is it possible to write custom RDD in java?
>>>
>>> Requirement is - I am having a list of Sqlserver tables need to be
>>> dumped in HDFS.
>>>
>>> So I have a
>>> List<String> tables = {dbname.tablename,dbname.tablename2......};
>>>
>>> then
>>> JavaRDD<String> rdd = javasparkcontext.parllelise(tables);
>>>
>>> JavaRDDString> tablecontent = rdd.map(new
>>> Function<String,Iterable<String>>){fetch table and return populate iterable}
>>>
>>> tablecontent.storeAsTextFile("hffs path");
>>>
>>>
>>> In rdd.map(new Function<String,>). I cannot keep complete table
>>> content in memory , so I want to creat my own RDD to handle it.
>>>
>>> Thanks
>>> Shushant
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>