Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Hi, I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions). How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from H

Re: Need help

2017-10-10 Thread Ilya Karpov
Suggest you reading «Hadoop Application Architectures» (orelly) by Mark Grover, Ted Malaska and others. There you can find some answers for your questions. > 10 окт. 2017 г., в 9:00, Mahender Sarangam > написал(а): > > Hi, > > I'm new to spark and big data, we are doing some poc and buildin

Re: Does Spark 2.2.0 support Dataset>> ?

2017-10-10 Thread kant kodali
I have also tried these. And none of them actually compile. dataset.map(new MapFunction>>() { @Override public Seq> call(String input) throws Exception { List> temp = new ArrayList<>(); temp.add(new HashMap()); return JavaConverters.asScalaBufferConverter(temp).asS

best spark spatial lib?

2017-10-10 Thread Imran Rajjad
I need to have a location column inside my Dataframe so that I can do spatial queries and geometry operations. Are there any third-party packages that perform this kind of operations. I have seen a few like Geospark and megalan but they don't support operations where spatial and logical operators c

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread 郭鹏飞
> 在 2017年10月4日,上午2:08,Nicolas Paris 写道: > > Hi > > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext > > I would say that jdbc is better since it uses HIVE that is based on > map-reduce / TEZ and then works on disk. > Using spark

Unable to run Spark Jobs in yarn cluster mode

2017-10-10 Thread Debabrata Ghosh
Hi All, I am constantly hitting an error : "ApplicationMaster: SparkContext did not initialize after waiting for 100 ms" while running my Spark code in yarn cluster mode. Here is the command what I am using :* spark-submit --master yarn --deploy-mode cluster spark_code.py*

Spark-submit on a sample program gives Syntax Error

2017-10-10 Thread shekar
Hi My environment: Windows 10, Spark 1.6.1 built for Hadoop 2.6.0 Build Python 2.7 Java 1.8 Issue: Go to C:\Spark The command: bin\spark-submit --master local C:\Spark\examples\src\main\python\pi.py 10 gives: File "", line 1 bin\spark-submit --master local C:\Spark\examples\src\main\python\p

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Write your own input format/datasource or split the file yourself beforehand (not recommended). > On 10. Oct 2017, at 09:14, Kanagha Kumar wrote: > > Hi, > > I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", > minPartitions). > > How can I control the no.of tasks by

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread ayan guha
That is not correct, IMHO. If I am not wrong, Spark will still load data in executor, by running some stats on the data itself to identify partitions On Tue, Oct 10, 2017 at 9:23 PM, 郭鹏飞 wrote: > > > 在 2017年10月4日,上午2:08,Nicolas Paris 写道: > > > > Hi > > > > I wonder the differences accessing

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format) On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke

Re: best spark spatial lib?

2017-10-10 Thread Anastasios Zouzias
Hi, Which spatial operations do you require exactly? Also, I don't follow what you mean by combining logical operators? I have created a library that wraps Lucene's spatial functionality here: https://github.com/zouzias/spark-lucenerdd/wiki/Spatial-search You could give a try to the library, it

Re: Unable to run Spark Jobs in yarn cluster mode

2017-10-10 Thread Vadim Semenov
Try increasing the `spark.yarn.am.waitTime` parameter, it's by default set to 100ms which might not be enough in certain cases. On Tue, Oct 10, 2017 at 7:02 AM, Debabrata Ghosh wrote: > Hi All, > I am constantly hitting an error : "ApplicationMaster: > SparkContext did not in

Re: EMR: Use extra mounted EBS volumes for spark.local.dir

2017-10-10 Thread Vadim Semenov
that's probably better be directed to the AWS support On Sun, Oct 8, 2017 at 9:54 PM, Tushar Sudake wrote: > Hello everyone, > > I'm using 'r4.8xlarge' instances on EMR for my Spark Application. > To each node, I'm attaching one 512 GB EBS volume. > > By logging in into nodes I tried verifying t

Re: best spark spatial lib?

2017-10-10 Thread Georg Heiler
What about someting like gromesa? Anastasios Zouzias schrieb am Di. 10. Okt. 2017 um 15:29: > Hi, > > Which spatial operations do you require exactly? Also, I don't follow what > you mean by combining logical operators? > > I have created a library that wraps Lucene's spatial functionality here:

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Thanks for the inputs!! I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted to read. It didn't take any effect. I also tried passing in spark.dfs.block.size, with all the params set to the same value. JavaSparkContext.fromSparkContext(spark.sparkContext()).te

Re: best spark spatial lib?

2017-10-10 Thread Silvio Fiorito
There’s a number of packages for geospatial analysis, depends on the features you need. Here are a few I know of and/or have used: Magellan: https://github.com/harsha2010/magellan MrGeo: https://github.com/ngageoint/mrgeo GeoMesa: http://www.geomesa.org/documentation/tutorials/spark.html GeoSpark

Re: best spark spatial lib?

2017-10-10 Thread Jim Hughes
Hi all, GeoMesa integrates with Spark SQL and allows for queries like: select * from chicago where case_number = 1 and st_intersects(geom, st_makeBox2d(st_point(-77, 38), st_point(-76, 39))) GeoMesa does this by calling package protected Spark methods to implement geospatial user defined typ

RE: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-10 Thread JG Perrin
Something along the line of: Dataset df = spark.read().json(jsonDf); ? From: kant kodali [mailto:kanth...@gmail.com] Sent: Saturday, October 07, 2017 2:31 AM To: user @spark Subject: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0? I have a Dataset ds which

Re: best spark spatial lib?

2017-10-10 Thread Ram Sriharsha
why can't you do this in Magellan? Can you post a sample query that you are trying to run that has spatial and logical operators combined? Maybe I am not understanding the issue properly Ram On Tue, Oct 10, 2017 at 2:21 AM, Imran Rajjad wrote: > I need to have a location column inside my Datafr

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
Have you seen this: https://stackoverflow.com/questions/42796561/set-hadoop-configuration-values-on-spark-submit-command-line ? Please try and let us know. On Wed, Oct 11, 2017 at 2:53 AM, Kanagha Kumar wrote: > Thanks for the inputs!! > > I passed in spark.mapred.max.split.size, spark.mapred.mi

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Maybe you need to set the parameters for the mapreduce api and not the mapred api. I do not have in mind now how they differ but the Hadoop web page should tell you ;-) > On 10. Oct 2017, at 17:53, Kanagha Kumar wrote: > > Thanks for the inputs!! > > I passed in spark.mapred.max.split.size, s

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Thanks Ayan! Finally it worked!! Thanks a lot everyone for the inputs! Once I prefixed the params with "spark.hadoop", I see the no.of tasks getting reduced. I'm setting the following params: --conf spark.hadoop.dfs.block.size --conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread weand
Is Hive from Spark via JDBC working for you? In case it does, I would be interested in your setup :-) We can't get this working. See bug here, especially my last comment: https://issues.apache.org/jira/browse/SPARK-21063 Regards Andreas -- Sent from: http://apache-spark-user-list.1001560.n3.na

RE: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread Walia, Reema
I am able to connect to Spark via JDBC - tested with Squirrel. I am referencing all the jars of current Spark distribution under /usr/hdp/current/spark2-client/jars/* Thanks, Reema -Original Message- From: weand [mailto:andreas.we...@gmail.com] Sent: Tuesday, October 10, 2017 5:14 PM

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread Gourav Sengupta
Hi, I do not think that SPARK will automatically determine the partitions. Actually it does not automatically determine the partitions. In case a table has a few million records, it all goes through the driver. Ofcourse, I have only tried JDBC connections in AURORA, Oracle and Postgres. Regards

Re: Unable to run Spark Jobs in yarn cluster mode

2017-10-10 Thread mailfordebu
Thanks Vadim! Sent from my iPhone > On 10-Oct-2017, at 11:09 PM, Vadim Semenov > wrote: > > Try increasing the `spark.yarn.am.waitTime` parameter, it's by default set to > 100ms which might not be enough in certain cases. > >> On Tue, Oct 10, 2017 at 7:02 AM, Debabrata Ghosh >> wrote: >> H

Re: best spark spatial lib?

2017-10-10 Thread Imran Rajjad
Thanks guy for the response. Basically I am migrating an oracle pl/sql procedure to spark-java. In oracle I have a table with geometry column, on which I am able to do a "where col = 1 and geom.within(another_geom)" I am looking for a less complicated port in to spark for which queries. I will g