Re: Log4j 1.2.17 spark CVE

2021-12-14 Thread Steve Loughran
log4j 1.2.17 is not vulnerable. There is an existing CVE there from a log aggregation servlet; Cloudera products ship a patched release with that servlet stripped...asf projects are not allowed to do that. But: some recent Cloudera Products do include log4j 2.x, so colleagues of mine are busy patc

Re: Missing module spark-hadoop-cloud in Maven central

2021-06-02 Thread Steve Loughran
off the record: Really irritates me too, as it forces me to do local builds even though I shouldn't have to. Sometimes I do that for other reasons, but still. Getting the cloud-storage module in was hard enough at the time that I wasn't going to push harder; I essentially stopped trying to get one

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Steve Loughran
al.io.cloud.PathOutputCommitProtocol"); > hadoopConfiguration.set("spark.sql.parquet.output.committer.class", > "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"); > hadoopConfiguration.set("fs.s3a.connection.maximum", > Integer.toString(coreCount

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-29 Thread Steve Loughran
you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is doomed. using a different aws sdk jar is a bit risky, though more recent upgrades have all be fairly low stress On Fri, 19 Jun 2020 at 05:39, murat mig

Re: [Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-26 Thread Steve Loughran
On 25 Jun 2018, at 23:59, Farshid Zavareh mailto:fhzava...@gmail.com>> wrote: I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two poss

Re: Palantir replease under org.apache.spark?

2018-01-11 Thread Steve Loughran
On 9 Jan 2018, at 18:10, Sean Owen mailto:so...@cloudera.com>> wrote: Just to follow up -- those are actually in a Palantir repo, not Central. Deploying to Central would be uncourteous, but this approach is legitimate and how it has to work for vendors to release distros of Spark etc. ASF p

Re: Writing files to s3 with out temporary directory

2017-12-01 Thread Steve Loughran
Hadoop trunk (i.e 3.1 when it comes out), has the code to do 0-rename commits http://steveloughran.blogspot.co.uk/2017/11/subatomic.html if you want to play today, you can build Hadoop trunk & spark master, + a little glue JAR of mine to get Parquet to play properly http://steveloughran.blo

Re: Process large JSON file without causing OOM

2017-11-15 Thread Steve Loughran
On 14 Nov 2017, at 15:32, Alec Swan mailto:alecs...@gmail.com>> wrote: But I wonder if there is a way to stream/batch the content of JSON file in order to convert it to ORC piecemeal and avoid reading the whole JSON file in memory in the first place? That is what you'll need to do; you'd

Re: Anyone knows how to build and spark on jdk9?

2017-10-30 Thread Steve Loughran
On 27 Oct 2017, at 19:24, Sean Owen mailto:so...@cloudera.com>> wrote: Certainly, Scala 2.12 support precedes Java 9 support. A lot of the work is in place already, and the last issue is dealing with how Scala closures are now implemented quite different with lambdas / invokedynamic. This affe

Re: Why does Spark need to set log levels

2017-10-12 Thread Steve Loughran
> On 9 Oct 2017, at 16:49, Daan Debie wrote: > > Hi all! > > I would love to use Spark with a somewhat more modern logging framework than > Log4j 1.2. I have Logback in mind, mostly because it integrates well with > central logging solutions such as the ELK stack. I've read up a bit on > get

Re: Quick one... AWS SDK version?

2017-10-04 Thread Steve Loughran
ed the SDK version to match the hadoop-aws JAR of the same version of Hadoop your JARs have. Similarly, if you were using spark-kinesis, it needs to be in sync there. From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Tuesday, October 03, 2017 2:20 PM To: JG Perrin mailto:jper...@lumer

Re: Quick one... AWS SDK version?

2017-10-03 Thread Steve Loughran
On 3 Oct 2017, at 02:28, JG Perrin mailto:jper...@lumeris.com>> wrote: Hey Sparkians, What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the Hadoop 2.7.3 libs? You generally to have to stick with the version which hadoop was built with I'm afraid...very brittle depende

Re: how do you deal with datetime in Spark?

2017-10-03 Thread Steve Loughran
On 3 Oct 2017, at 18:43, Adaryl Wakefield mailto:adaryl.wakefi...@hotmail.com>> wrote: I gave myself a project to start actually writing Spark programs. I’m using Scala and Spark 2.2.0. In my project, I had to do some grouping and filtering by dates. It was awful and took forever. I was trying

Re: More instances = slower Spark job

2017-10-01 Thread Steve Loughran
On 28 Sep 2017, at 15:27, Daniel Siegmann mailto:dsiegm...@securityscorecard.io>> wrote: Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text file? Does it use InputFormat do create multiple splits and creates 1 partition per split? Also, in case of S3 or NFS, how does

Re: More instances = slower Spark job

2017-10-01 Thread Steve Loughran
> On 28 Sep 2017, at 14:45, ayan guha wrote: > > Hi > > Can you kindly explain how Spark uses parallelism for bigger (say 1GB) text > file? Does it use InputFormat do create multiple splits and creates 1 > partition per split? Yes, Input formats give you their splits, this is usually used to

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
On 29 Sep 2017, at 20:03, JG Perrin mailto:jper...@lumeris.com>> wrote: You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. no, it doesn't work quite like that. 1. workers generate their data and save somwhere 2. on "ta

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
On 29 Sep 2017, at 15:59, Alexander Czech mailto:alexander.cz...@googlemail.com>> wrote: Yes I have identified the rename as the problem, that is why I think the extra bandwidth of the larger instances might not help. Also there is a consistency issue with S3 because of the how the rename work

Re: More instances = slower Spark job

2017-09-28 Thread Steve Loughran
On 28 Sep 2017, at 09:41, Jeroen Miller mailto:bluedasya...@gmail.com>> wrote: Hello, I am experiencing a disappointing performance issue with my Spark jobs as I scale up the number of instances. The task is trivial: I am loading large (compressed) text files from S3, filtering out lines that

Re: CSV write to S3 failing silently with partial completion

2017-09-08 Thread Steve Loughran
On 7 Sep 2017, at 18:36, Mcclintic, Abbi mailto:ab...@amazon.com>> wrote: Thanks all – couple notes below. Generally all our partitions are of equal size (ie on a normal day in this particular case I see 10 equally sized partitions of 2.8 GB). We see the problem with repartitioning and withou

Re: How to authenticate to ADLS from within spark job on the fly

2017-08-19 Thread Steve Loughran
On 19 Aug 2017, at 02:42, Imtiaz Ahmed mailto:emtiazah...@gmail.com>> wrote: Hi All, I am building a spark library which developers will use when writing their spark jobs to get access to data on Azure Data Lake. But the authentication will depend on the dataset they ask for. I need to call

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-11 Thread Steve Loughran
On 10 Aug 2017, at 09:51, Hemanth Gudela mailto:hemanth.gud...@qvantel.com>> wrote: Yeah, installing HDFS in our environment is unfornutately going to take lot of time (approvals/planning etc). I will have to live with local FS for now. The other option I had already tried is collect() and send

Re: SPARK Issue in Standalone cluster

2017-08-04 Thread Steve Loughran
> On 3 Aug 2017, at 19:59, Marco Mistroni wrote: > > Hello > my 2 cents here, hope it helps > If you want to just to play around with Spark, i'd leave Hadoop out, it's an > unnecessary dependency that you dont need for just running a python script > Instead do the following: > - got to the roo

Re: SPARK Issue in Standalone cluster

2017-08-03 Thread Steve Loughran
On 2 Aug 2017, at 20:05, Gourav Sengupta mailto:gourav.sengu...@gmail.com>> wrote: Hi Steve, I have written a sincere note of apology to everyone in a separate email. I sincerely request your kind forgiveness before hand if anything does sound impolite in my emails, in advance. Let me first

Re: PySpark Streaming S3 checkpointing

2017-08-02 Thread Steve Loughran
On 2 Aug 2017, at 10:34, Riccardo Ferrari mailto:ferra...@gmail.com>> wrote: Hi list! I am working on a pyspark streaming job (ver 2.2.0) and I need to enable checkpointing. At high level my python script goes like this: class StreamingJob(): def __init__(..): ... sparkContext._jsc.hadoop

Re: SPARK Issue in Standalone cluster

2017-08-02 Thread Steve Loughran
On 2 Aug 2017, at 14:25, Gourav Sengupta mailto:gourav.sengu...@gmail.com>> wrote: Hi, I am definitely sure that at this point of time everyone who has kindly cared to respond to my query do need to go and check this link https://spark.apache.org/docs/2.2.0/spark-standalone.html#spark-standal

Re: Spark, S3A, and 503 SlowDown / rate limit issues

2017-07-12 Thread Steve Loughran
On 10 Jul 2017, at 21:57, Everett Anderson mailto:ever...@nuna.com>> wrote: Hey, Thanks for the responses, guys! On Thu, Jul 6, 2017 at 7:08 AM, Steve Loughran mailto:ste...@hortonworks.com>> wrote: On 5 Jul 2017, at 14:40, Vadim Semenov mailto:vadim.seme...@datadoghq.com>&g

Re: Using Spark as a simulator

2017-07-07 Thread Steve Loughran
On 7 Jul 2017, at 08:37, Esa Heikkinen mailto:esa.heikki...@student.tut.fi>> wrote: I only want to simulate very huge "network" with even millions parallel time syncronized actors (state machines). There are also communication between actors via some (key-value pairs) database. I also want th

Re: Spark, S3A, and 503 SlowDown / rate limit issues

2017-07-06 Thread Steve Loughran
On 5 Jul 2017, at 14:40, Vadim Semenov mailto:vadim.seme...@datadoghq.com>> wrote: Are you sure that you use S3A? Because EMR says that they do not support S3A https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/ > Amazon EMR does not currently support use of the Apache Ha

Re: Spark querying parquet data partitioned in S3

2017-07-05 Thread Steve Loughran
> On 29 Jun 2017, at 17:44, fran wrote: > > We have got data stored in S3 partitioned by several columns. Let's say > following this hierarchy: > s3://bucket/data/column1=X/column2=Y/parquet-files > > We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the > following: > > A)

Re: Question on Spark code

2017-06-26 Thread Steve Loughran
On 25 Jun 2017, at 20:57, kant kodali mailto:kanth...@gmail.com>> wrote: impressive! I need to learn more about scala. What I mean stripping away conditional check in Java is this. static final boolean isLogInfoEnabled = false; public void logMessage(String message) { if(isLogInfoEnabled)

Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Steve Loughran
On 23 Jun 2017, at 10:22, Saisai Shao mailto:sai.sai.s...@gmail.com>> wrote: Spark running with standalone cluster manager currently doesn't support accessing security Hadoop. Basically the problem is that standalone mode Spark doesn't have the facility to distribute delegation tokens. Curren

Re: Using YARN w/o HDFS

2017-06-23 Thread Steve Loughran
you'll need a filesystem with * consistency * accessibility everywhere * supports a binding through one of the hadoop fs connectors NFS-style distributed filesystems work with file:// ; things like glusterfs need their own connectors. you can use azure's wasb:// as a drop in replacement for HDF

Re: SparkSQL not able to read a empty table location

2017-05-20 Thread Steve Loughran
On 20 May 2017, at 01:44, Bajpai, Amit X. -ND mailto:n...@disney.com>> wrote: Hi, I have a hive external table with the S3 location having no files (but the S3 location directory does exists). When I am trying to use Spark SQL to count the number of records in the table it is throwing error s

Re: Spark <--> S3 flakiness

2017-05-18 Thread Steve Loughran
ure=youtu.be Still doing some reading and will start testing in the next day or so. Thanks! Gary On 17 May 2017 at 03:19, Steve Loughran mailto:ste...@hortonworks.com>> wrote: On 17 May 2017, at 06:00, lucas.g...@gmail.com<mailto:lucas.g...@gmail.com> wrote: Steve, thanks for t

Re: s3 bucket access/read file

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 00:10, jazzed mailto:crackshotm...@gmail.com>> wrote: How did you solve the problem with V4? which v4 problem? Authentication? you need to declare the explicit s3a endpoint via fs.s3a.endpoint , otherwise you get a generic "bad auth" message which is not a good place to st

Re: Parquet file amazon s3a timeout

2017-05-17 Thread Steve Loughran
On 17 May 2017, at 11:13, Karin Valisova mailto:ka...@datapine.com>> wrote: Hello! I'm working with some parquet files saved on amazon service and loading them to dataframe with Dataset df = spark.read() .parquet(parketFileLocation); however, after some time I get the "Timeout waiting for con

Re: Spark <--> S3 flakiness

2017-05-17 Thread Steve Loughran
rformance/ https://www.cloudera.com/documentation/enterprise/5-9-x/topics/spark_s3.html On 16 May 2017 at 10:10, Steve Loughran mailto:ste...@hortonworks.com>> wrote: On 11 May 2017, at 06:07, lucas.g...@gmail.com<mailto:lucas.g...@gmail.com> wrote: Hi users, we have a bunch of pyspark jobs t

Re: Spark <--> S3 flakiness

2017-05-16 Thread Steve Loughran
On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote: Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files. Please don't, not without a committer specially written to work against S3 in the

Re: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2017-05-16 Thread Steve Loughran
On 10 May 2017, at 13:40, Mendelson, Assaf mailto:assaf.mendel...@rsa.com>> wrote: Hi all, When running spark I get the following warning: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Now I

Re: parquet optimal file structure - flat vs nested

2017-05-03 Thread Steve Loughran
> On 30 Apr 2017, at 09:19, Zeming Yu wrote: > > Hi, > > We're building a parquet based data lake. I was under the impression that > flat files are more efficient than deeply nested files (say 3 or 4 levels > down). Is that correct? > > Thanks, > Zeming Where's the data going to live: HDFS

Re: removing columns from file

2017-05-01 Thread Steve Loughran
On 28 Apr 2017, at 16:10, Anubhav Agarwal mailto:anubha...@gmail.com>> wrote: Are you using Spark's textFiles method? If so, go through this blog :- http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 old/dated blog post. If you get the Hadoop 2.8 binaries on your clas

Re: Questions related to writing data to S3

2017-04-24 Thread Steve Loughran
On 23 Apr 2017, at 19:49, Richard Hanson mailto:rhan...@mailbox.org>> wrote: I have a streaming job which writes data to S3. I know there are saveAs functions helping write data to S3. But it bundles all elements then writes out to S3. use Hadoop 2.8.x binaries and the fast output stream

Re: splitting a huge file

2017-04-24 Thread Steve Loughran
> On 21 Apr 2017, at 19:36, Paul Tremblay wrote: > > We are tasked with loading a big file (possibly 2TB) into a data warehouse. > In order to do this efficiently, we need to split the file into smaller files. > > I don't believe there is a way to do this with Spark, because in order for > Sp

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
CD is still applicable to building data products and data warehousing. I concur Regards, Gourav -Steve On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran mailto:ste...@hortonworks.com>> wrote: On 11 Apr 2017, at 20:46, Gourav Sengupta mailto:gourav.sengu...@gmail.com>> wrote:

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran
On 11 Apr 2017, at 20:46, Gourav Sengupta mailto:gourav.sengu...@gmail.com>> wrote: And once again JAVA programmers are trying to solve a data analytics and data warehousing problem using programming paradigms. It genuinely a pain to see this happen. While I'm happy to be faulted for treati

Re: optimising storage and ec2 instances

2017-04-11 Thread Steve Loughran
> On 11 Apr 2017, at 11:07, Zeming Yu wrote: > > Hi all, > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > > Background: I have a data set growing by 6 TB p.a. I plan to use spark to > read in all the data, manipulate

Re: unit testing in spark

2017-04-11 Thread Steve Loughran
(sorry sent an empty reply by accident) Unit testing is one of the easiest ways to isolate problems in an an internal class, things you can get wrong. But: time spent writing unit tests is time *not* spent writing integration tests. Which biases me towards the integration. What I do find is go

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Steve Loughran
ion dataset". -Steve On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta mailto:gourav.sengu...@gmail.com>> wrote: Hi Steve, Why would you ever do that? You are suggesting the use of a CI tool as a workflow and orchestration engine. Regards, Gourav Sengupta On Fri, Apr 7, 2017 at 4:07 PM, Ste

Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Steve Loughran
On 7 Apr 2017, at 15:32, Alvaro Brandon mailto:alvarobran...@gmail.com>> wrote: I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specify the Hadoop version you will use, I'm gues

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Steve Loughran
If you have Jenkins set up for some CI workflow, that can do scheduled builds and tests. Works well if you can do some build test before even submitting it to a remote cluster On 7 Apr 2017, at 10:15, Sam Elamin mailto:hussam.ela...@gmail.com>> wrote: Hi Shyla You have multiple options really

Re: httpclient conflict in spark

2017-03-30 Thread Steve Loughran
On 29 Mar 2017, at 14:42, Arvind Kandaswamy mailto:aravind.ka...@gmail.com>> wrote: Hello, I am getting the following error. I get this error when trying to use AWS S3. This appears to be a conflict with httpclient. AWS S3 comes with httplient-4.5.2.jar. I am not sure how to force spark to us

Re: Spark and continuous integration

2017-03-14 Thread Steve Loughran
On 13 Mar 2017, at 13:24, Sam Elamin mailto:hussam.ela...@gmail.com>> wrote: Hi Jorn Thanks for the prompt reply, really we have 2 main concerns with CD, ensuring tests pasts and linting on the code. I'd add "providing diagnostics when tests fail", which is a combination of: tests providing

Re: Wrong runtime type when using newAPIHadoopFile in Java

2017-03-06 Thread Steve Loughran
On 6 Mar 2017, at 12:30, Nira Amit mailto:amitn...@gmail.com>> wrote: And it's very difficult if it's doing unexpected things. All serialisations do unexpected things. Nobody understands them. Sorry

Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Steve Loughran
try giving a resource of a file in the JAR, e.g add a file "log4j-debugging.properties into the jar, and give a config option of -Dlog4j.configuration=/log4j-debugging.properties (maybe also try without the "/") On 26 Feb 2017, at 16:31, Prithish mailto:prith...@gmail.com>> wrote: Hoping s

Re: Get S3 Parquet File

2017-02-25 Thread Steve Loughran
On 24 Feb 2017, at 07:47, Femi Anthony mailto:femib...@gmail.com>> wrote: Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark. I would absolutely not use s3n with a 1.2 GB file. There is a WONTFIX JIRA on how it

Re: Will Spark ever run the same task at the same time

2017-02-20 Thread Steve Loughran
> On 16 Feb 2017, at 18:34, Ji Yan wrote: > > Dear spark users, > > Is there any mechanism in Spark that does not guarantee the idempotent > nature? For example, for stranglers, the framework might start another task > assuming the strangler is slow while the strangler is still running. This

Re: fault tolerant dataframe write with overwrite

2017-02-14 Thread Steve Loughran
On 14 Feb 2017, at 11:12, Mendelson, Assaf mailto:assaf.mendel...@rsa.com>> wrote: I know how to get the filesystem, the problem is that this means using Hadoop directly so if in the future we change to something else (e.g. S3) I would need to rewrite the code. well, no, because the s3 and hf

Re: How to measure IO time in Spark over S3

2017-02-13 Thread Steve Loughran
Hadoop 2.8's s3a does a lot more metrics here, most of which you can find on HDP-2.5 if you can grab those JARs. Everything comes out as hadoop JMX metrics, also readable & aggregatable through a call to FileSystem.getStorageStatistics Measuring IO time isn't something picked up, because it's a

Re: using an alternative slf4j implementation

2017-02-06 Thread Steve Loughran
> On 6 Feb 2017, at 11:06, Mendelson, Assaf wrote: > > Found some questions (without answers) and I found some jira > (https://issues.apache.org/jira/browse/SPARK-4147 and > https://issues.apache.org/jira/browse/SPARK-14703), however they do not solve > the issue. > Nominally, a library shoul

Re: spark 2.02 error when writing to s3

2017-01-28 Thread Steve Loughran
On 27 Jan 2017, at 23:17, VND Tremblay, Paul mailto:tremblay.p...@bcg.com>> wrote: Not sure what you mean by "a consistency layer on top." Any explanation would be greatly appreciated! Paul netflix's s3mper: https://github.com/Netflix/s3mper EMR consistency: http://docs.aws.amazon.com/emr

Re: spark 2.02 error when writing to s3

2017-01-27 Thread Steve Loughran
tics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile + _ From: Neil Jonkers [mailto:neilod...@gmail.com] Sent: Friday, January 20, 2017 11:39 AM To: Steve Loughran; VND Tremblay, Paul Cc: Takeshi Yamam

Re: spark 2.02 error when writing to s3

2017-01-20 Thread Steve Loughran
AWS S3 is eventually consistent: even after something is deleted, a LIST/GET call may show it. You may be seeing that effect; even after the DELETE has got rid of the files, a listing sees something there, And I suspect the time it takes for the listing to "go away" will depend on the total numb

Re: "Unable to load native-hadoop library for your platform" while running Spark jobs

2017-01-20 Thread Steve Loughran
On 19 Jan 2017, at 10:59, Sean Owen mailto:so...@cloudera.com>> wrote: It's a message from Hadoop libs, not Spark. It can be safely ignored. It's just saying you haven't installed the additional (non-Apache-licensed) native libs that can accelerate some operations. This is something you can ea

Re: Anyone has any experience using spark in the banking industry?

2017-01-20 Thread Steve Loughran
> On 18 Jan 2017, at 21:50, kant kodali wrote: > > Anyone has any experience using spark in the banking industry? I have couple > of questions. > 2. How can I make spark cluster highly available across multi datacenter? Any > pointers? That's not, AFAIK, been a design goal. The communicatio

Re: Spark GraphFrame ConnectedComponents

2017-01-06 Thread Steve Loughran
On 5 Jan 2017, at 21:10, Ankur Srivastava mailto:ankur.srivast...@gmail.com>> wrote: Yes I did try it out and it choses the local file system as my checkpoint location starts with s3n:// I am not sure how can I make it load the S3FileSystem. set fs.default.name to s3n://whatever , or, in spar

Re: Spark Read from Google store and save in AWS s3

2017-01-06 Thread Steve Loughran
On 5 Jan 2017, at 20:07, Manohar Reddy mailto:manohar.re...@happiestminds.com>> wrote: Hi Steve, Thanks for the reply and below is follow-up help needed from you. Do you mean we can set up two native file system to single sparkcontext ,so then based on urls prefix( gs://bucket/path and dest s3a

Re: Spark Read from Google store and save in AWS s3

2017-01-05 Thread Steve Loughran
On 5 Jan 2017, at 09:58, Manohar753 mailto:manohar.re...@happiestminds.com>> wrote: Hi All, Using spark is interoperability communication between two clouds(Google,AWS) possible. in my use case i need to take Google store as input to spark and do some processing and finally needs to store in S

Re: How to load a big csv to dataframe in Spark 1.6

2017-01-03 Thread Steve Loughran
On 31 Dec 2016, at 16:09, Raymond Xie mailto:xie3208...@gmail.com>> wrote: Hello Felix, I followed the instruction and ran the command: > $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 and I received the following error message: java.lang.RuntimeException: java.net

Re: Question about Spark and filesystems

2017-01-03 Thread Steve Loughran
On 18 Dec 2016, at 19:50, joa...@verona.se wrote: Since each Spark worker node needs to access the same files, we have tried using Hdfs. This worked, but there were some oddities making me a bit uneasy. For dependency hell reasons I compiled a modified Spark, and this ver

Re: Gradle dependency problem with spark

2016-12-16 Thread Steve Loughran
FWIW, although the underlying Hadoop declared guava dependency is pretty low, everything in org.apache.hadoop is set up to run against later versions. It just sticks with the old one to avoid breaking anything donwstream which does expect a low version number. See HADOOP-10101 for the ongoing pa

Re: Handling Exception or Control in spark dataframe write()

2016-12-16 Thread Steve Loughran
> On 14 Dec 2016, at 18:10, bhayat wrote: > > Hello, > > I am writing my RDD into parquet format but what i understand that write() > method is still experimental and i do not know how i will deal with possible > exceptions. > > For example: > > schemaXXX.write().mode(saveMode).parquet(parque

Re: Few questions on reliability of accumulators value.

2016-12-15 Thread Steve Loughran
On 12 Dec 2016, at 19:57, Daniel Siegmann mailto:dsiegm...@securityscorecard.io>> wrote: Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation: http://imranrashid.com/posts/Spark-Accumulato

Re: WARN util.NativeCodeLoader

2016-12-12 Thread Steve Loughran
> On 8 Dec 2016, at 06:38, baipeng wrote: > > Hi ALL > > I’m new to Spark.When I execute spark-shell, the first line is as > follows > WARN util.NativeCodeLoader: Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable. > Can someone tell

Re: Access multiple cluster

2016-12-05 Thread Steve Loughran
if the remote filesystem is visible from the other, than a different HDFS value, e.g hdfs://analytics:8000/historical/ can be used for reads & writes, even if your defaultFS (the one where you get max performance) is, say hdfs://processing:8000/ -performance will be slower, in both directions

Re: What benefits do we really get out of colocation?

2016-12-03 Thread Steve Loughran
On 3 Dec 2016, at 09:16, Manish Malhotra mailto:manish.malhotra.w...@gmail.com>> wrote: thanks for sharing number as well ! Now a days even network can be with very high throughput, and might out perform the disk, but as Sean mentioned data on network will have other dependencies like network

Re: Spark ignoring partition names without equals (=) separator

2016-11-29 Thread Steve Loughran
On 29 Nov 2016, at 05:19, Prasanna Santhanam mailto:t...@apache.org>> wrote: On Mon, Nov 28, 2016 at 4:39 PM, Steve Loughran mailto:ste...@hortonworks.com>> wrote: irrespective of naming, know that deep directory trees are performance killers when listing files on s3 and setting

Re: Spark ignoring partition names without equals (=) separator

2016-11-28 Thread Steve Loughran
irrespective of naming, know that deep directory trees are performance killers when listing files on s3 and setting up jobs. You might actually be better off having them in the same directory and using a pattern like 2016-03-11-* as the pattten to find files. On 28 Nov 2016, at 04:18, Prasann

Re: Third party library

2016-11-27 Thread Steve Loughran
On 27 Nov 2016, at 02:55, kant kodali mailto:kanth...@gmail.com>> wrote: I would say instead of LD_LIBRARY_PATH you might want to use java.library.path in the following way java -Djava.library.path=/path/to/my/library or pass java.library.path along with spark-submit This is only going to s

Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-25 Thread Steve Loughran
t isolate the upload problem from the conversion On 24 Nov 2016, at 18:44, vr spark mailto:vrspark...@gmail.com>> wrote: Hi, The source file i have is on local machine and its pretty huge like 150 gb. How to go about it? On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran mailto:ste...@h

Re: How to write a custom file system?

2016-11-22 Thread Steve Loughran
On 21 Nov 2016, at 17:26, Samy Dindane mailto:s...@dindane.com>> wrote: Hi, I'd like to extend the file:// file system and add some custom logic to the API that lists files. I think I need to extend FileSystem or LocalFileSystem from org.apache.hadoop.fs, but I am not sure how to go about it

Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-20 Thread Steve Loughran
On 19 Nov 2016, at 17:21, vr spark mailto:vrspark...@gmail.com>> wrote: Hi, I am looking for scala or python code samples to covert local tsv file to orc file and store on distributed cloud storage(openstack). So, need these 3 samples. Please suggest. 1. read tsv 2. convert to orc 3. store on

Re: Run spark with hadoop snapshot

2016-11-19 Thread Steve Loughran
I'd recommend you build a fill spark release with the new hadoop version; you should have built that locally earlier the same day (so that ivy/maven pick up the snapshot) dev/make-distribution.sh -Pyarn,hadoop-2.7,hive -Dhadoop.version=2.9.0-SNAPSHOT; > On 18 Nov 2016, at 19:31, lminer wrot

Re: Long-running job OOMs driver process

2016-11-18 Thread Steve Loughran
On 18 Nov 2016, at 14:31, Keith Bourgoin mailto:ke...@parsely.com>> wrote: We thread the file processing to amortize the cost of things like getting files from S3. Define cost here: actual $ amount, or merely time to read the data? If it's read times, you should really be trying the new stuff

Re: Any with S3 experience with Spark? Having ListBucket issues

2016-11-18 Thread Steve Loughran
On 16 Nov 2016, at 22:34, Edden Burrow mailto:eddenbur...@gmail.com>> wrote: Anyone dealing with a lot of files with spark? We're trying s3a with 2.0.1 because we're seeing intermittent errors in S3 where jobs fail and saveAsText file fails. Using pyspark. How many files? Thousands? Millions

Re: Delegation Token renewal in yarn-cluster

2016-11-04 Thread Steve Loughran
On 4 Nov 2016, at 01:37, Marcelo Vanzin mailto:van...@cloudera.com>> wrote: On Thu, Nov 3, 2016 at 3:47 PM, Zsolt Tóth mailto:toth.zsolt@gmail.com>> wrote: What is the purpose of the delegation token renewal (the one that is done automatically by Hadoop libraries, after 1 day by default)? I

Re: sanboxing spark executors

2016-11-04 Thread Steve Loughran
> On 4 Nov 2016, at 06:41, blazespinnaker wrote: > > Is there a good method / discussion / documentation on how to sandbox a spark > executor? Assume the code is untrusted and you don't want it to be able to > make un validated network connections or do unvalidated alluxio/hdfs/file use Kerb

Re: Spark 2.0 with Hadoop 3.0?

2016-10-29 Thread Steve Loughran
On 27 Oct 2016, at 23:04, adam kramer mailto:ada...@gmail.com>> wrote: Is the version of Spark built for Hadoop 2.7 and later only for 2.x releases? Is there any reason why Hadoop 3.0 is a non-starter for use with Spark 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which would

Re: Spark security

2016-10-27 Thread Steve Loughran
On 13 Oct 2016, at 14:40, Mendelson, Assaf mailto:assaf.mendel...@rsa.com>> wrote: Hi, We have a spark cluster and we wanted to add some security for it. I was looking at the documentation (in http://spark.apache.org/docs/latest/security.html) and had some questions. 1. Do all executors

Re: spark infers date to be timestamp type

2016-10-27 Thread Steve Loughran
CSV type inference isn't really ideal: it does a full scan of a file to determine this; you are doubling the amount of data you need to read. Unless you are just exploring files in your notebook, I'd recommend doing it once, getting the schema from it then using that as the basis for the code sn

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-24 Thread Steve Loughran
On 24 Oct 2016, at 20:32, Cheng Lian mailto:lian.cs@gmail.com>> wrote: On 10/22/16 6:18 AM, Steve Loughran wrote: ... On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian mailto:lian.cs@gmail.com>> wrote: What version of Spark are you using and how many output files does the job

Re: Getting the IP address of Spark Driver in yarn-cluster mode

2016-10-24 Thread Steve Loughran
On 24 Oct 2016, at 19:34, Masood Krohy mailto:masood.kr...@intact.net>> wrote: Hi everyone, Is there a way to set the IP address/hostname that the Spark Driver is going to be running on when launching a program through spark-submit in yarn-cluster mode (PySpark 1.6.0)? I do not see an option

Re: Issues with reading gz files with Spark Streaming

2016-10-24 Thread Steve Loughran
g the class: org.apache.spark.sql.execution.streaming.FileStreamSource On 22 October 2016 at 15:14, Steve Loughran mailto:ste...@hortonworks.com>> wrote: > On 21 Oct 2016, at 15:53, Nkechi Achara > mailto:nkach...@googlemail.com>> wrote: > > Hi, > > I am using Spark 1.5.0 to read gz files w

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-22 Thread Steve Loughran
On 22 Oct 2016, at 00:48, Chetan Khatri mailto:ckhatriman...@gmail.com>> wrote: Hello Cheng, Thank you for response. I am using spark 1.6.1, i am writing around 350 gz parquet part files for single table. Processed around 180 GB of Data using Spark. Are you writing to GCS storage to to the l

Re: Issues with reading gz files with Spark Streaming

2016-10-22 Thread Steve Loughran
> On 21 Oct 2016, at 15:53, Nkechi Achara wrote: > > Hi, > > I am using Spark 1.5.0 to read gz files with textFileStream, but when new > files are dropped in the specified directory. I know this is only the case > with gz files as when i extract the file into the directory specified the > f

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-20 Thread Steve Loughran
> On 19 Oct 2016, at 21:46, Jakob Odersky wrote: > > Another reason I could imagine is that files are often read from HDFS, > which by default uses line terminators to separate records. > > It is possible to implement your own hdfs delimiter finder, however > for arbitrary json data, finding th

Re: spark with kerberos

2016-10-19 Thread Steve Loughran
On 19 Oct 2016, at 00:18, Michael Segel mailto:msegel_had...@hotmail.com>> wrote: (Sorry sent reply via wrong account.. ) Steve, Kinda hijacking the thread, but I promise its still on topic to OP’s issue.. ;-) Usually you will end up having a local Kerberos set up per cluster. So your machine

Re: About Error while reading large JSON file in Spark

2016-10-19 Thread Steve Loughran
On 18 Oct 2016, at 10:58, Chetan Khatri mailto:ckhatriman...@gmail.com>> wrote: Dear Xi shen, Thank you for getting back to question. The approach i am following are as below: I have MSSQL server as Enterprise data lack. 1. run Java jobs and generated JSON files, every file is almost 6 GB. Co

Re: About Error while reading large JSON file in Spark

2016-10-18 Thread Steve Loughran
On 18 Oct 2016, at 08:43, Chetan Khatri mailto:ckhatriman...@gmail.com>> wrote: Hello Community members, I am getting error while reading large JSON file in spark, the underlying read code can't handle more than 2^31 bytes in a single line: if (bytesConsumed > Integer.MAX_VALUE) {

Re: spark with kerberos

2016-10-18 Thread Steve Loughran
the data to a client (edge node) before pushing it out to the secured cluster. Does that make sense? On Oct 14, 2016, at 1:32 PM, Steve Loughran mailto:ste...@hortonworks.com>> wrote: On 13 Oct 2016, at 10:50, dbolshak mailto:bolshakov.de...@gmail.com>> wrote: Hello community

Re: spark with kerberos

2016-10-14 Thread Steve Loughran
On 13 Oct 2016, at 10:50, dbolshak mailto:bolshakov.de...@gmail.com>> wrote: Hello community, We've a challenge and no ideas how to solve it. The problem, Say we have the following environment: 1. `cluster A`, the cluster does not use kerberos and we use it as a source of data, important thin

  1   2   3   4   >