date:20170127

outdated documentation? SparkSession

2017-01-27 Thread Wojciech Indyk

Hi! In this doc http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark initialization is described by SparkContext. Do you think is it reasonable to change it to SparkSession or just mentioned it at the end? I can prepare it and make PR for this, but want to know your opinion

Re: outdated documentation? SparkSession

2017-01-27 Thread Chetan Khatri

Not outdated at all, because there are other methods having dependencies on sparkcontext so you have to create it. For example, https://gist.github.com/chetkhatri/f75c2b743e6cb2d7066188687448c5a1 On Fri, Jan 27, 2017 at 2:06 PM, Wojciech Indyk wrote: > Hi! > In this doc http://spark.apache.org/d

Re: spark intermediate data fills up the disk

2017-01-27 Thread Takeshi Yamamuro

IIUC, if the references of RDDs have gone, the related files (e.g., shuffled data) of these RDDs are automatically removed by `ContextCleaner` ( https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L178 ). Since spark can recompute from datasources (

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Takeshi Yamamuro

Hi, Just a guess though, Kinesis shards sometimes have skew data. So, before you compute something from kinesis RDDs, you'd be better to repartition them for better parallelism. // maropu On Fri, Jan 27, 2017 at 2:54 PM, Graham Clark wrote: > Hi everyone - I am building a small prototype in Sp

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread ayan guha

Maybe a naive question: why are you creating 1 Dstream per shard? It should be one Dstream corresponding to kinesis stream, isn't it? On Fri, Jan 27, 2017 at 8:09 PM, Takeshi Yamamuro wrote: > Hi, > > Just a guess though, Kinesis shards sometimes have skew data. > So, before you compute somethin

Text

2017-01-27 Thread Soheila S.

Hi All, I read a test file using sparkContext.textfile(filename) and assign it to an RDD and process the RDD (replace some words) and finally write it to a text file using rdd.saveAsTextFile(output). Is there any way to be sure the order of the sentences will not be changed? I need to have the same

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Takeshi Yamamuro

Probably, he referred to the word-couting example in kinesis here: https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L114 On Fri, Jan 27, 2017 at 6:41 PM, ayan guha wrote: > Maybe a naive question: why a

Re: spark 2.02 error when writing to s3

2017-01-27 Thread Steve Loughran

OK Nobody should be committing output directly to S3 without having something add a consistency layer on top, not if you want reliabie (as in "doesn't lose/corrupt data" reliable) work On 26 Jan 2017, at 19:09, VND Tremblay, Paul mailto:tremblay.p...@bcg.com>> wrote: This seems to have done t

Re: Text

2017-01-27 Thread Md. Rezaul Karim

Some operations like map, filter, flatMap and coalesce (with shuffle=false) usually preserve the order. However, sortBy, reduceBy, partitionBy, join etc. do not. Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National Unive

Re: Text

2017-01-27 Thread ayan guha

I would not count on order preserving nature of the operations, because it is not guranteed. I would assign some order to the sentences and sort at the end before write back On Fri, 27 Jan 2017 at 10:59 pm, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Some operations like map, fil

Re: Kinesis streaming misunderstanding..?

2017-01-27 Thread Graham Clark

Hi - thanks for the responses. You are right that I started by copying the word-counting example. I assumed that this would help spread the load evenly across the cluster, with each worker receiving a portion of the stream data - corresponding to one shard's worth - and then keeping the data local

Re: Text

2017-01-27 Thread Jörn Franke

I agree with the previous statements. You cannot expect any ordering guarantee. This means you need to ensure that the same ordering is done as the original file. Internally Spark is using the Hadoop Client libraries - even if you do not have Hadoop installed, because it is a flexible transparen

Re: Text

2017-01-27 Thread Jörn Franke

Sorry the message was not complete: the key is the file position, so if you sort by key the lines will be in the same order as in the original file > On 27 Jan 2017, at 14:45, Jörn Franke wrote: > > I agree with the previous statements. You cannot expect any ordering > guarantee. This means y

Making withColumn nullable

2017-01-27 Thread Ninad Shringarpure

HI Team, When I add a column to my data frame using withColumn and assign some value, it automatically creates the schema with this column to be not nullable. My final Hive table schema where I want to insert it has this column to be nullable and hence throws an error when I try to save. Is there

Issue with caching

2017-01-27 Thread Anil Langote

Hi All I am trying to cache large dataset with storage level memory and sterilization with kyro enabled when I run my spark job multiple times I get different performance at a times caching dataset spark hangs and takes forever what is wrong. The best time I got is 20 mins and some times with

Re: Making withColumn nullable

2017-01-27 Thread Koert Kuipers

it should be by default nullable except for certain primitives where it defaults to non-nullable you can use Option for your return value to indicate nullability. On Fri, Jan 27, 2017 at 10:32 AM, Ninad Shringarpure wrote: > HI Team, > > When I add a column to my data frame using withColumn and

Dynamic resource allocation to Spark on Mesos

2017-01-27 Thread Ji Yan

Dear Spark Users, Currently is there a way to dynamically allocate resources to Spark on Mesos? Within Spark we can specify the CPU cores, memory before running job. The way I understand is that the Spark job will not run if the CPU/Mem requirement is not met. This may lead to decrease in overall

Converting timezones in Spark

2017-01-27 Thread Don Drake

I'm reading CSV with a timestamp clearly identified in the UTC timezone, and I need to store this in a parquet format and eventually read it back and convert to different timezones as needed. Sounds straightforward, but this involves some crazy function calls and I'm seeing strange results as I bu

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Koert Kuipers

code: val query = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "somenode:9092") .option("subscribe", "wikipedia") .load .select(col("value") cast StringType) .writeStream .format("console") .outputMode(Out

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Koert Kuipers

i checked my topic. it has 5 partitions but all the data is written to a single partition: wikipedia-2 i turned on debug logging and i see this: 2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Partitions assigned to consumer: [wikipedia-0, wikipedia-4, wikipedia-3, wikipedia-2, wikipedia-1]. Seeki

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Shixiong(Ryan) Zhu

Thanks for reporting this. Which Spark version are you using? Could you provide the full log, please? On Fri, Jan 27, 2017 at 10:24 AM, Koert Kuipers wrote: > i checked my topic. it has 5 partitions but all the data is written to a > single partition: wikipedia-2 > i turned on debug logging and

Re: Issue creating row with java.util.Map type

2017-01-27 Thread Ankur Srivastava

+ DEV Mailing List On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi, > > I am trying to map a Dataset with rows which have a map attribute. When I > try to create a Row with the map attribute I get cast errors. I am able to > reproduce the issue with the

Re: Dynamic resource allocation to Spark on Mesos

2017-01-27 Thread Michael Gummelt

> The way I understand is that the Spark job will not run if the CPU/Mem requirement is not met. Spark jobs will still run if they only have a subset of the requested resources. Tasks begin scheduling as soon as the first executor comes up. Dynamic allocation yields increased utilization by only

Re: Issue creating row with java.util.Map type

2017-01-27 Thread Richard Xin

try Row newRow = RowFactory.create(row.getString(0), row.getString(1), row.getMap(2)); On Friday, January 27, 2017 10:52 AM, Ankur Srivastava wrote: + DEV Mailing List On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava wrote: Hi, I am trying to map a Dataset with rows which have a ma

Re: Issue creating row with java.util.Map type

2017-01-27 Thread Ankur Srivastava

Thank you Richard for responding. I am able to run it successfully by using row.getMap but since I have to update the map I wanted to use the HashMap api. Is there a way I can use that? And I am surprised it worked in first case where I am creating Dataset from list of rows but fails in the Map fu

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Koert Kuipers

in case anyone else runs into this: the issue is that i was using kafka-clients 0.10.1.1 it works when i use kafka-clients 0.10.0.1 with spark structured streaming my kafka server is 0.10.1.1 On Fri, Jan 27, 2017 at 1:24 PM, Koert Kuipers wrote: > i checked my topic. it has 5 partitions but a

Re: Dynamic resource allocation to Spark on Mesos

2017-01-27 Thread Mihai Iacob

What about Spark on Kubernetes, is there a way to manage dynamic resource allocation? Regards, Mihai Iacob

Re: kafka structured streaming source refuses to read

2017-01-27 Thread Michael Armbrust

Yeah, kafka server client compatibility can be pretty confusing and does not give good errors in the case of mismatches. This should be addressed in the next release of kafka (they are adding an API to query the servers capabilities). On Fri, Jan 27, 2017 at 12:56 PM, Koert Kuipers wrote: > in

RE: spark 2.02 error when writing to s3

2017-01-27 Thread VND Tremblay, Paul

Not sure what you mean by "a consistency layer on top." Any explanation would be greatly appreciated! Paul _ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile +

CFP for Spark Summit San Francisco closes on Feb. 6

2017-01-27 Thread Scott walent

In June, the 10th Spark Summit will take place in San Francisco at Moscone West. We have expanded our CFP to include more topics and deep-dive technical sessions. Take center stage in front of your fellow Spark enthusiasts. Submit your presentation and join us for the big ten. The CFP closes on Fe

spark architecture question -- Pleas Read

2017-01-27 Thread Sirisha Cheruvu

Hi Team, RIght now our existing flow is Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive Context)-->Destination Hive table -->sqoop export to Oracle Half of the Hive UDFS required is developed in Java UDF.. SO Now I want to know if I run the native scala UDF's than runninng hive java

Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri

@Ted, I dont think so. On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu wrote: > Does the storage handler provide bulk load capability ? > > Cheers > > On Jan 25, 2017, at 3:39 AM, Amrit Jangid > wrote: > > Hi chetan, > > If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE > with > >

Re: spark architecture question -- Pleas Read

2017-01-27 Thread Sirisha Cheruvu

On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu wrote: > Hi Team, > > RIght now our existing flow is > > Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive > Context)-->Destination Hive table -->sqoop export to Oracle > > Half of the Hive UDFS required is developed in Java UDF.. > > SO N

Re: spark architecture question -- Pleas Read

2017-01-27 Thread Russell Spitzer

You can treat Oracle as a JDBC source ( http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) and skip Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the way back out (see the same link) and write directly to Oracle. I'll leave the performa

Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri

storage handler bulk load: SET hive.hbase.bulk=true; INSERT OVERWRITE TABLE users SELECT … ; But for now, you have to do some work and issue multiple Hive commands Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and TotalOr

outdated documentation? SparkSession

Re: outdated documentation? SparkSession

Re: spark intermediate data fills up the disk

Re: Kinesis streaming misunderstanding..?

Re: Kinesis streaming misunderstanding..?

Text

Re: Kinesis streaming misunderstanding..?

Re: spark 2.02 error when writing to s3

Re: Text

Re: Text

Re: Kinesis streaming misunderstanding..?

Re: Text

Re: Text

Making withColumn nullable

Issue with caching

Re: Making withColumn nullable

Dynamic resource allocation to Spark on Mesos

Converting timezones in Spark

Re: kafka structured streaming source refuses to read

Re: kafka structured streaming source refuses to read

Re: kafka structured streaming source refuses to read

Re: Issue creating row with java.util.Map type

Re: Dynamic resource allocation to Spark on Mesos

Re: Issue creating row with java.util.Map type

Re: Issue creating row with java.util.Map type

Re: kafka structured streaming source refuses to read

Re: Dynamic resource allocation to Spark on Mesos

Re: kafka structured streaming source refuses to read

RE: spark 2.02 error when writing to s3

CFP for Spark Summit San Francisco closes on Feb. 6

spark architecture question -- Pleas Read

Re: HBaseContext with Spark

Re: spark architecture question -- Pleas Read

Re: spark architecture question -- Pleas Read

Re: HBaseContext with Spark

35 matches

Site Navigation

Mail list logo

Footer information