Hi!
In this doc
http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark
initialization is described by SparkContext. Do you think is it reasonable
to change it to SparkSession or just mentioned it at the end? I can prepare
it and make PR for this, but want to know your opinion
Not outdated at all, because there are other methods having dependencies on
sparkcontext so you have to create it.
For example,
https://gist.github.com/chetkhatri/f75c2b743e6cb2d7066188687448c5a1
On Fri, Jan 27, 2017 at 2:06 PM, Wojciech Indyk
wrote:
> Hi!
> In this doc http://spark.apache.org/d
IIUC, if the references of RDDs have gone, the related files (e.g.,
shuffled data) of these
RDDs are automatically removed by `ContextCleaner` (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L178
).
Since spark can recompute from datasources (
Hi,
Just a guess though, Kinesis shards sometimes have skew data.
So, before you compute something from kinesis RDDs, you'd be better to
repartition them
for better parallelism.
// maropu
On Fri, Jan 27, 2017 at 2:54 PM, Graham Clark wrote:
> Hi everyone - I am building a small prototype in Sp
Maybe a naive question: why are you creating 1 Dstream per shard? It should
be one Dstream corresponding to kinesis stream, isn't it?
On Fri, Jan 27, 2017 at 8:09 PM, Takeshi Yamamuro
wrote:
> Hi,
>
> Just a guess though, Kinesis shards sometimes have skew data.
> So, before you compute somethin
Hi All,
I read a test file using sparkContext.textfile(filename) and assign it to
an RDD and process the RDD (replace some words) and finally write it to
a text file using rdd.saveAsTextFile(output).
Is there any way to be sure the order of the sentences will not be changed?
I need to have the same
Probably, he referred to the word-couting example in kinesis here:
https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L114
On Fri, Jan 27, 2017 at 6:41 PM, ayan guha wrote:
> Maybe a naive question: why a
OK
Nobody should be committing output directly to S3 without having something add
a consistency layer on top, not if you want reliabie (as in "doesn't
lose/corrupt data" reliable) work
On 26 Jan 2017, at 19:09, VND Tremblay, Paul
mailto:tremblay.p...@bcg.com>> wrote:
This seems to have done t
Some operations like map, filter, flatMap and coalesce (with shuffle=false)
usually preserve the order. However, sortBy, reduceBy, partitionBy, join
etc. do not.
Regards,
_
*Md. Rezaul Karim*, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National Unive
I would not count on order preserving nature of the operations, because it
is not guranteed. I would assign some order to the sentences and sort at
the end before write back
On Fri, 27 Jan 2017 at 10:59 pm, Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:
> Some operations like map, fil
Hi - thanks for the responses. You are right that I started by copying the
word-counting example. I assumed that this would help spread the load
evenly across the cluster, with each worker receiving a portion of the
stream data - corresponding to one shard's worth - and then keeping the
data local
I agree with the previous statements. You cannot expect any ordering guarantee.
This means you need to ensure that the same ordering is done as the original
file. Internally Spark is using the Hadoop Client libraries - even if you do
not have Hadoop installed, because it is a flexible transparen
Sorry the message was not complete: the key is the file position, so if you
sort by key the lines will be in the same order as in the original file
> On 27 Jan 2017, at 14:45, Jörn Franke wrote:
>
> I agree with the previous statements. You cannot expect any ordering
> guarantee. This means y
HI Team,
When I add a column to my data frame using withColumn and assign some
value, it automatically creates the schema with this column to be not
nullable.
My final Hive table schema where I want to insert it has this column to be
nullable and hence throws an error when I try to save.
Is there
Hi All
I am trying to cache large dataset with storage level memory and sterilization
with kyro enabled when I run my spark job multiple times I get different
performance at a times caching dataset spark hangs and takes forever what is
wrong.
The best time I got is 20 mins and some times with
it should be by default nullable except for certain primitives where it
defaults to non-nullable
you can use Option for your return value to indicate nullability.
On Fri, Jan 27, 2017 at 10:32 AM, Ninad Shringarpure
wrote:
> HI Team,
>
> When I add a column to my data frame using withColumn and
Dear Spark Users,
Currently is there a way to dynamically allocate resources to Spark on
Mesos? Within Spark we can specify the CPU cores, memory before running
job. The way I understand is that the Spark job will not run if the CPU/Mem
requirement is not met. This may lead to decrease in overall
I'm reading CSV with a timestamp clearly identified in the UTC timezone,
and I need to store this in a parquet format and eventually read it back
and convert to different timezones as needed.
Sounds straightforward, but this involves some crazy function calls and I'm
seeing strange results as I bu
code:
val query = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "somenode:9092")
.option("subscribe", "wikipedia")
.load
.select(col("value") cast StringType)
.writeStream
.format("console")
.outputMode(Out
i checked my topic. it has 5 partitions but all the data is written to a
single partition: wikipedia-2
i turned on debug logging and i see this:
2017-01-27 13:02:50 DEBUG kafka010.KafkaSource: Partitions assigned to
consumer: [wikipedia-0, wikipedia-4, wikipedia-3, wikipedia-2,
wikipedia-1]. Seeki
Thanks for reporting this. Which Spark version are you using? Could you
provide the full log, please?
On Fri, Jan 27, 2017 at 10:24 AM, Koert Kuipers wrote:
> i checked my topic. it has 5 partitions but all the data is written to a
> single partition: wikipedia-2
> i turned on debug logging and
+ DEV Mailing List
On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:
> Hi,
>
> I am trying to map a Dataset with rows which have a map attribute. When I
> try to create a Row with the map attribute I get cast errors. I am able to
> reproduce the issue with the
> The way I understand is that the Spark job will not run if the CPU/Mem
requirement is not met.
Spark jobs will still run if they only have a subset of the requested
resources. Tasks begin scheduling as soon as the first executor comes up.
Dynamic allocation yields increased utilization by only
try
Row newRow = RowFactory.create(row.getString(0), row.getString(1),
row.getMap(2));
On Friday, January 27, 2017 10:52 AM, Ankur Srivastava
wrote:
+ DEV Mailing List
On Thu, Jan 26, 2017 at 5:12 PM, Ankur Srivastava
wrote:
Hi,
I am trying to map a Dataset with rows which have a ma
Thank you Richard for responding.
I am able to run it successfully by using row.getMap but since I have to
update the map I wanted to use the HashMap api. Is there a way I can use
that? And I am surprised it worked in first case where I am creating
Dataset from list of rows but fails in the Map fu
in case anyone else runs into this:
the issue is that i was using kafka-clients 0.10.1.1
it works when i use kafka-clients 0.10.0.1 with spark structured streaming
my kafka server is 0.10.1.1
On Fri, Jan 27, 2017 at 1:24 PM, Koert Kuipers wrote:
> i checked my topic. it has 5 partitions but a
What about Spark on Kubernetes, is there a way to manage dynamic resource allocation?
Regards,
Mihai Iacob
Yeah, kafka server client compatibility can be pretty confusing and does
not give good errors in the case of mismatches. This should be addressed
in the next release of kafka (they are adding an API to query the servers
capabilities).
On Fri, Jan 27, 2017 at 12:56 PM, Koert Kuipers wrote:
> in
Not sure what you mean by "a consistency layer on top." Any explanation would
be greatly appreciated!
Paul
_
Paul Tremblay
Analytics Specialist
THE BOSTON CONSULTING GROUP
Tel. + ▪ Mobile +
In June, the 10th Spark Summit will take place in San Francisco at Moscone
West. We have expanded our CFP to include more topics and deep-dive
technical sessions.
Take center stage in front of your fellow Spark enthusiasts. Submit your
presentation and join us for the big ten. The CFP closes on Fe
Hi Team,
RIght now our existing flow is
Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
Context)-->Destination Hive table -->sqoop export to Oracle
Half of the Hive UDFS required is developed in Java UDF..
SO Now I want to know if I run the native scala UDF's than runninng hive
java
@Ted, I dont think so.
On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu wrote:
> Does the storage handler provide bulk load capability ?
>
> Cheers
>
> On Jan 25, 2017, at 3:39 AM, Amrit Jangid
> wrote:
>
> Hi chetan,
>
> If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE
> with
>
>
On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu wrote:
> Hi Team,
>
> RIght now our existing flow is
>
> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
> Context)-->Destination Hive table -->sqoop export to Oracle
>
> Half of the Hive UDFS required is developed in Java UDF..
>
> SO N
You can treat Oracle as a JDBC source (
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
and skip Sqoop, HiveTables and go straight to Queries. Then you can skip
hive on the way back out (see the same link) and write directly to Oracle.
I'll leave the performa
storage handler bulk load:
SET hive.hbase.bulk=true;
INSERT OVERWRITE TABLE users SELECT … ;
But for now, you have to do some work and issue multiple Hive commands
Sample source data for range partitioning
Save sampling results to a file
Run CLUSTER BY query using HiveHFileOutputFormat and TotalOr
35 matches
Mail list logo