Hi,
I took your code and ran it on spark 2.4.5 and it works ok for me. My first
though, like Sean, is that you have some Spark ML version mismatch somewhere.
Chris
> On 17 Aug 2020, at 16:18, Sean Owen wrote:
>
>
> Hm, next guess: you need a no-arg constructor this() on Fo
There’s also flint: https://github.com/twosigma/flint
> On 19 Sep 2018, at 17:55, Jörn Franke wrote:
>
> What functionality do you need ? Ie which methods?
>
>> On 19. Sep 2018, at 18:01, Mina Aslani wrote:
>>
>> Hi,
>> I have a question for you. Do we have any Time-Series Forecasting library
spark being able
to use the serializable versions.
That’s very much a last resort though!
Chris
> On 30 Nov 2018, at 05:08, Koert Kuipers wrote:
>
> if you only use it in the executors sometimes using lazy works
>
>> On Thu, Nov 29, 2018 at 9:45 AM James Starks
>>
information should allow us to figure out what.
Thanks,
Chris
> On 8 Apr 2019, at 09:21, neeraj bhadani wrote:
>
> Hi All,
> Can anyone help me here with my query?
>
> Regards,
> Neeraj
>
>> On Mon, Apr 1, 2019 at 9:44 AM neeraj bhadani
>> wrote:
achine-learning/
Visit us at Spark Summit in San Francisco next week and we would be happy to
provide more info on how we can help speedup your Spark ML applications and
at the same time reduce the TCO.
Chris Kachris
www.inaccel.com <http://www.inaccel.com>
Hi,
The solution we have is to make a generic spark submit python file (driver.py).
This is just a main method which takes a single parameter- the module
containing the app you want to run. The main method itself just dynamically
loads the module and executes some well know method on it (we us
is cached for the
lifetime of the dataframe.
In the case of parquet files Spark solves this issue via FileScanRDD. For
Datasource V2 it’s not obvious how one would solve a similar problem. Does
anyone have any ideas or prior art here?
Thanks,
Chris
by providing an answer ;)
Any help will be greatly appreciated, because otherwise I'm stuck with Spark
1.1.0, as quadrupling runtime is not an option.
Sincerely,
Chris
2015-02-09T14:06:06.328+01:00 INFOorg.apache.spark.executor.Executor
Running task 9.0 in stage 18.0 (TID 300)E
ee
https://issues.apache.org/jira/browse/SPARK-5715
Any help will be greatly appreciated, because otherwise I'm stuck with Spark
1.1.0, as quadrupling runtime is not an option.
Sincerely,
Chris
2015-02-09T14:06:06.328+01:00 INFOorg.apache.spark.executor.Executor
Running task 9.0 in st
I'm also curious if this is possible, so while I can't offer a solution
maybe you could try the following.
The driver and executor nodes need to have access to the same
(distributed) file system, so you could try to mount the file system to
your laptop, locally, and then try to submit jobs and/or
<https://stackoverflow.com/questions/67466878/can-spark-with-external-shuffle-service-use-saved-shuffle-files-in-the-event-of>open
on the same if you would rather answer directly there).
Kind regards,
Chris
One thing I would check is this line:
val fetchedRdd = rdd.map(r => (r.getGroup, r))
how many distinct groups do you ended up with? If there's just one then I
think you might see the behaviour you observe.
Chris
On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote:
> Also just to f
oving "spark.task.cpus":"16" or setting
spark.executor.cores to 1.
4. print out the group keys and see if there's any weird pattern to them.
5. See if the same thing happens in spark local.
If you have a reproducible example you can post publically then I'm happy
to take a look.
Ch
10MB
spark.sql.files.minPartitionNum 1000
Unfortunately we still see a large number of empty partitions and a small
number containing the rest of the data (see median vs max number of input
records).
[image: image.png]
Any help would be much appreciated
Chris
r, I don't quite understand the link between the splitting settings,
row group configuration, and resulting number of records when reading from
a delta table.
For more specifics: we're running Spark 3.1.2 using ADLS as cloud storage.
Best,
Chris
On Fri, Feb 11, 2022 at 1:40 PM Adam Binford w
483
num_row_groups: 1
format_version: 1.0
serialized_size: 6364
Columns
...
Chris
On Fri, Feb 11, 2022 at 3:37 PM Sean Owen wrote:
> It should just be parquet.block.size indeed.
> spark.write.option("parquet.block.size", "16m").parquet(...)
> This is an issue
. We set
> spark.hadoop.parquet.block.size in our spark config for writing to Delta.
>
> Adam
>
> On Fri, Feb 11, 2022, 10:15 AM Chris Coutinho
> wrote:
>
>> I tried re-writing the table with the updated block size but it doesn't
>> appear to have an effect
es in the static table for that particular non-unique key. The key is a
single column.
Thanks,
Chris
On Fri, 2022-02-11 at 20:10 +, Gourav Sengupta wrote:
> Gourav Sengupta
blob/v2.2.8/gcs/CONFIGURATION.md#io-configuration
[3] https://issues.apache.org/jira/browse/HADOOP-13327
[4] https://issues.apache.org/jira/browse/HADOOP-17597
Chris Nauroth
On Wed, Nov 9, 2022 at 11:04 PM second_co...@yahoo.com.INVALID
wrote:
> when running spark job, i used
>
> &q
ou'll hit after this. Speaking just in terms of what the build
does though, this should be sufficient.
I hope this helps.
Chris Nauroth
On Thu, Nov 17, 2022 at 11:32 PM Gnana Kumar
wrote:
> I have maven built the spark-kubernetes jar (
> spark-kubernetes_2.12-3.3.2-SNAPSHOT ) but wh
ficiency. (Of course, we're not always in
complete control of the data formats we're given, so the support for bz2 is
there.)
[1]
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
[2]
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aw
re also would not be any way to produce bgzip-style output like in the
df.write.option code sample. To achieve either of those, it would require
writing a custom Hadoop compression codec to integrate more closely with
the data format.
Chris Nauroth
On Mon, Dec 5, 2022 at 2:08 PM Olive
d = (binary_test - binary_train.mean()) /
binary_train.std()
On a data set this small, the difference in models could also be the result of
how the training/test sets were split.
Have you tried running k-folds cross validation on a larger data set?
Chris
> On May 20, 2015, at 6:15 PM, DB Tsai wrote:
>
I’m trying to iterate through a list of Columns and create new Columns based on
a condition. However, the when method keeps giving me errors that don’t quite
make sense.
If I do `when(col === “abc”, 1).otherwise(0)` I get the following error at
compile time:
[error] not found: value when
Howe
That did it! Thanks!
From: Yin Huai
Date: Friday, June 12, 2015 at 10:31 AM
To: Chris Freeman
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Issues with `when` in Column class
Hi Chris,
Have you imported "org.apache.spark.sql.functions._"?
Th
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
This does, in fact, silence the logging for everything else, but the Parquet
config seems totally unchanged. Does anyone know how to do this?
Thanks!
-Chris Freeman
and reading a local Parquet file from my hard drive.
-Chris
From: Cheng Lian [lian.cs@gmail.com]
Sent: Saturday, June 13, 2015 6:56 PM
To: Chris Freeman; user@spark.apache.org
Subject: Re: How to silence Parquet logging?
Hi Chris,
Which Spark version were you
Hi Ravi,
Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”)
Chris
> On Jun 22, 2015, at 5:28 PM, ravi tella wrote:
>
>
> Hello All,
> I am new to Spark. I have a very basic question.How do I write the output of
> an action on a RDD to HDFS?
>
> Thanks
Hi Ravi,
For this case, you could simply do
sc.parallelize([rdd.first()]).saveAsTextFile(“hdfs:///my_file”) using pyspark
or sc.parallelize(Array(rdd.first())).saveAsTextFile(“hdfs:///my_file”) using
Scala
Chris
> On Jun 22, 2015, at 5:53 PM, ddpis...@gmail.com wrote:
>
> Hi Chris,
add the partitions manually so that I can specify a location.
For what it's worth, "ActionEnum" is the first field in my schema. This
same table and query structure works fine with Hive. When I try to run this
with SparkSQL, however, I get the above error.
Anyone have any idea what the problem is here? Thanks!
--
Chris Miller
.lang.Thread.run(Thread.java:745)
****
In addition to the above, I also tried putting the test Avro files on HDFS
instead of S3 -- the error is the same. I also tried querying from Scala
instead of using Zeppelin, and I get the same error.
Where should I begin with troubleshooting th
ool.run(DataFileWriteTool.java:99)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
****
Any other ideas?
--
Chris Miller
On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman wrote:
> your field name is
> *enum1_values*
>
> but
oding the same files work fine with Hive, and I
imagine the same deserializer code is used there too.
Thoughts?
--
Chris Miller
On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman wrote:
> your field name is
> *enum1_values*
>
> but you have data
> { "foo1": "test123&q
p26386p26392.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
ome learning overhead if we go the Scala route. What I want to
>>>>> know is: is the Scala version of Spark still far enough ahead of pyspark
>>>>> to
>>>>> be well worth any initial training overhead?
>>>>>
>>>>>
>>>>> If you are a very advanced Python shop and if you’ve in-house
>>>>> libraries that you have written in Python that don’t exist in Scala or
>>>>> some
>>>>> ML libs that don’t exist in the Scala version and will require fair amount
>>>>> of porting and gap is too large, then perhaps it makes sense to stay put
>>>>> with Python.
>>>>>
>>>>> However, I believe, investing (or having some members of your group)
>>>>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>>>>> the performance gain, especially now with Tungsten (not sure how it
>>>>> relates
>>>>> to Python, but some other knowledgeable people on the list, please chime
>>>>> in). Two, since Spark is written in Scala, it gives you an enormous
>>>>> advantage to read sources (which are well documented and highly readable)
>>>>> should you have to consult or learn nuances of certain API method or
>>>>> action
>>>>> not covered comprehensively in the docs. And finally, there’s a long term
>>>>> benefit in learning Scala for reasons other than Spark. For example,
>>>>> writing other scalable and distributed applications.
>>>>>
>>>>>
>>>>> Particularly, we will be using Spark Streaming. I know a couple of
>>>>> years ago that practically forced the decision to use Scala. Is this
>>>>> still
>>>>> the case?
>>>>>
>>>>>
>>>>> You’ll notice that certain APIs call are not available, at least for
>>>>> now, in Python.
>>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>>>>>
>>>>>
>>>>> Cheers
>>>>> Jules
>>>>>
>>>>> --
>>>>> The Best Ideas Are Simple
>>>>> Jules S. Damji
>>>>> e-mail:dmat...@comcast.net
>>>>> e-mail:jules.da...@gmail.com
>>>>>
>>>>>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
instead of writing to the file from coalesce, sort that data structure,
then write your file.
--
Chris Miller
On Sat, Mar 5, 2016 at 5:24 AM, jelez wrote:
> My streaming job is creating files on S3.
> The problem is that those files end up very small if I just write them to
> S3
> dir
Guru:This is a really great response. Thanks for taking the time to explain
all of this. Helpful for me too.
--
Chris Miller
On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani wrote:
> Hi Lan,
>
> Streaming Means, Linear Regression and Logistic Regression support online
> machine lea
Gut instinct is no, Spark is overkill for your needs... you should be able
to accomplish all of that with a relational database or a column oriented
database (depending on the types of queries you most frequently run and the
performance requirements).
--
Chris Miller
On Mon, Mar 7, 2016 at 1:17
For anyone running into this same issue, it looks like Avro deserialization
is just broken when used with SparkSQL and partitioned schemas. I created
an bug report with details and a simplified example on how to reproduce:
https://issues.apache.org/jira/browse/SPARK-13709
--
Chris Miller
On Fri
I'm getting an exception when I try to submit a job (through prediction.io, if
you know it):
[INFO] [Runner$] Submission command:
/home/pio/PredictionIO/vendors/spark-1.5.1/bin/spark-submit --class
io.prediction.tools.imprt.FileToEvents --files
file:/home/pio/PredictionIO/conf/log4j.properties,
apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubs
the DAGScheduler, which
> will be apportioning the Tasks from those concurrent Jobs across the
> available Executor cores.
>
> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly wrote:
>
>> Good stuff, Evan. Looks like this is utilizing the in-memory
>> capabilities of FiloDB
bra operation
> >
> > Unfortunately, I'm fairly ignorant as to the internal mechanics of ALS
> > itself. Is what I'm asking possible?
> >
> > Thank you,
> > Colin
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
for analytical queries
> that the OP wants; and MySQL is great but not scalable. Probably
> something like VectorWise, HANA, Vertica would work well, but those
> are mostly not free solutions. Druid could work too if the use case
> is right.
>
> Anyways, great discussio
one the datum?
Seems I'm not the only one who ran into this problem:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I can't
figure out how to fix it in my case without hacking away like the person in
the linked PR did.
Suggestions?
--
Chris Miller
What exactly are you trying to do? Zeppelin is for interactive analysis of
a dataset. What do you mean "realtime analytics" -- do you mean build a
report or dashboard that automatically updates as new data comes in?
--
Chris Miller
On Sat, Mar 12, 2016 at 3:13 PM, trung kien wrote:
tln(record.get("myValue"))
})
*
What am I doing wrong?
--
Chris Miller
On Sat, Mar 12, 2016 at 1:48 PM, Peyman Mohajerian
wrote:
> Here is the reason for the behavior:
> '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
> objec
removed.
Finally, if I add rdd.persist(), then it doesn't work. I guess I would need
to do .map(_._1.datum) again before the map that does the real work.
--
Chris Miller
On Sat, Mar 12, 2016 at 4:15 PM, Chris Miller
wrote:
> Wow! That sure is buried in the documentation! But yeah, that
trigger your code to push out an updated value to any clients via the
websocket. You could use something like a Redis pub/sub channel to trigger
the web app to notify clients of an update.
There are about 5 million other ways you could design this, but I would
just keep it as simple as possible. I
Cool! Thanks for sharing.
--
Chris Miller
On Sun, Mar 13, 2016 at 12:53 AM, Todd Nist wrote:
> Below is a link to an example which Silvio Fiorito put together
> demonstrating how to link Zeppelin with Spark Stream for real-time charts.
> I think the original thread was pack in early
r you described.
--
Chris Miller
On Tue, Mar 15, 2016 at 11:22 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> There are many solutions to a problem.
>
> Also understand that sometimes your situation might be such. For ex what
> if you are accessing S3 from
Short answer: Nope
Less short answer: Spark is not designed to maintain sort order in this
case... it *may*, but there's no guarantee... generally, it would not be in
the same order unless you implement something to order by and then sort the
result based on that.
--
Chris Miller
On Wed, M
If you have lots of small files, distcp should handle that well -- it's
supposed to distribute the transfer of files across the nodes in your
cluster. Conductor looks interesting if you're trying to distribute the
transfer of single, large file(s)...
right?
--
Chris Miller
On Wed, Ma
With Avro you solve this by using a default value for the new field...
maybe Parquet is the same?
--
Chris Miller
On Tue, Mar 22, 2016 at 9:34 PM, gtinside wrote:
> Hi ,
>
> I have a table sourced from* 2 parquet files* with few extra columns in one
> of the parquet file. Simp
the processed logs to both elastic search and
> kafka. So that Spark Streaming can pick data from Kafka for the complex use
> cases, while logstash filters can be used for the simpler use cases.
>
> I was wondering if someone has already done this evaluation and could
> provide me
with production-ready Kafka Streams,
so I can try this out myself - and hopefully remove a lot of extra plumbing.
On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly wrote:
> this is a very common pattern, yes.
>
> note that in Netflix's case, they're currently pushing all of their logs
>
perhaps renaming to Spark ML would actually clear up code and documentation
confusion?
+1 for rename
> On Apr 5, 2016, at 7:00 PM, Reynold Xin wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
>> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley wrote:
>> +1 By the way, the JIRA for tracking
flatmap?
--
Chris Miller
On Thu, Apr 7, 2016 at 10:25 PM, greg huang wrote:
> Hi All,
>
>Can someone give me a example code to get rid of the empty string in
> JavaRDD? I kwon there is a filter method in JavaRDD:
> https://spark.apache.org/docs/1.6.0/api/java/org/apache/spa
this took me a bit to get working, but I finally got it up and running so with
the package that Burak pointed out.
here's some relevant links to my project that should give you some clues:
https://github.com/fluxcapacitor/pipeline/blob/master/myapps/spark/ml/src/main/scala/com/advancedspark/ml/n
Tue, May 17, 2016 at 1:36 AM, Todd wrote:
>>
>>> Hi,
>>> We have a requirement to do count(distinct) in a processing batch
>>> against all the streaming data(eg, last 24 hours' data),that is,when we do
>>> count(distinct),we actually want to compute dis
ere the cluster manager is running.
>>>>>>>>>
>>>>>>>>> The Driver node runs on the same host that the cluster manager is
>>>>>>>>> running. The Driver requests the Cluster Manager for resources to run
>>>>>>>>> tasks. The worker is tasked to create the executor (in this case
>>>>>>>>> there is
>>>>>>>>> only one executor) for the Driver. The Executor runs tasks for the
>>>>>>>>> Driver.
>>>>>>>>> Only one executor can be allocated on each worker per application. In
>>>>>>>>> your
>>>>>>>>> case you only have
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The minimum you will need will be 2-4G of RAM and two cores. Well
>>>>>>>>> that is my experience. Yes you can submit more than one spark-submit
>>>>>>>>> (the
>>>>>>>>> driver) but they may queue up behind the running one if there is not
>>>>>>>>> enough
>>>>>>>>> resources.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You pointed out that you will be running few applications in
>>>>>>>>> parallel on the same host. The likelihood is that you are using a VM
>>>>>>>>> machine for this purpose and the best option is to try running the
>>>>>>>>> first
>>>>>>>>> one, Check Web GUI on 4040 to see the progress of this Job. If you
>>>>>>>>> start
>>>>>>>>> the next JVM then assuming it is working, it will be using port 4041
>>>>>>>>> and so
>>>>>>>>> forth.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In actual fact try the command "free" to see how much free memory
>>>>>>>>> you have.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn *
>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 28 May 2016 at 16:42, sujeet jog wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have a question w.r.t production deployment mode of spark,
>>>>>>>>>>
>>>>>>>>>> I have 3 applications which i would like to run independently on
>>>>>>>>>> a single machine, i need to run the drivers in the same machine.
>>>>>>>>>>
>>>>>>>>>> The amount of resources i have is also limited, like 4- 5GB RAM ,
>>>>>>>>>> 3 - 4 cores.
>>>>>>>>>>
>>>>>>>>>> For deployment in standalone mode : i believe i need
>>>>>>>>>>
>>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor )
>>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor )
>>>>>>>>>> 1 Driver JVM, 1 worker node ( 1 executor )
>>>>>>>>>>
>>>>>>>>>> The issue here is i will require 6 JVM running in parallel, for
>>>>>>>>>> which i do not have sufficient CPU/MEM resources,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hence i was looking more towards a local mode deployment mode,
>>>>>>>>>> would like to know if anybody is using local mode where Driver +
>>>>>>>>>> Executor
>>>>>>>>>> run in a single JVM in production mode.
>>>>>>>>>>
>>>>>>>>>> Are there any inherent issues upfront using local mode for
>>>>>>>>>> production base systems.?..
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
--
*Chris Fregly*
Research Scientist @ Flux Capacitor AI
"Bringing AI Back to the Future!"
San Francisco, CA
http://fluxcapacitor.ai
>
> *Abhishek Kumar*
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclos
0.0 in stage 0.0
>>>>>> (TID 0, localhost): java.lang.Error: Multiple ES-Hadoop versions detected
>>>>>> in the classpath; please use only one
>>>>>>
>>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar
>>>>>>
>>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar
>>>>>>
>>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar
>>>>>>
>>>>>> at org.elasticsearch.hadoop.util.Version.(Version.java:73)
>>>>>> at
>>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
>>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
>>>>>> at
>>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
>>>>>> at
>>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
>>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>>>> at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>> .. still tracking this down but was wondering if there is someting
>>>>>> obvious I'm dong wrong. I'm going to take out
>>>>>> elasticsearch-hadoop-2.3.2.jar and try again.
>>>>>>
>>>>>> Lots of trial and error here :-/
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>> --
>>>>>>
>>>>>> We’re hiring if you know of any awesome Java Devops or Linux
>>>>>> Operations Engineers!
>>>>>>
>>>>>> Founder/CEO Spinn3r.com
>>>>>> Location: *San Francisco, CA*
>>>>>> blog: http://burtonator.wordpress.com
>>>>>> … or check out my Google+ profile
>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>>> Engineers!
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>
>>>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
--
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
http://pipeline.io
=spark+mllib
>>>>>
>>>>>
>>>>> https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8&qid=1465657706&sr=8-3&keywords=spark+mllib
>>>>>
>>>>>
>>>>> https:
re are some doc regarding using kafka directly
>> from the
>> > reader.stream?
>> > Has it been integrated already (I mean the source)?
>> >
>> > Sorry if the answer is RTFM (but then I'd appreciate a pointer anyway^^)
>> >
>> > thanks,
>> > cheers
>> > andy
>> > --
>> > andy
>>
> --
> andy
>
--
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
http://pipeline.io
S) or coordinated distribution of the models. But I
>> wanted
>> > to know if there is any infrastructure in Spark that specifically
>> addresses
>> > such need.
>> >
>> > Thanks.
>> >
>> > Cheers,
>> >
>> > P.S
column of
unknown type? Or is my only alternative here to reduce this to BinaryType
and use whatever encoding/data structures I want under the covers there and
in subsequent UDFs?
Thanks,
Chris
2.345] or
{"name":"Chris","value":123}. Given the Spark SQL constraints that
ArrayType and MapType need explicit and consistent element types, I don't
see any way to support this in the current type system short of falling
back to binary data.
Open to other sugges
On 8/8/16, 2:03 AM, "matthias.du...@fiduciagad.de"
wrote:
>Hello,
>
>I write to you because I am not really sure whether I did everything right
>when registering and subscribing to the spark user list.
>
>I posted the appended question to Spark User list after subscribing and
>receiving t
>times:
>
>https://www.mail-archive.com/search?l=user%40spark.apache.org&q=dueckm&submit.x=0&submit.y=0
>
>On Mon, Aug 8, 2016 at 3:03 PM, Chris Mattmann wrote:
>>
>>
>>
>>
>>
>> On 8/8/16, 2:03 AM, "matthias.du...@fiduciagad.de&
enkins builds to figure this out as i'll likely get
it wrong.
please provide the relevant snapshot/preview/nightly/whatever repos (or
equivalent) that we need to include in our builds to have access to the
absolute latest build assets for every major and minor release.
thanks!
-chris
On Tue,
nditions placed on
>> the package. If you find that the general public are downloading such test
>> packages, then remove them.
>>
>
> On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly wrote:
>
>> this is a valid question. there are many people building products and
>&
Great work Luciano!
On 8/15/16, 2:19 PM, "Luciano Resende" wrote:
The Apache Bahir PMC is pleased to announce the release of Apache Bahir
2.0.0 which is our first major release and provides the following
extensions for Apache Spark 2.0.0 :
Akka Streaming
MQTT Streamin
separating out your code into separate streaming jobs - especially when there
are no dependencies between the jobs - is almost always the best route. it's
easier to combine atoms (fusion), then split them (fission).
I recommend splitting out jobs along batch window, stream window, and
state-tr
hey Eran, I run into this all the time with Json.
the problem is likely that your Json is "too pretty" and extending beyond a
single line which trips up the Json reader.
my solution is usually to de-pretty the Json - either manually or through an
ETL step - by stripping all white space before p
how does Spark SQL/DataFrame know that train_users_2.csv has a field named,
"id" or anything else domain specific? is there a header? if so, does
sc.textFile() know about this header?
I'd suggest using the Databricks spark-csv package for reading csv data. there
is an option in there to spec
hopping on a plane, but check the hive-site.xml that's in your spark/conf
directory (or should be, anyway). I believe you can change the root path thru
this mechanism.
if not, this should give you more info google on.
let me know as this comes up a fair amount.
> On Dec 19, 2015, at 4:58 PM,
this type of broadcast should be handled by Spark SQL/DataFrames automatically.
this is the primary cost-based, physical-plan query optimization that the Spark
SQL Catalyst optimizer supports.
in Spark 1.5 and before, you can trigger this optimization by properly setting
the spark.sql.autobroad
M, Jakob Odersky
>>>> wrote:
>>>>
>>>>> It might be a good idea to see how many files are open and try
>>>>> increasing the open file limit (this is done on an os level). In some
>>>>> application use-cases it is actually a legitim
our PostgreSQL server, I get the following
>>>> error.
>>>>
>>>> Error: java.sql.SQLException: No suitable driver found for
>>>> jdbc:postgresql:///
>>>> (state=,code=0)
>>>>
>>>> Can someone help me understand why this is?
>>>>
>>>&g
> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
connect jar file.
>>>> >
>>>> > I've tried:
>>>> > • Using different versions of mysql-connector-java in my
>>>> build.sbt file
>>>> > • Copying the connector jar to my_project/src/main/lib
>>>> > • Copying the connector jar to my_project/lib <-- (this is
>>>> where I keep my build.sbt)
>>>> > Everything loads fine and works, except my call that does
>>>> "sqlContext.load("jdbc", myOptions)". I know this is a total newbie
>>>> question but in my defense, I'm fairly new to Scala, and this is my first
>>>> go at deploying a fat jar with sbt-assembly.
>>>> >
>>>> > Thanks for any advice!
>>>> >
>>>> > --
>>>> > David Yerrington
>>>> > yerrington.net
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> David Yerrington
>> yerrington.net
>>
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
apache-spark-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-un
;
> (fields:
> Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
> cannot be applied to (org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField)
>val customSchema = StructType( StructField("year", IntegerType,
> true), StructField("make", StringType, true) ,StructField("model",
> StringType, true) , StructField("comment", StringType, true) ,
> StructField("blank", StringType, true),StructField("blank", StringType,
> true))
> ^
>Would really appreciate if somebody share the example which works with
> Spark 1.4 or Spark 1.5.0
>
> Thanks,
> Divya
>
> ^
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
gt;>>
>>>> Hi,
>>>> I know it is a wide question but can you think of reasons why a pyspark
>>>> job which runs on from server 1 using user 1 will run faster then the same
>>>> job when running on server 2 with user 1
>>>> Eran
>>>>
>>>
>>>
>>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
; PFB the code snippet
>
> val lr = new LinearRegression()
> lr.setMaxIter(10)
> .setRegParam(0.01)
> .setFitIntercept(true)
> val model= lr.fit(test)
> val estimates = model.summary
>
>
> --
> Thanks and Regards
> Arun
>
--
>>> "com.epam.parso.spark.ds.DefaultSource");
>>>>>> df.cache();
>>>>>> df.printSchema(); <-- prints the schema perfectly fine!
>>>>>>
>>>>>> df.show(); <-- Works perfectly fine (shows table
>>>>>> with 20 lines)!
>>>>>> df.registerTempTable("table");
>>>>>> df.select("select * from table limit 5").show(); <-- gives weird
>>>>>> exception
>>>>>>
>>>>>> Exception is:
>>>>>>
>>>>>> AnalysisException: cannot resolve 'select * from table limit 5' given
>>>>>> input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS
>>>>>>
>>>>>> I can do a collect on a dataframe, but cannot select any specific
>>>>>> columns either "select * from table" or "select VER, CREATED from table".
>>>>>>
>>>>>> I use spark 1.5.2.
>>>>>> The same code perfectly works through Zeppelin 0.5.5.
>>>>>>
>>>>>> Thanks.
>>>>>> --
>>>>>> Be well!
>>>>>> Jean Morozov
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
97
On Fri, Dec 25, 2015 at 2:17 PM, Chris Fregly wrote:
> I assume by "The same code perfectly works through Zeppelin 0.5.5" that
> you're using the %sql interpreter with your regular SQL SELECT statement,
> correct?
>
> If so, the Zeppelin interpreter is conver
ust say we didn't have this problem in the old mllib API so
> it might be something in the new ml that I'm missing.
> I will dig deeper into the problem after holidays.
>
> 2015-12-25 16:26 GMT+01:00 Chris Fregly :
> > so it looks like you're increasing num tr
which version of spark is this?
is there any chance that a single key - or set of keys- key has a large number
of values relative to the other keys (aka. skew)?
if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although
I had some issues still with 1.5.1 in a similar situati
; expect it in the next few days. You will probably want to use the new API
>> once it's available.
>>
>>
>> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot
>> wrote:
>>
>>> Hi,
>>> I am new bee to spark and a bit confused about RDDs and
results.col("labelIndex"),
>>
>>
>> results.col("prediction"),
>>
>>
>> results.col("words"));
>>
>> exploreDF.show(10);
>>
>>
>>
>> Yes I realize its strange to switch styles how ever this should not cause
>> memory problems
>>
>>
>> final String exploreTable = "exploreTable";
>>
>> exploreDF.registerTempTable(exploreTable);
>>
>> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'";
>>
>> String stmt = String.format(fmt, exploreTable);
>>
>>
>> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);
>>
>>
>> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144
>>
>>
>> exploreDF.unpersist(true); does not resolve memory issue
>>
>>
>>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
on quick glance, it appears that you're calling collect() in there which is
bringing down a huge amount of data down to the single Driver. this is why,
when you allocated more memory to the Driver, a different error emerges most
-definitely related to stop-the-world GC to cause the node to beco
ator.populateAndCacheStripeDetails(OrcInputFormat.java:927)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> I will be glad for any help on that matter.
>>
>> Regards
>> Dawid Wysakowicz
>>
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
to the documentation. Has anyone used
> this approach yet and if so what has you experience been with using it? If
> it helps we’d be looking to implement it using Scala. Secondly, in general
> what has people experience been with using experimental features in Spark?
>
>
>
&
sisted
>> to memory-only. I want to be able to run a count (actually
>> "countApproxDistinct") after filtering by an, at compile time, unknown
>> (specified by query) value.
>> >
>> > I've experimented with using (abusing) Spark Streaming, by str
@Jim-
I'm wondering if those docs are outdated as its my understanding (please
correct if I'm wrong), that we should never be seeing OOMs as 1.5/Tungsten not
only improved (reduced) the memory footprint of our data, but also introduced
better task level - and even key level - external spilling
are the credentials visible from each Worker node to all the Executor JVMs on
each Worker?
> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev
> wrote:
>
> Dear Spark community,
>
> I faced the following issue with trying accessing data on S3a, my code is the
> following:
>
> val sparkCo
- and at some point, you will want to autoscale.
On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <
kudryavtsev.konstan...@gmail.com> wrote:
> Chris,
>
> good question, as you can see from the code I set up them on driver, so I
> expect they will be propagated t
ride
>
> public Function1, List> createTransformFunc() {
>
> //
> http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters
>
> Function1, List> f = new
> AbstractFunction1, List>() {
>
> public
1 - 100 of 245 matches
Mail list logo