Hadoop version 2.7.3
On Tue, Jun 20, 2017 at 11:12 PM, yohann jardin
wrote:
> Which version of Hadoop are you running on?
>
> *Yohann Jardin*
> Le 6/21/2017 à 1:06 AM, N B a écrit :
>
> Ok some more info about this issue to see if someone can shine a light on
> what could be going on. I turned o
https://spark.apache.org/docs/2.1.0/building-spark.html#specifying-the-hadoop-version
Version Hadoop v2.2.0 only is the default build version, but other versions can
still be built. The package you downloaded is prebuilt for Hadoop 2.7 as said
on the download page, don't worry.
Yohann Jardin
L
Which version of Hadoop are you running on?
Yohann Jardin
Le 6/21/2017 à 1:06 AM, N B a écrit :
Ok some more info about this issue to see if someone can shine a light on what
could be going on. I turned on debug logging for
org.apache.spark.streaming.scheduler in the driver process and this is
If you do an action, most intermediate calculations would be gone for the next
iteration.
What I would do is persist every iteration, then after some (say 5) I would
write to disk and reload. At that point you should call unpersist to free the
memory as it is no longer relevant.
Thanks,
I had downloaded the pre build package labeled "Spark 2.1.1 prebuilt with
Hadoop 2.7 or later" from the direct download link on spark.apache.org.
However, I am seeing compatibility errors running against a deployed HDFS
2.7.3. (See my earlier message about Flume DStream producing 0 records
after H
I have already seen on example where data is generated using spark, no reason
to think it's a bad idea as far as I know.
You can check the code here, I m not very sure but I think there is something
there which generates data for the TPCDS benchmark and you can provide how much
data you want in
Unsubscribe
Thanks & Best Regards,
Engr. Palash Gupta
Consultant, OSS/CEM/Big Data
Skype: palash2494
https://www.linkedin.com/in/enggpalashgupta
you should make hbase a data source(seems we already have hbase connector?),
create a dataframe from hbase, and do join in Spark SQL.
> On 21 Jun 2017, at 10:17 AM, sunerhan1...@sina.com wrote:
>
> Hello,
> My scenary is like this:
> 1.val df=hivecontext/carboncontex.sql("sql")
>
After investigation, it looks like my Spark 2.1.1 jars got corrupted during
download - all good now... ;)
> On Jun 20, 2017, at 4:14 PM, Jean Georges Perrin wrote:
>
> Hey all,
>
> i was giving a run to 2.1.1 and got an error on one of my test program:
>
> package net.jgp.labs.spark.l000_ing
Ok some more info about this issue to see if someone can shine a light on
what could be going on. I turned on debug logging for
org.apache.spark.streaming.scheduler in the driver process and this is what
gets thrown in the logs and keeps throwing it even after the downed HDFS
node is restarted. Usi
never mind!
I has a space at the end of my data which was not showing up in manual testing.
thanks
From: jeff saremi
Sent: Tuesday, June 20, 2017 2:48:06 PM
To: user@spark.apache.org
Subject: Bizzare diff in behavior between scala REPL and sparkSQL UDF
I have
I have this function which does a regex matching in scala. I test it in the
REPL I get expected results.
I use it as a UDF in sparkSQL i get completely incorrect results.
Function:
class UrlFilter (filters: Seq[String]) extends Serializable {
val regexFilters = filters.map(new Regex(_))
r
It's in the spark-catalyst_2.11-2.1.1.jar since the logical query plans and
optimization also need to know about types.
On Tue, Jun 20, 2017 at 1:14 PM, Jean Georges Perrin wrote:
> Hey all,
>
> i was giving a run to 2.1.1 and got an error on one of my test program:
>
> package net.jgp.labs.spar
Hi,
How do we bootstrap the streaming job with the previous state when we do a
code change and redeploy? We use updateStateByKey to maintain the state and
store session objects and LinkedHashMaps in the checkpoint.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.
Thanks Vadim & Jörn... I will look into those.
jg
> On Jun 20, 2017, at 2:12 PM, Vadim Semenov
> wrote:
>
> You can launch one permanent spark context and then execute your jobs within
> the context. And since they'll be running in the same context, they can share
> data easily.
>
> These t
Hey all,
i was giving a run to 2.1.1 and got an error on one of my test program:
package net.jgp.labs.spark.l000_ingestion;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org
You can launch one permanent spark context and then execute your jobs
within the context. And since they'll be running in the same context, they
can share data easily.
These two projects provide the functionality that you need:
https://github.com/spark-jobserver/spark-jobserver#persistent-context-
You could all express it in one program, alternatively ignite in memory file
system or the ignite sharedrdd ( not sure if dataframe is supported)
> On 20. Jun 2017, at 19:46, Jean Georges Perrin wrote:
>
> Hey,
>
> Here is my need: program A does something on a set of data and produces
> resu
Hi Assaf,
Thanks for the suggestion on checkpointing - I'll need to read up more on
that.
My current implementation seems to be crashing with a GC memory limit
exceeded error if Im keeping multiple persist calls for a large number of
files.
Thus, I was also thinking about the constant calls to p
Hey,
Here is my need: program A does something on a set of data and produces
results, program B does that on another set, and finally, program C combines
the data of A and B. Of course, the easy way is to dump all on disk after A and
B are done, but I wanted to avoid this.
I was thinking of c
BTW, this is running on Spark 2.1.1.
I have been trying to debug this issue and what I have found till now is
that it is somehow related to the Spark WAL. The directory named
/receivedBlockMetadata seems to stop getting
written to after the point of an HDFS node being killed and restarted. I
have
Unsubscribe
Sent from my iPhone
And we will having a webinar on July 27 going into some more details. Stay
tuned.
Cheers
Jules
Sent from my iPhone
Pardon the dumb thumb typos :)
> On Jun 20, 2017, at 7:00 AM, Michael Mior wrote:
>
> It's still in the early stages, but check out Deep Learning Pipelines from
> Databricks
It is fine, but you have to design it that generated rows are written in large
blocks for optimal performance.
The most tricky part with data generation is the conceptual part, such as
probabilistic distribution etc
You have to check as well that you use a good random generator, for some cases
Hi
Spark is a data analyzer, but would it be possible to use Spark as a data
generator or simulator ?
My simulation can be very huge and i think a parallelized simulation using by
Spark (cloud) could work.
Is that good or bad idea ?
Regards
Esa Heikkinen
Hi,
I have seen that databricks have higher order functions
(https://docs.databricks.com/_static/notebooks/higher-order-functions.html,
https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html)
which basically allows to do generic ope
It's still in the early stages, but check out Deep Learning Pipelines from
Databricks
https://github.com/databricks/spark-deep-learning
--
Michael Mior
mm...@apache.org
2017-06-20 0:36 GMT-04:00 Gaurav1809 :
> Hi All,
>
> Similar to how we have machine learning library called ML, do we have
> a
Correction.
On Tue, Jun 20, 2017 at 5:27 PM, sujeet jog wrote:
> , Below is the query, looks like from physical plan, the query is same as
> that of cqlsh,
>
> val query = s"""(select * from model_data
> where TimeStamp > \'$timeStamp+\' and TimeStamp <=
> \'$startTS+\'
>
, Below is the query, looks like from physical plan, the query is same as
that of cqlsh,
val query = s"""(select * from model_data
where TimeStamp > \'$timeStamp+\' and TimeStamp <=
\'$startTS+\'
and MetricID = $metricID)"""
println("Model query" + query)
val df
Hi,
Personally I would inspect how dates are managed. How does your spark code
looks like? What does the explain say. Does TimeStamp gets parsed the same
way?
Best,
On Tue, Jun 20, 2017 at 12:52 PM, sujeet jog wrote:
> Hello,
>
> I have a table as below
>
> CREATE TABLE analytics_db.ml_forecas
Hello,
I have a table as below
CREATE TABLE analytics_db.ml_forecast_tbl (
"MetricID" int,
"TimeStamp" timestamp,
"ResourceID" timeuuid
"Value" double,
PRIMARY KEY ("MetricID", "TimeStamp", "ResourceID")
)
select * from ml_forecast_tbl where "MetricID" = 1 and "TimeStamp" >
'20
Unsubscribe
Sent from Yahoo Mail on Android
Hi Edwin,
I have faced a similar issue as well and this behaviour is very abrupt. I
even created a question on StackOverflow but no solution yet.
https://stackoverflow.com/questions/43496205/spark-job-processing-time-increases-to-4s-without-explanation
For us, we sometimes had this constant delay
hi,all :
https://issues.apache.org/jira/browse/SPARK-19680
is this issue have any method to patch it ? I met the same problem.
2017-06-20
lk_spark
34 matches
Mail list logo