If you takes time to actually learn Scala starting from its fundamental
concepts AND quite importantly get familiar with general functional
programming concepts, you'd immediately realize the things that you'd
really miss going back to Java (8).
On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła wr
So I have a very simple dataframe that looks like
df: [name:String, Place:String, time: time:timestamp]
I build this java.sql.Timestamp from a string and it works really well
expect when I call saveAsTable("tableName") on this df. Without the
timestamp, it saves fine but with the timestamp, it th
Hi all,
our scenario is to generate lots of folders containinig parquet file and
then uses "add partition" to add these folder locations to a hive table;
when trying to read the hive table using Spark,
following logs would show up and took a lot of time on reading them;
but this won't happen afte
Hi all,
I submit a spark-streaming job under yarn-cluster model but error happens
15/07/16 14:36:36 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with
error: org.jboss.netty.channel.ChannelException: Failed to bind to:
sparktest/192.168.1.17:3
15/07/16 14:36:36 ERROR executor.Ex
IMHO only Scala is an option. Once you're familiar with it you just cant
even look at java code.
czw., 16.07.2015 o 07:20 użytkownik spark user
napisał:
> I struggle lots in Scala , almost 10 days n0 improvement , but when i
> switch to Java 8 , things are so smooth , and I used Data Frame with
Hi Ted,
Thanks for the information. The post seems little different with my
requirement: suppose we defined different functions to do different
streaming work (e.g. 50 functions), i want to test these 50 functions in
the spark shell, and the shell will always throw OOM at the middle of test
(yes,
Hi TD,
Yes, we do have the invertible function provided. However, I am not sure I
understood how to use the filterFunction. Is there an example somewhere
showing its usage?
The header comment on the function says :
* @param filterFunc function to filter expired key-value pairs;
*
This is because that you did not set the parameter "spark.sql.
hive.metastore.version".
You can check other parameters that you have set, it will work well.
Or you can first set this parameter, and then get it.
2015-07-17 11:53 GMT+08:00 RajG :
> I am using this version of Spark : *spark-1.4.0-bi
See this recent thread:
http://search-hadoop.com/m/q3RTtFW7iMDkrj61/Spark+shell+oom+&subj=java+lang+OutOfMemoryError+PermGen+space
> On Jul 16, 2015, at 8:51 PM, Terry Hole wrote:
>
> Hi,
>
> Background: The spark shell will get out of memory error after dealing lots
> of spark work.
>
>
I am using this version of Spark : *spark-1.4.0-bin-hadoop2.6* . I want to
check few default properties. So I gave the following statement in
spark-shell
*scala> sqlContext.getConf("spark.sql.hive.metastore.version")
*I was expecting the call to method getConf to return a value of 0.13.1 as
desrib
Hi,
Background: The spark shell will get out of memory error after dealing lots
of spark work.
Is there any method which can reset the spark shell to the startup status?
I tried "*:reset*", but it seems not working: i can not create spark
context anymore (some compile error as below) after the "*
No problem:) Glad to hear that!
On Thu, Jul 16, 2015 at 8:22 PM, Koert Kuipers wrote:
> that solved it, thanks!
>
> On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers wrote:
>
>> thanks i will try 1.4.1
>>
>> On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai wrote:
>>
>>> Hi Koert,
>>>
>>> For the classlo
that solved it, thanks!
On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers wrote:
> thanks i will try 1.4.1
>
> On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai wrote:
>
>> Hi Koert,
>>
>> For the classloader issue, you probably hit
>> https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed
Hi,
I am using the spark thrift server.
In my deployment, I need to have more memory for the driver, to be able to get
results back from the executors.
Currently a lot of the driver memory is spent on caching, but I would prefer
the driver would not use memory for that (only the executors)
Is tha
The same issue (A custome udf jar added through 'add jar' is not
recognized) is observed on Spark 1.4.1.
Instead of executing,
beeline>add jar udf.jar
My workaround is either
1) to pass the udf.jar by using "--jars" while starting ThriftServer
(This didn't work in AWS EMR's Spark 1.4.0.b).
or
2)
By default it is in ${SPARK_HOME}/work/${APP_ID}/${EXECUTOR_ID}
On Thu, Jul 16, 2015 at 3:43 PM, Tao Lu wrote:
> Hi, Guys,
>
> Where can I find the console log file of CoarseGrainedExecutorBackend
> process?
>
> Thanks!
>
> Tao
>
>
--
Best Regards
Jeff Zhang
Hi, Guys,
Where can I find the console log file of CoarseGrainedExecutorBackend
process?
Thanks!
Tao
thanks i will try 1.4.1
On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai wrote:
> Hi Koert,
>
> For the classloader issue, you probably hit
> https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed in
> Spark 1.4.1. Can you try 1.4.1 and see if the exception disappear?
>
> Thanks,
>
> Yi
I have tested on another pc which has 8 CPU cores.
But it hangs when defaultParallelismLevel > 4, e.g.
sparkConf.setMaster("local[*]")
local[1] ~ local[3] work well.
4 is the mysterious boundary.
It seems that I am not the only one encountered this problem:
https://issues.apache.org/jira/browse/S
Moscow : http://www.meetup.com/Apache-Spark-in-Moscow/
Slovenija (Ljubljana) http://www.meetup.com/Apache-Spark-Ljubljana-Meetup/
Yes, that's most of the work, just getting the native libs into the
assembly. netlib can find them from there even if you don't have BLAS
libs on your OS, since it includes a reference implementation as a
fallback.
One common reason it won't load is not having libgfortran installed on
your OSes th
Hi Koert,
For the classloader issue, you probably hit
https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed in
Spark 1.4.1. Can you try 1.4.1 and see if the exception disappear?
Thanks,
Yin
On Thu, Jul 16, 2015 at 2:12 PM, Koert Kuipers wrote:
> i am using scala 2.11
>
> spar
Hello all,
I am running the Spark recommendation algorithm in MLlib and I have been
studying its output with various model configurations. Ideally I would like to
be able to run one job that trains the recommendation model with many different
configurations to try to optimize for performance.
The snippet at the end worked for me. We run Spark 1.3.x, so
DataFrame.drop is not available to us.
As pointed out by Yana, DataFrame operations typically return a new
DataFrame, so use as such:
import com.foo.sparkstuff.DataFrameOps._
...
val df = ...
val prunedDf = df.dropColumns("one_col",
i am using scala 2.11
spark jars are not in my assembly jar (they are "provided"), since i launch
with spark-submit
On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers wrote:
> spark 1.4.0
>
> spark-csv is a normal dependency of my project and in the assembly jar
> that i use
>
> but i also tried ad
spark 1.4.0
spark-csv is a normal dependency of my project and in the assembly jar that
i use
but i also tried adding spark-csv with --package for spark-submit, and got
the same error
On Thu, Jul 16, 2015 at 4:31 PM, Yin Huai wrote:
> We do this in SparkILookp (
> https://github.com/apache/spa
We do this in SparkILookp (
https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037).
What is the version of Spark you are using? How did you add the spark-csv
jar?
On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers wrote:
> has a
has anyone tried to make HiveContext only if the class is available?
i tried this:
implicit lazy val sqlc: SQLContext = try {
Class.forName("org.apache.spark.sql.hive.HiveContext", true,
Thread.currentThread.getContextClassLoader)
.getConstructor(classOf[SparkContext]).newInstance(sc).asInst
See also https://issues.apache.org/jira/browse/SPARK-8385
(apologies if someone already mentioned that -- just saw this thread)
On Thu, Jul 16, 2015 at 7:19 PM, Jerrick Hoang wrote:
> So, this has to do with the fact that 1.4 has a new way to interact with
> HiveMetastore, still investigating. W
So, this has to do with the fact that 1.4 has a new way to interact with
HiveMetastore, still investigating. Would really appreciate if anybody has
any insights :)
On Tue, Jul 14, 2015 at 4:28 PM, Jerrick Hoang
wrote:
> Hi all,
>
> I'm upgrading from spark1.3 to spark1.4 and when trying to run s
Instead of using that RDD operation just use the native DataFrame function
approxCountDistinct
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
On Thu, Jul 16, 2015 at 6:58 AM, Yana Kadiyska
wrote:
> Hi, could someone point me to the recommended way of u
Thanks for reporting this, could you file a JIRA for it?
On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra wrote:
> Hi all,
>
> I am having some troubles when using a custom udf in dataframes with pyspark
> 1.4.
>
> I have rewritten the udf to simplify the problem and it gets even weirder.
> The udfs
Not exactly the same issue, but possibly related:
https://issues.apache.org/jira/browse/KAFKA-1196
On Thu, Jul 16, 2015 at 12:03 PM, Cody Koeninger wrote:
> Well, working backwards down the stack trace...
>
> at java.nio.Buffer.limit(Buffer.java:275)
>
> That exception gets thrown if the limit
Well, working backwards down the stack trace...
at java.nio.Buffer.limit(Buffer.java:275)
That exception gets thrown if the limit is negative or greater than
the buffer's capacity
at kafka.message.Message.sliceDelimited(Message.scala:236)
If size had been negative, it would have just returned
Hello,
When I'm reprocessing the data from kafka (about 40 Gb) with the new
Spark Streaming Kafka method createDirectStream, everything is fine
till a driver error happened (driver is killed, connection lost...).
When the driver pops up again, it resumes the processing with the
checkpoint in HDFS.
Hi All ,
How can i broadcast a data change to all the executor ever other 10 min or
1 min
Ashish
Have you tried to examine what clean_cols contains -- I'm suspect of this
part mkString(“, “).
Try this:
val clean_cols : Seq[String] = df.columns...
if you get a type error you need to work on clean_cols (I suspect yours is
of type String at the moment and presents itself to Spark as a single
col
Depending on what you do with them, they will get computed separately.
Bcoz u may have long dag in each branch. So spark tries to run all the
transformation function together rather than trying to optimize things
across branches.
On Jul 16, 2015 1:40 PM, "Bin Wang" wrote:
> What if I would use bo
Cheng,
Yes, "select * from temp_table" was working. I was able to perform some
transformation+action on the dataframe and print it on console.
HiveThriftServer2.startWithContext was being run on the same session.
When you say "try --jars option", are you asking me to pass spark-csv jar?
I'm alrea
Hi all,
I am having some troubles when using a custom udf in dataframes with
pyspark 1.4.
I have rewritten the udf to simplify the problem and it gets even weirder.
The udfs I am using do absolutely nothing, they just receive some value and
output the same value with the same format.
I show you
Hi folks.
We have lots of Spark enthusiasts and some organizations held talk
events in Tokyo, Japan.
Now we're going to unifiy those events and have created our home page in
meetup.com.
http://www.meetup.com/Tokyo-Spark-Meetup/
Could you add this to the list?
Thanks.
- Kousuke Saruta
-
You ask an interesting question…
Lets set aside spark, and look at the overall ingestion pattern.
Its really an ingestion pattern where your input in to the system is from a
queue.
Are the events discrete or continuous? (This is kinda important.)
If the events are continuous then more than
Hi,
In a hundred columns dataframe, I wish to either select all of them except or
drop the ones I dont want.
I am failing in doing such simple task, tried two ways
val clean_cols = df.columns.filterNot(col_name =>
col_name.startWith("STATE_").mkString(", ")
df.select(clean_cols)
But this thro
Ok…
After having some off-line exchanges with Shashidhar Rao came up with an idea…
Apply machine learning to either implement or improve autoscaling up or down
within a Storm/Akka cluster.
While I don’t know what constitutes an acceptable PhD thesis, or senior project
for undergrads… this is
Thanks!
On Thu, Jul 16, 2015 at 1:59 PM Vetle Leinonen-Roeim
wrote:
> By the way - if you're going this route, see
> https://github.com/datastax/spark-cassandra-connector
>
> On Thu, Jul 16, 2015 at 2:40 PM Vetle Leinonen-Roeim
> wrote:
>
>> You'll probably have to install it separately.
>>
>>
Hi, could someone point me to the recommended way of using
countApproxDistinctByKey with DataFrames?
I know I can map to pair RDD but I'm wondering if there is a simpler
method? If someone knows if this operations is expressible in SQL that
information would be most appreciated as well.
Yes. The HIVE UDF and distribute by both supported by Spark SQL.
If you are using Spark 1.4, you can try Hive analytics windows function
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),most
of which are already supported in Spark 1.4, so you don't need the
Did you take a look at the excellent write up by Yin Huai and Michael
Armbrust? It appears that rank is supported in the 1.4.x release.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Snippet from above article for your convenience:
To answer the first ques
Hello,
I use Apache GraphX (version 1.1.0). And sometime stage which corresponds to
this line of code:
val graph = GraphLoader.edgeListFile(...)
takes too much time. Looking to EVENT_LOG_1 file I found out that for some
tasks of this stage 'Executor Deserialize Time' were too big (like ~102
By the way - if you're going this route, see
https://github.com/datastax/spark-cassandra-connector
On Thu, Jul 16, 2015 at 2:40 PM Vetle Leinonen-Roeim
wrote:
> You'll probably have to install it separately.
>
> On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker wrote:
>
>> Hi Vetle,
>>
>> IndexedRDD i
You'll probably have to install it separately.
On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker wrote:
> Hi Vetle,
>
> IndexedRDD is persisted in the same way RDDs are as far as I am aware. Are
> you aware if Cassandra can be built into my application or has to be a
> stand alone database which is ins
Hi Vetle,
IndexedRDD is persisted in the same way RDDs are as far as I am aware. Are
you aware if Cassandra can be built into my application or has to be a
stand alone database which is installed separately?
Thanks,
Jem
On Thu, Jul 16, 2015 at 12:59 PM Vetle Leinonen-Roeim
wrote:
> Hi,
>
> No
Does spark HiveContext support the rank() ... distribute by syntax (as in
the following article-
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive
)?
If not, how can it be achieved?
Thanks,
Lior
Hi,
Not sure how IndexedRDD is persisted, but perhaps you're better off using a
NOSQL database for lookups (perhaps using Cassandra, with the Cassandra
connector)? That should give you good performance on lookups, but
persisting those billion records sounds like something that will take some
time
Ηι Lian,
Thank you for the tip. Indeed, there were a lot of distinct values in my
result set (approximately 3000). As you suggested i decided to partition
the data firstly on a column with much smaller cardinality.
Thanks
n
On Thu, Jul 16, 2015 at 2:09 PM, Cheng Lian wrote:
> Hi Nikos,
>
> Ho
Hi,
This is an ugly solution because it requires pulling out a row:
val rdd: RDD[Row] = ...
ctx.createDataFrame(rdd, rdd.first().schema)
Is there a better alternative to get a DataFrame from an RDD[Row] since
toDF won't work as Row is not a Product ?
Thanks,
Marius
Hi Nikos,
How many columns and distinct values of "some_column" are there in the
DataFrame? Parquet writer is known to be very memory consuming for wide
tables. And lots of distinct partition column values result in many
concurrent Parquet writers. One possible workaround is to first
repartit
Hi. I am trying to use Apache Spark in a Restful web service in which I am
trying to query the data from Hive tables using Apache Spark Sql. This is my
java class
SparkConf sparkConf = new
SparkConf().setAppName("Hive").setMaster("local").setSparkHome("Path");
JavaSparkContext
Hi forumI am currently using Spark 1.4.0, and started using the ML pipeline
framework.I ran the example program
"ml.JavaSimpleTextClassificationPipeline" which uses the LogisticRegression.
But I wanted to do multiclass classification, so I used
DecisionTreeClassifier present in the org.apache.spark
Given the following code which just reads from s3, then saves files to s3
val inputFileName: String = "s3n://input/file/path"
val outputFileName: String = "s3n://output/file/path"
val conf = new
SparkConf().setAppName(this.getClass.getName).setMaster("local[4]")
val
Hi Eugene,
thanks for your response!
Your recommendation makes sense, that's what I more or less tried.
The problem that I am facing is that inside foreachPartition() I cannot
create a new rdd and use saveAsTextFile.
It would probably make sense to write directly to HDFS using the Java API.
When I
hi all,
I'm use spark-streaming with spark ,I configure flume like this:
a1.channels = c1
a1.sinks = k1
a1.sources = r1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 3
a1.sources.r1.type = avro
a1.sources.r1.bind = localhost
a1
MAke sure you provide the filterFunction with the invertible
reduceByKeyAndWindow. Otherwise none of the keys will get removed, and the
key space will continue increase. This is what is leading to the lag. So
use the filtering function to filter out the keys that are not needed any
more.
On Thu, J
Thanks Akhil. For doing reduceByKeyAndWindow, one has to have checkpointing
enabled. So, yes we do have it enabled. But not Write Ahead Log because we
don't have a need for recovery and we do not recover the process state on
restart.
I don't know if IO Wait fully explains the increasing processing
What if I would use both rdd1 and rdd2 later?
Raghavendra Pandey 于2015年7月16日周四 下午4:08写道:
> If you cache rdd it will save some operations. But anyway filter is a lazy
> operation. And it runs based on what you will do later on with rdd1 and
> rdd2...
>
> Raghavendra
> On Jul 16, 2015 1:33 PM, "Bin
If you cache rdd it will save some operations. But anyway filter is a lazy
operation. And it runs based on what you will do later on with rdd1 and
rdd2...
Raghavendra
On Jul 16, 2015 1:33 PM, "Bin Wang" wrote:
> If I write code like this:
>
> val rdd = input.map(_.value)
> val f1 = rdd.filter(_
If I write code like this:
val rdd = input.map(_.value)
val f1 = rdd.filter(_ == 1)
val f2 = rdd.filter(_ == 2)
...
Then the DAG of the execution may be this:
-> Filter -> ...
Map
-> Filter -> ...
But the two filters is operated on the same RDD, which means it could be
done by
Hello,
I have been using IndexedRDD as a large lookup (1 billion records) to join
with small tables (1 million rows). The performance of indexedrdd is great
until it has to be persisted on disk. Are there any alternatives to
IndexedRDD or any changes to how I use it to improve performance with big
What is your data volume? Are you having checkpointing/WAL enabled? In that
case make sure you are having SSD disks as this behavior is mainly due to
the IO wait.
Thanks
Best Regards
On Thu, Jul 16, 2015 at 8:43 AM, N B wrote:
> Hello,
>
> We have a Spark streaming application and the problem t
Yes you can do that, just make sure you rsync the same file to the same
location on every machine.
Thanks
Best Regards
On Thu, Jul 16, 2015 at 5:50 AM, Julien Beaudan
wrote:
> Hi all,
>
> Is it possible to use Spark to assign each machine in a cluster the same
> task, but on files in each machi
Did you try this?
*val out=lines.filter(xx=>{*
val y=xx
val x=broadcastVar.value
var flag:Boolean=false
for(a<-x)
{
if(y.contains(a))
flag=true
}
flag
}
*})*
Thanks
Best Regards
On Wed, Jul 15, 2015 at 8:10 PM, Naveen Dabas wrote:
>
>
>
>
>
> I am using the
Which version of spark are you using? insertIntoJDBC is deprecated (from
1.4.0), you may use write.jdbc() instead.
Thanks
Best Regards
On Wed, Jul 15, 2015 at 2:43 PM, Manohar753 wrote:
> Hi All,
>
> Am trying to add few new rows for existing table in mysql using
> DataFrame.But it is adding ne
72 matches
Mail list logo