it fails (possibly, because the SparkContext is null(?)).
How do I address this issue? What needs to be done? Do I need to switch to
a synchronous architecture?
Thanks in advance.
Kind regards,
Marco
ave a reproducer, but I would say that it's enough to create one
Future and call sparkContext from there.
Thanks again for the answers.
Kind regards,
Marco
2016-01-18 19:37 GMT+01:00 Shixiong(Ryan) Zhu :
> Hey Marco,
>
> Since the codes in Future is in an asynchronous way, you cannot call
&
cipe for
> making data from other data. It is not literally computed by materializing
> every RDD completely. That is, a lot of the "copy" can be optimized away
> too.
I hope it answers your question.
Kind regards,
Marco
2016-01-19 13:14 GMT+01:00 ddav :
> Hi,
>
> Certain
It depends on what you mean by "write access". The RDDs are immutable, so
you can't really change them. When you apply a mapping/filter/groupBy
function, you are creating a new RDD starting from the original one.
Kind regards,
Marco
2016-01-19 13:27 GMT+01:00 Dave :
> Hi Ma
yDependencies += "org.apache.hadoop" % "hadoop-mapreduce-client-core"
% "2.6.0"
libraryDependencies += "org.apache.hbase" % "hbase-client" %
"0.98.4-hadoop2"
libraryDependencies += "org.apache.hbase" % "hbase-server" %
"0.9
I've switched to maven and all issues are gone, now.
2015-01-23 12:07 GMT+01:00 Sean Owen :
> Use mvn dependency:tree or sbt dependency-tree to print all of the
> dependencies. You are probably bringing in more servlet API libs from
> some other source?
>
> On Fri, Jan 23, 201
cause of this issue?
Thanks,
Marco
<<<<<
15/01/28 10:25:06 INFO spark.SecurityManager: Changing modify acls to: user
15/01/28 10:25:06 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(user); users with modify perm
Hi,
I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive
Thriftserver Maven Artifact). Can somebody point me (URL) to the artifact
in a public repository ? I have not found it @Maven Central.
Thanks,
Marco
Ok, so will it be only available for the next version (1.30)?
2015-02-16 15:24 GMT+01:00 Ted Yu :
> I searched for 'spark-hive-thriftserver_2.10' on this page:
> http://mvnrepository.com/artifact/org.apache.spark
>
> Looks like it is not published.
>
> On Mon, Fe
he correct way to do that is that
combineByKey call setKeyOrdering function on the ShuflleRDD that it returns.
Am I wrong? Can be done by a combination of other transformations with
the same efficiency?
Thanks,
Marco
-
To unsu
Thx Ted for the info !
2015-04-27 23:51 GMT+02:00 Ted Yu :
> This is available for 1.3.1:
>
> http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10
>
> FYI
>
> On Mon, Feb 16, 2015 at 7:24 AM, Marco wrote:
>
>> Ok, so will it be only
On 04/27/2015 06:00 PM, Ganelin, Ilya wrote:
> Marco - why do you want data sorted both within and across partitions? If you
> need to take an ordered sequence across all your data you need to either
> aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an
Hi,
I'd like to know if Spark supports edge AI: can Spark run on physical device
such as mobile devices running Android/iOS?
Best regards,
Marco Sassarini
[cid:b995380c-a2a9-47fd-a865-edcad29e4206]
Marco Sassarini
Artificial Intelligence Department
office: +39 0434 562 978
www.over
DD is [0, 2]
Result is [0, 2]
RDD is [0, 2]
Filtered RDD is [0, 1]
Result is [0, 1]
```
Thanks,
Marco
Hi, my name is Marco and I'm one of the developers behind
https://github.com/unicredit/hbase-rdd
a project we are currently reviewing for various reasons.
We were basically wondering if RDD "is still a thing" nowadays (we see lots of
usage for DataFrames or Datasets) and we&
Hmm, I think I got what Jingnan means. The lambda function is x != i and i
is not evaluated when the lambda function was defined. So the pipelined rdd
is rdd.filter(lambda x: x != i).filter(lambda x: x != i), rather than
having the values of i substituted. Does that make sense to you, Sean?
On Wed
Marco Costantini
5:55 PM (5 minutes ago)
to user
I have two tables: {users, orders}. In this example, let's say that for
each 1 User in the users table, there are 10 Orders in the orders table.
I have to use pyspark to generate a statement of Orders for each User. So,
a single user will
I have two tables: {users, orders}. In this example, let's say that for
each 1 User in the users table, there are 10 Orders in the orders table.
I have to use pyspark to generate a statement of Orders for each User. So,
a single user will need his/her own list of Orders. Additionally, I need t
'description'))).alias('orders'))
```
(json is ultimately needed)
This actually achieves my goal by putting all of the 'orders' in a single
Array column. Now my worry is, will this column become too large if there
are a great many orders. Is there a limit? I have search for
Thanks Mich,
Great idea. I have done it. Those files are attached. I'm interested to
know your thoughts. Let's imagine this same structure, but with huge
amounts of data as well.
Please and thank you,
Marco.
On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh
wrote:
> Hi Marco,
>
ther actions to each iteration (send email, send HTTP
request, etc).
Thanks Mich,
Marco.
On Tue, Apr 25, 2023 at 6:06 PM Mich Talebzadeh
wrote:
> Hi Marco,
>
> First thoughts.
>
> foreach() is an action operation that is to iterate/loop over each
> element in the dataset, meanin
earch for them. Even late last night!
Thanks for your help team,
Marco.
On Wed, Apr 26, 2023 at 6:21 AM Mich Talebzadeh
wrote:
> Indeed very valid points by Ayan. How email is going to handle 1000s of
> records. As a solution architect I tend to replace. Users by customers and
> for each or
ith other requirements (serializing other
things).
Any advice? Please and thank you,
Marco.
Hi Enrico,
What a great answer. Thank you. Seems like I need to get comfortable with
the 'struct' and then I will be golden. Thank you again, friend.
Marco.
On Thu, May 4, 2023 at 3:00 AM Enrico Minack wrote:
> Hi,
>
> You could rearrange the DataFrame so that writing
the partitions I need. However, the filenames are something
like:part-0-0e2e2096-6d32-458d-bcdf-dbf7d74d80fd.c000.json
Now, I understand Spark's need to include the partition number in the
filename. However, it sure would be nice to control the rest of the file
name.
Any advice? Please and thank you.
Marco.
ll
aware) makes sense. Question: what are some good methods, tools, for
combining the parts into a single, well-named file? I imagine that is
outside of the scope of PySpark, but any advice is welcome.
Thank you,
Marco.
On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh
wrote:
> AWS S3, or Go
Hi Mich,
Thank you. Ah, I want to avoid bringing all data to the driver node. That
is my understanding of what will happen in that case. Perhaps, I'll trigger
a Lambda to rename/combine the files after PySpark writes them.
Cheers,
Marco.
On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh
please feel free to contact me (marco.s...@gmail.com
) with any issues, comments, or
questions.Sincerely,Marco ShawSpark MOOC TA_(This is being sent
as an HTML formatted email. Some of the links have been duplicated just in
case.)1. Install VirtualBox here
<https://www.virt
Hello everybody,
I'm running a two node spark cluster on ec2, created using the provided
scripts. I then ssh into the master and invoke
"PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook
--profile=pyspark' spark/bin/pyspark". This launches a spark notebook which
has been instructe
Have u tried df.saveAsParquetFIle? I think that method is on df Api
Hth
Marco
On 19 Mar 2016 7:18 pm, "Vincent Ohprecio" wrote:
>
> For some reason writing data from Spark shell to csv using the `csv
> package` takes almost an hour to dump to disk. Am I going crazy or did I
Hi
I try tomorrow same settings as you to see if I can experience same issues.
Will report back once done
Thanks
On 20 Mar 2016 3:50 pm, "Vincent Ohprecio" wrote:
> Thanks Mich and Marco for your help. I have created a ticket to look into
> it on dev channel.
> Here
sion. How would one access the persisted RDD in the new shell session ?
>>>
>>>
>>> Thanks,
>>>
>>> --
>>>
>>>Nick
>>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
--
Ing. Marco Colombo
that table
later, for example, via thrift server of from my code.
If every time DF is accessed and connections have been closed, it is a
performance penalty to reopen then anytime.
Then, especially for Oracle, it is also costly...
Does anyone have a better understanding of this?
Thanks again!
selected (0.126 seconds)
> >
>
> It shows table that are persisted on hive metastore using saveAsTable.
> Temp table (registerTempTable) can't able to view
>
> Can any1 help me with this,
> Thanks
>
--
Ing. Marco Colombo
Hi
U can use SBT assembly to create uber jar. U should set spark libraries as
'provided' in ur SBT
Hth
Marco
Ps apologies if by any chances I m telling u something u already know
On 4 Apr 2016 2:36 pm, "Mich Talebzadeh" wrote:
> Hi,
>
>
> When one builds a proj
fetch remote files and
process them in ) in one go.
I want to avoid doing the first step (processing the million row file) in
spark and the rest (_fetching FTP and process files) offline.
Does spark has anything that can help with the FTP fetch?
Thanks in advance and rgds
Marco
Many thanks for suggestion Andy!
Kr
Marco
On 5 Apr 2016 7:25 pm, "Andy Davidson"
wrote:
> Hi Marco
>
> You might consider setting up some sort of ELT pipe line. One of your
> stages might be to create a file of all the FTP URL. You could then write
> a spark app that
Have you tried SBT eclipse plugin? Then u can run SBT eclipse and have ur
spark project directly in eclipse
Pls Google it and u shud b able to find ur way.
If not ping me and I send u the plugin (I m replying from my phone)
Hth
On 12 Apr 2016 4:53 pm, "ImMr.K" <875061...@qq.com> wrote:
But how to
" % "1.5.2" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-flume" %
"1.3.0" % "provided"
...
But compilations fail mentioning that class StateSpec and State are not
found
Could pls someone point me to the right packages to refer if i want to use
StateSpec?
kind regards
marco
any docs tell ing me which config file i have to modify
anyone can assist ?
kr
marco
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 April 2016 at 15:13, Marco Mistroni wrote:
>
>> HI all
>> i need to use spark-csv in my spark instance, and i want to avoid
>> launching spark-shell
>> by passing the package name every
If u r using Scala api you can do
Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
Maybe a little bit complicated but will do the trick
As per spark CSV, you will get back a data frame which you can reconduct to
rdd. .
Hth
Marco
On 27 Apr 2016 6:59 am, "nihed mbarek" wrote:
> You
; % "test"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.1"
% "provided"
libraryDependencies += &quo
))
:28: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else
lit(0))
any suggestions?
kind regars
marco
uot;) > 29.0,
> 1).otherwise(0)).show
> +++--+
> | age|name|AgeInt|
> +++--+
> |25.0| foo| 0|
> |30.0| bar| 1|
> +++--+
>
> On Thu, 28 Apr 2016 at 20:45 Marco Mistroni wrote:
>
>> HI all
>> i have a dataFrame with a col
marco
Hi
was wondering if anyone can assist here..
I am trying to create a spark cluster on AWS using scripts located in
spark-1.6.1/ec2 directory
When the spark_ec2.py scripts tries to do a rsync to copy directories over
to teh AWS
master node it fails miserably with this stack trace
DEBUG:spark ecd
ys are in stream1 and stream2 in the same time interval.Do you guys
have any suggestion to implement this correctly with Spark?Thanks, Marco
Dear all,
Does Spark uses data locality information from HDFS, when running in standalone
mode? Or is it running on YARN mandatory for such purpose? I can't find this
information in the docs, and on Google I am only finding contrasting opinion on
that.
Regards
Marco Capuccini
would be
enough.
Regards
Marco
On 05 Jun 2016, at 12:17, Mich Talebzadeh
mailto:mich.talebza...@gmail.com>> wrote:
Well in standalone mode you are running your spark code on one physical node so
the assumption would be that there is HDFS node running on the same host.
When you are r
HI Ashok
this is not really a spark-related question so i would not use this
mailing list.
Anyway, my 2 cents here
as outlined by earlier replies, if the class you are referencing is in a
different jar, at compile time you will need to add that dependency to your
build.sbt,
I'd personally lea
HI
have you tried to add this flag?
-Djsse.enableSNIExtension=false
i had similar issues in another standalone application when i switched to
java8 from java7
hth
marco
On Mon, Jun 6, 2016 at 9:58 PM, Koert Kuipers wrote:
> mhh i would not be very happy if the implication is that i have
Hi
how about
1. have a process that read the data from your sqlserver and dumps it as a
file into a directory on your hd
2. use spark-streanming to read data from that directory and store it into
hdfs
perhaps there is some sort of spark 'connectors' that allows you to read
data from a db direct
2105263157895)
((entropy,1,28),0.684210526315789
could anyone explain why?
kind regards
marco
Hi all,
I'm running a Spark Streaming application that uses reduceByKeyAndWindow(). The
window interval is 2 hours, while the slide interval is 1 hour. I have a
JavaPairRDD in which both keys and values are strings. Each time the
reduceByKeyAndWindow() function is called, it uses appendString(
HI all
which method shall i use to verify the accuracy of a
BinaryClassificationMetrics ?
the multiClassMetrics has a precision() method but that is missing
on the BinaryClassificationMetrics
thanks
marco
unsubscribe
too little info
it'll help if you can post the exception and show your sbt file (if you are
using sbt), and provide minimal details on what you are doing
kr
On Fri, Jun 17, 2016 at 10:08 AM, VG wrote:
> Failed to find data source: com.databricks.spark.xml
>
> Any suggestions to resolve this
>
>
ql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
> ... 4 more
>
> Code
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("org.apache.spark.xml")
>
t;> scala.collection.GenTraversableOnce$class*
>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>
Hi
Post the code. I code in python and Scala on spark..I can give u help
though api for Scala and python are practically sameonly difference is
in the python lambda vs Scala inline functions
Hth
On 18 Jun 2016 6:27 am, "Aakash Basu" wrote:
> I don't have a sound knowledge in Python and on
Dear admin,
I've tried to unsubscribe from this mailing list twice, but I'm still receiving
emails. Can you please fix this?
Thanks,Marco
Hi Chen
pls post
1 . snippet code
2. exception
any particular reason why you need to load classes in other jars
programmatically?
Have you tried to build a fat jar with all the dependencies ?
hth
marco
On Thu, Jul 7, 2016 at 5:05 PM, Chen Song wrote:
> Sorry to spam people who are
Hi all, I cannot start thrift server on spark 1.6.2
I've configured binding port and IP and left default metastore.
In logs I get:
16/07/11 22:51:46 INFO NettyBlockTransferService: Server created on 46717
16/07/11 22:51:46 INFO BlockManagerMaster: Trying to register BlockManager
16/07/11 22:51:46
Dr Mich
do you have any slides or videos available for the presentation you did
@Canary Wharf?
kindest regards
marco
On Wed, Jul 6, 2016 at 10:37 PM, Mich Talebzadeh
wrote:
> Dear forum members
>
> I will be presenting on the topic of "Running Spark on Hive or Hive on
> Sp
I paste below the build.sbt i am using for my SparkExamples apps, hope this
helps.
kr
marco
name := "SparkExamples"
version := "1.0"
scalaVersion := "2.10.5"
// Add a single dependency
libraryDependencies += "junit" % "junit" % "4.8&qu
Hi
afaik yes (other pls override ). Generally, in RandomForest and
DecisionTree you have a column which you are trying to 'predict' (the
label) and a set of features that are used to predict the outcome.
i would assume that if you specify thelabel column and the 'features'
columns, everything else
Spark version pre 1.4?
kr
marco
On Wed, Jul 20, 2016 at 6:13 PM, Sachin Mittal wrote:
> NoClassDefFound error was for spark classes like say SparkConext.
> When running a standalone spark application I was not passing external
> jars using --jars option.
>
> However I have fi
Hi,
have you tried to use spark-csv (https://github.com/databricks/spark-csv)
? after all you can reconduct an XL file to CSV
hth.
On Thu, Jul 21, 2016 at 4:25 AM, Felix Cheung
wrote:
> From looking at be CLConnect package, its loadWorkbook() function only
> supports reading from local fi
Hi all, I have a spark application that was working in 1.5.2, but now has a
problem in 1.6.2.
Here is an example:
val conf = new SparkConf()
.setMaster("spark://10.0.2.15:7077")
.setMaster("local")
.set("spark.cassandra.connection.host", "10.0.2.15")
.setAppName("spark
Thanks.
That is just a typo. I'm using on 'spark://10.0.2.15:7077' (standalone).
Same url used in --master in spark-submit
2016-07-21 16:08 GMT+02:00 Mich Talebzadeh :
> Hi Marco
>
> In your code
>
> val conf = new SparkConf()
> .setMaster("spark
but for me i work
> with pyspark and it s very wonderful machine
>
> my question we don't have tools for ploting data each time we have to
> switch and go back to python for using plot.
> but when you have large result scatter plot or roc curve you cant use
> collect to take data .
>
> somone have propostion for plot .
>
> thanks
>
>
--
Ing. Marco Colombo
Hello Jean
you can take ur current DataFrame and send them to mllib (i was doing that
coz i dindt know the ml package),but the process is littlebit cumbersome
1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
2. run your ML model
i'd suggest you stick to DataFrame + ml package :)
hth
using Cassandra. However, he says it is too slow
>>> and not user friendly/
>>> MongodDB as a doc databases is pretty neat but not fast enough
>>>
>>> May main concern is fast writes per second and good scaling.
>>>
>>>
>>> Hive on Spark or Tez?
>>>
>>> How about Hbase. or anything else
>>>
>>> Any expert advice warmly acknowledged..
>>>
>>> thanking you
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
--
Ing. Marco Colombo
Hi
So u u have a data frame, then use zipwindex and create a tuple
I m not sure if df API has something useful for zip w index.
But u can
- get a data frame
- convert it to rdd (there's a tordd )
- do a zip with index
That will give u a rdd with 3 fields...
I don't think you can update df col
ou-must-build-spark-with-hive-exception-td27390.html
>
> plz help me.. I couldn't find any solution..plz
>
> On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin wrote:
>
>> Thanks Marco - I like the idea of sticking with DataFrames ;)
>>
>>
>> On Jul
ive at step 7 without any issues (uhm i dont have
matplotlib so i skipped step 5, which i guess is irrelevant as it just
display the data rather than doing any logic)
Pls let me know if this fixes your problems..
hth
marco
On Fri, Jul 22, 2016 at 6:34 PM, Inam Ur Rehman
wrote:
>
Hi vg I believe the error msg is misleading. I had a similar one with
pyspark yesterday after calling a count on a data frame, where the real
error was with an incorrect user defined function being applied .
Pls send me some sample code with a trimmed down version of the data and I
see if i can rep
Hi how bout creating an auto increment column in hbase?
Hth
On 24 Jul 2016 3:53 am, "yeshwanth kumar" wrote:
> Hi,
>
> i am doing bulk load to hbase using spark,
> in which i need to generate a sequential key for each record,
> the key should be sequential across all the executors.
>
> i tried z
Apologies I misinterpreted could you post two use cases?
Kr
On 24 Jul 2016 3:41 pm, "janardhan shetty" wrote:
> Marco,
>
> Thanks for the response. It is indexed order and not ascending or
> descending order.
> On Jul 24, 2016 7:37 AM, "Marco Mistroni&q
Hi
what is your source data? i am guessing a DataFrame or Integers as you
are usingan UDF
So your DataFrame is then a bunch of Row[Integer] ?
below a sample from one of my code to predict eurocup winners , going from
a DataFrame of Row[Double] to a RDD of LabeledPoint
I m not using UDF to con
ment of ID3
> next first 5 elements of ID1 to ID2. Similarly next 5 elements in that
> order until the end of number of elements.
> Let me know if this helps
>
>
> On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni
> wrote:
>
>> Apologies I misinterpreted could you post
rewrite them as udaf?
Thanks!
--
Ing. Marco Colombo
must have
> a connection or a pool of connection per worker. Executors of the same
> worker can share connection pool.
>
> Best
> Ayan
> On 25 Jul 2016 16:48, "Marco Colombo" > wrote:
>
>> Hi all!
>> Among other use cases, I want to use spark as a dist
t 1:21 AM, janardhan shetty
wrote:
> Thanks Marco. This solved the order problem. Had another question which is
> prefix to this.
>
> As you can see below ID2,ID1 and ID3 are in order and I need to maintain
> this index order as well. But when we do groupByKey
> operation(*rd
Hi Kevin
you should not need to rebuild everything.
Instead, i believe you should launch spark-submit by specifying the kafka
jar file in your --packages... i had to follow same when integrating spark
streaming with flume
have you checked this link ?
https://spark.apache.org/docs/latest/stream
Hi all,
I was using JdbcRRD and signature for constructure was accepting a
function to get a DB connection. This is very useful to provide my own
connection handler.
I'm valuating to move to daraframe, but I cannot how to provide such
function and migrate my code. I want to use my own 'getConnect
similar issue? any suggestion on how can i use a custom
AMI when creating a spark cluster?
kind regards
marco
>From getConnection I'm handling a connection pool.
I see no option for that in docs
Regards
Il lunedì 25 luglio 2016, Mich Talebzadeh ha
scritto:
> Hi Marco,
>
> what is in your UDF getConnection and why not use DF itself?
>
> I guess it is all connecti
. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss,
ardhan shetty
wrote:
> groupBy is a shuffle operation and index is already lost in this process
> if I am not wrong and don't see *sortWith* operation on RDD.
>
> Any suggestions or help ?
>
> On Mon, Jul 25, 2016 at 12:58 AM, Marco Mistroni
> wrote:
>
>> Hi
&g
ering if Spark has
> the hooks to allow me to try ;-)
>
> Cheers,
> Tim
>
> -----
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
Ing. Marco Colombo
hi all
could anyone assist?
i need to create a udf function that returns a LabeledPoint
I read that in pyspark (1.6) LabeledPoint is not supported and i have to
create
a StructType
anyone can point me in some directions?
kr
marco
Hi jg
+1 for link. I'd add ML and graph examples if u can
-1 for programmign language choice :))
kr
On 31 Jul 2016 9:13 pm, "Jean Georges Perrin" wrote:
> Thanks Guys - I really appreciate :)... If you have any idea of something
> missing, I'll gladly add it.
>
> (and yeah, come on! Is
Hi all, I've a question on how hive+spark are handling data.
I've started a new HiveContext and I'm extracting data from cassandra.
I've configured spark.sql.shuffle.partitions=10.
Now, I've following query:
select d.id, avg(d.avg) from v_points d where id=90 group by id;
I see that 10 task are
;
>
> On Thu, Aug 4, 2016 at 4:58 PM, Marco Colombo > wrote:
>
>> Hi all, I've a question on how hive+spark are handling data.
>>
>> I've started a new HiveContext and I'm extracting data from cassandra.
>> I've configured spark.sql.shuffle.part
gt;
>
> If your query will use partition keys in C*, always use them with either
> "=" or "in". If not, then you have to wait for the data transfer from C* to
> spark. Spark + C* allow to run any ad-hoc queries, but you need to know the
> underline price paid.
>
How bout all dependencies? Presumably they will all go in --jars ?
What if I have 10 dependencies? Any best practices in packaging apps for
spark 2.0?
Kr
On 10 Aug 2016 6:46 pm, "Nick Pentreath" wrote:
> You're correct - Spark packaging has been shifted to not use the assembly
> jar.
>
> To buil
sqlContext.sql
I was wondering how can i configure hive to point to a different
directorywhere i have more permissions
kr
marco
Thanks Chris will give it a go and report back.
Bizarrely if I start the pyspark shell I don't see any issues
Kr
Marco
On 20 Dec 2015 5:02 pm, "Chris Fregly" wrote:
> hopping on a plane, but check the hive-site.xml that's in your spark/conf
> directory (or should be, a
1 - 100 of 259 matches
Mail list logo