Calling SparkContext methods in scala Future

2016-01-18 Thread Marco
it fails (possibly, because the SparkContext is null(?)). How do I address this issue? What needs to be done? Do I need to switch to a synchronous architecture? Thanks in advance. Kind regards, Marco

Re: Calling SparkContext methods in scala Future

2016-01-19 Thread Marco
ave a reproducer, but I would say that it's enough to create one Future and call sparkContext from there. Thanks again for the answers. Kind regards, Marco 2016-01-18 19:37 GMT+01:00 Shixiong(Ryan) Zhu : > Hey Marco, > > Since the codes in Future is in an asynchronous way, you cannot call &

Re: RDD immutablility

2016-01-19 Thread Marco
cipe for > making data from other data. It is not literally computed by materializing > every RDD completely. That is, a lot of the "copy" can be optimized away > too. I hope it answers your question. Kind regards, Marco 2016-01-19 13:14 GMT+01:00 ddav : > Hi, > > Certain

Re: RDD immutablility

2016-01-19 Thread Marco
It depends on what you mean by "write access". The RDDs are immutable, so you can't really change them. When you apply a mapping/filter/groupBy function, you are creating a new RDD starting from the original one. Kind regards, Marco 2016-01-19 13:27 GMT+01:00 Dave : > Hi Ma

Re: Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2015-01-23 Thread Marco
yDependencies += "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.6.0" libraryDependencies += "org.apache.hbase" % "hbase-client" % "0.98.4-hadoop2" libraryDependencies += "org.apache.hbase" % "hbase-server" % "0.9

Re: Got java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s when running job from intellij Idea

2015-01-28 Thread Marco
I've switched to maven and all issues are gone, now. 2015-01-23 12:07 GMT+01:00 Sean Owen : > Use mvn dependency:tree or sbt dependency-tree to print all of the > dependencies. You are probably bringing in more servlet API libs from > some other source? > > On Fri, Jan 23, 201

Issue with SparkContext in cluster

2015-01-28 Thread Marco
cause of this issue? Thanks, Marco <<<<< 15/01/28 10:25:06 INFO spark.SecurityManager: Changing modify acls to: user 15/01/28 10:25:06 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify perm

hive-thriftserver maven artifact

2015-02-16 Thread Marco
Hi, I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the artifact in a public repository ? I have not found it @Maven Central. Thanks, Marco

Re: hive-thriftserver maven artifact

2015-02-16 Thread Marco
Ok, so will it be only available for the next version (1.30)? 2015-02-16 15:24 GMT+01:00 Ted Yu : > I searched for 'spark-hive-thriftserver_2.10' on this page: > http://mvnrepository.com/artifact/org.apache.spark > > Looks like it is not published. > > On Mon, Fe

ReduceByKey and sorting within partitions

2015-04-27 Thread Marco
he correct way to do that is that combineByKey call setKeyOrdering function on the ShuflleRDD that it returns. Am I wrong? Can be done by a combination of other transformations with the same efficiency? Thanks, Marco - To unsu

Re: hive-thriftserver maven artifact

2015-04-28 Thread Marco
Thx Ted for the info ! 2015-04-27 23:51 GMT+02:00 Ted Yu : > This is available for 1.3.1: > > http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10 > > FYI > > On Mon, Feb 16, 2015 at 7:24 AM, Marco wrote: > >> Ok, so will it be only

Re: ReduceByKey and sorting within partitions

2015-04-29 Thread Marco
On 04/27/2015 06:00 PM, Ganelin, Ilya wrote: > Marco - why do you want data sorted both within and across partitions? If you > need to take an ordered sequence across all your data you need to either > aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an

Edge AI with Spark

2020-09-24 Thread Marco Sassarini
Hi, I'd like to know if Spark supports edge AI: can Spark run on physical device such as mobile devices running Android/iOS? Best regards, Marco Sassarini [cid:b995380c-a2a9-47fd-a865-edcad29e4206] Marco Sassarini Artificial Intelligence Department office: +39 0434 562 978 www.over

RDD filter in for loop gave strange results

2021-01-20 Thread Marco Wong
DD is [0, 2] Result is [0, 2] RDD is [0, 2] Filtered RDD is [0, 1] Result is [0, 1] ``` Thanks, Marco

Spark RDD + HBase: adoption trend

2021-01-20 Thread Marco Firrincieli
Hi, my name is Marco and I'm one of the developers behind  https://github.com/unicredit/hbase-rdd  a project we are currently reviewing for various reasons. We were basically wondering if RDD "is still a thing" nowadays (we see lots of usage for DataFrames or Datasets) and we&

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Marco Wong
Hmm, I think I got what Jingnan means. The lambda function is x != i and i is not evaluated when the lambda function was defined. So the pipelined rdd is rdd.filter(lambda x: x != i).filter(lambda x: x != i), rather than having the values of i substituted. Does that make sense to you, Sean? On Wed

What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
Marco Costantini 5:55 PM (5 minutes ago) to user I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 10 Orders in the orders table. I have to use pyspark to generate a statement of Orders for each User. So, a single user will

What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 10 Orders in the orders table. I have to use pyspark to generate a statement of Orders for each User. So, a single user will need his/her own list of Orders. Additionally, I need t

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
'description'))).alias('orders')) ``` (json is ultimately needed) This actually achieves my goal by putting all of the 'orders' in a single Array column. Now my worry is, will this column become too large if there are a great many orders. Is there a limit? I have search for

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich, Great idea. I have done it. Those files are attached. I'm interested to know your thoughts. Let's imagine this same structure, but with huge amounts of data as well. Please and thank you, Marco. On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh wrote: > Hi Marco, >

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
ther actions to each iteration (send email, send HTTP request, etc). Thanks Mich, Marco. On Tue, Apr 25, 2023 at 6:06 PM Mich Talebzadeh wrote: > Hi Marco, > > First thoughts. > > foreach() is an action operation that is to iterate/loop over each > element in the dataset, meanin

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Marco Costantini
earch for them. Even late last night! Thanks for your help team, Marco. On Wed, Apr 26, 2023 at 6:21 AM Mich Talebzadeh wrote: > Indeed very valid points by Ayan. How email is going to handle 1000s of > records. As a solution architect I tend to replace. Users by customers and > for each or

Write custom JSON from DataFrame in PySpark

2023-05-03 Thread Marco Costantini
ith other requirements (serializing other things). Any advice? Please and thank you, Marco.

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Marco Costantini
Hi Enrico, What a great answer. Thank you. Seems like I need to get comfortable with the 'struct' and then I will be golden. Thank you again, friend. Marco. On Thu, May 4, 2023 at 3:00 AM Enrico Minack wrote: > Hi, > > You could rearrange the DataFrame so that writing

Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
the partitions I need. However, the filenames are something like:part-0-0e2e2096-6d32-458d-bcdf-dbf7d74d80fd.c000.json Now, I understand Spark's need to include the partition number in the filename. However, it sure would be nice to control the rest of the file name. Any advice? Please and thank you. Marco.

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
ll aware) makes sense. Question: what are some good methods, tools, for combining the parts into a single, well-named file? I imagine that is outside of the scope of PySpark, but any advice is welcome. Thank you, Marco. On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh wrote: > AWS S3, or Go

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-05 Thread Marco Costantini
Hi Mich, Thank you. Ah, I want to avoid bringing all data to the driver node. That is my understanding of what will happen in that case. Perhaps, I'll trigger a Lambda to rename/combine the files after PySpark writes them. Cheers, Marco. On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh

Spark MOOC - early access

2015-05-21 Thread Marco Shaw
please feel free to contact me (marco.s...@gmail.com ) with any issues, comments, or questions.Sincerely,Marco ShawSpark MOOC TA_(This is being sent as an HTML formatted email. Some of the links have been duplicated just in case.)1. Install VirtualBox here <https://www.virt

Ipython notebook, ec2 spark cluster and matplotlib

2015-07-10 Thread Marco Didonna
Hello everybody, I'm running a two node spark cluster on ec2, created using the provided scripts. I then ssh into the master and invoke "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook --profile=pyspark' spark/bin/pyspark". This launches a spark notebook which has been instructe

Re: Fwd: Spark 2.0 Shell -csv package weirdness

2016-03-19 Thread Marco Mistroni
Have u tried df.saveAsParquetFIle? I think that method is on df Api Hth Marco On 19 Mar 2016 7:18 pm, "Vincent Ohprecio" wrote: > > For some reason writing data from Spark shell to csv using the `csv > package` takes almost an hour to dump to disk. Am I going crazy or did I

Re: Spark 2.0 Shell -csv package weirdness

2016-03-20 Thread Marco Mistroni
Hi I try tomorrow same settings as you to see if I can experience same issues. Will report back once done Thanks On 20 Mar 2016 3:50 pm, "Vincent Ohprecio" wrote: > Thanks Mich and Marco for your help. I have created a ticket to look into > it on dev channel. > Here

Re: Reading Back a Cached RDD

2016-03-24 Thread Marco Colombo
sion. How would one access the persisted RDD in the new shell session ? >>> >>> >>> Thanks, >>> >>> -- >>> >>>Nick >>> >> > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > -- Ing. Marco Colombo

Re: Spark and DB connection pool

2016-03-25 Thread Marco Colombo
that table later, for example, via thrift server of from my code. If every time DF is accessed and connections have been closed, it is a performance penalty to reopen then anytime. Then, especially for Oracle, it is also costly... Does anyone have a better understanding of this? Thanks again!

Re: Exposing dataframe via thrift server

2016-03-30 Thread Marco Colombo
selected (0.126 seconds) > > > > It shows table that are persisted on hive metastore using saveAsTable. > Temp table (registerTempTable) can't able to view > > Can any1 help me with this, > Thanks > -- Ing. Marco Colombo

Re: All inclusive uber-jar

2016-04-04 Thread Marco Mistroni
Hi U can use SBT assembly to create uber jar. U should set spark libraries as 'provided' in ur SBT Hth Marco Ps apologies if by any chances I m telling u something u already know On 4 Apr 2016 2:36 pm, "Mich Talebzadeh" wrote: > Hi, > > > When one builds a proj

Can spark somehow help with this usecase?

2016-04-05 Thread Marco Mistroni
fetch remote files and process them in ) in one go. I want to avoid doing the first step (processing the million row file) in spark and the rest (_fetching FTP and process files) offline. Does spark has anything that can help with the FTP fetch? Thanks in advance and rgds Marco

Re: Can spark somehow help with this usecase?

2016-04-05 Thread Marco Mistroni
Many thanks for suggestion Andy! Kr Marco On 5 Apr 2016 7:25 pm, "Andy Davidson" wrote: > Hi Marco > > You might consider setting up some sort of ELT pipe line. One of your > stages might be to create a file of all the FTP URL. You could then write > a spark app that

Re: Re:[spark] build/sbt gen-idea error

2016-04-12 Thread Marco Mistroni
Have you tried SBT eclipse plugin? Then u can run SBT eclipse and have ur spark project directly in eclipse Pls Google it and u shud b able to find ur way. If not ping me and I send u the plugin (I m replying from my phone) Hth On 12 Apr 2016 4:53 pm, "ImMr.K" <875061...@qq.com> wrote: But how to

Please assist: Spark 1.5.2 / cannot find StateSpec / State

2016-04-13 Thread Marco Mistroni
" % "1.5.2" % "provided" libraryDependencies += "org.apache.spark" %% "spark-streaming-flume" % "1.3.0" % "provided" ... But compilations fail mentioning that class StateSpec and State are not found Could pls someone point me to the right packages to refer if i want to use StateSpec? kind regards marco

Pls assist: which conf file do i need to modify if i want spark-shell to inclucde external packages?

2016-04-21 Thread Marco Mistroni
any docs tell ing me which config file i have to modify anyone can assist ? kr marco

Re: Pls assist: which conf file do i need to modify if i want spark-shell to inclucde external packages?

2016-04-21 Thread Marco Mistroni
> > > > http://talebzadehmich.wordpress.com > > > > On 21 April 2016 at 15:13, Marco Mistroni wrote: > >> HI all >> i need to use spark-csv in my spark instance, and i want to avoid >> launching spark-shell >> by passing the package name every

Re: removing header from csv file

2016-04-27 Thread Marco Mistroni
If u r using Scala api you can do Myrdd.zipwithindex.filter(_._2 >0).map(_._1) Maybe a little bit complicated but will do the trick As per spark CSV, you will get back a data frame which you can reconduct to rdd. . Hth Marco On 27 Apr 2016 6:59 am, "nihed mbarek" wrote: > You

Re: n

2016-04-27 Thread Marco Mistroni
; % "test" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1" % "provided" libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.1" % "provided" libraryDependencies += &quo

Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Marco Mistroni
)) :28: error: type mismatch; found : org.apache.spark.sql.Column required: Boolean df.withColumn("AgeInt", if (df("Age") > 29.0) lit(1) else lit(0)) any suggestions? kind regars marco

Re: Addign a new column to a dataframe (based on value of existing column)

2016-04-28 Thread Marco Mistroni
uot;) > 29.0, > 1).otherwise(0)).show > +++--+ > | age|name|AgeInt| > +++--+ > |25.0| foo| 0| > |30.0| bar| 1| > +++--+ > > On Thu, 28 Apr 2016 at 20:45 Marco Mistroni wrote: > >> HI all >> i have a dataFrame with a col

Issue with creation of EC2 cluster using spark scripts

2016-05-16 Thread Marco Mistroni
marco

Pls Assist: error when creating cluster on AWS using spark's ec2 scripts

2016-05-17 Thread Marco Mistroni
Hi was wondering if anyone can assist here.. I am trying to create a spark cluster on AWS using scripts located in spark-1.6.1/ec2 directory When the spark_ec2.py scripts tries to do a rsync to copy directories over to teh AWS master node it fails miserably with this stack trace DEBUG:spark ecd

How to carry data streams over multiple batch intervals in Spark Streaming

2016-05-21 Thread Marco Platania
ys are in stream1 and stream2 in the same time interval.Do you guys have any suggestion to implement this correctly with Spark?Thanks, Marco

Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini
Dear all, Does Spark uses data locality information from HDFS, when running in standalone mode? Or is it running on YARN mandatory for such purpose? I can't find this information in the docs, and on Google I am only finding contrasting opinion on that. Regards Marco Capuccini

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini
would be enough. Regards Marco On 05 Jun 2016, at 12:17, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: Well in standalone mode you are running your spark code on one physical node so the assumption would be that there is HDFS node running on the same host. When you are r

Re: Fw: Basic question on using one's own classes in the Scala app

2016-06-06 Thread Marco Mistroni
HI Ashok this is not really a spark-related question so i would not use this mailing list. Anyway, my 2 cents here as outlined by earlier replies, if the class you are referencing is in a different jar, at compile time you will need to add that dependency to your build.sbt, I'd personally lea

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Marco Mistroni
HI have you tried to add this flag? -Djsse.enableSNIExtension=false i had similar issues in another standalone application when i switched to java8 from java7 hth marco On Mon, Jun 6, 2016 at 9:58 PM, Koert Kuipers wrote: > mhh i would not be very happy if the implication is that i have

Re: Spark_Usecase

2016-06-07 Thread Marco Mistroni
Hi how about 1. have a process that read the data from your sqlserver and dumps it as a file into a directory on your hd 2. use spark-streanming to read data from that directory and store it into hdfs perhaps there is some sort of spark 'connectors' that allows you to read data from a db direct

Pls assist: Spark DecisionTree question

2016-06-10 Thread Marco Mistroni
2105263157895) ((entropy,1,28),0.684210526315789 could anyone explain why? kind regards marco

Neither previous window has value for key, nor new values found

2016-06-10 Thread Marco Platania
Hi all,  I'm running a Spark Streaming application that uses reduceByKeyAndWindow(). The window interval is 2 hours, while the slide interval is 1 hour. I have a JavaPairRDD in which both keys and values are strings. Each time the reduceByKeyAndWindow() function is called, it uses appendString(

Accuracy of BinaryClassificationMetrics

2016-06-11 Thread Marco Mistroni
HI all which method shall i use to verify the accuracy of a BinaryClassificationMetrics ? the multiClassMetrics has a precision() method but that is missing on the BinaryClassificationMetrics thanks marco

unsubscribe

2016-06-16 Thread Marco Platania
unsubscribe

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Marco Mistroni
too little info it'll help if you can post the exception and show your sbt file (if you are using sbt), and provide minimal details on what you are doing kr On Fri, Jun 17, 2016 at 10:08 AM, VG wrote: > Failed to find data source: com.databricks.spark.xml > > Any suggestions to resolve this > >

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Marco Mistroni
ql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) > ... 4 more > > Code > SQLContext sqlContext = new SQLContext(sc); > DataFrame df = sqlContext.read() > .format("org.apache.spark.xml") >

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Marco Mistroni
t;> scala.collection.GenTraversableOnce$class* >>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381) >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) >>>>

Re: Python to Scala

2016-06-18 Thread Marco Mistroni
Hi Post the code. I code in python and Scala on spark..I can give u help though api for Scala and python are practically sameonly difference is in the python lambda vs Scala inline functions Hth On 18 Jun 2016 6:27 am, "Aakash Basu" wrote: > I don't have a sound knowledge in Python and on

unsubscribe error

2016-06-18 Thread Marco Platania
Dear admin, I've tried to unsubscribe from this mailing list twice, but I'm still receiving emails. Can you please fix this? Thanks,Marco

Re: spark classloader question

2016-07-07 Thread Marco Mistroni
Hi Chen pls post 1 . snippet code 2. exception any particular reason why you need to load classes in other jars programmatically? Have you tried to build a fat jar with all the dependencies ? hth marco On Thu, Jul 7, 2016 at 5:05 PM, Chen Song wrote: > Sorry to spam people who are

Error starting thrift server on Spark

2016-07-11 Thread Marco Colombo
Hi all, I cannot start thrift server on spark 1.6.2 I've configured binding port and IP and left default metastore. In logs I get: 16/07/11 22:51:46 INFO NettyBlockTransferService: Server created on 46717 16/07/11 22:51:46 INFO BlockManagerMaster: Trying to register BlockManager 16/07/11 22:51:46

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-15 Thread Marco Mistroni
Dr Mich do you have any slides or videos available for the presentation you did @Canary Wharf? kindest regards marco On Wed, Jul 6, 2016 at 10:37 PM, Mich Talebzadeh wrote: > Dear forum members > > I will be presenting on the topic of "Running Spark on Hive or Hive on > Sp

Re: Building standalone spark application via sbt

2016-07-20 Thread Marco Mistroni
I paste below the build.sbt i am using for my SparkExamples apps, hope this helps. kr marco name := "SparkExamples" version := "1.0" scalaVersion := "2.10.5" // Add a single dependency libraryDependencies += "junit" % "junit" % "4.8&qu

Re: RandomForestClassifier

2016-07-20 Thread Marco Mistroni
Hi afaik yes (other pls override ). Generally, in RandomForest and DecisionTree you have a column which you are trying to 'predict' (the label) and a set of features that are used to predict the outcome. i would assume that if you specify thelabel column and the 'features' columns, everything else

Re: Building standalone spark application via sbt

2016-07-20 Thread Marco Mistroni
Spark version pre 1.4? kr marco On Wed, Jul 20, 2016 at 6:13 PM, Sachin Mittal wrote: > NoClassDefFound error was for spark classes like say SparkConext. > When running a standalone spark application I was not passing external > jars using --jars option. > > However I have fi

Re: XLConnect in SparkR

2016-07-21 Thread Marco Mistroni
Hi, have you tried to use spark-csv (https://github.com/databricks/spark-csv) ? after all you can reconduct an XL file to CSV hth. On Thu, Jul 21, 2016 at 4:25 AM, Felix Cheung wrote: > From looking at be CLConnect package, its loadWorkbook() function only > supports reading from local fi

HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Marco Colombo
Hi all, I have a spark application that was working in 1.5.2, but now has a problem in 1.6.2. Here is an example: val conf = new SparkConf() .setMaster("spark://10.0.2.15:7077") .setMaster("local") .set("spark.cassandra.connection.host", "10.0.2.15") .setAppName("spark

Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Marco Colombo
Thanks. That is just a typo. I'm using on 'spark://10.0.2.15:7077' (standalone). Same url used in --master in spark-submit 2016-07-21 16:08 GMT+02:00 Mich Talebzadeh : > Hi Marco > > In your code > > val conf = new SparkConf() > .setMaster("spark

Re: spark and plot data

2016-07-22 Thread Marco Colombo
but for me i work > with pyspark and it s very wonderful machine > > my question we don't have tools for ploting data each time we have to > switch and go back to python for using plot. > but when you have large result scatter plot or roc curve you cant use > collect to take data . > > somone have propostion for plot . > > thanks > > -- Ing. Marco Colombo

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
Hello Jean you can take ur current DataFrame and send them to mllib (i was doing that coz i dindt know the ml package),but the process is littlebit cumbersome 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint] 2. run your ML model i'd suggest you stick to DataFrame + ml package :) hth

Re: Fast database with writes per second and horizontal scaling

2016-07-22 Thread Marco Colombo
using Cassandra. However, he says it is too slow >>> and not user friendly/ >>> MongodDB as a doc databases is pretty neat but not fast enough >>> >>> May main concern is fast writes per second and good scaling. >>> >>> >>> Hive on Spark or Tez? >>> >>> How about Hbase. or anything else >>> >>> Any expert advice warmly acknowledged.. >>> >>> thanking you >>> >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > -- Ing. Marco Colombo

Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread Marco Mistroni
Hi So u u have a data frame, then use zipwindex and create a tuple I m not sure if df API has something useful for zip w index. But u can - get a data frame - convert it to rdd (there's a tordd ) - do a zip with index That will give u a rdd with 3 fields... I don't think you can update df col

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
ou-must-build-spark-with-hive-exception-td27390.html > > plz help me.. I couldn't find any solution..plz > > On Fri, Jul 22, 2016 at 5:50 PM, Jean Georges Perrin wrote: > >> Thanks Marco - I like the idea of sticking with DataFrames ;) >> >> >> On Jul

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Marco Mistroni
ive at step 7 without any issues (uhm i dont have matplotlib so i skipped step 5, which i guess is irrelevant as it just display the data rather than doing any logic) Pls let me know if this fixes your problems.. hth marco On Fri, Jul 22, 2016 at 6:34 PM, Inam Ur Rehman wrote: >

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Marco Mistroni
Hi vg I believe the error msg is misleading. I had a similar one with pyspark yesterday after calling a count on a data frame, where the real error was with an incorrect user defined function being applied . Pls send me some sample code with a trimmed down version of the data and I see if i can rep

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Marco Mistroni
Hi how bout creating an auto increment column in hbase? Hth On 24 Jul 2016 3:53 am, "yeshwanth kumar" wrote: > Hi, > > i am doing bulk load to hbase using spark, > in which i need to generate a sequential key for each record, > the key should be sequential across all the executors. > > i tried z

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni
Apologies I misinterpreted could you post two use cases? Kr On 24 Jul 2016 3:41 pm, "janardhan shetty" wrote: > Marco, > > Thanks for the response. It is indexed order and not ascending or > descending order. > On Jul 24, 2016 7:37 AM, "Marco Mistroni&q

Re: UDF to build a Vector?

2016-07-24 Thread Marco Mistroni
Hi what is your source data? i am guessing a DataFrame or Integers as you are usingan UDF So your DataFrame is then a bunch of Row[Integer] ? below a sample from one of my code to predict eurocup winners , going from a DataFrame of Row[Double] to a RDD of LabeledPoint I m not using UDF to con

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni
ment of ID3 > next first 5 elements of ID1 to ID2. Similarly next 5 elements in that > order until the end of number of elements. > Let me know if this helps > > > On Sun, Jul 24, 2016 at 7:45 AM, Marco Mistroni > wrote: > >> Apologies I misinterpreted could you post

Hive and distributed sql engine

2016-07-24 Thread Marco Colombo
rewrite them as udaf? Thanks! -- Ing. Marco Colombo

Re: Hive and distributed sql engine

2016-07-25 Thread Marco Colombo
must have > a connection or a pool of connection per worker. Executors of the same > worker can share connection pool. > > Best > Ayan > On 25 Jul 2016 16:48, "Marco Colombo" > wrote: > >> Hi all! >> Among other use cases, I want to use spark as a dist

Re: Maintaining order of pair rdd

2016-07-25 Thread Marco Mistroni
t 1:21 AM, janardhan shetty wrote: > Thanks Marco. This solved the order problem. Had another question which is > prefix to this. > > As you can see below ID2,ID1 and ID3 are in order and I need to maintain > this index order as well. But when we do groupByKey > operation(*rd

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Marco Mistroni
Hi Kevin you should not need to rebuild everything. Instead, i believe you should launch spark-submit by specifying the kafka jar file in your --packages... i had to follow same when integrating spark streaming with flume have you checked this link ? https://spark.apache.org/docs/latest/stream

jdbcRRD and dataframe

2016-07-25 Thread Marco Colombo
Hi all, I was using JdbcRRD and signature for constructure was accepting a function to get a DB connection. This is very useful to provide my own connection handler. I'm valuating to move to daraframe, but I cannot how to provide such function and migrate my code. I want to use my own 'getConnect

Pls assist: Creating Spak EC2 cluster using spark_ec2.py script and a custom AMI

2016-07-25 Thread Marco Mistroni
similar issue? any suggestion on how can i use a custom AMI when creating a spark cluster? kind regards marco

Re: jdbcRRD and dataframe

2016-07-25 Thread Marco Colombo
>From getConnection I'm handling a connection pool. I see no option for that in docs Regards Il lunedì 25 luglio 2016, Mich Talebzadeh ha scritto: > Hi Marco, > > what is in your UDF getConnection and why not use DF itself? > > I guess it is all connecti

Re: jdbcRRD and dataframe

2016-07-25 Thread Marco Colombo
. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss,

Re: Maintaining order of pair rdd

2016-07-26 Thread Marco Mistroni
ardhan shetty wrote: > groupBy is a shuffle operation and index is already lost in this process > if I am not wrong and don't see *sortWith* operation on RDD. > > Any suggestions or help ? > > On Mon, Jul 25, 2016 at 12:58 AM, Marco Mistroni > wrote: > >> Hi &g

Re: Possible to push sub-queries down into the DataSource impl?

2016-07-27 Thread Marco Colombo
ering if Spark has > the hooks to allow me to try ;-) > > Cheers, > Tim > > ----- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Ing. Marco Colombo

Pls assist: need to create an udf that returns a LabeledPoint in pyspark

2016-07-28 Thread Marco Mistroni
hi all could anyone assist? i need to create a udf function that returns a LabeledPoint I read that in pyspark (1.6) LabeledPoint is not supported and i have to create a StructType anyone can point me in some directions? kr marco

Re: Java Recipes for Spark

2016-08-01 Thread Marco Mistroni
Hi jg +1 for link. I'd add ML and graph examples if u can -1 for programmign language choice :)) kr On 31 Jul 2016 9:13 pm, "Jean Georges Perrin" wrote: > Thanks Guys - I really appreciate :)... If you have any idea of something > missing, I'll gladly add it. > > (and yeah, come on! Is

Spark SQL and number of task

2016-08-04 Thread Marco Colombo
Hi all, I've a question on how hive+spark are handling data. I've started a new HiveContext and I'm extracting data from cassandra. I've configured spark.sql.shuffle.partitions=10. Now, I've following query: select d.id, avg(d.avg) from v_points d where id=90 group by id; I see that 10 task are

Re: Spark SQL and number of task

2016-08-04 Thread Marco Colombo
; > > On Thu, Aug 4, 2016 at 4:58 PM, Marco Colombo > wrote: > >> Hi all, I've a question on how hive+spark are handling data. >> >> I've started a new HiveContext and I'm extracting data from cassandra. >> I've configured spark.sql.shuffle.part

Re: Spark SQL and number of task

2016-08-04 Thread Marco Colombo
gt; > > If your query will use partition keys in C*, always use them with either > "=" or "in". If not, then you have to wait for the data transfer from C* to > spark. Spark + C* allow to run any ad-hoc queries, but you need to know the > underline price paid. >

Re: Spark2 SBT Assembly

2016-08-10 Thread Marco Mistroni
How bout all dependencies? Presumably they will all go in --jars ? What if I have 10 dependencies? Any best practices in packaging apps for spark 2.0? Kr On 10 Aug 2016 6:46 pm, "Nick Pentreath" wrote: > You're correct - Spark packaging has been shifted to not use the assembly > jar. > > To buil

Hive error when starting up spark-shell in 1.5.2

2015-12-19 Thread Marco Mistroni
sqlContext.sql I was wondering how can i configure hive to point to a different directorywhere i have more permissions kr marco

Re: Hive error when starting up spark-shell in 1.5.2

2015-12-20 Thread Marco Mistroni
Thanks Chris will give it a go and report back. Bizarrely if I start the pyspark shell I don't see any issues Kr Marco On 20 Dec 2015 5:02 pm, "Chris Fregly" wrote: > hopping on a plane, but check the hive-site.xml that's in your spark/conf > directory (or should be, a

  1   2   3   >