Re: Agenda Announced for Spark Summit 2016 in San Francisco

2016-04-18 Thread Tim To
anyone know of discount promo code for the spark summit that still works?? Thanks,Tim  On Wednesday, April 6, 2016 9:44 AM, Scott walent wrote: Spark Summit 2016 (www.spark-summit.org/2016) will be held from June 6-8 at the Union Square Hilton in San Francisco, and the recently relea

R: How many disks for spark_local_dirs?

2016-04-18 Thread Luca Guerra
Hi Mich, I have only 32 cores, I have tested with 2 GB of memory per worker to force spills to disk. My application had 12 cores and 3 cores per executor. Thank you very much. Luca Da: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Inviato: venerdì 15 aprile 2016 18:56 A: Luca Guerra Cc: us

Re: Apache Flink

2016-04-18 Thread Mich Talebzadeh
Please forgive me for going in tangent Well there may be many SQL engines but only 5-6 are in the league. Oracle has Oracle and MySQL plus TimesTen from various acquisitions. SAP has Hana, SAP ASE, SAP IQ and few others again from acquiring Sybase . So very few big players. Cars, Fiat owns many g

Re: How many disks for spark_local_dirs?

2016-04-18 Thread Mich Talebzadeh
.. to force spills to disk ... That is pretty smart as at some point you are inevitable going to run out of memory. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

R: How many disks for spark_local_dirs?

2016-04-18 Thread Luca Guerra
Hi Jan, It's a physical server, I have launched the application with: - "spark.cores.max": "12", - "spark.executor.cores": "3" - 2 GB RAM per worker Spark version is 1.6.0, I don't use Hadoop. Thanks, Luca -Messaggio originale- Da: Jan Rock [mailto:r...@do-hadoop.c

Re: Apache Flink

2016-04-18 Thread Mich Talebzadeh
Thanks Todd I will have a look. Regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 18 April 2016

Re: Apache Flink

2016-04-18 Thread Jörn Franke
What is your exact set of requirements for algo trading? Is it react in real-time or analysis over longer time? In the first case, I do not think a framework such as Spark or Flink makes sense. They are generic, but in order to compete with other usually custom developed highly - specialized en

drools on spark, how to reload rule file?

2016-04-18 Thread yaoxiaohua
Hi bros, I am trying using drools on spark to parse log and do some rule match and derived some fields. Now I refer one blog on cloudera, http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processin g-app-on-apache-spark-and-drools/

How to handle ActiveMQ message produce in case of spark speculation

2016-04-18 Thread aneesh.mohan
I understand that duplicate results due to speculation for handled at driver side. But what if one of my spark task ( in a streaming scenario ) involves sending a message to ActiveMQ. Here driver has no control over it and once message is sent it's done. How is this handled? -- View this message

Re: Read Parquet in Java Spark

2016-04-18 Thread Ramkumar V
HI, Any idea on this ? *Thanks*, On Mon, Apr 4, 2016 at 2:47 PM, Akhil Das wrote: > I wasn't knowing you have a parquet file containing json data. > > Thanks > Best Regards > > On Mon, Apr 4, 2016 at 2:44 PM, Ramkumar V > wrote: > >> Hi Akhil, >> >>

Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Joao Azevedo
Hi! I'm trying to submit Spark applications to Mesos using the 'cluster' deploy mode. I'm using Marathon as the container orchestration platform and launching the Mesos Cluster Dispatcher through it. I'm using Spark 1.6.1 with Scala 2.11. I'm able to successfully communicate with the cluster

How to start HDFS on Spark Standalone

2016-04-18 Thread My List
Hi , I am a newbie on Spark.I wanted to know how to start and verify if HDFS has started on Spark stand alone. Env - Windows 7 - 64 bit Spark 1.4.1 With Hadoop 2.6 Using Scala Shell - spark-shell -- Thanks, Harry

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
Once you download hadoop and format the namenode , you can use start-dfs.sh to start hdfs. Then use 'jps' to sss if datanode/namenode services are up and running. Thanks Deepak On Mon, Apr 18, 2016 at 5:18 PM, My List wrote: > Hi , > > I am a newbie on Spark.I wanted to know how to start and ve

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread My List
Deepak, The following could be a very dumb questions so pardon me for the same. 1) When I download the binary for Spark with a version of Hadoop(Hadoop 2.6) does it not come in the zip or tar file? 2) If it does not come along,Is there a Apache Hadoop for windows, is it in binary format or will ha

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
Binary for Spark means ts spark built against hadoop 2.6 It will not have any hadoop executables. You'll have to setup hadoop separately. I have not used windows version yet but there are some. Thanks Deepak On Mon, Apr 18, 2016 at 5:43 PM, My List wrote: > Deepak, > > The following could be a

Standard deviation on multiple columns

2016-04-18 Thread Naveen Kumar Pokala
Hi, I am using spark 1.6.0 I want to find standard deviation of columns that will come dynamically. val stdDevOnAll = columnNames.map { x => stddev(x } causalDf.groupBy(causalDf("A"),causalDf("B"),causalDf("C")) .agg(stdDevOnAll:_*) //error line I am trying to do as above. But it i

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread My List
Deepak, Would you advice that I use Ubuntu? or Redhat. Cause Windows support etc and issues are Galore on Spark. Since I am starting afresh, what would you advice? On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma wrote: > Binary for Spark means ts spark built against hadoop 2.6 > It will not have

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
It works well with all flavors of Linux. It all depends on your ex with these flavors. Thanks Deepak On Mon, Apr 18, 2016 at 5:51 PM, My List wrote: > Deepak, > > Would you advice that I use Ubuntu? or Redhat. Cause Windows support etc > and issues are Galore on Spark. > Since I am starting afr

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread My List
Deepak, I love the unix flavors have been a programmed on them. Just have a windows laptop and pc hence haven't moved to unix flavors. Was trying to run big data stuff on windows. Have run in so much of issues that I could just throw the laptop with windows out. Your view - Redhat, Ubuntu or Cent

Re: Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Jacek Laskowski
Hi, Not that I might help much with deployment to Mesos, but can you describe your Mesos/Marathon setup? What's Mesos cluster dispatcher? Jacek 18.04.2016 12:54 PM "Joao Azevedo" napisał(a): > Hi! > > I'm trying to submit Spark applications to Mesos using the 'cluster' > deploy mode. I'm using

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
I prefer using CentOS/SLES/Ubuntu personally. Thanks Deepak On Mon, Apr 18, 2016 at 5:57 PM, My List wrote: > Deepak, > > I love the unix flavors have been a programmed on them. Just have a > windows laptop and pc hence haven't moved to unix flavors. Was trying to > run big data stuff on window

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju
Mike, We tried that. This map task is actually part of a larger set of operations. I pointed out this map task since it involves partitionBy() and we always use partitionBy() whenever partition-unaware shuffle operations are performed (such as distinct). We in fact do not notice a change in the di

Re: Logging in executors

2016-04-18 Thread Carlos Rojas Matas
Thanks Ted, already checked it but is not the same. I'm working with StandAlone spark, the examples refers to HDFS paths, therefore I assume Hadoop 2 Resource Manager is used. I've tried all possible flavours. The only one that worked was changing the spark-defaults.conf in every machine. I'll go w

Re: Logging in executors

2016-04-18 Thread Ted Yu
Looking through this thread, I don't see Spark version you use. Can you tell us the Spark release ? Thanks On Mon, Apr 18, 2016 at 6:32 AM, Carlos Rojas Matas wrote: > Thanks Ted, already checked it but is not the same. I'm working with > StandAlone spark, the examples refers to HDFS paths, th

RE: Logging in executors

2016-04-18 Thread Ashic Mahtab
I spent ages on this recently, and here's what I found: --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///local/file/on.executor.properties" works. Alternatively, you can also do: --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=filename.properties" --files=

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin
The limitation is in the drools implementation. Changing a rule in a stateful KB is not possible, particularly if it leads to logical contradictions with the previous version or any other rule in the KB. When we ran into this, we worked around (part of) it by salting the rule name with a unique

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Sonal Goyal
Are you specifying your spark master in the scala program? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju
No. We specify it as a configuration option to the spark-submit. Does that make a difference? Regards, Raghava. On Mon, Apr 18, 2016 at 9:56 AM, Sonal Goyal wrote: > Are you specifying your spark master in the scala program? > > Best Regards, > Sonal > Founder, Nube Technologies

RE: drools on spark, how to reload rule file?

2016-04-18 Thread yaoxiaohua
Thanks for your reply , Jason, I can use stateless session in spark streaming job. But now my question is when the rule update, how to pass it to RDD? We generate a ruleExecutor(stateless session) in main method, Then pass the ruleExectutor in Rdd. I am new in drools, I am trying to read th

Re: Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Joao Azevedo
On 18/04/16 13:30, Jacek Laskowski wrote: Not that I might help much with deployment to Mesos, but can you describe your Mesos/Marathon setup? The setup is a single master with multiple slaves cluster. The master is also running Zookeeper besides 'mesos-master'. There's a DNS record pointing

Re: drools on spark, how to reload rule file?

2016-04-18 Thread Jason Nerothin
Could you post some psuedo-code? val data = rdd.whatever(…) val facts: Array[DroolsCompatibleType] = convert(data) facts.map{ f => ksession.insert( f ) } > On Apr 18, 2016, at 9:20 AM, yaoxiaohua wrote: > > Thanks for your reply , Jason, > I can use stateless session in spark streaming job.

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Steve Loughran
1. You don't need to start HDFS or anything like that, just set up Spark so that it can use the Hadoop APIs for some things; on windows this depends on some native libs. This means you don't need to worry about learning it. Focus on the Spark APIs, python and/or scala 2. You should be able to

Re: Submitting applications to Mesos with cluster deploy mode

2016-04-18 Thread Joao Azevedo
On 18/04/16 11:54, Joao Azevedo wrote: I'm trying to submit Spark applications to Mesos using the 'cluster' deploy mode. I'm using Marathon as the container orchestration platform and launching the Mesos Cluster Dispatcher through it. I'm using Spark 1.6.1 with Scala 2.11. Using the current mas

Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Erwan ALLAIN
Hello, I'm currently designing a solution where 2 distinct clusters Spark (2 datacenters) share the same Kafka (Kafka rack aware or manual broker repartition). The aims are - preventing DC crash: using kafka resiliency and consumer group mechanism (or else ?) - keeping consistent offset among repl

Jdbc connection from sc.addJar failing

2016-04-18 Thread chaz2505
Hi all, I'm a bit stuck with a problem that I thought was solved in SPARK-6913 but can't seem to get it to work. I'm programatically adding a jar (sc.addJar(pathToJar)) after the SparkContext is created then using the driver from the jar to load a table through sqlContext.read.jdbc(). When I tr

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Jason Nerothin
Hi Erwan, You might consider InsightEdge: http://insightedge.io . It has the capability of doing WAN between data grids and would save you the work of having to re-invent the wheel. Additionally, RDDs can be shared between developers in the same DC. Thanks, Jason > On

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Cody Koeninger
The current direct stream only handles exactly the partitions specified at startup. You'd have to restart the job if you changed partitions. https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work towards using the kafka 0.10 consumer, which would allow for dynamic topicparittions

Re: Jdbc connection from sc.addJar failing

2016-04-18 Thread Mich Talebzadeh
You can either add jar to spark-submit as pointed out like ${SPARK_HOME}/bin/spark-submit \ --packages com.databricks:spark-csv_2.11:1.3.0 \ --jars /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \ OR Create/package a fat jar file that wil

Calling Python code from Scala

2016-04-18 Thread didmar
Hi, I have a Spark project in Scala and I would like to call some Python functions from within the program. Both parts are quite big, so re-coding everything in one language is not really an option. The workflow would be: - Creating a RDD with Scala code - Mapping a Python function over this RDD

Re: Calling Python code from Scala

2016-04-18 Thread Ndjido Ardo BAR
Hi Didier, I think with PySpark you can wrap your legacy Python functions into UDFs and use it in your DataFrames. But you have to use DataFrames instead of RDD. cheers, Ardo On Mon, Apr 18, 2016 at 7:13 PM, didmar wrote: > Hi, > > I have a Spark project in Scala and I would like to call some

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Jörn Franke
I think the easiest would be to use a Hadoop Windows distribution, such as Hortonworks. However, the Linux version of Hortonworks is a little bit more advanced. > On 18 Apr 2016, at 14:13, My List wrote: > > Deepak, > > The following could be a very dumb questions so pardon me for the same. >

Re: Calling Python code from Scala

2016-04-18 Thread Holden Karau
So if there is just a few python functions your interested in accessing you can also use the pipe interface (you'll have to manually serialize your data on both ends in ways that Python and Scala can respectively parse) - but its a very generic approach and can work with many different languages.

Re: Read Parquet in Java Spark

2016-04-18 Thread Jan Rock
Hi, Section Parquet in the documentation (it’s very nice document): http://spark.apache.org/docs/latest/sql-programming-guide.html Be sure that the parquet file is created properly: http://www.do-hadoop.com/scala/csv-parquet-scala-spark-snippet/ Kind Regards, Jan > On 18 Apr 2016, at 11:44, Ra

Re: Calling Python code from Scala

2016-04-18 Thread Mohit Jaggi
When faced with this issue I followed the approach taken by pyspark and used py4j. You have to: - ensure your code is Java compatible - use py4j to call the java (scala) code from python > On Apr 18, 2016, at 10:29 AM, Holden Karau wrote: > > So if there is just a few python functions your int

Re: Calling Python code from Scala

2016-04-18 Thread Didier Marin
Hi, Thank you for the quick answers ! Ndjido Ardo: actually the Python part is not legacy code. I'm not familiar with UDF in Spark, do you have some examples of Python UDFs and how to use them in Scala code ? Holden Karau: the pipe interface seems like a good solution, but I'm a bit concerned ab

inter spark application communication

2016-04-18 Thread Soumitra Johri
Hi, I have two applications : App1 and App2. On a single cluster I have to spawn 5 instances os App1 and 1 instance of App2. What would be the best way to send data from the 5 App1 instances to the single App2 instance ? Right now I am using Kafka to send data from one spark application to the s

large scheduler delay

2016-04-18 Thread Darshan Singh
Hi, I have an application which uses 3 parquet files , 2 of which are large and another one is small. These files are in hdfs and are partitioned by column "col1". Now I create 3 data-frames one for each parquet file but I pass col1 value so that it reads the relevant data. I always read from the

Re: Spark support for Complex Event Processing (CEP)

2016-04-18 Thread Mario Ds Briggs
Hey Mich, Luciano Will provide links with docs by tomorrow thanks Mario - Message from Mich Talebzadeh on Sun, 17 Apr 2016 19:17:38 +0100 - To: Luciano Resende

Re: Spark support for Complex Event Processing (CEP)

2016-04-18 Thread Mich Talebzadeh
great stuff Mario. Much appreciated. Mich Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 18 April 20

How to decide the number of tasks in Spark?

2016-04-18 Thread Dogtail L
Hi, When launching a job in Spark, I have great trouble deciding the number of tasks. Someone says it is better to create a task per HDFS block size, i.e., make sure one task process 128MB of input data; others suggest that the number of tasks should be the twice of the total cores available to th

Re: inter spark application communication

2016-04-18 Thread Michael Segel
have you thought about Akka? What are you trying to send? Why do you want them to talk to one another? > On Apr 18, 2016, at 12:04 PM, Soumitra Johri > wrote: > > Hi, > > I have two applications : App1 and App2. > On a single cluster I have to spawn 5 instances os App1 and 1 instance of >

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Michael Segel
Perhaps this is a silly question on my part…. Why do you want to start up HDFS on a single node? You only mention one windows machine in your description of your cluster. If this is a learning experience, why not run Hadoop in a VM (MapR and I think the other vendors make linux images that ca

Re: inter spark application communication

2016-04-18 Thread Michael Segel
Hi, Putting this back in User email list… Ok, So its not inter-job communication, but taking the output of Job 1 and using it as input to job 2. (Chaining of jobs.) So you don’t need anything fancy… > On Apr 18, 2016, at 1:47 PM, Soumitra Johri > wrote: > > Yes , I want to chain spa

Parse XML using Java Spark

2016-04-18 Thread Jinan Alhajjaj
I would like to parse an XML file using JAVA not Scala. The XML files is Wikipedia Dump files If there is any example of how to parse an XML using Java of course by Apache Spark ,I would like to share it with me Thank you

Re: spark-ec2 hitting yum install issues

2016-04-18 Thread Anusha S
Yes, it does not work manually. I am not able to really do 'yum search' to find exact package names to try others, but I tried python-pip and it gave same error. I will post this in the link you pointed out. Thanks! On Thu, Apr 14, 2016 at 6:11 PM, Nicholas Chammas < nicholas.cham...@gmail.com> w

Parse XML using java spark

2016-04-18 Thread Jinan Alhajjaj
Thank you for your help.I would like to parse the XML file using Java not scala . Can you please provide me with exsample of how to parse XMl via java using spark. My XML file is Wikipedia dump file Thank you

Re: How to decide the number of tasks in Spark?

2016-04-18 Thread Mich Talebzadeh
Try to have a look at this doc http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Parse XML using java spark

2016-04-18 Thread Mail.com
You might look at using JaxB or Stax. If it is simple enough use data frames auto generated scheme. Pradeep > On Apr 18, 2016, at 6:37 PM, Jinan Alhajjaj wrote: > > Thank you for your help. > I would like to parse the XML file using Java not scala . Can you please > provide me with exsample

spark-submit not adding application jar to classpath

2016-04-18 Thread Saif.A.Ellafi
Hi, I am submitting a jar file to spark submit, which has some content inside src/main/resources. I am unable to access such resources, since the application jar is not being added to the classpath. This works fine if I include the application jar also in the -driver-class-path entry. Is this

[Spark 1.5.2] Log4j Configuration for executors

2016-04-18 Thread Divya Gehlot
Hi, I tried configuring logs to write it to file for Spark Driver and Executors . I have two separate log4j properties files for Spark driver and executor respectively. Its wrtiting log for Spark driver but for executor logs I am getting below error : java.io.FileNotFoundException: /home/hdfs/spa

Re:RE: Why Spark having OutOfMemory Exception?

2016-04-18 Thread 李明伟
Hi Samaga Thanks very much for your reply and sorry for the delay reply. Cassandra or Hive is a good suggestion. However in my situation I am not sure if it will make sense. My requirements is that to get the recent 24 hour data to generate report. The frequency is 5 minute. So if use cas

Is it possible that spark worker require more resource than the cluster?

2016-04-18 Thread kramer2...@126.com
I have a stand alone cluster running on one node The ps command will show that Worker is having 1 GB memory and Driver is having 256m. root 23182 1 0 Apr01 ?00:19:30 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.

spark sql on hive

2016-04-18 Thread Jieliang Li
hi everyone.i use spark sql, but throw an exception: Retrying creating default database after error: Error creating transactional connection factory javax.jdo.JDOFatalInternalException: Error creating transactional connection factory at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionFor

Processing millions of messages in milliseconds -- Architecture guide required

2016-04-18 Thread Deepak Sharma
Hi all, I am looking for an architecture to ingest 10 mils of messages in the micro batches of seconds. If anyone has worked on similar kind of architecture , can you please point me to any documentation around the same like what should be the architecture , which all components/big data ecosystem

Re: Why Spark having OutOfMemory Exception?

2016-04-18 Thread Zhan Zhang
What kind of OOM? Driver or executor side? You can use coredump to find what cause the OOM. Thanks. Zhan Zhang On Apr 18, 2016, at 9:44 PM, 李明伟 mailto:kramer2...@126.com>> wrote: Hi Samaga Thanks very much for your reply and sorry for the delay reply. Cassandra or Hive is a good suggestion.

Re: Read Parquet in Java Spark

2016-04-18 Thread Zhan Zhang
You can try something like below, if you only have one column. val rdd = parquetFile.javaRDD().map(row => row.getAs[String](0) Thanks. Zhan Zhang On Apr 18, 2016, at 3:44 AM, Ramkumar V mailto:ramkumar.c...@gmail.com>> wrote: HI, Any idea on this ? Thanks, [http://www.mylivesignature.com/si

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-18 Thread Prashant Sharma
Hello Deepak, It is not clear what you want to do. Are you talking about spark streaming ? It is possible to process historical data in Spark batch mode too. You can add a timestamp field in xml/json. Spark documentation is at spark.apache.org. Spark has good inbuilt features to process json and x

??????spark sql on hive

2016-04-18 Thread Sea
It's a bug of hive. Please use hive metastore service instead of visiting mysql directly. set hive.metastore.uris in hive-site.xml -- -- ??: "Jieliang Li";; : 2016??4??19??(??) 12:55 ??: "user"; : spark sql on hive hi

Re: [Spark 1.5.2] Log4j Configuration for executors

2016-04-18 Thread Prashant Sharma
May be you can try creating it before running the App.