Explanation regarding Spark Streaming

2016-08-03 Thread Saurav Sinha
Hi, I have query Q1. What will happen if spark streaming job have batchDurationTime as 60 sec and processing time of complete pipeline is greater then 60 sec. -- Thanks and Regards, Saurav Sinha Contact: 9742879062

How to connect Power BI to Apache Spark on local machine?

2016-08-03 Thread Devi P.V
Hi all, I am newbie in Power BI.What are the configurations need to connect Power BI to spark on my local machine? I found some documents that mentioned spark over Azure's HDInsight .But didn't find any reference materials for connecting Spark to remote machine? Is it possible? following is the pr

Re: how to run local[k] threads on a single core

2016-08-03 Thread Sun Rui
I don’t think it possible as Spark does not support thread to CPU affinity. > On Aug 4, 2016, at 14:27, sujeet jog wrote: > > Is there a way we can run multiple tasks concurrently on a single core in > local mode. > > for ex :- i have 5 partition ~ 5 tasks, and only a single core , i want these

how to run local[k] threads on a single core

2016-08-03 Thread sujeet jog
Is there a way we can run multiple tasks concurrently on a single core in local mode. for ex :- i have 5 partition ~ 5 tasks, and only a single core , i want these tasks to run concurrently, and specifiy them to use /run on a single core. The machine itself is say 4 core, but i want to utilize on

Spark 2.0 - make-distribution fails while regular build succeeded

2016-08-03 Thread Richard Siebeling
Hi, spark 2.0 with mapr hadoop libraries was succesfully build using the following command: ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0-mapr-1602 -DskipTests clean package However when I then try to build a runnable distribution using the following command ./dev/make-distribution.sh --

Re: PermGen space Error

2016-08-03 Thread $iddhe$h Divekar
I am running the spark job in yarn-client mode. On Wed, Aug 3, 2016 at 8:14 PM, $iddhe$h Divekar wrote: > Hi, > > I am running spark jobs using apache oozie. > My job.properties has sparkConf which gets used in workflow.xml. > > I have tried increasing MaxPermSize using sparkConf in job.properti

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-08-03 Thread Sun Rui
--num-executors does not work for Standalone mode. Try --total-executor-cores > On Jul 26, 2016, at 00:17, Mich Talebzadeh wrote: > > Hi, > > > I am doing some tests > > I have started Spark in Standalone mode. > > For simplicity I am using one node only with 8 works and I have 12 cores > >

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Kiran Chitturi
Thank you! On Wed, Aug 3, 2016 at 8:45 PM, Marcelo Vanzin wrote: > The Flume connector is still available from Spark: > > http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-streaming-flume-assembly_2.11%7C2.0.0%7Cjar > > Many of the others have indeed been removed from Spark, an

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Marcelo Vanzin
The Flume connector is still available from Spark: http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-streaming-flume-assembly_2.11%7C2.0.0%7Cjar Many of the others have indeed been removed from Spark, and can be found at the Apache Bahir project: http://bahir.apache.org/ I don't

Re: 2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Sean Owen
You're looking for http://bahir.apache.org/ On Wed, Aug 3, 2016 at 8:40 PM, Kiran Chitturi wrote: > Hi, > > When Spark 2.0.0 is released, the 'spark-streaming-twitter' package and > several other packages are not released/published to maven central. It looks > like these packages are removed from

2.0.0 packages for twitter streaming, flume and other connectors

2016-08-03 Thread Kiran Chitturi
Hi, When Spark 2.0.0 is released, the 'spark-streaming-twitter' package and several other packages are not released/published to maven central. It looks like these packages are removed from the official repo of Spark. I found the replacement git repos for these missing packages at https://github.

PermGen space Error

2016-08-03 Thread $iddhe$h Divekar
Hi, I am running spark jobs using apache oozie. My job.properties has sparkConf which gets used in workflow.xml. I have tried increasing MaxPermSize using sparkConf in job.properties but that is not resolving the issue. *sparkConf*=--verbose --driver-java-options '-XX:MaxPermSize=8192M' --conf s

Re: Spark 2.0: Task never completes

2016-08-03 Thread Utkarsh Sengar
Not sure what caused it but the partition size was 3 million there. The RDD was created from mongo hadoop 1.5.1 Earlier (mongo hadoop 1.3 and spark 1.5) it worked just fine, not sure what changed. A a fix, I applied a repartition(40) (where 40 varies by my processing logic) before the cartesian an

Re: how to debug spark app?

2016-08-03 Thread Sumit Khanna
Am not really sure of the best practices on this , but I either consult the localhost:4040/jobs/ etc or better this : val customSparkListener: CustomSparkListener = new CustomSparkListener() sc.addSparkListener(customSparkListener) class CustomSparkListener extends SparkListener { override def o

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn yarn.scheduler.capacity.resource-calculator on, then check again. On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao wrote: > Use dominant resource calculator instead of default resource calculator will > get the expected vcores as you wanted. Basically by default yarn does not > honor cpu c

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Mungeol Heo
Try to turn "yarn.scheduler.capacity.resource-calculator" on On Wed, Aug 3, 2016 at 4:53 PM, Saisai Shao wrote: > Use dominant resource calculator instead of default resource calculator will > get the expected vcores as you wanted. Basically by default yarn does not > honor cpu cores as resource,

Re: how to debug spark app?

2016-08-03 Thread Ted Yu
Have you looked at: https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application If you use Mesos: https://spark.apache.org/docs/latest/running-on-mesos.html#troubleshooting-and-debugging On Wed, Aug 3, 2016 at 6:13 PM, glen wrote: > Any tool like gdb? Which support brea

how to debug spark app?

2016-08-03 Thread glen
Any tool like gdb? Which support break point at some line or some function?

Re: standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-08-03 Thread Michael Gummelt
DC/OS was designed to reduce the operational cost of maintaining a cluster, and DC/OS Spark runs well on it. On Sat, Jul 16, 2016 at 11:11 AM, Teng Qiu wrote: > Hi Mark, thanks, we just want to keep our system as simple as > possible, using YARN means we need to maintain a full-size hadoop > clu

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-08-03 Thread Michael Gummelt
> but Spark on Mesos is certainly lagging behind Spark on YARN regarding the features Spark uses off the scheduler backends -- security, data locality, queues, etc. If by security you mean Kerberos, we'll be upstreaming that to Apache Spark soon. It's been in DC/OS Spark for a while: https://gith

Re: how to use spark.mesos.constraints

2016-08-03 Thread Michael Gummelt
If you run your jobs with debug logging on in Mesos, it should print why the offer is being declined: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L301 On Tue, Jul 26, 2016 at 6:38 PM, Rodrick Brow

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
On 3 Aug 2016, at 22:01, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: ok in other words the result set of joining two dataset ends up with inconsistent result as a header from one DS is joined with another row from another DS? I am not 100% sure I got this point. Let me check if I

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Mich Talebzadeh
ok in other words the result set of joining two dataset ends up with inconsistent result as a header from one DS is joined with another row from another DS? You really need to get rid of headers one way or other before joining. or try to register them as temp table before join to see where the fa

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Hi Mich, Thanks again. My issue is not when I read the csv from a file. It is when you have a Dataset that is output of some join operations. Any help on that? Many Thanks, Best, Carlo On 3 Aug 2016, at 21:43, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: hm odd. Otherwise you ca

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Mich Talebzadeh
hm odd. Otherwise you can try using databricks to read the CSV file. This is scala example //val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868") val df = sqlContext.read.format

Re: unsubscribe

2016-08-03 Thread Daniel Lopes
please send to user-unsubscr...@spark.apache.org *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br On Tue, Aug 2, 2016 at 10:11

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen
Yeah, that's libsvm format, which is 1-indexed. On Wed, Aug 3, 2016 at 12:45 PM, Tony Lane wrote: > I guess the setup of the model and usage of the vector got to me. > Setup takes position 1 , 2 , 3 - like this in the build example - "1:0.0 > 2:0.0 3:0.0" > I thought I need to follow the same nu

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
I guess the setup of the model and usage of the vector got to me. Setup takes position 1 , 2 , 3 - like this in the build example - "1:0.0 2:0.0 3:0.0" I thought I need to follow the same numbering while creating vector too. thanks a bunch On Thu, Aug 4, 2016 at 12:39 AM, Sean Owen wrote: > Y

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen
You mean "new int[] {0,1,2}" because vectors are 0-indexed. On Wed, Aug 3, 2016 at 11:52 AM, Tony Lane wrote: > Hi Sean, > > I did not understand, > I created a KMeansModel with 3 dimensions and then I am calling predict > method on this model with a 3 dimension vector ? > I am not sre what is wr

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
One more: it seems that the steps == Step 1: transform the Dataset into JavaRDD JavaRDD dataPointsWithHeader =dataset1_Join_dataset2.toJavaRDD(); and List someRows = dataPointsWithHeader.collect(); someRows.forEach(System.out::println); do not print the header. So, Could I assume

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
Hi Sean, I did not understand, I created a KMeansModel with 3 dimensions and then I am calling predict method on this model with a 3 dimension vector ? I am not sre what is wrong in this approach. i am missing a point ? Tony On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen wrote: > You declare that

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Flavio
Just for clarification, this is the fullstracktrace: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/08/03 18:18:44 INFO SparkContext: Running Spark version 2.0.0 16/08/03 18:18:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... u

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread flavio marchi
Ho Sean, thanks for the reply. I just omitted the real path which would point on my file system. I just posted the real one. On 3 Aug 2016 19:09, "Sean Owen" wrote: > file: "absolute directory" > does not sound like a valid URI > > On Wed, Aug 3, 2016 at 11:05 AM, Flavio wrote: > > Hello everyo

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread flavio marchi
Thanks for the reply, well I don't have hadoop installed at all. I am just running in local as eclipse project so I don't know where I can configure the path as suggested per JIRA :( On 3 Aug 2016 19:07, "Ted Yu" wrote: > SPARK-15899 ? > > On W

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Sean Owen
file: "absolute directory" does not sound like a valid URI On Wed, Aug 3, 2016 at 11:05 AM, Flavio wrote: > Hello everyone, > > I am try to run a very easy example but unfortunately I am stuck on the > follow exception: > > Exception in thread "main" java.lang.IllegalArgumentException: > java.net

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Ted Yu
SPARK-15899 ? On Wed, Aug 3, 2016 at 11:05 AM, Flavio wrote: > Hello everyone, > > I am try to run a very easy example but unfortunately I am stuck on the > follow exception: > > Exception in thread "main" java.lang.IllegalArgumentException: >

java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Flavio
Hello everyone, I am try to run a very easy example but unfortunately I am stuck on the follow exception: Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file: "absolute directory" I was wondering if anyone got this excep

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Thanks Mich. Yes, I know both headers (categoryRankSchema, categorySchema ) as expressed below: this.dataset1 = d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath); this.dataset2 = d2_DFR.schema(categorySchema).csv(categoryFilePath); Can you use filter to get rid of the

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Sean Owen
You declare that the vector has 3 dimensions, but then refer to its 4th dimension (at index 3). That is the error. On Wed, Aug 3, 2016 at 10:43 AM, Tony Lane wrote: > I am using the following vector definition in java > > Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 })) >

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Mich Talebzadeh
Do you know the headers? Can you use filter to get rid of the header from both CSV files before joining them? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
I am using the following vector definition in java Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 })) However when I run the predict method on this vector it leads to Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3 at org.apache.spark.mllib.linalg.BL

Re: Calling KmeansModel predict method

2016-08-03 Thread Tony Lane
use factory methods in Vectors On Wed, Aug 3, 2016 at 9:54 PM, Rohit Chaddha wrote: > The predict method takes a Vector object > I am unable to figure out how to make this spark vector object for getting > predictions from my model. > > Does anyone has some code in java for this ? > > Thanks > R

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Hi Aseem, Thank you very much for your help. Please, allow me to be more specific for my case (to some extent I already do what you suggested): Let us imagine that I two csv datasets d1 and d2. I generate the Dataset as in the following: == Reading d1: sparkSession=spark; options =

Re: Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Aseem Bansal
Hi Depending on how how you reading the data in the first place, can you simply use the header as header instead of a row? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq) See the header option On Wed, Aug 3, 2016 at 10:14 PM, Car

Re: issue with coalesce in Spark 2.0.0

2016-08-03 Thread Michael Armbrust
Spark 2.0 is not binary compatible with Spark 1.x, you'll need to recompile your jar. On Tue, Aug 2, 2016 at 2:57 AM, 陈宇航 wrote: > Hi all. > > > I'm testing on Spark 2.0.0 and found an issue when using coalesce in > my code. > > The procedure is simple doing a coalesce for a RDD[Stirng],

Spark 2.0: Task never completes

2016-08-03 Thread Utkarsh Sengar
After an upgrade from 1.5.1 to 2.0, one of the tasks never completes and keeps spilling data to disk overtime. long count = resultRdd.count(); LOG.info("TOTAL in resultRdd: " + count); resultRdd is has a rather complex structure: JavaPairRDD> resultRdd = myRdd

Dataset and JavaRDD: how to eliminate the header.

2016-08-03 Thread Carlo . Allocca
Hi All, I would like to apply a regression to my data. One of the workflow is the prepare my data as a JavaRDD starting from a Dataset with its header. So, what I did was the following: == Step 1: transform the Dataset into JavaRDD JavaRDD dataPointsWithHeader =modelDS.toJavaRDD();

Re: SparkSession for RDBMS

2016-08-03 Thread Takeshi Yamamuro
Hi, If these bounaries are not given, spark tries to read all the data as a single parition. See: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala#L56 // maropu On Wed, Aug 3, 2016 at 11:19 PM, Selvam Raman w

Re: How to generate a sequential key in rdd across executors

2016-08-03 Thread yeshwanth kumar
Hi Andrew, Hfileoutputformat2 needs the hbase keys to be sorted in lexicographically. as per your suggestion timestamp + hashed key, i might end up doing a sort on the rdd. which i want to avoid. if i could generate a sequential key, i don't need to do a sort, i could just write after processin

Re: [2.0.0] mapPartitions on DataFrame unable to find encoder

2016-08-03 Thread Dragisa Krsmanovic
Perfect ! That's what I was looking for. Thanks Sun ! On Tue, Aug 2, 2016 at 6:58 PM, Sun Rui wrote: > import org.apache.spark.sql.catalyst.encoders.RowEncoder > implicit val encoder = RowEncoder(df.schema) > df.mapPartitions(_.take(1)) > > On Aug 3, 2016, at 04:55, Dragisa Krsmanovic > wrote:

Re: How does MapWithStateRDD distribute the data

2016-08-03 Thread Cody Koeninger
Are you using KafkaUtils.createDirectStream? On Wed, Aug 3, 2016 at 9:42 AM, Soumitra Johri wrote: > Hi, > > I am running a steaming job with 4 executors and 16 cores so that each > executor has two cores to work with. The input Kafka topic has 4 partitions. > With this given configuration I was

Re: How does MapWithStateRDD distribute the data

2016-08-03 Thread Ben Teeuwen
Did you check the executors logs to check whether the kafka offsets pulled in evenly over the 4 executors? I recall a similar situation with such uneven balancing from a kafka stream, and ended up raising the amount of resources (RAM and cores). Then it nicely balanced out. I don’t understand t

Re: [Thriftserver2] Controlling number of tasks

2016-08-03 Thread Takeshi Yamamuro
Hi, HiveThriftserver2 itself has no such functionality. Have you tried adaptive execution in spark? https://issues.apache.org/jira/browse/SPARK-9850 I have not used this yet though, it seems this experimental feature is to tune #tasks depending on partition size. // maropu On Thu, Aug 4, 2016 a

Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mich Talebzadeh
I gather you have an issue where predicate pushdown is not taking place and it takes time to get the data? 1. How many rows do you have 2. The version of Hive 3. have you analyzed statistics for this Hive table In summary you expect that partition in the query to be read as opposed to e

Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Gourav Sengupta
from what I am observing your path is s3a://buckets3/day=2016-07-25 and partition is day, mba_id and partition_id. Are the sub folders in the form s3://buckets3/day=2016-07-25/mba_id=1122/partition_id=111? can you please include the add partition statement as well for one single partition? The ot

Calling KmeansModel predict method

2016-08-03 Thread Rohit Chaddha
The predict method takes a Vector object I am unable to figure out how to make this spark vector object for getting predictions from my model. Does anyone has some code in java for this ? Thanks Rohit

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
Spark 2.0 has been released. Mind giving it a try :-) ? On Wed, Aug 3, 2016 at 9:11 AM, Rychnovsky, Dusan < dusan.rychnov...@firma.seznam.cz> wrote: > OK, thank you. What do you suggest I do to get rid of the error? > > > -- > *From:* Ted Yu > *Sent:* Wednesday, Augu

Re: [Thriftserver2] Controlling number of tasks

2016-08-03 Thread Chanh Le
I believe there is no way to reduce tasks by Hive using coalesce because when It come to Hive just read the files and depend on number of files you put into. So The way to did was coalesce at the ELT layer put a small number of files as possible reduce IO time for reading file. > On Aug 3, 201

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
OK, thank you. What do you suggest I do to get rid of the error? From: Ted Yu Sent: Wednesday, August 3, 2016 6:10 PM To: Rychnovsky, Dusan Cc: user@spark.apache.org Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memor

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
The latest QA run was no longer accessible (error 404): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59141/consoleFull Looking at the comments on the PR, there is not enough confidence in pulling in the fix into 1.6 On Wed, Aug 3, 2016 at 9:05 AM, Rychnovsky, Dusan < dusan.r

Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mehdi Meziane
Hi Mich, The data is stored as parquet. The table definition looks like : CREATE EXTERNAL TABLE nadata ( extract_date TIMESTAMP, date_formatted STRING, day_of_week INT, hour_of_day INT, entity_label STRING, entity_currency_id INT, entity_currency_label STRING, entity_margin_percenta

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
I am confused. I tried to look for Spark that would have this issue fixed, i.e. https://github.com/apache/spark/pull/13027/ merged in, but it looks like the patch has not been merged for 1.6. How do I get a fixed 1.6 version? Thanks, Dusan [https://avatars2.githubusercontent.com/u/545478

Re: Stop Spark Streaming Jobs

2016-08-03 Thread Tony Lane
SparkSession exposes stop() method On Wed, Aug 3, 2016 at 8:53 AM, Pradeep wrote: > Thanks Park. I am doing the same. Was trying to understand if there are > other ways. > > Thanks, > Pradeep > > > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee > wrote: > > > > So sorry. Your name was Pradeep !!

Re: Error in building spark core on windows - any suggestions please

2016-08-03 Thread Sean Owen
Hm, all of the Jenkins builds are OK, but none of them run on Windows. It could be a Windows-specific thing. It means the launcher process exited abnormally. I don't see any log output from the process, so maybe it failed entirely to launch. Anyone else on Windows seeing this fail or work? On We

Re: Error in building spark core on windows - any suggestions please

2016-08-03 Thread Tony Lane
Compiling without running tests... and this is going fine .. On Wed, Aug 3, 2016 at 8:00 PM, Tony Lane wrote: > I am trying to build spark in windows, and getting the following test > failures and consequent build failures. > > [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ > spar

RE: Python : StreamingContext isn't created properly

2016-08-03 Thread Paolo Patierno
Sorry I missed the "return" on the last line ... coming from Scala :-) Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor Twitter : @ppatierno Linkedin : paolopatierno Blog : DevExperience From: ppatie...@live.com To: user@spark.

Re: [Thriftserver2] Controlling number of tasks

2016-08-03 Thread ayan guha
What I understand is you have a source location where files are dropped and never removed? If that is the case, you may want to keep a track of which files are already processed by your program and read only the "new" files. On 3 Aug 2016 22:03, "Yana Kadiyska" wrote: > Hi folks, I have an ETL pi

Re: [SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mich Talebzadeh
Hi, Do you have a schema definition for this Hive table? What format is this table stored HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Change nullable property in Dataset schema

2016-08-03 Thread Kazuaki Ishizaki
Dear all, Would it be possible to let me know how to change nullable property in Dataset? When I looked for how to change nullable property in Dataframe schema, I found the following approaches. http://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe htt

How does MapWithStateRDD distribute the data

2016-08-03 Thread Soumitra Johri
Hi, I am running a steaming job with 4 executors and 16 cores so that each executor has two cores to work with. The input Kafka topic has 4 partitions. With this given configuration I was expecting MapWithStateRDD to be evenly distributed across all executors, how ever I see that it uses only two

Error in building spark core on windows - any suggestions please

2016-08-03 Thread Tony Lane
I am trying to build spark in windows, and getting the following test failures and consequent build failures. [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ spark-core_2.11 --- --- T E S T S --

Python : StreamingContext isn't created properly

2016-08-03 Thread Paolo Patierno
Hi all, I wrote following function in Python : def createStreamingContext(): conf = (SparkConf().setMaster("local[2]").setAppName("my_app")) conf.set("spark.streaming.receiver.writeAheadLog.enable", "true") sc = SparkContext(conf=conf) ssc = StreamingContext(sc, 1) ssc.checkp

SparkSession for RDBMS

2016-08-03 Thread Selvam Raman
Hi All, I would like to read the data from RDBMS to spark (2.0) using sparksession. How can i decide upper boundary, lower boundary and partitions. is there any specific approach available. How Sqoop2 decides number of partitions, upper and lower boundary if we are not specifying anything. -- S

[SQL] Reading from hive table is listing all files in S3

2016-08-03 Thread Mehdi Meziane
Hi all, We have a hive table stored in S3 and registered in a hive metastore. This table is partitionned with a key "day". So we access this table through the spark dataframe API as : sqlContext.read() .table("tablename) .where(col("day").between("2016-08-01","2016-08-02")) When the

OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-03 Thread Ben Teeuwen
Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error. I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
Yes, I believe I'm using Spark 1.6.0. > spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and therefo

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
Are you using Spark 1.6+ ? See SPARK-11293 On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan < dusan.rychnov...@firma.seznam.cz> wrote: > Hi, > > > I have a Spark workflow that when run on a relatively small portion of > data works fine, but when run on big data fails with strange errors. In the

Re: How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread Cody Koeninger
MatrixFactorizationModel is serializable. Instantiate it on the driver, not on the executors. On Wed, Aug 3, 2016 at 2:01 AM, wrote: > hello guys: > I have an app which consumes json messages from kafka and recommend > movies for the users in those messages ,the code like this : > > >

Sparkstreaming not consistently picking files from streaming directory

2016-08-03 Thread ravi.gawai
I am using spark streaming for text files. I have feeder program which moves files to spark streaming directory. while Spark processing particular file at the same time if feeders puts another file into streaming directory, sometimes spark does not pick file for processing. we are using sparking

RE: Spark 2.0 error: Wrong FS: file://spark-warehouse, expected: file:///

2016-08-03 Thread Ulanov, Alexander
Hi Sean, I updated the issue, could you check the changes? Best regards, Alexander -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, August 03, 2016 2:49 AM To: Utkarsh Sengar Cc: User Subject: Re: Spark 2.0 error: Wrong FS: file://spark-warehouse, expect

Spark 2.0 - Case sensitive column names while reading csv

2016-08-03 Thread Aseem Bansal
While reading csv via DataFrameReader how can I make column names case sensitive? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html None of the options specified mention case sensitivity http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFr

Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
Hi, I have a Spark workflow that when run on a relatively small portion of data works fine, but when run on big data fails with strange errors. In the log files of failed executors I found the following errors: Firstly > Managed memory leak detected; size = 263403077 bytes, TID = 6524 And

[Thriftserver2] Controlling number of tasks

2016-08-03 Thread Yana Kadiyska
Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When spark reads these files, I end up with 315K tasks for a dataframe reading a few days worth of data. I now with a regular Spark job, I can use coalesce to come to a lower number of tasks. Is there a way to tell HiveThriftserver

Re: converting a Dataset into JavaRDD

2016-08-03 Thread Carlo . Allocca
problem solved. The package org.apache.spark.api.java.function.Function was missing. Thanks. Carlo On 3 Aug 2016, at 12:14, Carlo.Allocca mailto:carlo.allo...@open.ac.uk>> wrote: Hi All, I am trying to convert a Dataset into JavaRDD in order to apply a linear regression. I am using spark-co

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-03 Thread Jacek Laskowski
Hi Jestin, I need to expand on this in the Spark notes. Spark can handle data locality itself but if the Spark nodes run on separate nodes than HDFS' there's always the network between them that makes the performance worse comparing to co-location of Spark and HDFS nodes. These are mere details

converting a Dataset into JavaRDD

2016-08-03 Thread Carlo . Allocca
Hi All, I am trying to convert a Dataset into JavaRDD in order to apply a linear regression. I am using spark-core_2.10, version2.0.0 with Java 1.8. My current approach is: == Step 1: convert the Dataset into JavaRDD JavaRDD dataPoints =modelDS.toJavaRDD(); == Step 2: convert JavaRDD int

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-03 Thread Yong Zhang
Data Locality is part of job/task scheduling responsibility. So both links you specified originally are correct, one is for the standalone mode comes with Spark, another is for the YARN. Both have this ability. But YARN, as a very popular scheduling component, comes with MUCH, MUCH more featur

Spark steaming with Flume jobs failing

2016-08-03 Thread Bhupendra Mishra
Hi team, I have integrated SparkSteam with Flume and my flume as well spark job gets failed and getting following error. Your help will be highly appreciative. Many Thanks my flume configuration is as follows flume.conf *** agent1.sources = source1 agent1.sinks = sink1 agent1.channels = ch

Spark 2.0 empty result in some tpc-h queries

2016-08-03 Thread eviekas
Hello, I recently upgraded from Spark 1.6 to Spark 2.0 with Hadoop 2.7. I have some Scala code which I updated for Spark 2.0 that creates parquet tables from textfiles and runs tpc-h queries with the spark.sql module on a 1TB dataset. Most of the queries run as expected with correct results but th

Re: Extracting key word from a textual column

2016-08-03 Thread Mich Talebzadeh
Guys, We are moving in tangent here. The question was what is the easiest way of finding key words in a string column as in transactiondescription? I am aware of functions like regexp, instr, patindex etc. How in general this is done and not necessarily in Spark? For example. A naive question.

Re: Relative path in absolute URI

2016-08-03 Thread Sean Owen
https://issues.apache.org/jira/browse/SPARK-15899 On Tue, Aug 2, 2016 at 11:54 PM, Abhishek Ranjan wrote: > Hi All, > > I am trying to use spark 2.0 with hadoop-hdfs 2.7.2 with scala 2.11. I am > not able to use below API to load file from my local hdfs. > > spark.sqlContext.read > .format("co

Re: How to partition a SparkDataFrame using all distinct column values in sparkR

2016-08-03 Thread Sun Rui
SparkDataFrame.repartition() uses hash partitioning, it can guarantee that all rows of the same column value go to the same partition, but it does not guarantee that each partition contain only single column value. Fortunately, Spark 2.0 comes with gapply() in SparkR. You can apply an R functio

Re: spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread Mich Talebzadeh
Lev, Mine worked with this dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided" [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 11.084 s] [INFO] Spark Project Tags ..

Re: spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread Saisai Shao
I guess you're mentioning about spark assembly uber jar. In Spark 2.0, there's no uber jar, instead there's a jars folder which contains all jars required in the run-time. For the end user it is transparent, the way to submit spark application is still the same. On Wed, Aug 3, 2016 at 4:51 PM, Mic

Re: spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread Mich Talebzadeh
Just to clarify are you building Spark 2 from source downloaded? Or are you referring to building a uber jar file using your code and mvn to submit with spark-submit etc? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

spark 2.0.0 - how to build an uber-jar?

2016-08-03 Thread lev
hi, in spark 1.5, to build an uber-jar, I would just compile the code with: mvn ... package and that will create one big jar with all the dependencies. when trying to do the same with spark 2.0, I'm getting a tar.gz file instead. this is the full command I'm using: mvn -Pyarn -Phive -Phadoop-2.6

"object cannot be cast to Double" using pipline with pyspark

2016-08-03 Thread colin
My colleagues use scala and I use python. They save a hive table ,which has doubletype columns. However there's no double in python. When I use /pipline.fit(dataframe)/, there occured an error: java.lang.ClassCastException: [Ljava.lang.Object: cnnot be cast to java.lang.Double.. I guess i

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Saisai Shao
Use dominant resource calculator instead of default resource calculator will get the expected vcores as you wanted. Basically by default yarn does not honor cpu cores as resource, so you will always see vcore is 1 no matter what number of cores you set in spark. On Wed, Aug 3, 2016 at 12:11 PM, sa

回复:How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread luohui20001
PS: I am using Spark1.6.1, kafka 0.10.0.0 Thanks&Best regards! San.Luo - 原始邮件 - 发件人: 收件人:"user" 主题:How to get recommand result for users in a kafka SparkStreaming Application 日期:2016年08月03日 15点01分 hello guys: I have an app which consumes json m

How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread luohui20001
hello guys: I have an app which consumes json messages from kafka and recommend movies for the users in those messages ,the code like this : conf.setAppName("KafkaStreaming") val storageLevel = StorageLevel.DISK_ONLY val ssc = new StreamingContext(conf, Seconds(batchInterval.toIn

  1   2   >