Re: Joining data using Latitude, Longitude

2015-03-09 Thread Akhil Das
Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing

Re: ANSI Standard Supported by the Spark-SQL

2015-03-09 Thread Ravindra
>From the archives in this user list, It seems that Spark-SQL is yet to achieve SQL 92 level. But there are few things still not clear. 1. This is from an old post dated : Aug 09, 2014. 2. It clearly says that it doesn't support DDL and DML operations. Does that means, all reads (select) are sql 9

Re: Spark with Spring

2015-03-09 Thread Akhil Das
It will be good if you can explain the entire usecase like what kind of requests, what sort of processing etc. Thanks Best Regards On Mon, Mar 9, 2015 at 11:18 PM, Tarun Garg wrote: > Hi, > > I have a existing web base system which receives the request and process > that. This framework uses Sp

Re: saveAsTextFile extremely slow near finish

2015-03-09 Thread Akhil Das
Don't you think 1000 is too less for 160GB of data? Also you could try using KryoSerializer, Enabling RDD Compression. Thanks Best Regards On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x wrote: > I'm basically running a sorting using spark. The spark program will read > from > HDFS, sort on compos

Re: Read Parquet file from scala directly

2015-03-09 Thread Akhil Das
Here's a Java version https://github.com/cloudera/parquet-examples/tree/master/MapReduce It won't be that hard to make that in Scala. Thanks Best Regards On Mon, Mar 9, 2015 at 9:55 PM, Shuai Zheng wrote: > Hi All, > > > > I have a lot of parquet files, and I try to open them directly instead o

ANSI Standard Supported by the Spark-SQL

2015-03-09 Thread Ravindra
Hi All, I am new to spark and trying to understand what SQL Standard is supported by the Spark. I googled around a lot but didn't get clear answer. Some where I saw that Spark supports sql-92 and at other location I found that spark is not fully compliant with sql-92. I also noticed that using H

Top rows per group

2015-03-09 Thread Moss
I do have a schemaRDD where I want to group by a given field F1, but want the result to be not a single row per group but multiple rows per group where only the rows that have the N top F2 field values are kept. The issue is that the groupBy operation is an aggregation of multiple rows to a singl

Spark History server default conf values

2015-03-09 Thread Srini Karri
Hi All, What are the default values for the following conf properities if we don't set in the conf file? # spark.history.fs.updateInterval 10 # spark.history.retainedApplications 500 Regards, Srini.

Re: GraphX Snapshot Partitioning

2015-03-09 Thread Takeshi Yamamuro
Hi, Vertices are simply hash-paritioned by their 64-bit IDs, so they are evenly spread over parititons. As for edges, GraphLoader#edgeList builds edge paritions through hadoopFile(), so the initial parititons depend on InputFormat#getSplits implementations (e.g, partitions are mostly equal to 64M

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread Wang, Ningjun (LNG-NPV)
Hi Yong Thanks for the reply. Yes it works with local drive letter. But I really need to use UNC path because the path is input from at runtime. I cannot dynamically assign a drive letter to arbitrary UNC path at runtime. Is there any work around that I can use UNC path for sc.textFile(...)?

Re: Spark Streaming input data source list

2015-03-09 Thread Tathagata Das
Link to custom receiver guide https://spark.apache.org/docs/latest/streaming-custom-receivers.html On Mon, Mar 9, 2015 at 5:55 PM, Shao, Saisai wrote: > Hi Lin, > > > > AFAIK, currently there’s no built-in receiver API for RDBMs, but you can > customize your own receiver to get data from RDBMs,

RE: Spark Streaming input data source list

2015-03-09 Thread Shao, Saisai
Hi Lin, AFAIK, currently there's no built-in receiver API for RDBMs, but you can customize your own receiver to get data from RDBMs, for the details you can refer to the docs. Thanks Jerry From: Cui Lin [mailto:cui@hds.com] Sent: Tuesday, March 10, 2015 8:36 AM To: Tathagata Das Cc: user@s

Re: Process time series RDD after sortByKey

2015-03-09 Thread Zhan Zhang
Does the code flow similar to following work for you, which processes each partition of an RDD sequentially? while( iterPartition < RDD.partitions.length) { val res = sc.runJob(this, (it: Iterator[T]) => somFunc, iterPartition, allowLocal = true) Some other function after processing

Re: Spark Streaming input data source list

2015-03-09 Thread Cui Lin
Tathagata, Thanks for your quick response. The link is helpful to me. Do you know any API for streaming data from RMDB ? Best regards, Cui Lin From: Tathagata Das mailto:t...@databricks.com>> Date: Monday, March 9, 2015 at 11:28 AM To: Cui Lin mailto:cui@hds.com>> Cc: "user@spark.apache.or

sparse vector operations in Python

2015-03-09 Thread Daniel, Ronald (ELS-SDG)
Hi, Sorry to ask this, but how do I compute the sum of 2 (or more) mllib SparseVectors in Python? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apac

Process time series RDD after sortByKey

2015-03-09 Thread Shuai Zheng
Hi All, I am processing some time series data. For one day, it might has 500GB, then for each hour, it is around 20GB data. I need to sort the data before I start process. Assume I can sort them successfully dayRDD.sortByKey but after that, I might have thousands of partitions (to m

Top, takeOrdered, sortByKey

2015-03-09 Thread Saba Sehrish
From: Saba Sehrish mailto:ssehr...@fnal.gov>> Date: March 9, 2015 at 4:11:07 PM CDT To: mailto:user-...@spark.apache.org>> Subject: Using top, takeOrdered, sortByKey I am using spark for a template matching problem. We have 77 million events in the template library, and we compare energy of eac

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread java8964
This is a Java problem, not really Spark. >From this page: >http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path class in Hadoop will use java.io.*, instead o

yarn + spark deployment issues (high memory consumption and task hung)

2015-03-09 Thread pranavkrs
Yarn+ Spark: I am running my spark job (on yarn) on 6 data node cluster of 512GB each. I was having tough time configuring it since the job would hang in one or more tasks on any of the executors for indefinite time. The stage can be as simple as rdd count. And the bottleneck point is not always th

sc.textFile() on windows cannot access UNC path

2015-03-09 Thread Wang, Ningjun (LNG-NPV)
I am running Spark on windows 2008 R2. I use sc.textFile() to load text file using UNC path, it does not work. sc.textFile(raw"file:10.196.119.230/folder1/abc.txt", 4).count() Input path does not exist: file:/10.196.119.230/folder1/abc.txt org.apache.hadoop.mapred.InvalidInputException: Inp

Joining data using Latitude, Longitude

2015-03-09 Thread Ankur Srivastava
Hi, I am trying to join data based on the latitude and longitude. I have reference data which has city information with their latitude and longitude. I have a data source with user information with their latitude and longitude. I want to find the nearest city to the user's latitude and longitude.

error on training with logistic regression sgd

2015-03-09 Thread Peng Xia
Hi, I was launching a spark cluster with 4 work nodes, each work nodes contains 8 cores and 56gb ram, and I was testing my logistic regression problem. The training set is around 1.2 million records.When I was using 2**10 (1024) features, the whole program works fine, but when I use 2**14 features

Re: Spark Streaming input data source list

2015-03-09 Thread Tathagata Das
Spark Streaming has StreamingContext.socketStream() http://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/StreamingContext.html#socketStream(java.lang.String, int, scala.Function1, org.apache.spark.storage.StorageLevel, scala.reflect.ClassTag) TD On Mon, Mar 9, 2015 at 11:37 AM,

From Spark web ui, how to prove the parquet column pruning working

2015-03-09 Thread java8964
Hi, Currently most of the data in our production is using Avro + Snappy. I want to test the benefits if we store the data in Parquet format. I changed the our ETL to generate the Parquet format, instead of Avor, and want to test a simple sql in Spark SQL, to verify the benefits from Parquet. I g

Re: Can't cache RDD of collaborative filtering on MLlib

2015-03-09 Thread Xiangrui Meng
cache() is lazy. The data is stored into memory after the first time it gets materialized. So the first time you call `predict` after you load the model back from HDFS, it still takes time to load the actual data. The second time will be much faster. Or you can call `userJavaRDD.count()` and `produ

Re: MLlib/kmeans newbie question(s)

2015-03-09 Thread Xiangrui Meng
You need to change `== 1` to `== i`. `println(t)` happens on the workers, which may not be what you want. Try the following: noSets.filter(t => model.predict(Utils.featurize(t)) == i).collect().foreach(println) -Xiangrui On Sat, Mar 7, 2015 at 3:20 PM, Pierce Lamb wrote: > Hi all, > > I'm very

Spark Streaming input data source list

2015-03-09 Thread Cui Lin
Dear all, Could you send me a list for input data source that spark streaming could support? My list is HDFS, Kafka, textfile?… I am wondering if spark streaming could directly read data from certain port (443 e.g.) that my devices directly send to? Best regards, Cui Lin

distcp problems on ec2 standalone spark cluster

2015-03-09 Thread roni
I got pass the issues with the cluster not started problem by adding Yarn to mapreduce.framework.name . But when I try to to distcp , if I use uRI with s3://path to my bucket .. I get invalid path even though the bucket exists. If I use s3n:// it just hangs. Did anyone else face anything like that

Spark with Spring

2015-03-09 Thread Tarun Garg
Hi, I have a existing web base system which receives the request and process that. This framework uses Spring framework. Now i am planning to separate this business logic and out that in Spark Streaming. I am not sure using Spring framework in streaming is how much valuable. Any suggestion is we

saveAsTextFile extremely slow near finish

2015-03-09 Thread mingweili0x
I'm basically running a sorting using spark. The spark program will read from HDFS, sort on composite keys, and then save the partitioned result back to HDFS. pseudo code is like this: input = sc.textFile pairs = input.mapToPair sorted = pairs.sortByKey values = sorted.values values.saveAsTextFile

GraphX Snapshot Partitioning

2015-03-09 Thread Matthew Bucci
Hello, I am working on a project where we want to split graphs of data into snapshots across partitions and I was wondering what would happen if one of the snapshots we had was too large to fit into a single partition. Would the snapshot be split over the two partitions equally, for example, and h

java.lang.RuntimeException: Couldn't find function Some

2015-03-09 Thread Patcharee Thongtra
Hi, In my spark application I queried a hive table and tried to take only one record, but got java.lang.RuntimeException: Couldn't find function Some val rddCoOrd = sql("SELECT date, x, y FROM coordinate where order by date limit 1") valresultCoOrd = rddCoOrd.take(1)(0) Any ideas? I

Read Parquet file from scala directly

2015-03-09 Thread Shuai Zheng
Hi All, I have a lot of parquet files, and I try to open them directly instead of load them into RDD in driver (so I can optimize some performance through special logic). But I do some research online and can't find any example to access parquet directly from scala, anyone has done this befor

Re: failure to display logs on YARN UI with log aggregation on

2015-03-09 Thread Ted Yu
See http://search-hadoop.com/m/JW1q5AneoE1 Cheers On Mon, Mar 9, 2015 at 7:29 AM, rok wrote: > I'm using log aggregation on YARN with Spark and I am not able to see the > logs through the YARN web UI after the application completes: > > Failed redirect for container_1425390894284_0066_01_01

How to preserve/preset partition information when load time series data?

2015-03-09 Thread Shuai Zheng
Hi All, If I have a set of time series data files, they are in parquet format and the data for each day are store in naming convention, but I will not know how many files for one day. 20150101a.parq 20150101b.parq 20150102a.parq 20150102b.parq 20150102c.parq . 201501010a.parq . N

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-09 Thread Burak Yavuz
Hi Jaonary, The RowPartitionedMatrix is a special case of the BlockMatrix, where the colsPerBlock = nCols. I hope that helps. Burak On Mar 6, 2015 9:13 AM, "Jaonary Rabarisoa" wrote: > Hi Shivaram, > > Thank you for the link. I'm trying to figure out how can I port this to > mllib. May you can

Re: what are the types of tasks when running ALS iterations

2015-03-09 Thread Burak Yavuz
+user On Mar 9, 2015 8:47 AM, "Burak Yavuz" wrote: > Hi, > In the web UI, you don't see every single task. You see the name of the > last task before the stage boundary (which is a shuffle like a groupByKey), > which in your case is a flatMap. Therefore you only see flatMap in the UI. > The group

failure to display logs on YARN UI with log aggregation on

2015-03-09 Thread rok
I'm using log aggregation on YARN with Spark and I am not able to see the logs through the YARN web UI after the application completes: Failed redirect for container_1425390894284_0066_01_01 Failed while trying to construct the redirect url to the log server. Log Server url may not b

Re: issue creating spark context with CDH 5.3.1

2015-03-09 Thread Sean Owen
This one is CDH-specific and is already answered in the forums, so I'd go there instead. Ex: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-sql-and-Hive-tables/td-p/22051 On Mon, Mar 9, 2015 at 12:33 PM, sachin Singh wrote: > Hi, > I am using CDH5.3.1 > I am getting bello

Re: issue creating spark context with CDH 5.3.1

2015-03-09 Thread sachin Singh
I have copied hive-site.xml to spark conf folder "cp /etc/hive/conf/hive-site.xml /usr/lib/spark/conf" -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-creating-spark-context-with-CDH-5-3-1-tp21968p21969.html Sent from the Apache Spark User List mailin

issue creating spark context with CDH 5.3.1

2015-03-09 Thread sachin Singh
Hi, I am using CDH5.3.1 I am getting bellow error while, even spark context not getting created, I am submitting my job like this - submitting command- spark-submit --jars ./analiticlibs/utils-common-1.0.0.jar,./analiticlibs/mysql-connector-java-5.1.17.jar,./analiticlibs/log4j-1.2.17.jar,./analiti

Re: How to use the TF-IDF model?

2015-03-09 Thread Jeffrey Jedele
Hi, well, it really depends on what you want to do ;) TF-IDF is a measure that originates in the information retrieval context and that can be used to judge the relevancy of a document in context of a given search term. It's also often used for text-related machine learning tasks. E.g. have a loo

Re: Optimizing SQL Query

2015-03-09 Thread anamika gupta
Please fine the query plan scala> sqlContext.sql("SELECT dw.DAY_OF_WEEK, dw.HOUR, avg(dw.SDP_USAGE) AS AVG_SDP_USAGE FROM (SELECT sdp.WID, DAY_OF_WEEK, HOUR, SUM(INTERVAL_VALUE) AS SDP_USAGE FROM (SELECT * FROM date_d AS dd JOIN interval_f AS intf ON intf.DATE_WID = dd.WID WHERE intf.DATE_WID >= 2

Is there any problem in having a long opened connection to spark sql thrift server

2015-03-09 Thread fanooos
I have some applications developed using PHP and currently we have a problem in connecting these applications to spark sql thrift server. ( Here is the problem I am talking about.

How to build Spark and run examples using Intellij ?

2015-03-09 Thread MEETHU MATHEW
Hi, I am trying to  run examples of spark(master branch from git)  from Intellij(14.0.2) but facing errors. These are the steps I followed: 1. git clone the master branch of apache spark.2. Build it using mvn -DskipTests clean install3. In Intellij  select Import Projects and choose the POM.xml

RE: A strange problem in spark sql join

2015-03-09 Thread Dai, Kevin
No, I don’t have tow master instances. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: 2015年3月9日 15:03 To: Dai, Kevin Cc: user@spark.apache.org Subject: Re: A strange problem in spark sql join Make sure you don't have two master instances running on the same machine. It could happen li

Ensuring data locality when opening files

2015-03-09 Thread Daniel Haviv
Hi, We wrote a spark steaming app that receives file names on HDFS from Kafka and opens them using Hadoop's libraries. The problem with this method is that I'm not utilizing data locality because any worker might open any file without giving precedence to data locality. I can't open the files using

Re: A way to share RDD directly using Tachyon?

2015-03-09 Thread Akhil Das
Did you try something like: myRDD.saveAsObjectFile("tachyon://localhost:19998/Y") val newRDD = sc.objectFile[MyObject]("tachyon://localhost:19998/Y") Thanks Best Regards On Sun, Mar 8, 2015 at 3:59 PM, Yijie Shen wrote: > Hi, > > I would like to share a RDD in several Spark Applications, > i.

Re: No executors allocated on yarn with latest master branch

2015-03-09 Thread Sandy Ryza
You would have needed to configure it by setting yarn.scheduler.capacity.resource-calculator to something ending in DominantResourceCalculator. If you haven't configured it, there's a high probability that the recently committed https://issues.apache.org/jira/browse/SPARK-6050 will fix your proble

Re: A strange problem in spark sql join

2015-03-09 Thread Akhil Das
Make sure you don't have two master instances running on the same machine. It could happen like you were running the job and in the middle you tried to stop the cluster which didn't completely stopped it and you did a start-all again which will eventually end up having 2 master instances running, a

Re: How to load my ML model?

2015-03-09 Thread Akhil Das
You are doing some wrong conversions it seems, would be good if you can paste the piece of code. Thanks Best Regards On Mon, Mar 9, 2015 at 12:24 PM, Xi Shen wrote: > Hi, > > I used the method on this > http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classi