Re: pyspark and hdfs file name

2014-11-14 Thread Oleg Ruchovets
Hi Devies. Thank you for the quick answer. I have a code like this: sc = SparkContext(appName="TAD") lines = sc.textFile(sys.argv[1], 1) result = lines.map(doSplit).groupByKey().map(lambda (k,vc): traffic_process_model(k,vc)) result.saveAsTextFile(sys.argv[2]) Can you please give short e

Re: toLocalIterator in Spark 1.0.0

2014-11-14 Thread Deep Pradhan
val iter = toLocalIterator (rdd) This is what I am doing and it says error: not found On Fri, Nov 14, 2014 at 12:34 PM, Patrick Wendell wrote: > It looks like you are trying to directly import the toLocalIterator > function. You can't import functions, it should just appear as a > method of an

Re: pyspark and hdfs file name

2014-11-14 Thread Davies Liu
On Fri, Nov 14, 2014 at 12:14 AM, Oleg Ruchovets wrote: > Hi Devies. > Thank you for the quick answer. > > I have a code like this: > > > > sc = SparkContext(appName="TAD") > lines = sc.textFile(sys.argv[1], 1) > result = lines.map(doSplit).groupByKey().map(lambda (k,vc): > traffic_process_mo

Re: toLocalIterator in Spark 1.0.0

2014-11-14 Thread Andrew Ash
Deep, toLocalIterator is a method on the RDD class. So try this instead: rdd.toLocalIterator() On Fri, Nov 14, 2014 at 12:21 AM, Deep Pradhan wrote: > val iter = toLocalIterator (rdd) > > This is what I am doing and it says error: not found > > On Fri, Nov 14, 2014 at 12:34 PM, Patrick Wend

same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-14 Thread aaronlin
Hi folks, Although spark-1977 said that this problem is resolved in 1.0.2, but I will have this problem while running the script in AWS EC2 via spark-c2.py. I checked spark-1977 and found that twitter.chill resolve the problem in v.0.4.0 not v.0.3.6, but spark depends on twitter.chill v0.3.6 ba

Spark Memory Hungry?

2014-11-14 Thread TJ Klein
Hi, I am using PySpark (1.1) and I am using it for some image processing tasks. The images (RDD) are of in the order of several MB to low/mid two digit MB. However, when using the data and running operations on it using Spark, I experience blowing up memory. Is there anything I can do about it? I

Re: Spark Memory Hungry?

2014-11-14 Thread Andrew Ash
TJ, what was your expansion factor between image size on disk and in memory in pyspark? I'd expect in memory to be larger due to Java object overhead, but don't know the exact amounts you should expect. On Fri, Nov 14, 2014 at 12:50 AM, TJ Klein wrote: > Hi, > > I am using PySpark (1.1) and I a

EmptyRDD

2014-11-14 Thread Deep Pradhan
How to create an empty RDD in Spark? Thank You

Re: EmptyRDD

2014-11-14 Thread Ted Yu
See http://spark.apache.org/docs/0.8.1/api/core/org/apache/spark/rdd/EmptyRDD.html On Nov 14, 2014, at 2:09 AM, Deep Pradhan wrote: > How to create an empty RDD in Spark? > > Thank You

Re: EmptyRDD

2014-11-14 Thread Deep Pradhan
To get an empty RDD, I did this: I have an rdd with one element. I created another rdd using filter so that the second rdd does not contain anything. I achieved what I wanted but I want to know whether there is an efficient way to achieve this. This is a very crude way of creating an empty RDD. Is

Re: EmptyRDD

2014-11-14 Thread Gerard Maas
If I remember correctly, EmptyRDD is private [spark] You can create an empty RDD using the spark context: val emptyRdd = sc.emptyRDD -kr, Gerard. On Fri, Nov 14, 2014 at 11:22 AM, Deep Pradhan wrote: > To get an empty RDD, I did this: > > I have an rdd with one element. I created another rd

saveAsParquetFile throwing exception

2014-11-14 Thread vdiwakar.malladi
Hi, I'm trying to load JSON file and store the same as parquet file from a standalone program. But at the time of saving parquet file, I'm getting the following exception. Can anyone help me on this. Exception in thread "main" java.lang.RuntimeException: Unsupported dataType: StructType(ArrayBuff

Spark streaming fault tolerance question

2014-11-14 Thread François Garillot
Hi guys, I have a question about how the basics of D-Streams, accumulators, failure and speculative execution interact. Let's say I have a streaming app that takes a stream of strings, formats them (let's say it converts each to Unicode), and prints them (e.g. on a news ticker). I know print() by

Re: EmptyRDD

2014-11-14 Thread Gerard Maas
It looks like an Scala issue. Seems like the implicit conversion to ArrayOps does not apply if the type is Array[Nothing]. Try giving a type to the empty RDD: val emptyRdd: RDD[Any] = sc.EmptyRDD emptyRdd.collect.foreach(println) // prints a line return -kr, Gerard. On Fri, Nov 14, 2014 at 1

1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-14 Thread Priya Ch
Hi All, We have set up 2 node cluster (NODE-DSRV05 and NODE-DSRV02) each is having 32gb RAM and 1 TB hard disk capacity and 8 cores of cpu. We have set up hdfs which has 2 TB capacity and the block size is 256 mb When we try to process 1 gb file on spark, we see the following exception 14/11/

Re: saveAsParquetFile throwing exception

2014-11-14 Thread Cheng Lian
Which version are you using? You probably hit this bug https://issues.apache.org/jira/browse/SPARK-3421 if some field name in the JSON contains characters other than [a-zA-Z0-9_]. This has been fixed in https://github.com/apache/spark/pull/2563 On 11/14/14 6:35 PM, vdiwakar.malladi wrote: Hi,

Re: saveAsParquetFile throwing exception

2014-11-14 Thread vdiwakar.malladi
Thanks for your response. I'm using Spark 1.1.0 Currently I have the spark setup which comes with Hadoop CDH (using cloudera manager). Could you please suggest me, how can I make use of the patch? Thanks in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nab

Re: 1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-14 Thread Akhil Das
It shows nullPointerException, your data could be corrupted? Try putting a try catch inside the operation that you are doing, Are you running the worker process on the master node also? If not, then only 1 node will be doing the processing. If yes, then try setting the level of parallelism and numb

Re: saveAsParquetFile throwing exception

2014-11-14 Thread Cheng Lian
Hm, I'm not sure whether this is the official way to upgrade CDH Spark, maybe you can checkout https://github.com/cloudera/spark, apply required patches, and then compile your own version. On 11/14/14 8:46 PM, vdiwakar.malladi wrote: Thanks for your response. I'm using Spark 1.1.0 Currently I

Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
Hi, I am trying to read a HDFS file from Spark "scheduler code". I could find how to write hdfs read/writes in java. But I need to access hdfs from spark using scala. Can someone please help me in this regard.

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
like this? val file = sc.textFile("hdfs://localhost:9000/sigmoid/input.txt") Thanks Best Regards On Fri, Nov 14, 2014 at 9:02 PM, rapelly kartheek wrote: > Hi, > I am trying to read a HDFS file from Spark "scheduler code". I could find > how to write hdfs read/writes in java. > > But I need t

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
No. I am not accessing hdfs from either shell or a spark application. I want to access from spark "Scheduler code". I face an error when I use sc.textFile() as SparkContext wouldn't have been created yet. So, error says: "sc not found". On Fri, Nov 14, 2014 at 9:07 PM, Akhil Das wrote: > like t

Re: Skipping Bad Records in Spark

2014-11-14 Thread Gerard Maas
You can combine map and filter in one operation using collect(PartialFunction) [1] val cleanData = rawData.collect{case x if (condition(x)) f(x) } [1] **Not to be confused with the parameterless rdd.collect() that triggers computations and delivers the results to the driver! ** PS: use the use

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
Can you not create SparkContext inside the scheduler code? If you are looking just to access hdfs then you can use the following object with it, you can create/read/write files. val hdfs = org.apache.hadoop.fs.FileSystem.get(new URI("hdfs://localhost:9000"), hadoopConf) Thanks Best Regards On

RE: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Bui, Tri
It should be val file = sc.textFile("hdfs:///localhost:9000/sigmoid/input.txt") 3 “///” Thanks Tri From: rapelly kartheek [mailto:kartheek.m...@gmail.com] Sent: Friday, November 14, 2014 9:42 AM To: Akhil Das; user@spark.apache.org Subject: Re: Read a HDFS file from Spark using HDFS API No. I

User Authn and Authz in Spark missing ?

2014-11-14 Thread Zeeshan Ali Shah
Hi, I am facing an issue as a Cloud Sysadmin , when Spark master launched on public IPs any one who knows the URL of spark can submit the jobs to it . Any way/hack to have a Authn and Authz in spark . i tried to look into it but could not find .. any hint ? -- Regards Zeeshan Ali Shah System

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
[image: Inline image 1] Thanks Best Regards On Fri, Nov 14, 2014 at 9:18 PM, Bui, Tri < tri@verizonwireless.com.invalid> wrote: > It should be > > > > val file = sc.textFile("hdfs:///localhost:9000/sigmoid/input.txt") > > > > 3 “///” > > > > Thanks > > Tri > > > > *From:* rapelly kartheek [

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
I'll just try out with object Akhil provided. There was no problem working in shell with sc.textFile. Thank you Akhil and Tri. On Fri, Nov 14, 2014 at 9:21 PM, Akhil Das wrote: > [image: Inline image 1] > > > Thanks > Best Regards > > On Fri, Nov 14, 2014 at 9:18 PM, Bui, Tri < > tri@verizo

Set worker log configuration when running "local[n]"

2014-11-14 Thread Jim Carroll
How do I set the log level when running "local[n]"? It ignores the log4j.properties file on my classpath. I also tried to set the spark home dir on the SparkConfig using setSparkHome and made sure an appropriate log4j.properties file was in a "conf" subdirectory and that didn't work either. I'm r

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
Hi Akhil, I face error: " not found : value URI " On Fri, Nov 14, 2014 at 9:29 PM, rapelly kartheek wrote: > I'll just try out with object Akhil provided. > There was no problem working in shell with sc.textFile. > > Thank you Akhil and Tri. > > On Fri, Nov 14, 2014 at 9:21 PM, Akhil Das > wro

Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Simone Franzini
Let's say I have to apply a complex sequence of operations to a certain RDD. In order to make code more modular/readable, I would typically have something like this: object myObject { def main(args: Array[String]) { val rdd1 = function1(myRdd) val rdd2 = function2(rdd1) val rdd3 = fu

Re: Set worker log configuration when running "local[n]"

2014-11-14 Thread Jim Carroll
Actually, it looks like it's Parquet logging that I don't have control over. For some reason the parquet project decided to use java.util logging with its own logging configuration. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-worker-log-configuratio

Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-14 Thread Xiangrui Meng
If you use Kryo serialier, you need to register mutable.BitSet and Rating: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102 The JIRA was marked resolved because chill resolved the problem in v0.4.0 and we have this workaro

Re: flatMap followed by mapPartitions

2014-11-14 Thread Debasish Das
mapPartitions tried to hold data is memory which did not work for me.. I am doing flatMap followed by groupByKey now with HashPartitioner and number of blocks is 60 (Based on 120 cores I am running the job on)... Now when the shuffle size < 100 GB it works fine...as flatMap shuffle goes to 200 GB

How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
I'm running a local spark master ("local[n]"). I cannot seem to turn off the parquet logging. I tried: 1) Setting a log4j.properties on the classpath. 2) Setting a log4j.properties file in a spark install conf directory and pointing to the install using setSparkHome 3) Editing the log4j-default.

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Rishi Yadav
how about using fluent style of Scala programming. On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini wrote: > Let's say I have to apply a complex sequence of operations to a certain > RDD. > In order to make code more modular/readable, I would typically have > something like this: > > object myO

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Sean Owen
This code executes on the driver, and an "RDD" here is really just a handle on all the distributed data out there. It's a local bookkeeping object. So, manipulation of these objects themselves in the local driver code has virtually no performance impact. These two versions would be about identical*

Given multiple .filter()'s, is there a way to set the order?

2014-11-14 Thread YaoPau
I have an RDD "x" of millions of STRINGs, each of which I want to pass through a set of filters. My filtering code looks like this: x.filter(filter#1, which will filter out 40% of data). filter(filter#2, which will filter out 20% of data). filter(filter#3, which will filter out 2% of data).

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
This is a problem because (other than the fact that Parquet uses java.util.logging) of a bug in Spark in the current master. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with

Re: Set worker log configuration when running "local[n]"

2014-11-14 Thread Jim Carroll
Just to be complete, this is a problem in Spark that I worked around and detailed here: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-td18955.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-worker-

saveAsTextFile error

2014-11-14 Thread Niko Gamulin
Hi, I tried to modify NetworkWordCount example in order to save the output to a file. In NetworkWordCount.scala I replaced the line wordCounts.print() with wordCounts.saveAsTextFile("/home/bart/rest_services/output.txt") When I ran sbt/sbt package it returned the following error: [error] /home

Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me in the right direction? Thanks, Josh

Re: No module named pyspark - latest built

2014-11-14 Thread Andrew Or
I see. The general known constraints on building your assembly jar for pyspark on Yarn are: Java 6 NOT RedHat Maven Some of these are documented here (bottom). Maybe we should make it more explicit. 2014-11-13 2:31 GMT-08:00 jamborta

Re: saveAsTextFile error

2014-11-14 Thread Harold Nguyen
Hi Niko, It looks like you are calling a method on DStream, which does not exist. Check out: https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams for the method "saveAsTextFiles" Harold On Fri, Nov 14, 2014 at 10:39 AM, Niko Gamulin wrote: > Hi,

Re: Given multiple .filter()'s, is there a way to set the order?

2014-11-14 Thread Aaron Davidson
In the situation you show, Spark will pipeline each filter together, and will apply each filter one at a time to each row, effectively constructing an "&&" statement. You would only see a performance difference if the filter code itself is somewhat expensive, then you would want to only execute it

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Michael Armbrust
> > Anyone want a PR? > Yes please.

Compiling Spark master HEAD failed.

2014-11-14 Thread Jianshi Huang
The mvn build command is mvn clean install -Pyarn -Phive -Phive-0.13.1 -Phadoop-2.4 -Djava.version=1.7 -DskipTests I'm getting this error message: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-hive_2.10: wrap: scala.

Cancelled Key Exceptions on Massive Join

2014-11-14 Thread Ganelin, Ilya
Hello all. I have been running a Spark Job that eventually needs to do a large join. 24 million x 150 million A broadcast join is infeasible in this instance clearly, so I am instead attempting to do it with Hash Partitioning by defining a custom partitioner as: class RDD2Partitioner(partition

How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Steve Lewis
I have instrumented word count to track how many machines the code runs on. I use an accumulator to maintain a Set or MacAddresses. I find that everything is done on a single machine. This is probably optimal for word count but not the larger problems I am working on. How to a force processing to

Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Egor Pahomov
Hi. I execute ipython notebook + pyspark with spark.dynamicAllocation.enabled = true. Task never ends. Code: import sys from random import random from operator import add partitions = 10 n = 10 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Sandy Ryza
Hi Egor, Is it successful without dynamic allocation? From your log, it looks like the job is unable to acquire resources from YARN, which could be because other jobs are using up all the resources. -Sandy On Fri, Nov 14, 2014 at 11:32 AM, Egor Pahomov wrote: > Hi. > I execute ipython notebook

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Daniel Siegmann
Most of the information you're asking for can be found on the Spark web UI (see here ). You can see which tasks are being processed by which nodes. If you're using HDFS and your file size is smaller than the HDFS block size you will only have one

Submitting Python Applications from Remote to Master

2014-11-14 Thread Benjamin Zaitlen
Hi All, I'm not quite clear on whether submitting a python application to spark standalone on ec2 is possible. Am I reading this correctly: *A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Egor Pahomov
It's successful without dynamic allocation. I can provide spark log for that scenario if it can help. 2014-11-14 21:36 GMT+02:00 Sandy Ryza : > Hi Egor, > > Is it successful without dynamic allocation? From your log, it looks like > the job is unable to acquire resources from YARN, which could be

Kryo serialization in examples.streaming.TwitterAlgebirdCMS/HLL

2014-11-14 Thread Debasish Das
Hi, If I look inside algebird Monoid implementation it uses java.io.Serializable... But when we use CMS/HLL in examples.streaming.TwitterAlgebirdCMS, I don't see a KryoRegistrator for CMS and HLL monoid... In these examples we will run with Kryo serialization on CMS and HLL or they will be java

Re: Using data in RDD to specify HDFS directory to write to

2014-11-14 Thread jschindler
I reworked my app using your idea of throwing the data in a map. It looks like it should work but I'm getting some strange errors and my job gets terminated. I get a "WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered

Mulitple Spark Context

2014-11-14 Thread Charles
I need continuously run multiple calculations concurrently on a cluster. They are not sharing RDDs. Each of the calculations needs different number of cores and memory. Also, some of them are long running calculation and others are short running calculation.They all need be run on regular basis and

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
Jira: https://issues.apache.org/jira/browse/SPARK-4412 PR: https://github.com/apache/spark/pull/3271 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-tp18955p18977.html Sent from the Apache Spark User List maili

SparkSQL exception on cached parquet table

2014-11-14 Thread Sadhan Sood
While testing SparkSQL on a bunch of parquet files (basically used to be a partition for one of our hive tables), I encountered this error: import org.apache.spark.sql.SchemaRDD import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Sadhan Sood
Thanks Cheng, that was helpful. I noticed from UI that only half of the memory per executor was being used for caching, is that true? We have a 2 TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB memory but caching still failed and what looked like from the UI was that it u

Re: Mulitple Spark Context

2014-11-14 Thread Daniil Osipov
Its not recommended to have multiple spark contexts in one JVM, but you could launch a separate JVM per context. How resources get allocated is probably outside the scope of Spark, and more of a task for the cluster manager. On Fri, Nov 14, 2014 at 12:58 PM, Charles wrote: > I need continuously

Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Sandy Ryza
That would be helpful as well. Can you confirm that when you try it with dynamic allocation the cluster has free resources? On Fri, Nov 14, 2014 at 12:17 PM, Egor Pahomov wrote: > It's successful without dynamic allocation. I can provide spark log for > that scenario if it can help. > > 2014-11

RE: Mulitple Spark Context

2014-11-14 Thread Bui, Tri
Does this also apply to StreamingContext ? What issue would I have if I have 1000s of StreaminContext ? Thanks Tri From: Daniil Osipov [mailto:daniil.osi...@shazam.com] Sent: Friday, November 14, 2014 3:47 PM To: Charles Cc: u...@spark.incubator.apache.org Subject: Re: Mulitple Spark Context It

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Steve Lewis
The cluster runs Mesos and I can see the tasks in the Mesos UI but most are not doing much - any hints about that UI On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann wrote: > Most of the information you're asking for can be found on the Spark web UI > (see here

RE: Mulitple Spark Context

2014-11-14 Thread Charles
Thanks for your reply! Can you be more specific about the JVM? Is JVM referring to the driver application? if I want to create multiple sparkContext, I will need start a driver application instance for each sparkContext? -- View this message in context: http://apache-spark-user-list.1001560.n

Client application that calls Spark and receives an MLlib model Scala Object and then predicts without Spark installed on hadoop

2014-11-14 Thread xiaoyan yu
I had the same need as those documented back to July archived at http://qnalist.com/questions/5013193/client-application-that-calls-spark-and-receives-an-mllib-model-scala-object-not-just-result . I wonder if anyone would like to share any successful stories. Thanks, Xiaoyan

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2014-11-14 Thread Andrew Or
Hey Egor, Have you checked the AM logs? My guess is that it threw an exception or something such that no executors (not even the initial set) have registered with your driver. You may already know this, but you can go to the http://:8088 page and click into the application to access this. Alternat

Re: Sourcing data from RedShift

2014-11-14 Thread Michael Armbrust
I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD command used to produce the data. Xiangrui can correct me if I'm wrong though. On Fri, Nov 14, 2014 at 2:19 PM, Gary Malouf wrote: > We have a bunch of data in RedShift tables that we'd like to pull in > during job runs to S

Re: Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Referring to this paper . On Fri, Nov 14, 2014 at 10:42 AM, Josh J wrote: > Hi, > > I was wondering if the adaptive stream processing and dynamic batch > processing was available to use in spark streaming? If someone could help > point me in the right d

Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
Hmm, we actually read the CSV data in S3 now and were looking to avoid that. Unfortunately, we've experienced dreadful performance reading 100GB of text data for a job directly from S3 - our hope had been connecting directly to Redshift would provide some boost. We had been using 12 m3.xlarges, b

Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
I'll try this out and follow up with what I find. On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng wrote: > For each node, if the CSV reader is implemented efficiently, you should be > able to hit at least half of the theoretical network bandwidth, which is > about 60MB/second/node. So if you just

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Cheng Lian
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of memory is used for caching. You may refer to details from here http://spark.apache.org/docs/latest/configuration.html On 11/15/14 5:43 AM, Sadhan Sood wrote: Thanks Cheng, that was helpful. I noticed from UI that only half o

filtering a SchemaRDD

2014-11-14 Thread Daniel, Ronald (ELS-SDG)
Hi all, I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one of which holds the text for a sentence from a document. # Load sentence data table sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing') sentenceRDD.take(3) Out[20]: [Row(annotID=118, annotSet=u'

Re: filtering a SchemaRDD

2014-11-14 Thread Vikas Agarwal
Hi, did you try using single quote instead of double around column name? I faced similar situation with apache phoenix. On Saturday, November 15, 2014, Daniel, Ronald (ELS-SDG) < r.dan...@elsevier.com> wrote: > Hi all, > > > > I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fi

Re: filtering a SchemaRDD

2014-11-14 Thread Michael Armbrust
> > > If I use row[6] instead of row["text"] I get what I am looking for. > However, finding the right numeric index could be a pain. > > > > Can I access the fields in a Row of a SchemaRDD by name, so that I can > map, filter, etc. without a trial and error process of finding the right > int for t