Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Chris Miller
flatmap? -- Chris Miller On Thu, Apr 7, 2016 at 10:25 PM, greg huang wrote: > Hi All, > >Can someone give me a example code to get rid of the empty string in > JavaRDD? I kwon there is a filter method in JavaRDD: > https://spark.apache.org/docs/1.6.0/api/java/org/apache/spa

Re: Spark schema evolution

2016-03-22 Thread Chris Miller
With Avro you solve this by using a default value for the new field... maybe Parquet is the same? -- Chris Miller On Tue, Mar 22, 2016 at 9:34 PM, gtinside wrote: > Hi , > > I have a table sourced from* 2 parquet files* with few extra columns in one > of the parquet file. Simp

Re: newbie HDFS S3 best practices

2016-03-16 Thread Chris Miller
If you have lots of small files, distcp should handle that well -- it's supposed to distribute the transfer of files across the nodes in your cluster. Conductor looks interesting if you're trying to distribute the transfer of single, large file(s)... right? -- Chris Miller On Wed, Ma

Re: Does parallelize and collect preserve the original order of list?

2016-03-16 Thread Chris Miller
Short answer: Nope Less short answer: Spark is not designed to maintain sort order in this case... it *may*, but there's no guarantee... generally, it would not be in the same order unless you implement something to order by and then sort the result based on that. -- Chris Miller On Wed, M

Re: reading file from S3

2016-03-16 Thread Chris Miller
r you described. -- Chris Miller On Tue, Mar 15, 2016 at 11:22 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > There are many solutions to a problem. > > Also understand that sometimes your situation might be such. For ex what > if you are accessing S3 from

Re: Correct way to use spark streaming with apache zeppelin

2016-03-13 Thread Chris Miller
Cool! Thanks for sharing. -- Chris Miller On Sun, Mar 13, 2016 at 12:53 AM, Todd Nist wrote: > Below is a link to an example which Silvio Fiorito put together > demonstrating how to link Zeppelin with Spark Stream for real-time charts. > I think the original thread was pack in early

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Chris Miller
for maintain. > > I'm just wondering, whats the best way to store Stats table( a database or > parquet file?) > What exactly are you trying to do? Zeppelin is for interactive analysis of > a dataset. What do you mean "realtime analytics" -- do you mean build a > re

Re: Repeating Records w/ Spark + Avro?

2016-03-12 Thread Chris Miller
removed. Finally, if I add rdd.persist(), then it doesn't work. I guess I would need to do .map(_._1.datum) again before the map that does the real work. -- Chris Miller On Sat, Mar 12, 2016 at 4:15 PM, Chris Miller wrote: > Wow! That sure is buried in the documentation! But yeah, that&#x

Re: Repeating Records w/ Spark + Avro?

2016-03-12 Thread Chris Miller
tln(record.get("myValue")) }) * What am I doing wrong? -- Chris Miller On Sat, Mar 12, 2016 at 1:48 PM, Peyman Mohajerian wrote: > Here is the reason for the behavior: > '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable > objec

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Chris Miller
What exactly are you trying to do? Zeppelin is for interactive analysis of a dataset. What do you mean "realtime analytics" -- do you mean build a report or dashboard that automatically updates as new data comes in? -- Chris Miller On Sat, Mar 12, 2016 at 3:13 PM, trung kien wrote:

Repeating Records w/ Spark + Avro?

2016-03-11 Thread Chris Miller
one the datum? Seems I'm not the only one who ran into this problem: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I can't figure out how to fix it in my case without hacking away like the person in the linked PR did. Suggestions? -- Chris Miller

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-06 Thread Chris Miller
For anyone running into this same issue, it looks like Avro deserialization is just broken when used with SparkSQL and partitioned schemas. I created an bug report with details and a simplified example on how to reproduce: https://issues.apache.org/jira/browse/SPARK-13709 -- Chris Miller On Fri

Re: Is Spark right for us?

2016-03-06 Thread Chris Miller
Gut instinct is no, Spark is overkill for your needs... you should be able to accomplish all of that with a relational database or a column oriented database (depending on the types of queries you most frequently run and the performance requirements). -- Chris Miller On Mon, Mar 7, 2016 at 1:17

Re: MLLib + Streaming

2016-03-06 Thread Chris Miller
Guru:This is a really great response. Thanks for taking the time to explain all of this. Helpful for me too. -- Chris Miller On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani wrote: > Hi Lan, > > Streaming Means, Linear Regression and Logistic Regression support online > machine lea

Re: Best way to merge files from streaming jobs‏ on S3

2016-03-04 Thread Chris Miller
instead of writing to the file from coalesce, sort that data structure, then write your file. -- Chris Miller On Sat, Mar 5, 2016 at 5:24 AM, jelez wrote: > My streaming job is creating files on S3. > The problem is that those files end up very small if I just write them to > S3 > dir

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
oding the same files work fine with Hive, and I imagine the same deserializer code is used there too. Thoughts? -- Chris Miller On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman wrote: > your field name is > *enum1_values* > > but you have data > { "foo1": "test123&q

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
ool.run(DataFileWriteTool.java:99) at org.apache.avro.tool.Main.run(Main.java:84) at org.apache.avro.tool.Main.main(Main.java:73) ******** Any other ideas? -- Chris Miller On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman wrote: > your field name is > *enum1_values* > > but

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Chris Miller
.lang.Thread.run(Thread.java:745) ******** In addition to the above, I also tried putting the test Avro files on HDFS instead of S3 -- the error is the same. I also tried querying from Scala instead of using Zeppelin, and I get the same error. Where should I begin with troubleshooting th

Avro SerDe Issue w/ Manual Partitions?

2016-03-02 Thread Chris Miller
add the partitions manually so that I can specify a location. For what it's worth, "ActionEnum" is the first field in my schema. This same table and query structure works fine with Hive. When I try to run this with SparkSQL, however, I get the above error. Anyone have any idea what the problem is here? Thanks! -- Chris Miller