Given a Avro Schema object is there a way to get StructType in Java?

2017-12-15 Thread kant kodali
Hi All, Given a Avro Schema object is there a way to get StructType that represents the schema in Java? Thanks!

NASA CDF files in Spark

2017-12-15 Thread Christopher Piggott
I'm looking to run a job that involves a zillion files in a format called CDF, a nasa standard. There are a number of libraries out there that can read CDFs but most of them are not high quality compared to the official NASA one, which has java bindings (via JNI). It's a little clumsy but I have

Please Help with DecisionTree/FeatureIndexer

2017-12-15 Thread Marco Mistroni
HI all i am trying to run a sample decision tree, following examples here (for Mllib) https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier the example seems to use a Vectorindexer, however i am missing something. How does the featureIndexer knows which

Re: Several Aggregations on a window function

2017-12-15 Thread Julien CHAMP
May be I should consider something like impala ? Le ven. 15 déc. 2017 à 11:32, Julien CHAMP a écrit : > Hi Spark Community members ! > > I want to do several ( from 1 to 10) aggregate functions using window > functions on something like 100 columns. > > Instead of doing several pass on the data

Re: kinesis throughput problems

2017-12-15 Thread Gourav Sengupta
Hi Jeremy, just out of curiosity - you do know that this is a SPARK user group? Regards, Gourav On Thu, Dec 14, 2017 at 7:03 PM, Jeremy Kelley wrote: > We have a largeish kinesis stream with about 25k events per second and > each record is around 142k. I have tried multiple cluster sizes, mu

Using UDF compiled with Janino in Spark

2017-12-15 Thread Michael Shtelma
Hi all, I am trying to compile my udf with janino copmpiler and then register it in spark and use it afterwards. Here is the code: String s = " \n" + "public class MyUDF implements org.apache.spark.sql.api.java.UDF1 {\n" + "@Override\n" + "public St

Several Aggregations on a window function

2017-12-15 Thread Julien CHAMP
Hi Spark Community members ! I want to do several ( from 1 to 10) aggregate functions using window functions on something like 100 columns. Instead of doing several pass on the data to compute each aggregate function, is there a way to do this efficiently ? Currently it seems that doing val

Recompute Spark outputs intelligently

2017-12-15 Thread Ashwin Raju
Hi, We have a batch processing application that reads logs files over multiple days, does transformations and aggregations on them using Spark and saves various intermediate outputs to Parquet. These jobs take many hours to run. This pipeline is deployed at many customer sites with some site speci