log traces

2016-07-04 Thread Jean Georges Perrin
Hi, I have installed Apache Spark via Maven. How can I control the volume of log it displays on my system? I tried different location for a log4j.properties, but none seems to work for me. Thanks for help... - To unsubscribe e

Re: log traces

2016-07-04 Thread Jean Georges Perrin
y and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or

Re: log traces

2016-07-04 Thread Jean Georges Perrin
> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext> > > Hope this helps, > Anupam > > > > > On Mon, Jul 4, 2016 at 2:18 PM, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: > Thanks Mich, but what is SPARK_HO

Re: log traces

2016-07-04 Thread Jean Georges Perrin
sing the > $SPARK_HOME/bin/spark-submit script. It might be helpful to provide us more > details on how you are running your application. > > Regards, > Luis > > On 4 July 2016 at 16:57, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: > Hey Anupam, >

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Jean Georges Perrin
What are you doing it on right now? > On Jul 6, 2016, at 3:25 PM, dabuki wrote: > > I was thinking about to replace a legacy batch job with Spark, but I'm not > sure if Spark is suited for this use case. Before I start the proof of > concept, I wanted to ask for opinions. > > The legacy job wor

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Jean Georges Perrin
or free using a > cloud infrastructure. > > > > > On 6. Juli 2016 um 21:29:53 MESZ, Jean Georges Perrin wrote: >> What are you doing it on right now? >> >> > On Jul 6, 2016, at 3:25 PM, dabuki wrote: >> > >> > I was thinking about to replace a l

Re: Processing json document

2016-07-06 Thread Jean Georges Perrin
do you want id1, id2, id3 to be processed similarly? The Java code I use is: df = df.withColumn(K.NAME, df.col("fields.premise_name")); the original structure is something like {"fields":{"premise_name":"ccc"}} hope it helps > On Jul 7, 2016, at 1:48 AM, Lan Jiang wrote: > > H

Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
Hi, So far I have been using Spark "embedded" in my app. Now, I'd like to run it on a dedicated server. I am that far: - fresh ubuntu 16, server name is mocha / ip 10.0.100.120, installed scala 2.10, installed Spark 1.6.2, recompiled - Pi test works - UI on port 8080 works Log says: Spark Comm

Re: Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address and still connection refused... but no luck > On Jul 10, 2016, at 1:26 PM, Jean Georges Perrin wrote: > > Hi, > > So far I have been using Spark "embedded" in my app. Now, I'd like to run it >

Re: Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
It appears like i had issues in my /etc/hosts... it seems ok now > On Jul 10, 2016, at 2:13 PM, Jean Georges Perrin wrote: > > I tested that: > > I set: > > _JAVA_OPTIONS=-Djava.net.preferIPv4Stack=true > SPARK_LOCAL_IP=10.0.100.120 > I still have the warning in t

"client / server" config

2016-07-10 Thread Jean Georges Perrin
I have my dev environment on my Mac. I have a dev Spark server on a freshly installed physical Ubuntu box. I had some connection issues, but it is now all fine. In my code, running on the Mac, I have: 1 SparkConf conf = new SparkConf().setAppName("myapp").setMaster("spark://10.0

Re: "client / server" config

2016-07-10 Thread Jean Georges Perrin
river (ie your mac) @ > location: file:/Users/jgp/Documents/Data/restaurants-data.json > >> On Mon, Jul 11, 2016 at 12:33 PM, Jean Georges Perrin wrote: >> >> I have my dev environment on my Mac. I have a dev Spark server on a freshly >> installed physical Ubuntu

Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
Hi, I have a Java memory issue with Spark. The same application working on my 8GB Mac crashes on my 72GB Ubuntu server... I have changed things in the conf file, but it looks like Spark does not care, so I wonder if my issues are with the driver or executor. I set: spark.driver.memory

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
I have added: SparkConf conf = new SparkConf().setAppName("app").setExecutorEnv("spark.executor.memory", "8g") .setMaster("spark://10.0.100.120:7077"); but it did not change a thing > On Jul 13, 2

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
Looks like replacing the setExecutorEnv() by set() did the trick... let's see how fast it'll process my 50x 10ˆ15 data points... > On Jul 13, 2016, at 9:24 PM, Jean Georges Perrin wrote: > > I have added: > > SparkConf conf = new > SparkConf().setA

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
emory > because in local mode your setting about executor and driver didn’t work that > you expected. > > > > >> On Jul 14, 2016, at 8:43 AM, Jean Georges Perrin > <mailto:j...@jgp.net>> wrote: >> >> Looks like replacing the setExecutorEnv() by s

spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
Hi, Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores My process uses all the cores of my server (good), but I am trying to limit it so I can actually submit a second job. I tried SparkConf conf = new SparkConf().setAppName("NC Eatery app").set("spark.executor.mem

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
park.executor.cores", "2"); > } > JavaSparkContext javaSparkContext = new JavaSparkContext(conf); > > On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: > Hi, > > Configuration: standalone clus

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
> relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 15 July 2016 at 13:48, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: &

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
wrote: > > Mich's invocation is for starting a Spark application against an already > running Spark standalone cluster. It will not start the cluster for you. > > We used to not use "spark-submit", but we started using it when it solved > some problem

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss,

Re: Spark streaming takes longer time to read json into dataframes

2016-07-15 Thread Jean Georges Perrin
Do you need it on disk or just push it to memory? Can you try to increase memory or # of cores (I know it sounds basic) > On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi > wrote: > > Hello, > > I have 400K json messages pulled from Kafka into spark streaming using > DirectStream approach.

Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Jean Georges Perrin
Hey, I love when questions are numbered, it's easier :) 1) Yes (but I am not an expert) 2) You don't control... One of my process is going to 8k tasks, so... 3) Yes, if you have HT, it double. My servers have 12 cores, but HT, so it makes 24. 4) From my understanding: Slave is the logical comput

MLlib, Java, and DataFrame

2016-07-21 Thread Jean Georges Perrin
? Do you know/have any example like that? Thanks! jg Jean Georges Perrin j...@jgp.net <mailto:j...@jgp.net> / @jgperrin

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
ames-in-spark-for-large-scale-data-science.html> > > Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :) > > > > Sent from my iPhone > Pardon the dumb thumb typos :) > On Jul 21, 2016, at 8:41 PM, Jean Georges Perrin <mailto:j...@jgp.net>

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
cNetExample.java> > > This example uses a Dataset, which is type equivalent to a DataFrame. > > > On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: > Hi, > > I am looking for some really super basic examples of MLlib (like a

Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
rsome > > > 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint] > 2. run your ML model > > i'd suggest you stick to DataFrame + ml package :) > > hth > > > > On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <mailto:j...@jgp.net>> wr

Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin
I am trying to build a DataFrame from a list, here is the code: private void start() { SparkConf conf = new SparkConf().setAppName("Data Set from Array").setMaster("local"); SparkContext sc = new SparkContext(conf); SQLContext sqlContext =

Re: Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin
ateStructType(Collections.singletonList( > DataTypes.createStructField("int_field", DataTypes.IntegerType, true))); > > DataFrame intDataFrame = sqlContext.createDataFrame(rows, schema); > > > > On Fri, Jul 22, 2016 at 7:53 AM, Jean Georges Perrin <mailto:j...@j

UDF to build a Vector?

2016-07-24 Thread Jean Georges Perrin
Hi, Here is my UDF that should build a VectorUDT. How do I actually make that the value is in the vector? package net.jgp.labs.spark.udf; import org.apache.spark.mllib.linalg.VectorUDT; import org.apache.spark.sql.api.java.UDF1; public class VectorBuilder implements UDF1 { private sta

java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Jean Georges Perrin
I try to build a simple DataFrame that can be used for ML SparkConf conf = new SparkConf().setAppName("Simple prediction from Text File").setMaster("local"); SparkContext sc = new SparkContext(conf); SQLContext sqlContext = new SQLContext(sc);

Java Recipes for Spark

2016-07-29 Thread Jean Georges Perrin
Sorry if this looks like a shameless self promotion, but some of you asked me to say when I'll have my Java recipes for Apache Spark updated. It's done here: http://jgp.net/2016/07/22/spark-java-recipes/ and in the GitHub repo. Enjoy / have a gre

Re: Java Recipes for Spark

2016-07-31 Thread Jean Georges Perrin
> On Jul 30, 2016, at 12:19 AM, Shiva Ramagopal wrote: > > +1 for the Java love :-) > > > On 30-Jul-2016 4:39 AM, "Renato Perini" <mailto:renato.per...@gmail.com>> wrote: > Not only very useful, but finally some Java love :-) > > Thank you. > &g

Raleigh, Durham, and around...

2016-08-04 Thread Jean Georges Perrin
Hi, With some friends, we try to develop the Apache Spark community in the Triangle area of North Carolina, USA. If you are from there, feel free to join our Slack team: http://oplo.io/td. Danny Siegle has also organized a lot of meet ups around the edX courses (see https://www.meetup.com/Tria

Parsing XML

2016-10-04 Thread Jean Georges Perrin
Spark 2.0.0 XML parser 0.4.0 Java Hi, I am trying to create a new column in my data frame, based on a value of a sub element. I have done that several time with JSON, but not very successful in XML. (I know a world with less format would be easier :) ) Here is the code: df.withColumn("Fulfill

Re: Parsing XML

2016-10-04 Thread Jean Georges Perrin
; constructing a UDF which applies your xpath query, and give that as the > second argument to withColumn. > >> On Tue, Oct 4, 2016 at 4:35 PM, Jean Georges Perrin wrote: >> Spark 2.0.0 >> XML parser 0.4.0 >> Java >> >> Hi, >> >> I am try

Support for uniVocity in Spark 2.x

2016-10-06 Thread Jean Georges Perrin
The original CSV parser from Databricks had support for uniVocity in Spark 1.x. Can someone confirm it has disappeared (per: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html#csv(java.lang.String)

Re: Inserting New Primary Keys

2016-10-10 Thread Jean Georges Perrin
Is there only one process adding rows? because this seems a little risky if you have multiple threads doing that… > On Oct 8, 2016, at 1:43 PM, Benjamin Kim wrote: > > Mich, > > After much searching, I found and am trying to use “SELECT ROW_NUMBER() > OVER() + b.id_max AS id, a.* FROM source

JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Hi folks, I am trying to parse JSON arrays and it’s getting a little crazy (for me at least)… 1) If my JSON is: {"vals":[100,500,600,700,800,200,900,300]} I get: ++ |vals| ++ |[100, 500, 600, 7...| ++ root |-- vals: a

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
to see what he does not like? the JSON parser has been pretty good to me until recently. > On Oct 10, 2016, at 12:59 PM, Sudhanshu Janghel <> wrote: > > As far as my experience goes spark can parse only certain types of Json > correctly not all and has strict Parsing rules unlike

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
t often fail. > > > > On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: > Hi folks, > > I am trying to parse JSON arrays and it’s getting a little crazy (for me at > least)… > &g

eager? in dataframe's checkpoint

2017-01-26 Thread Jean Georges Perrin
Hey Sparkers, Trying to understand the Dataframe's checkpoint (not in the context of streaming) https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)

Spark 2 + Java + UDF + unknown return type...

2017-02-02 Thread Jean Georges Perrin
Hi fellow Sparkans, I am building a UDF (in Java) that can return various data types, basically the signature of the function itself is: public Object call(String a, Object b, String c, Object d, String e) throws Exception When I register my function, I need to provide a type, e.g.:

Re: eager? in dataframe's checkpoint

2017-02-02 Thread Jean Georges Perrin
If you don't checkpoint, you continue to build the lineage, > therefore while that lineage is being resolved, you may hit the > StackOverflowException. > > HTH, > Burak > > On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin <mailto:j...@jgp.net>> wrote:

Custom Spark data source in Java

2017-03-22 Thread Jean Georges Perrin
Hi, I am trying to build a custom file data source for Spark, in Java. I have found numerous examples in Scala (including the CSV and XML data sources from Databricks), but I cannot bring Scala in this project. We also already have the parser itself written in Java, I just need to build the "g

Re: Custom Spark data source in Java

2017-03-22 Thread Jean Georges Perrin
then also post > the code here without disclosing confidential stuff. > Or you try directly in Java a data source that returns always a row with one > column containing a String. I fear in any case you need to import some Scala > classes in Java and/or have some wrappers in Scala. > If

Re: checkpoint

2017-04-14 Thread Jean Georges Perrin
Sorry - can't help with PySpark, but here is some Java code which you may be able to transform to Python? http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ jg > On Apr 14, 2017, at 07:18, issues solution wrote: > > Hi > somone can give me an complete example to work with c

Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Jean Georges Perrin
Do you have some other security in place like Kerberos or impersonation? It may affect your access. jg > On Jun 7, 2017, at 02:15, Patrik Medvedev wrote: > > Hello guys, > > I need to execute hive queries on remote hive server from spark, but for some > reasons i receive only column names(

"Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
Hey, Here is my need: program A does something on a set of data and produces results, program B does that on another set, and finally, program C combines the data of A and B. Of course, the easy way is to dump all on disk after A and B are done, but I wanted to avoid this. I was thinking of c

org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Jean Georges Perrin
Hey all, i was giving a run to 2.1.1 and got an error on one of my test program: package net.jgp.labs.spark.l000_ingestion; import java.util.Arrays; import java.util.List; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org

Re: "Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
---faster--required-for-related-jobs> > https://github.com/cloudera/livy#post-sessions > <https://github.com/cloudera/livy#post-sessions> > > On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: > Hey, > > Here is my need: progr

Re: org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Jean Georges Perrin
After investigation, it looks like my Spark 2.1.1 jars got corrupted during download - all good now... ;) > On Jun 20, 2017, at 4:14 PM, Jean Georges Perrin wrote: > > Hey all, > > i was giving a run to 2.1.1 and got an error on one of my test program

Re: "Sharing" dataframes...

2017-06-21 Thread Jean Georges Perrin
s in general a bad idea, since it breaks the DAG > and thus prevents some potential push-down optimizations. > > On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: > Thanks Vadim & Jörn... I will look into those. > > jg > >

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Jean Georges Perrin
Awesome! Congrats! Can't wait!! jg > On Jul 11, 2017, at 18:48, Michael Armbrust wrote: > > Hi all, > > Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release > removes the experimental tag from Structured Streaming. In addition, this > release focuses on usability, sta

Re: Is there a way to run Spark SQL through REST?

2017-07-22 Thread Jean Georges Perrin
There's Livi but it's pretty resource intensive. I know it's not helpful but my company has developed its own and I am trying to Open Source it. Looks like there are quite a few companies who had the need and custom build. jg > On Jul 22, 2017, at 04:01, kant kodali wrote: > > Is there a

Quick one on evaluation

2017-08-02 Thread Jean Georges Perrin
Hi Sparkians, I understand the lazy evaluation mechanism with transformations and actions. My question is simpler: 1) are show() and/or printSchema() actions? I would assume so... and optional question: 2) is there a way to know if there are transformations "pending"? Thanks! jg -

Re: Quick one on evaluation

2017-08-02 Thread Jean Georges Perrin
pending ? You can see the status of the job in the UI. > >> On 2. Aug 2017, at 14:16, Jean Georges Perrin wrote: >> >> Hi Sparkians, >> >> I understand the lazy evaluation mechanism with transformations and actions. >> My question is simpler: 1) are sho

Re: SPARK Issue in Standalone cluster

2017-08-04 Thread Jean Georges Perrin
I use CIFS and it works reasonably well and easily cross platform, well documented... > On Aug 4, 2017, at 6:50 AM, Steve Loughran wrote: > > >> On 3 Aug 2017, at 19:59, Marco Mistroni wrote: >> >> Hello >> my 2 cents here, hope it helps >> If you want to just to play around with Spark, i'd

Re: Quick one on evaluation

2017-08-04 Thread Jean Georges Perrin
wrote: > > > On Wed, Aug 2, 2017 at 2:16 PM, Jean Georges Perrin <mailto:j...@jgp.net>> wrote: > Hi Sparkians, > > I understand the lazy evaluation mechanism with transformations and actions. > My question is simpler: 1) are show() and/or printSchema() actions? I

Spark 2 | Java | Dataset

2017-08-17 Thread Jean Georges Perrin
Hey, I was wondering if it would make sense to have a Dataset of something else than Row? Does anyone has an example (in Java) or use case? My use case would be to use Spark on existing objects we have and benefit from the distributed processing on those objects. jg ---

Re: Spark 2.0.0 and Hive metastore

2017-08-29 Thread Jean Georges Perrin
Sorry if my comment is not helping, but... why do you need Hive? Can't you save your aggregation using parquet for example? jg > On Aug 29, 2017, at 08:34, Andrés Ivaldi wrote: > > Hello, I'm using Spark API and with Hive support, I dont have a Hive > instance, just using Hive for some aggre

Re: How to convert Row to JSON in Java?

2017-09-10 Thread Jean Georges Perrin
Hey, I have a few examples https://github.com/jgperrin/net.jgp.labs.spark. I recently worked on such problems, so there's definitely a solution there or I'll be happy to write one for you. Look in l250 map... jg > On Sep 10, 2017, at 20:51, ayan guha wrote: > > Sorry for side-line questi

Re: How to convert Row to JSON in Java?

2017-09-10 Thread Jean Georges Perrin
Sorry - more likely l700 save. jg > On Sep 10, 2017, at 20:56, Jean Georges Perrin wrote: > > Hey, > > I have a few examples https://github.com/jgperrin/net.jgp.labs.spark. I > recently worked on such problems, so there's definitely a solution there or > I'l

Re: Nested RDD operation

2017-09-15 Thread Jean Georges Perrin
Hey Daniel, not sure this will help, but... I had a similar need where i wanted the content of a dataframe to become a "cell" or a row in the parent dataframe. I grouped by the child dataframe, then collect it as a list in the parent dataframe after a join operation. As I said, not sure it match

[Timer-0:WARN] Logging$class: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-09-18 Thread Jean Georges Perrin
Hi, I am trying to connect to a new cluster I just set up. And I get... [Timer-0:WARN] Logging$class: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources I must have forgotten something really super obvious. My

Re: Nested RDD operation

2017-09-19 Thread Jean Georges Perrin
ool.toString)).toDF("event_name")).select("eventIndex").first().getDouble(0)) > }) > }) > > Wondering if there is any better/faster way to do this ? > > Thanks. > > > > On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin <mailt

Re: Is there a SparkILoop for Java?

2017-09-20 Thread Jean Georges Perrin
It all depends what you want to do :) - we've implemented some dynamic interpretation of Java code through dynamic class loading, not as powerful and requires some work on your end, but you can build a Java-based interpreter. jg > On Sep 20, 2017, at 08:06, kant kodali wrote: > > Got it! Th

Re: Spark code to get select firelds from ES

2017-09-20 Thread Jean Georges Perrin
Same issue with RDBMS ingestion (I think). I solved it with views. Can you do views on ES? jg > On Sep 20, 2017, at 09:22, Kedarnath Dixit > wrote: > > Hi, > > I want to get only select fields from ES using Spark ES connector. > > I have done some code which is fetching all the documents m

Re: Quick one... AWS SDK version?

2017-10-07 Thread Jean Georges Perrin
Hey Marco, I am actually reading from S3 and I use 2.7.3, but I inherited the project and they use some AWS API from Amazon SDK, which version is like from yesterday :) so it’s confused and AMZ is changing its version like crazy so it’s a little difficult to follow. Right now I went back to 2.

Re: What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Jean Georges Perrin
Just hints: Repartition in 10? Get the RDD from the dataframe? What about a forEach row and send every 100? (I just did that actually) jg > On Oct 26, 2017, at 13:37, Noorul Islam Kamal Malmiyoda > wrote: > > Hi all, > > I have a Dataframe with 1000 records. I want to split them into 100 >

Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jean Georges Perrin
May I ask what is the use case? Although it is a very interesting question, but I would be concerned about going further than a proof of concept. A lot of the enterprises I see and visit are barely on Java8, so starting to talk JDK 9 might be a slight overkill but if you have a good story, I’m a

Re: Hi all,

2017-11-03 Thread Jean Georges Perrin
Hi Oren, Why don’t you want to use a GroupBy? You can cache or checkpoint the result and use it in your process, keeping everything in Spark and avoiding save/ingestion... > On Oct 31, 2017, at 08:17, ⁨אורן שמון⁩ <⁨oren.sha...@gmail.com⁩> wrote: > > I have 2 spark jobs one is pre-process and

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread Jean Georges Perrin
Write a UDF? > On Oct 31, 2017, at 11:48, Aakash Basu > wrote: > > Hey all, > > Any help in the below please? > > Thanks, > Aakash. > > > -- Forwarded message -- > From: Aakash Basu > > Date: Tue, Oct 31,

Re: How to get the data url

2017-11-03 Thread Jean Georges Perrin
I am a little confused by your question… Are you trying to ingest a file from S3? If so… look for net.jgp.labs.spark on GitHub and look for net.jgp.labs.spark.l000_ingestion.l001_csv_in_progress.S3CsvToDataset You can modify the file as the keys are yours… If you want to download first: look

Re: learning Spark

2017-12-05 Thread Jean Georges Perrin
When you pick a book, make sure it covers the version of Spark you want to deploy. There are a lot of books out there that focus a lot on Spark 1.x. Spark 2.x generalizes the dataframe API, introduces Tungsten, etc. All might not be relevant to a pure “sys admin” learning, but it is good to know

Storage at node or executor level

2017-12-22 Thread Jean Georges Perrin
Hi all, This is more of a general architecture question, I have my idea, but wanted to confirm/infirm... When your executor is accessing data, where is it stored: at the executor level or at the worker level? jg - To unsubscr

Re: Custom Data Source for getting data from Rest based services

2017-12-24 Thread Jean Georges Perrin
If you need Java code, you can have a look @: https://github.com/jgperrin/net.jgp.labs.spark.datasources and: https://databricks.com/session/extending-apache-sparks-ingestion-building-your-own-java-data-source

Re: S3 token times out during data frame "write.csv"

2018-01-25 Thread Jean Georges Perrin
Are you writing from an Amazon instance or from a on premise install to S3? How many partitions are you writing from? Maybe you can try to “play” with repartitioning to see how it behaves? > On Jan 23, 2018, at 17:09, Vasyl Harasymiv wrote: > > It is about 400 million rows. S3 automatically chu

Schema - DataTypes.NullType

2018-01-29 Thread Jean Georges Perrin
Hi Sparkians, Can someone tell me what is the purpose of DataTypes.NullType, specially as you are building a schema? Thanks jg - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Type Casting Error in Spark Data Frame

2018-01-29 Thread Jean Georges Perrin
You can try to create new columns with the nested value, > On Jan 29, 2018, at 15:26, Arnav kumar wrote: > > Hello Experts, > > I would need your advice in resolving the below issue when I am trying to > retrieving the data from a dataframe. > > Can you please let me know where I am going wr

Schema - DataTypes.NullType

2018-01-29 Thread Jean Georges Perrin
Hi Sparkians, Can someone tell me what is the purpose of DataTypes.NullType, specially as you are building a schema? Thanks jg - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread Jean Georges Perrin
Sure, use withColumn()... jg > On Feb 1, 2018, at 05:50, kant kodali wrote: > > Hi All, > > Is there any way to create a new timeuuid column of a existing dataframe > using raw sql? you can assume that there is a timeuuid udf function if that > helps. > > Thanks!

Re: Schema - DataTypes.NullType

2018-02-04 Thread Jean Georges Perrin
Any taker on this one? ;) > On Jan 29, 2018, at 16:05, Jean Georges Perrin wrote: > > Hi Sparkians, > > Can someone tell me what is the purpose of DataTypes.NullType, specially as > you are building a schema? &

Re: Schema - DataTypes.NullType

2018-02-11 Thread Jean Georges Perrin
What is the purpose of DataTypes.NullType, specially as you are building a schema? Have anyone used it or seen it as spart of a schema auto-generation? (If I keep asking long enough, I may get an answer, no? :) ) > On Feb 4, 2018, at 13:15, Jean Georges Perrin wrote: > > Any take

Re: Schema - DataTypes.NullType

2018-02-12 Thread Jean Georges Perrin
long (nullable = true) > |-- val: null (nullable = true) > > > Nicholas Szandor Hakobian, Ph.D. > Staff Data Scientist > Rally Health > nicholas.hakob...@rallyhealth.com <mailto:nicholas.hakob...@rallyhealth.com> > > > On Sun, Feb 11, 2018 at 5:40 AM, Jean Georges Perri

Re: What's the best way to have Spark a service?

2018-03-15 Thread Jean Georges Perrin
Hi David, I ended building up my own. Livy sounded great on paper, but heavy to manipulate. I found out about Jobserver too late. We did not find too complicated to build ours, with a small Spring boot app that was holding the session (we did not need more than one session). jg > On Mar 15,

A code example of Catalyst optimization

2018-06-04 Thread Jean Georges Perrin
Hi there, I am looking for an example of optimization through Catalyst, that you can demonstrate via code. Typically, you load some data in a dataframe, you do something, you do the opposite operation, and, when you collect, it’s super fast because nothing really happened to the data. Hopefully

Re: submitting dependencies

2018-06-27 Thread Jean Georges Perrin
Have you tried to build a uber jar to bundle all your classes together? > On Jun 27, 2018, at 01:27, amin mohebbi > wrote: > > Could you please help me to understand how I should submit my spark > application ? > > I have used this connector (pygmalios/react

Re: spark sql data skew

2018-07-13 Thread Jean Georges Perrin
Just thinking out loud… repartition by key? create a composite key based on company and userid? How big is your dataset? > On Jul 13, 2018, at 06:20, 崔苗 wrote: > > Hi, > when I want to count(distinct userId) by company,I met the data skew and the > task takes too long time,how to count disti

Re: How to merge multiple rows

2018-08-22 Thread Jean Georges Perrin
How do you do it now? You could use a withColumn(“newDetails”, ) jg > On Aug 22, 2018, at 16:04, msbreuer wrote: > > A dataframe with following contents is given: > > ID PART DETAILS > 11 A1 > 12 A2 > 13 A3 > 21 B1 > 31 C1 > > Target format should be as following: > >

Where is the DAG stored before catalyst gets it?

2018-10-04 Thread Jean Georges Perrin
Hi, I am assuming it is still in the master and when catalyst is finished it sends the tasks to the workers. Correct? tia jg - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Triangle Apache Spark Meetup

2018-10-10 Thread Jean Georges Perrin
Hi, Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh, Durham, and Chapel Hill in North Carolina, USA. The group started back in July 2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/ . Ca

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Jean Georges Perrin
did not see anything, but curious if you find something. I think one of the big benefit of using Java, for data engineering in the context of Spark, is that you do not have to train a lot of your team to Scala. Now if you want to do data science, Java is probably not the best tool yet... > On

Re: Is there any Spark source in Java

2018-11-03 Thread Jean Georges Perrin
I would take this one very closely to my heart :) Look at: https://github.com/jgperrin/net.jgp.labs.spark And if the examples are too weird, have a look at: http://jgp.net/book published at Manning Feedback appreciated! jg > On Nov 3, 2018, at 12:30, Jeyhun Karimov wrote: > > Hi Soheil, >

Re: OData compliant API for Spark

2018-12-05 Thread Jean Georges Perrin
I was involved in a project like that and we decided to deploy the data in https://ckan.org/. We used Spark for the data pipeline and transformation. Hih. jg > On Dec 4, 2018, at 21:14, Affan Syed wrote: > > All, > > We have been thinking about exposing our platform for analytics an OData

Re: how to generate a larg dataset paralleled

2018-12-13 Thread Jean Georges Perrin
You just want to generate some data in Spark or ingest a large dataset outside of Spark? What’s the ultimate goal you’re pursuing? jg > On Dec 13, 2018, at 21:38, lk_spark wrote: > > hi,all: > I want't to generate some test data , which contained about one hundred > million rows . >

Multiple sessions in one application?

2018-12-19 Thread Jean Georges Perrin
Hi there, I was curious of what use cases would drive the use of newSession() (as in https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html#newSession-- ). I underst

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Jean Georges Perrin
There are a little bit more than the list you specified nevertheless, some data types are not directly compatible between Scala and Java and requires conversion, so it’s good to not pollute your code with plenty of conversion and focus on using the straight API. I don’t remember from the top

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-15 Thread Jean-Georges Perrin
java class to extend this class to return a JavaDStream. > This is my real problem. > Tell me if the above description is not clear, because English is > not my native language. > > Thanks in advance > Gary > > On Tue, May 14, 2019 at 11:06 PM Jean Georges Perri

Checkpointing and accessing the checkpoint data

2019-06-27 Thread Jean-Georges Perrin
any performance comparison between the two? On small datasets, caching seems more performant, but I can imagine that there is a sweet spot… Thanks! jgp Jean -Georges Perrin j...@jgp.net

  1   2   >