Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Alex
iveServer2. --Alex On 3/8/2016 3:13 PM, Mich Talebzadeh wrote: Hi, What do you mean by Hive Metastore Client? Are you referring to Hive server login much like beeline? Spark uses hive-site.xml to get the details of Hive metastore and the login to the metastore which could be any database. Mine

Re: Hive Context: Hive Metastore Client

2016-03-08 Thread Alex
find a solution in the meantime. Thanks, Alex On 3/8/2016 4:00 PM, Mich Talebzadeh wrote: The current scenario resembles a three tier architecture but without the security of second tier. In a typical three-tier you have users connecting to the application server (read Hive server2) are

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Alex
Hi All, Thanks for your response .. Please find below flow diagram Please help me out simplifying this architecture using Spark 1) Can i skip step 1 to step 4 and directly store it in spark if I am storing it in spark where actually it is getting stored Do i need to retain HAdoop to store data o

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Alex
Spark supports through the Hadoop apis a wide range of file > systems, but does not need HDFS for persistence. You can have local > filesystem (ie any file system mounted to a node, so also distributed ones, > such as zfs), cloud file systems (s3, azure blob etc). > > > > On 2

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error: java.lang.Double cannot be cast to org.apache.hadoop.hive.serde2.io.DoubleWritable] Getting below error while running hive UDF on spark but the UDF is working perfectly fine in Hive.. public Object get(Object name) {

how to compare two avro format hive tables

2017-01-30 Thread Alex
Hi Team, how to compare two avro format hive tables if there is same data in it if i give limit 5 its giving different results

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
How to debug Hive UDfs?! On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu" wrote: > Hi Team, > > I am trying to keep below code in get method and calling that get mthod in > another hive UDF > and running the hive UDF using Hive Context.sql procedure.. > > > switch (f) { > case "double" : return (

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-30 Thread Alex
Hi All, If I modify the code to below The hive UDF is working in spark-sql but it is giving different results..Please let me know difference between these two below codes.. 1) public Object get(Object name) { int pos = getPos((String)name); if(pos<0) return null; Str

does both below code do the same thing? I had to refactor code to fit in spark-sql

2017-01-30 Thread Alex
public Object get(Object name) { int pos = getPos((String) name); if (pos < 0) return null; String f = "string"; Object obj = list.get(pos); Object result = null; if (obj == null)

alternatives for long to longwritable typecasting in spark sql

2017-01-30 Thread Alex
Hi Guys Please let me know if any other ways to typecast as below is throwing error unable to typecast java.lang Long to Longwritable and same for Double for Text also in spark -sql Below piece of code is from hive udf which i am trying to run in spark-sql public Object get(Object name) {

Roadblock -- stuck for 10 days :( how come same hive udf giving different results in spark and hive

2017-01-31 Thread Alex
Hi All, i am trying to run a hive udf in spark-sql and its giving different rows as result in both hive and spark.. My UDF query looks something like this select col1,col2,col3, sum(col4) col4, sum(col5) col5,Group_name from (select inline(myudf('cons1',record)) from table1) test group by col1,c

Re: does both below code do the same thing? I had to refactor code to fit in spark-sql

2017-01-31 Thread Alex
Guys! Please Reply On Tue, Jan 31, 2017 at 12:31 PM, Alex wrote: > public Object get(Object name) { > int pos = getPos((String) name); > if (pos < 0) > return null; > String f = "string&quo

Hive Java UDF running on spark-sql issue

2017-01-31 Thread Alex
Hi , we have Java Hive UDFS which are working perfectly fine in Hive SO for Better performance we are migrating the same To Spark-sql SO these jar files we are giving --jars argument to spark-sql and defining temporary functions to make it to run on spark-sql there is this particular Java UDF

Re: Hive Java UDF running on spark-sql issue

2017-02-01 Thread Alex
ther type depending on what is the type of > the original value? > Kr > > > > On 1 Feb 2017 5:56 am, "Alex" wrote: > > Hi , > > > we have Java Hive UDFS which are working perfectly fine in Hive > > SO for Better performance we are migrating the sam

Is it okay to run Hive Java UDFS in Spark-sql. Anybody's still doing it?

2017-02-02 Thread Alex
uld yu run the same java UDF using Spark-sql or You would recode all java UDF to scala UDF and then run? Regards, Alex

Suprised!!!!!Spark-shell showing inconsistent results

2017-02-02 Thread Alex
Hi As shown below same query when ran back to back showing inconsistent results.. testtable1 is Avro Serde table... [image: Inline image 1] hc.sql("select * from testtable1 order by col1 limit 1").collect; res14: Array[org.apache.spark.sql.Row] = Array([1570,3364,201607,Y,APJ,PHILIPPINES,8518

Re: Suprised!!!!!Spark-shell showing inconsistent results

2017-02-03 Thread Alex
: Inline image 1] On Thu, Feb 2, 2017 at 3:33 PM, Alex wrote: > Hi As shown below same query when ran back to back showing inconsistent > results.. > > testtable1 is Avro Serde table... > > [image: Inline image 1] > > > > hc.sql("select * from testtable1 order

Is DoubleWritable and DoubleObjectInspector doing the same thing in Hive UDF?

2017-02-03 Thread Alex
Hi, can You guys tell me if below peice of two codes are returning the same thing? (((DoubleObjectInspector) ins2).get(obj)); and (DoubleWritable)obj).get(); from below two codes code 1) public Object get(Object name) { int pos = getPos((String)name); if(pos<0) return null; Stri

Re: Is DoubleWritable and DoubleObjectInspector doing the same thing in Hive UDF?

2017-02-04 Thread Alex
H, Please Reply? On Fri, Feb 3, 2017 at 8:19 PM, Alex wrote: > Hi, > > can You guys tell me if below peice of two codes are returning the same > thing? > > (((DoubleObjectInspector) ins2).get(obj)); and (DoubleWritable)obj).get() > ; from below two codes > >

Re: spark architecture question -- Pleas Read

2017-02-07 Thread Alex
iew?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this emai

Re: keep or remove sc.stop() coz of RpcEnv already stopped error

2017-03-13 Thread Alex
Hi , I am using spark-1.6 how to ignore this warning because of this Illegal state exception my production jobs which are scheduld are showing completed abnormally... I cant even handle exception as after sc.stop if i try to execute any code again this exception will come from catch block.. so i re

[PySpark Profiler]: Does empty profile mean no execution in Python Interpreter?

2018-11-01 Thread Alex
b.com/AlexHagerman/pyspark-profiling Thanks, Alex from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql.types import ArrayType from pyspark.sql.functions import broadcast, udf from pyspark.ml.feature import Word2Vec, Word2VecModel from pyspark.ml.linalg import Vector, Vect

RE: is there anyway to enforce Spark to cache data in all worker nodes(almost equally) ?

2015-04-30 Thread Alex
482 MB should be small enough to be distributed as a set of broadcast variables. Then you can use local features of spark to process. -Original Message- From: "shahab" Sent: ‎4/‎30/‎2015 9:42 AM To: "user@spark.apache.org" Subject: is there anyway to enforce Spark to cache data in all w

Re: Cassandra raw deletion

2020-07-04 Thread Alex Ott
ra AS> connector. AS> Thanks AS> Amit -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Datasource API V2] Creating datasource - no step for final cleanup on read

2021-01-18 Thread Alex Rehnby
nning something at the end of the read operation using the current API? If not, I would ask if this might be a useful addition, or if there are design reasons for not including such a step. Thanks, Alex

[PySpark] how to use a JAR model to score a dataset?

2021-08-13 Thread Alex Martishius
Hello, This question has been addressed on Stack Overflow using the spark shell, but not PySpark. I found within the Spark SQL documentation where in PySpark SQL I can load a JAR into my SparkSession config such as: *spark = SparkSession\* *.builder\* *.appName("appname")\* *.config(

Re: Spark Structured Streaming Continuous Trigger on multiple sinks

2021-09-12 Thread Alex Ott
he second one does not. S> Is there any solution to the problem of being able to write to multiple sinks in Continuous Trigger Mode using Structured Streaming? -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian) ---

Re: Unable to use WriteStream to write to delta file.

2021-12-19 Thread Alex Ott
at >> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) >> >> at >> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) >> >> at org.apache.spark.sql.execution.streaming.StreamExecution.org >> $apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:286) >> >> at >> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209) >> >> obj.test_ingest_incremental_data_batch1() >> >> File >> "C:\Users\agundapaneni\Development\ModernDataEstate\tests\test_mdefbasic.py", >> line 56, in test_ingest_incremental_data_batch1 >> >> mdef.ingest_incremental_data('example', entity, >> self.schemas['studentattendance'], 'school_year') >> >> File >> "C:\Users\agundapaneni\Development\ModernDataEstate/src\MDEFBasic.py", line >> 109, in ingest_incremental_data >> >> query.awaitTermination() # block until query is terminated, with >> stop() or with error; A StreamingQueryException will be thrown if an >> exception occurs. >> >> File >> "C:\Users\agundapaneni\Development\ModernDataEstate\.tox\default\lib\site-packages\pyspark\sql\streaming.py", >> line 101, in awaitTermination >> >> return self._jsq.awaitTermination() >> >> File >> "C:\Users\agundapaneni\Development\ModernDataEstate\.tox\default\lib\site-packages\py4j\java_gateway.py", >> line 1309, in __call__ >> >> return_value = get_return_value( >> >> File >> "C:\Users\agundapaneni\Development\ModernDataEstate\.tox\default\lib\site-packages\pyspark\sql\utils.py", >> line 117, in deco >> >> raise converted from None >> >> pyspark.sql.utils.StreamingQueryException: >> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldNames(Lscala/collection/Seq;)V >> >> === Streaming Query === >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian)

Re: Spark 3.2.0 upgrade

2022-01-22 Thread Alex Ott
dClass(ClassLoaders.java:178) AS> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) AS> Thanks AS> AS> Amit -- With best wishes,Alex Ott http://alexott.net/ Twitter: alexott_en (English), alexott (Russian) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark SQL] Structured Streaming in pyhton can connect to cassandra ?

2022-03-25 Thread Alex Ott
7;_thread.RLock' object gf> Can you please tell me how to do this? gf> Or at least give me some advice? gf> Sincerely, gf> FARCY Guillaume. gf> - gf> To unsubscribe e-mail: user-unsubscr...@spark.

spark ETL and spark thrift server running together

2022-03-30 Thread Alex Kosberg
Hi, Some details: * Spark SQL (version 3.2.1) * Driver: Hive JDBC (version 2.3.9) * ThriftCLIService: Starting ThriftBinaryCLIService on port 1 with 5...500 worker threads * BI tool is connect via odbc driver After activating Spark Thrift Server I'm unable to ru

RE: [EXTERNAL] Re: spark ETL and spark thrift server running together

2022-03-30 Thread Alex Kosberg
Hi Christophe, Thank you for the explanation! Regards, Alex From: Christophe Préaud Sent: Wednesday, March 30, 2022 3:43 PM To: Alex Kosberg ; user@spark.apache.org Subject: [EXTERNAL] Re: spark ETL and spark thrift server running together Hi Alex, As stated in the Hive documentation (https

[Spark thread pool configurations]: I would like to configure all ThreadPoolExecutor parameters for each thread pool started in Spark

2022-07-27 Thread Alex Peelman
Hi everyone, My name is Alex and I've been using Spark for the past 4 years to solve most, if not all, of my data processing challenges. From time to time I go a bit left field with this :). Like embedding Spark in my JVM based application running only in `local` mode and using it as a

Unsubscribe

2023-08-01 Thread Alex Landa
Unsubscribe

Re: Using Spark like a search engine

2015-05-25 Thread Alex Chavez
ent pair is dominated by computing ~50 dot products of 100-dimensional vectors. Best, Alex On Mon, May 25, 2015 at 2:59 AM, Сергей Мелехин wrote: > Hi, ankur! > Thanks for your reply! > CVs are a just bunch of IDs, each ID represents some object of some class > (eg. class=JOB, object

Re: Implementing custom RDD in Java

2015-05-26 Thread Alex Robbins
s, and then load them from S3. > > On Mon, May 25, 2015 at 8:19 PM, Alex Robbins < > alexander.j.robb...@gmail.com> wrote: > >> If a Hadoop InputFormat already exists for your data source, you can load >> it from there. Otherwise, maybe you can dump your data source out

Re: Recommended Scala version

2015-05-29 Thread Alex Nakos
Hi- I’ve just built the latest spark RC from source (1.4.0 RC3) and can confirm that the spark shell is still NOT working properly on 2.11. No classes in the jar I've specified with the —jars argument on the command line are available in the REPL. Cheers Alex On Thu, May 28, 2015 at 8:

Re: Spark1.3.1 build issue with CDH5.4.0 getUnknownFields

2015-05-29 Thread Alex Robbins
I've gotten that error when something is trying to use a different version of protobuf than you want. Maybe check out a `mvn dependency:tree` to see if someone is trying to use something other than libproto 2.5.0. (At least, 2.5.0 was current when I was having the problem) On Fri, May 29, 2015 at

Re: Recommended Scala version

2015-05-31 Thread Alex Nakos
Hi- Yup, I’ve already done so here: https://issues.apache.org/jira/browse/SPARK-7944 Please let me know if this requires any more information - more than happy to provide whatever I can. Thanks Alex On Sun, May 31, 2015 at 8:45 AM, Tathagata Das wrote: > Can you file a JIRA with the detai

[OFFTOPIC] Big Data Application Meetup

2015-06-02 Thread Alex Baranau
at Hadoop Summit and Spark Summit in the following weeks. Thank you, Alex Baranau

Re: --jars not working?

2015-06-12 Thread Alex Nakos
Mesos. Thanks Alex On Fri, Jun 12, 2015 at 8:45 PM, Akhil Das wrote: > You can verify if the jars are shipped properly by looking at the driver > UI (running on 4040) Environment tab. > > Thanks > Best Regards > > On Sat, Jun 13, 2015 at 12:43 AM, Jonathan Coveney > wrot

different schemas per row with DataFrames

2015-06-18 Thread Alex Nastetsky
nd the order in which they occur, it may be possible to get the RDD from the DataFrame and build my own DataFrame with createDataFrame and passing it my fabricated super-schema. However, this is brittle, as the super-schema is not in my control and may change in the future. Thanks for any suggestions, Alex.

Calling rdd() on a DataFrame causes stage boundary

2015-06-22 Thread Alex Nastetsky
When I call rdd() on a DataFrame, it ends the current stage and starts a new one that just maps the DataFrame to rdd and nothing else. It doesn't seem to do a shuffle (which is good and expected), but then why does why is there a separate stage? I also thought that stages only end when there's a s

Re: breeze.linalg.DenseMatrix not found

2015-07-01 Thread Alex Gittens
jar to the classpath. Thanks for your help! Alex On Tue, Jun 30, 2015 at 9:11 AM, Burak Yavuz wrote: > How does your build file look? Are you possibly using wrong Scala > versions? Have you added Breeze as a dependency to your project? If so > which version? > > Thanks, > Burak

Re: Need clarification on spark on cluster set up instruction

2015-07-01 Thread Alex Gittens
I have a similar use case, so I wrote a python script to fix the cluster configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster with enough machines that the hdfs system can hold 1Tb (so use instance types that have SSDs), then follow the instructions at http://thousandfold.net/cz

Aggregate to array (or 'slice by key') with DataFrames

2015-07-05 Thread Alex Beatson
Hello, I'm migrating some RDD-based code to using DataFrames. We've seen massive speedups so far! One of the operations in the old code creates an array of the values for each key, as follows: val collatedRDD = valuesRDD.mapValues(value=>Array(value)).reduceByKey((array1,array2) => array1++array

Re: Sorting the RDD

2016-03-03 Thread Alex Dzhagriev
turns something meaningful. Cheers, Alex. On Thu, Mar 3, 2016 at 8:39 AM, Angel Angel wrote: > Hello Sir/Madam, > > I am try to sort the RDD using *sortByKey* function but i am getting the > following error. > > > My code is > 1) convert the rdd array into key value pair. &

Re: Spark on RAID

2016-03-08 Thread Alex Kozlov
; > My question is why not raid? What is the argument\reason for not using > Raid? > > Thanks! > -Eddie > -- Alex Kozlov

Hive Context: Hive Metastore Client

2016-03-08 Thread Alex F
As of Spark 1.6.0 it is now possible to create new Hive Context sessions sharing various components but right now the Hive Metastore Client is shared amongst each new Hive Context Session. Are there any plans to create individual Metastore Clients for each Hive Context? Related to the question ab

Re: sparkR issues ?

2016-03-14 Thread Alex Kozlov
error > > > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ > condition) > Error in DataFrame(colData, row.names = rownames(colData)) : > cannot coerce class "data.frame" to a DataFrame > > I am really stumped. I am not using any spark fun

Re: sparkR issues ?

2016-03-15 Thread Alex Kozlov
; > On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui wrote: > >> It seems as.data.frame() defined in SparkR convers the versions in R base >> package. >> >> We can try to see if we can change the implementation of as.data.frame() >> in SparkR to avoid such covering. &g

Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Alex Dzhagriev
Hi Vinay, I believe it's not possible as the spark-shuffle code should run in the same JVM process as the Node Manager. I haven't heard anything about on the fly bytecode loading in the Node Manger. Thanks, Alex. On Wed, Mar 16, 2016 at 10:12 AM, Vinay Kashyap wrote: > Hi all, &

Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Alex Kozlov
t;> I thought about using data cache as well for serving the data >> The data cache should have the capability to serve the historical data >> in milliseconds (may be upto 30 days of data) >> -- >> Thanks >> Deepak >> www.bigdatabig.com >> >> -- Alex Kozlov ale...@gmail.com

--jars for mesos cluster

2016-05-03 Thread Alex Dzhagriev
to specify the --jars correctly? Thanks, Alex.

streaming on yarn

2016-06-24 Thread Alex Dzhagriev
matic scaling (not blocking the resources if they is no data in the stream) and the ui to manage the running jobs. Thanks, Alex.

spark sql aggregate function "Nth"

2016-07-26 Thread Alex Nastetsky
Spark SQL has a "first" function that returns the first item in a group. Is there a similar function, perhaps in a third party lib, that allows you to return an arbitrary (e.g. 3rd) item from the group? Was thinking of writing a UDAF for it, but didn't want to reinvent the wheel. My endgoal is to b

Re: spark sql aggregate function "Nth"

2016-07-26 Thread Alex Nastetsky
ut writing a UDF is much simpler than a UDAF. On Tue, Jul 26, 2016 at 11:48 AM, ayan guha wrote: > You can use rank with window function. Rank=1 is same as calling first(). > > Not sure how you would randomly pick records though, if there is no Nth > record. In your example, what

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky
Ran into this need myself. Does Spark have an equivalent of "mapreduce. input.fileinputformat.list-status.num-threads"? Thanks. On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park wrote: > Hi, > > I am wondering if anyone has successfully enabled > "mapreduce.input.fileinputformat.list-status.num-t

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky
Thanks. I was actually able to get mapreduce.input. fileinputformat.list-status.num-threads working in Spark against a regular fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive. On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park wrote: > Alex, see this jira- >

Re: failure to parallelize an RDD

2016-01-12 Thread Alex Gittens
I'm using Spark 1.5.1 When I turned on DEBUG, I don't see anything that looks useful. Other than the INFO outputs, there is a ton of RPC message related logs, and this bit: 16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure (org.apache.spark.rdd.RDD$$anonfun$count$1) +++ 16/01/13 05:53

Re: PCA OutOfMemoryError

2016-01-13 Thread Alex Gittens
nutes to compute the top 20 PCs of a 46.7K-by-6.3M dense matrix of doubles (~2 Tb), with most of the time spent on the distributed matrix-vector multiplies. Best, Alex On Tue, Jan 12, 2016 at 6:39 PM, Bharath Ravi Kumar wrote: > Any suggestion/opinion? > On 12-Jan-2016 2:06 pm, &

Databricks Cloud vs AWS EMR

2016-01-26 Thread Alex Nastetsky
As a user of AWS EMR (running Spark and MapReduce), I am interested in potential benefits that I may gain from Databricks Cloud. I was wondering if anyone has used both and done comparison / contrast between the two services. In general, which resource manager(s) does Databricks Cloud use for Spar

Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread Alex Kozlov
;d like Spark cores just be available in total and the first >>> app who needs it, takes as much as required from the available at the >>> moment. Is it possible? I believe Mesos is able to set resources free if >>> they're not in use. Is it possible with YARN? >>> >>> I'd appreciate if you could share your thoughts or experience on the >>> subject. >>> >>> Thanks. >>> -- >>> Be well! >>> Jean Morozov >>> >> -- Alex Kozlov ale...@gmail.com

Re: How to query a hive table from inside a map in Spark

2016-02-14 Thread Alex Kozlov
nt from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Alex Kozlov (408) 507-4987 (650) 887-2135 efax ale...@gmail.com

cartesian with Dataset

2016-02-17 Thread Alex Dzhagriev
Hello all, Is anybody aware of any plans to support cartesian for Datasets? Are there any ways to work around this issue without switching to RDDs? Thanks, Alex.

Re: Importing csv files into Hive ORC target table

2016-02-17 Thread Alex Dzhagriev
eDataFrame(resultRdd).write.orc("..path..") Please, note that resultRdd should contain Products (e.g. case classes) Cheers, Alex. On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Hi, > > We put csv files that are z

Re: Importing csv files into Hive ORC target table

2016-02-18 Thread Alex Dzhagriev
Hi Mich, Try to use a regexp to parse your string instead of the split. Thanks, Alex. On Thu, Feb 18, 2016 at 6:35 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > > > thanks, > > > > I have an issue here. > > define rdd to rea

an OOM while persist as DISK_ONLY

2016-02-22 Thread Alex Dzhagriev
please, explain what is the overhead which consumes that much memory during persist to the disk and how can I estimate what extra memory should I give to the executors in order to make it not fail? Thanks, Alex.

Re: Can we load csv partitioned data into one DF?

2016-02-22 Thread Alex Dzhagriev
Hi Saif, You can put your files into one directory and read it as text. Another option is to read them separately and then union the datasets. Thanks, Alex. On Mon, Feb 22, 2016 at 4:25 PM, wrote: > Hello all, I am facing a silly data question. > > If I have +100 csv files which ar

reasonable number of executors

2016-02-23 Thread Alex Dzhagriev
m map-side join with bigger table. What other considerations should I keep in mind in order to choose the right configuration? Thanks, Alex.

Re: reasonable number of executors

2016-02-24 Thread Alex Dzhagriev
Hi Igor, That's a great talk and an exact answer to my question. Thank you. Cheers, Alex. On Tue, Feb 23, 2016 at 8:27 PM, Igor Berman wrote: > > http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications > > there is a section that is

Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir, I think you can use the rest api provided with Spark: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala Unfortunately, I haven't find any documentation, but it looks fine. Thanks, Alex. On Sun, Feb 28, 2016 at

Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir, Regarding the streaming, you can take a look at the spark streaming, the micro-batching framework. If it satisfies your needs it has a bunch of integrations. Thus, the source for the jobs could be Kafka, Flume or Akka. Cheers, Alex. On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Alex Rovner
he command line you are using to submit your jobs for further troubleshooting. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>* On Sat, Oct 3, 2015 at 6:19 AM, unk1102 wrote: > Hi I have couple of Spark jobs which uses group by query which

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Alex Rovner
Can you send over your yarn logs along with the command you are using to submit your job? *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>* On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha wrote: > Hi Alex thanks much for the reply. Please

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-04 Thread Alex Rovner
Can you at least copy paste the error(s) you are seeing when the job fails? Without the error message(s), it's hard to even suggest anything. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>* On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha w

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Alex Rovner
I have the same question about the history server. We are trying to run multiple versions of Spark and are wondering if the history server is backwards compatible. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>* On Mon, Oct 5, 2015 at 9:22 AM, A

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Alex Rovner
Hey Steve, Are you referring to the 1.5 version of the history server? *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>* On Mon, Oct 5, 2015 at 10:18 AM, Steve Loughran wrote: > > > On 5 Oct 2015, at 15:59, Alex Rovner wrote: > >

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Alex Rovner
configure multiple versions to use the same shuffling service. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>* On Mon, Oct 5, 2015 at 11:06 AM, Andreas Fritzler < andreas.fritz...@gmail.com> wrote: > Hi Steve, Alex, > >

Re: How can I disable logging when running local[*]?

2015-10-05 Thread Alex Kozlov
; indicated by the sender. If you are not a designated recipient, you may > not review, use, > copy or distribute this message. If you received this in error, please > notify the sender by > reply e-mail and delete this message. > -- Alex Kozlov (408) 507-4987 (408) 830-9982 fax (650) 887-2135 efax ale...@gmail.com

Re: How can I disable logging when running local[*]?

2015-10-06 Thread Alex Kozlov
rred. Program will exit. > > > I tried a bunch of different quoting but nothing produced a good result. I > also tried passing it directly to activator using –jvm but it still > produces the same results with verbose logging. Is there a way I can tell > if it’s picking up my file? &g

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-06 Thread Alex Rovner
Thank you all for your help. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>* On Tue, Oct 6, 2015 at 11:17 AM, Steve Loughran wrote: > > On 6 Oct 2015, at 01:23, Andrew Or wrote: > > Both the history server and the shuffle se

Re: How can I disable logging when running local[*]?

2015-10-07 Thread Alex Kozlov
> # Change this to set Spark log level > > log4j.logger.org.apache.spark=WARN > > > # Silence akka remoting > > log4j.logger.Remoting=WARN > > > # Ignore messages below warning level from Jetty, because it's a bit > verbose > > log4j.logger.org.eclipse.jetty

Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Alex Rovner
-- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>*

spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky
I save my dataframe to avro with spark-avro 1.0.0 and it looks like this (using avro-tools tojson): {"field1":"value1","field2":976200} {"field1":"value2","field2":976200} {"field1":"value3","field2":614100} But when I use spark-avro 2.0.1, it looks like this: {"field1":{"string":"value1"},"fiel

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky
Here you go: https://github.com/databricks/spark-avro/issues/92 Thanks. On Wed, Oct 14, 2015 at 4:41 PM, Josh Rosen wrote: > Can you report this as an issue at > https://github.com/databricks/spark-avro/issues so that it's easier to > track? Thanks! > > On Wed, Oct 14, 2

dataframes and numPartitions

2015-10-14 Thread Alex Nastetsky
A lot of RDD methods take a numPartitions parameter that lets you specify the number of partitions in the result. For example, groupByKey. The DataFrame counterparts don't have a numPartitions parameter, e.g. groupBy only takes a bunch of Columns as params. I understand that the DataFrame API is

writing avro parquet

2015-10-19 Thread Alex Nastetsky
Using Spark 1.5.1, Parquet 1.7.0. I'm trying to write Avro/Parquet files. I have this code: sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS, classOf[AvroWriteSupport].getName) AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$) myDF.write.parquet(outputPath) Th

Re: writing avro parquet

2015-10-19 Thread Alex Nastetsky
Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to use it on myDF.rdd instead of converting it to a PairRDD first. On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Using Spark 1.5.1, Parquet 1.7.0. > > I'm

doc building process hangs on Failed to load class “org.slf4j.impl.StaticLoggerBinder”

2015-10-27 Thread Alex Luya
followed this https://github.com/apache/spark/blob/master/docs/README.md to build spark docs,but it hangs on: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBin

foreachPartition

2015-10-30 Thread Alex Nastetsky
I'm just trying to do some operation inside foreachPartition, but I can't even get a simple println to work. Nothing gets printed. scala> val a = sc.parallelize(List(1,2,3)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21 scala> a.foreachPartition(p => println("f

Re: foreachPartition

2015-10-30 Thread Alex Nastetsky
Ahh, makes sense. Knew it was going to be something simple. Thanks. On Fri, Oct 30, 2015 at 7:45 PM, Mark Hamstra wrote: > The closure is sent to and executed an Executor, so you need to be looking > at the stdout of the Executors, not on the Driver. > > On Fri, Oct 30, 2015 at 4

CompositeInputFormat in Spark

2015-10-30 Thread Alex Nastetsky
Does Spark have an implementation similar to CompositeInputFormat in MapReduce? CompositeInputFormat joins multiple datasets prior to the mapper, that are partitioned the same way with the same number of partitions, using the "part" number in the file name in each dataset to figure out which file

Sort Merge Join

2015-11-01 Thread Alex Nastetsky
Hi, I'm trying to understand SortMergeJoin (SPARK-2213). 1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two datasets have different number of partitions, but it still does a SortMerge join after a "hashpartitioning". CODE: val sparkCo

Re: Sort Merge Join

2015-11-02 Thread Alex Nastetsky
join keys will be loaded by the > same node/task , since lots of factors need to be considered, like task > pool size, cluster size, source format, storage, data locality etc.,. > > I’ll agree it’s worth to optimize it for performance concerns, and > actually in Hive, it is calle

Re: Spark 1.5 UDAF ArrayType

2015-11-10 Thread Alex Nastetsky
Hi, I believe I ran into the same bug in 1.5.0, although my error looks like this: Caused by: java.lang.ClassCastException: [Lcom.verve.spark.sql.ElementWithCount; cannot be cast to org.apache.spark.sql.types.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getA

Spark SQL UDAF works fine locally, OutOfMemory on YARN

2015-11-16 Thread Alex Nastetsky
ingle core to help with debugging, but I have the same issue with more executors/nodes. I am running this on EMR on AWS, so this is unlikely to be a hardware issue (different hardware each time I launch a cluster). I've also isolated the issue to this UDAF, as removing it from my Spark SQL makes the issue go away. Any ideas would be appreciated. Thanks, Alex.

Spark Powered By Page

2015-11-16 Thread Alex Rovner
I would like to list our organization on the Powered by Page. Company: Magnetic Description: We are leveraging Spark Core, Streaming and YARN to process our massive datasets. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>*

YARN Labels

2015-11-16 Thread Alex Rovner
ion master not to run on spot nodes. For what ever reason, application master is not able to recover in cases the node where it was running suddenly disappears, which is the case with spot nodes. Any guidance on this topic is appreciated. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.005

subscribe

2015-11-18 Thread Alex Luya
subscribe

  1   2   3   >