iveServer2.
--Alex
On 3/8/2016 3:13 PM, Mich Talebzadeh wrote:
Hi,
What do you mean by Hive Metastore Client? Are you referring to Hive
server login much like beeline?
Spark uses hive-site.xml to get the details of Hive metastore and the
login to the metastore which could be any database. Mine
find a solution in the meantime.
Thanks,
Alex
On 3/8/2016 4:00 PM, Mich Talebzadeh wrote:
The current scenario resembles a three tier architecture but without
the security of second tier. In a typical three-tier you have users
connecting to the application server (read Hive server2)
are
Hi All,
Thanks for your response .. Please find below flow diagram
Please help me out simplifying this architecture using Spark
1) Can i skip step 1 to step 4 and directly store it in spark
if I am storing it in spark where actually it is getting stored
Do i need to retain HAdoop to store data
o
Spark supports through the Hadoop apis a wide range of file
> systems, but does not need HDFS for persistence. You can have local
> filesystem (ie any file system mounted to a node, so also distributed ones,
> such as zfs), cloud file systems (s3, azure blob etc).
>
>
>
> On 2
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error:
java.lang.Double cannot be cast to
org.apache.hadoop.hive.serde2.io.DoubleWritable]
Getting below error while running hive UDF on spark but the UDF is working
perfectly fine in Hive..
public Object get(Object name) {
Hi Team,
how to compare two avro format hive tables if there is same data in it
if i give limit 5 its giving different results
How to debug Hive UDfs?!
On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu" wrote:
> Hi Team,
>
> I am trying to keep below code in get method and calling that get mthod in
> another hive UDF
> and running the hive UDF using Hive Context.sql procedure..
>
>
> switch (f) {
> case "double" : return (
Hi All,
If I modify the code to below The hive UDF is working in spark-sql but it
is giving different results..Please let me know difference between these
two below codes..
1) public Object get(Object name) {
int pos = getPos((String)name);
if(pos<0) return null;
Str
public Object get(Object name) {
int pos = getPos((String) name);
if (pos < 0)
return null;
String f = "string";
Object obj = list.get(pos);
Object result = null;
if (obj == null)
Hi Guys
Please let me know if any other ways to typecast as below is throwing error
unable to typecast java.lang Long to Longwritable and same for Double for
Text also in spark -sql Below piece of code is from hive udf which i am
trying to run in spark-sql
public Object get(Object name) {
Hi All,
i am trying to run a hive udf in spark-sql and its giving different rows as
result in both hive and spark..
My UDF query looks something like this
select col1,col2,col3, sum(col4) col4, sum(col5) col5,Group_name
from
(select inline(myudf('cons1',record))
from table1) test group by col1,c
Guys! Please Reply
On Tue, Jan 31, 2017 at 12:31 PM, Alex wrote:
> public Object get(Object name) {
> int pos = getPos((String) name);
> if (pos < 0)
> return null;
> String f = "string&quo
Hi ,
we have Java Hive UDFS which are working perfectly fine in Hive
SO for Better performance we are migrating the same To Spark-sql
SO these jar files we are giving --jars argument to spark-sql
and defining temporary functions to make it to run on spark-sql
there is this particular Java UDF
ther type depending on what is the type of
> the original value?
> Kr
>
>
>
> On 1 Feb 2017 5:56 am, "Alex" wrote:
>
> Hi ,
>
>
> we have Java Hive UDFS which are working perfectly fine in Hive
>
> SO for Better performance we are migrating the sam
uld yu run the same java UDF using Spark-sql
or
You would recode all java UDF to scala UDF and then run?
Regards,
Alex
Hi As shown below same query when ran back to back showing inconsistent
results..
testtable1 is Avro Serde table...
[image: Inline image 1]
hc.sql("select * from testtable1 order by col1 limit 1").collect;
res14: Array[org.apache.spark.sql.Row] =
Array([1570,3364,201607,Y,APJ,PHILIPPINES,8518
: Inline image 1]
On Thu, Feb 2, 2017 at 3:33 PM, Alex wrote:
> Hi As shown below same query when ran back to back showing inconsistent
> results..
>
> testtable1 is Avro Serde table...
>
> [image: Inline image 1]
>
>
>
> hc.sql("select * from testtable1 order
Hi,
can You guys tell me if below peice of two codes are returning the same
thing?
(((DoubleObjectInspector) ins2).get(obj)); and (DoubleWritable)obj).get(); from
below two codes
code 1)
public Object get(Object name) {
int pos = getPos((String)name);
if(pos<0) return null;
Stri
H,
Please Reply?
On Fri, Feb 3, 2017 at 8:19 PM, Alex wrote:
> Hi,
>
> can You guys tell me if below peice of two codes are returning the same
> thing?
>
> (((DoubleObjectInspector) ins2).get(obj)); and (DoubleWritable)obj).get()
> ; from below two codes
>
>
iew?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this emai
Hi ,
I am using spark-1.6 how to ignore this warning because of this Illegal
state exception my production jobs which are scheduld are showing completed
abnormally... I cant even handle exception as after sc.stop if i try to
execute any code again this exception will come from catch block.. so i
re
b.com/AlexHagerman/pyspark-profiling
Thanks,
Alex
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import broadcast, udf
from pyspark.ml.feature import Word2Vec, Word2VecModel
from pyspark.ml.linalg import Vector, Vect
482 MB should be small enough to be distributed as a set of broadcast
variables. Then you can use local features of spark to process.
-Original Message-
From: "shahab"
Sent: 4/30/2015 9:42 AM
To: "user@spark.apache.org"
Subject: is there anyway to enforce Spark to cache data in all w
ra
AS> connector.
AS> Thanks
AS> Amit
--
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
nning something at the end of the read
operation using the current API? If not, I would ask if this might be a
useful addition, or if there are design reasons for not including such a
step.
Thanks,
Alex
Hello,
This question has been addressed on Stack Overflow using the spark shell,
but not PySpark.
I found within the Spark SQL documentation where in PySpark SQL I can load
a JAR into my SparkSession config such as:
*spark = SparkSession\*
*.builder\*
*.appName("appname")\*
*.config(
he second one does not.
S> Is there any solution to the problem of being able to write to multiple
sinks in Continuous Trigger Mode using Structured Streaming?
--
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
---
at
>> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>>
>> at
>> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>>
>> at org.apache.spark.sql.execution.streaming.StreamExecution.org
>> $apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:286)
>>
>> at
>> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209)
>>
>> obj.test_ingest_incremental_data_batch1()
>>
>> File
>> "C:\Users\agundapaneni\Development\ModernDataEstate\tests\test_mdefbasic.py",
>> line 56, in test_ingest_incremental_data_batch1
>>
>> mdef.ingest_incremental_data('example', entity,
>> self.schemas['studentattendance'], 'school_year')
>>
>> File
>> "C:\Users\agundapaneni\Development\ModernDataEstate/src\MDEFBasic.py", line
>> 109, in ingest_incremental_data
>>
>> query.awaitTermination() # block until query is terminated, with
>> stop() or with error; A StreamingQueryException will be thrown if an
>> exception occurs.
>>
>> File
>> "C:\Users\agundapaneni\Development\ModernDataEstate\.tox\default\lib\site-packages\pyspark\sql\streaming.py",
>> line 101, in awaitTermination
>>
>> return self._jsq.awaitTermination()
>>
>> File
>> "C:\Users\agundapaneni\Development\ModernDataEstate\.tox\default\lib\site-packages\py4j\java_gateway.py",
>> line 1309, in __call__
>>
>> return_value = get_return_value(
>>
>> File
>> "C:\Users\agundapaneni\Development\ModernDataEstate\.tox\default\lib\site-packages\pyspark\sql\utils.py",
>> line 117, in deco
>>
>> raise converted from None
>>
>> pyspark.sql.utils.StreamingQueryException:
>> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldNames(Lscala/collection/Seq;)V
>>
>> === Streaming Query ===
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
--
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
dClass(ClassLoaders.java:178)
AS> at
java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
AS> Thanks
AS>
AS> Amit
--
With best wishes,Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
7;_thread.RLock' object
gf> Can you please tell me how to do this?
gf> Or at least give me some advice?
gf> Sincerely,
gf> FARCY Guillaume.
gf> -
gf> To unsubscribe e-mail: user-unsubscr...@spark.
Hi,
Some details:
* Spark SQL (version 3.2.1)
* Driver: Hive JDBC (version 2.3.9)
* ThriftCLIService: Starting ThriftBinaryCLIService on port 1 with
5...500 worker threads
* BI tool is connect via odbc driver
After activating Spark Thrift Server I'm unable to ru
Hi Christophe,
Thank you for the explanation!
Regards,
Alex
From: Christophe Préaud
Sent: Wednesday, March 30, 2022 3:43 PM
To: Alex Kosberg ; user@spark.apache.org
Subject: [EXTERNAL] Re: spark ETL and spark thrift server running together
Hi Alex,
As stated in the Hive documentation
(https
Hi everyone,
My name is Alex and I've been using Spark for the past 4 years to solve
most, if not all, of my data processing challenges. From time to time I go
a bit left field with this :). Like embedding Spark in my JVM based
application running only in `local` mode and using it as a
Unsubscribe
ent pair is dominated by computing ~50 dot
products of 100-dimensional vectors.
Best,
Alex
On Mon, May 25, 2015 at 2:59 AM, Сергей Мелехин wrote:
> Hi, ankur!
> Thanks for your reply!
> CVs are a just bunch of IDs, each ID represents some object of some class
> (eg. class=JOB, object
s, and then load them from S3.
>
> On Mon, May 25, 2015 at 8:19 PM, Alex Robbins <
> alexander.j.robb...@gmail.com> wrote:
>
>> If a Hadoop InputFormat already exists for your data source, you can load
>> it from there. Otherwise, maybe you can dump your data source out
Hi-
I’ve just built the latest spark RC from source (1.4.0 RC3) and can confirm
that the spark shell is still NOT working properly on 2.11. No classes in
the jar I've specified with the —jars argument on the command line are
available in the REPL.
Cheers
Alex
On Thu, May 28, 2015 at 8:
I've gotten that error when something is trying to use a different version
of protobuf than you want. Maybe check out a `mvn dependency:tree` to see
if someone is trying to use something other than libproto 2.5.0. (At least,
2.5.0 was current when I was having the problem)
On Fri, May 29, 2015 at
Hi-
Yup, I’ve already done so here:
https://issues.apache.org/jira/browse/SPARK-7944
Please let me know if this requires any more information - more than happy
to provide whatever I can.
Thanks
Alex
On Sun, May 31, 2015 at 8:45 AM, Tathagata Das wrote:
> Can you file a JIRA with the detai
at Hadoop Summit and Spark
Summit in the following weeks.
Thank you,
Alex Baranau
Mesos.
Thanks
Alex
On Fri, Jun 12, 2015 at 8:45 PM, Akhil Das
wrote:
> You can verify if the jars are shipped properly by looking at the driver
> UI (running on 4040) Environment tab.
>
> Thanks
> Best Regards
>
> On Sat, Jun 13, 2015 at 12:43 AM, Jonathan Coveney
> wrot
nd the order in which they
occur, it may be possible to get the RDD from the DataFrame and build my
own DataFrame with createDataFrame and passing it my fabricated
super-schema. However, this is brittle, as the super-schema is not in my
control and may change in the future.
Thanks for any suggestions,
Alex.
When I call rdd() on a DataFrame, it ends the current stage and starts a
new one that just maps the DataFrame to rdd and nothing else. It doesn't
seem to do a shuffle (which is good and expected), but then why does why is
there a separate stage?
I also thought that stages only end when there's a s
jar to the classpath. Thanks for your help!
Alex
On Tue, Jun 30, 2015 at 9:11 AM, Burak Yavuz wrote:
> How does your build file look? Are you possibly using wrong Scala
> versions? Have you added Breeze as a dependency to your project? If so
> which version?
>
> Thanks,
> Burak
I have a similar use case, so I wrote a python script to fix the cluster
configuration that spark-ec2 uses when you use Hadoop 2. Start a cluster
with enough machines that the hdfs system can hold 1Tb (so use instance
types that have SSDs), then follow the instructions at
http://thousandfold.net/cz
Hello,
I'm migrating some RDD-based code to using DataFrames. We've seen massive
speedups so far!
One of the operations in the old code creates an array of the values for
each key, as follows:
val collatedRDD =
valuesRDD.mapValues(value=>Array(value)).reduceByKey((array1,array2) =>
array1++array
turns something
meaningful.
Cheers, Alex.
On Thu, Mar 3, 2016 at 8:39 AM, Angel Angel wrote:
> Hello Sir/Madam,
>
> I am try to sort the RDD using *sortByKey* function but i am getting the
> following error.
>
>
> My code is
> 1) convert the rdd array into key value pair.
&
;
> My question is why not raid? What is the argument\reason for not using
> Raid?
>
> Thanks!
> -Eddie
>
--
Alex Kozlov
As of Spark 1.6.0 it is now possible to create new Hive Context sessions
sharing various components but right now the Hive Metastore Client is
shared amongst each new Hive Context Session.
Are there any plans to create individual Metastore Clients for each Hive
Context?
Related to the question ab
error
>
> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
> condition)
> Error in DataFrame(colData, row.names = rownames(colData)) :
> cannot coerce class "data.frame" to a DataFrame
>
> I am really stumped. I am not using any spark fun
;
> On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui wrote:
>
>> It seems as.data.frame() defined in SparkR convers the versions in R base
>> package.
>>
>> We can try to see if we can change the implementation of as.data.frame()
>> in SparkR to avoid such covering.
&g
Hi Vinay,
I believe it's not possible as the spark-shuffle code should run in the
same JVM process as the Node Manager. I haven't heard anything about on the
fly bytecode loading in the Node Manger.
Thanks, Alex.
On Wed, Mar 16, 2016 at 10:12 AM, Vinay Kashyap wrote:
> Hi all,
&
t;> I thought about using data cache as well for serving the data
>> The data cache should have the capability to serve the historical data
>> in milliseconds (may be upto 30 days of data)
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>>
>>
--
Alex Kozlov
ale...@gmail.com
to specify the --jars correctly?
Thanks, Alex.
matic
scaling (not blocking the resources if they is no data in the stream) and
the ui to manage the running jobs.
Thanks, Alex.
Spark SQL has a "first" function that returns the first item in a group. Is
there a similar function, perhaps in a third party lib, that allows you to
return an arbitrary (e.g. 3rd) item from the group? Was thinking of writing
a UDAF for it, but didn't want to reinvent the wheel. My endgoal is to b
ut writing a UDF is much simpler
than a UDAF.
On Tue, Jul 26, 2016 at 11:48 AM, ayan guha wrote:
> You can use rank with window function. Rank=1 is same as calling first().
>
> Not sure how you would randomly pick records though, if there is no Nth
> record. In your example, what
Ran into this need myself. Does Spark have an equivalent of "mapreduce.
input.fileinputformat.list-status.num-threads"?
Thanks.
On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park wrote:
> Hi,
>
> I am wondering if anyone has successfully enabled
> "mapreduce.input.fileinputformat.list-status.num-t
Thanks. I was actually able to get mapreduce.input.
fileinputformat.list-status.num-threads working in Spark against a regular
fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive.
On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park wrote:
> Alex, see this jira-
>
I'm using Spark 1.5.1
When I turned on DEBUG, I don't see anything that looks useful. Other than
the INFO outputs, there is a ton of RPC message related logs, and this bit:
16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure
(org.apache.spark.rdd.RDD$$anonfun$count$1) +++
16/01/13 05:53
nutes to compute the top 20 PCs of a 46.7K-by-6.3M
dense matrix of doubles (~2 Tb), with most of the time spent on the
distributed matrix-vector multiplies.
Best,
Alex
On Tue, Jan 12, 2016 at 6:39 PM, Bharath Ravi Kumar
wrote:
> Any suggestion/opinion?
> On 12-Jan-2016 2:06 pm, &
As a user of AWS EMR (running Spark and MapReduce), I am interested in
potential benefits that I may gain from Databricks Cloud. I was wondering
if anyone has used both and done comparison / contrast between the two
services.
In general, which resource manager(s) does Databricks Cloud use for Spar
;d like Spark cores just be available in total and the first
>>> app who needs it, takes as much as required from the available at the
>>> moment. Is it possible? I believe Mesos is able to set resources free if
>>> they're not in use. Is it possible with YARN?
>>>
>>> I'd appreciate if you could share your thoughts or experience on the
>>> subject.
>>>
>>> Thanks.
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>
--
Alex Kozlov
ale...@gmail.com
nt from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com
Hello all,
Is anybody aware of any plans to support cartesian for Datasets? Are there
any ways to work around this issue without switching to RDDs?
Thanks, Alex.
eDataFrame(resultRdd).write.orc("..path..")
Please, note that resultRdd should contain Products (e.g. case classes)
Cheers, Alex.
On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> Hi,
>
> We put csv files that are z
Hi Mich,
Try to use a regexp to parse your string instead of the split.
Thanks, Alex.
On Thu, Feb 18, 2016 at 6:35 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>
>
> thanks,
>
>
>
> I have an issue here.
>
> define rdd to rea
please, explain
what is the overhead which consumes that much memory during persist to the
disk and how can I estimate what extra memory should I give to the
executors in order to make it not fail?
Thanks, Alex.
Hi Saif,
You can put your files into one directory and read it as text. Another
option is to read them separately and then union the datasets.
Thanks, Alex.
On Mon, Feb 22, 2016 at 4:25 PM, wrote:
> Hello all, I am facing a silly data question.
>
> If I have +100 csv files which ar
m map-side
join with bigger table. What other considerations should I keep in mind in
order to choose the right configuration?
Thanks, Alex.
Hi Igor,
That's a great talk and an exact answer to my question. Thank you.
Cheers, Alex.
On Tue, Feb 23, 2016 at 8:27 PM, Igor Berman wrote:
>
> http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
>
> there is a section that is
Hi Moshir,
I think you can use the rest api provided with Spark:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala
Unfortunately, I haven't find any documentation, but it looks fine.
Thanks, Alex.
On Sun, Feb 28, 2016 at
Hi Moshir,
Regarding the streaming, you can take a look at the spark streaming, the
micro-batching framework. If it satisfies your needs it has a bunch of
integrations. Thus, the source for the jobs could be Kafka, Flume or Akka.
Cheers, Alex.
On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael
he
command line you are using to submit your jobs for further troubleshooting.
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
On Sat, Oct 3, 2015 at 6:19 AM, unk1102 wrote:
> Hi I have couple of Spark jobs which uses group by query which
Can you send over your yarn logs along with the command you are using to
submit your job?
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha wrote:
> Hi Alex thanks much for the reply. Please
Can you at least copy paste the error(s) you are seeing when the job fails?
Without the error message(s), it's hard to even suggest anything.
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha w
I have the same question about the history server. We are trying to run
multiple versions of Spark and are wondering if the history server is
backwards compatible.
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
On Mon, Oct 5, 2015 at 9:22 AM, A
Hey Steve,
Are you referring to the 1.5 version of the history server?
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
On Mon, Oct 5, 2015 at 10:18 AM, Steve Loughran
wrote:
>
> > On 5 Oct 2015, at 15:59, Alex Rovner wrote:
> >
configure multiple versions to use the same shuffling service.
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
On Mon, Oct 5, 2015 at 11:06 AM, Andreas Fritzler <
andreas.fritz...@gmail.com> wrote:
> Hi Steve, Alex,
>
>
; indicated by the sender. If you are not a designated recipient, you may
> not review, use,
> copy or distribute this message. If you received this in error, please
> notify the sender by
> reply e-mail and delete this message.
>
--
Alex Kozlov
(408) 507-4987
(408) 830-9982 fax
(650) 887-2135 efax
ale...@gmail.com
rred. Program will exit.
>
>
> I tried a bunch of different quoting but nothing produced a good result. I
> also tried passing it directly to activator using –jvm but it still
> produces the same results with verbose logging. Is there a way I can tell
> if it’s picking up my file?
&g
Thank you all for your help.
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
On Tue, Oct 6, 2015 at 11:17 AM, Steve Loughran
wrote:
>
> On 6 Oct 2015, at 01:23, Andrew Or wrote:
>
> Both the history server and the shuffle se
> # Change this to set Spark log level
>
> log4j.logger.org.apache.spark=WARN
>
>
> # Silence akka remoting
>
> log4j.logger.Remoting=WARN
>
>
> # Ignore messages below warning level from Jetty, because it's a bit
> verbose
>
> log4j.logger.org.eclipse.jetty
--
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
I save my dataframe to avro with spark-avro 1.0.0 and it looks like this
(using avro-tools tojson):
{"field1":"value1","field2":976200}
{"field1":"value2","field2":976200}
{"field1":"value3","field2":614100}
But when I use spark-avro 2.0.1, it looks like this:
{"field1":{"string":"value1"},"fiel
Here you go: https://github.com/databricks/spark-avro/issues/92
Thanks.
On Wed, Oct 14, 2015 at 4:41 PM, Josh Rosen wrote:
> Can you report this as an issue at
> https://github.com/databricks/spark-avro/issues so that it's easier to
> track? Thanks!
>
> On Wed, Oct 14, 2
A lot of RDD methods take a numPartitions parameter that lets you specify
the number of partitions in the result. For example, groupByKey.
The DataFrame counterparts don't have a numPartitions parameter, e.g.
groupBy only takes a bunch of Columns as params.
I understand that the DataFrame API is
Using Spark 1.5.1, Parquet 1.7.0.
I'm trying to write Avro/Parquet files. I have this code:
sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS,
classOf[AvroWriteSupport].getName)
AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$)
myDF.write.parquet(outputPath)
Th
Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to
use it on myDF.rdd instead of converting it to a PairRDD first.
On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:
> Using Spark 1.5.1, Parquet 1.7.0.
>
> I'm
followed this
https://github.com/apache/spark/blob/master/docs/README.md
to build spark docs,but it hangs on:
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBin
I'm just trying to do some operation inside foreachPartition, but I can't
even get a simple println to work. Nothing gets printed.
scala> val a = sc.parallelize(List(1,2,3))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize
at :21
scala> a.foreachPartition(p => println("f
Ahh, makes sense. Knew it was going to be something simple. Thanks.
On Fri, Oct 30, 2015 at 7:45 PM, Mark Hamstra
wrote:
> The closure is sent to and executed an Executor, so you need to be looking
> at the stdout of the Executors, not on the Driver.
>
> On Fri, Oct 30, 2015 at 4
Does Spark have an implementation similar to CompositeInputFormat in
MapReduce?
CompositeInputFormat joins multiple datasets prior to the mapper, that are
partitioned the same way with the same number of partitions, using the
"part" number in the file name in each dataset to figure out which file
Hi,
I'm trying to understand SortMergeJoin (SPARK-2213).
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two datasets have different number of
partitions, but it still does a SortMerge join after a "hashpartitioning".
CODE:
val sparkCo
join keys will be loaded by the
> same node/task , since lots of factors need to be considered, like task
> pool size, cluster size, source format, storage, data locality etc.,.
>
> I’ll agree it’s worth to optimize it for performance concerns, and
> actually in Hive, it is calle
Hi,
I believe I ran into the same bug in 1.5.0, although my error looks like
this:
Caused by: java.lang.ClassCastException:
[Lcom.verve.spark.sql.ElementWithCount; cannot be cast to
org.apache.spark.sql.types.ArrayData
at
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getA
ingle core to help with
debugging, but I have the same issue with more executors/nodes.
I am running this on EMR on AWS, so this is unlikely to be a hardware issue
(different hardware each time I launch a cluster).
I've also isolated the issue to this UDAF, as removing it from my Spark SQL
makes the issue go away.
Any ideas would be appreciated.
Thanks,
Alex.
I would like to list our organization on the Powered by Page.
Company: Magnetic
Description: We are leveraging Spark Core, Streaming and YARN to process
our massive datasets.
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052
* <http://www.magnetic.com/>*
ion master not to run on spot
nodes. For what ever reason, application master is not able to recover in
cases the node where it was running suddenly disappears, which is the case
with spot nodes.
Any guidance on this topic is appreciated.
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.005
subscribe
1 - 100 of 213 matches
Mail list logo