Re: Inquiry: Extending Spark ML Support via Spark Connect to Scala/Java APIs (SPARK-50812 Analogue)

2025-06-04 Thread Daniel Filev
consideration. Best wishes, Daniel On Fri, May 30, 2025 at 11:40 AM Daniel Filev wrote: > Dear Apache Spark Community/Development Team, > > I hope this message finds you well. > > I am writing to inquire about the roadmap and future plans for extending > Spark ML support through Sp

Inquiry: Extending Spark ML Support via Spark Connect to Scala/Java APIs (SPARK-50812 Analogue)

2025-05-30 Thread Daniel Filev
atly assist us in planning our integration strategy and, if feasible, contributing to the effort. Thank you for your time and for your ongoing work on Apache Spark. I look forward to your guidance. -- Daniel Filev Software Engineer Ontotext doing business as Graphwise | Making sense of data one triple

Re: Kafka Connector: producer throttling

2025-04-17 Thread daniel williams
aging broadcast variables to control your KafkaProducers. On Thu, Apr 17, 2025 at 4:27 PM Abhishek Singla wrote: > @daniel williams > It's batch and not streaming. I still don't understand why " > kafka.linger.ms" and "kafka.batch.size" are not being hono

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
13 PM Ángel Álvarez Pascua < angel.alvarez.pas...@gmail.com> wrote: > Have you used the new equality functions introduced in Spark 3.5? > > > https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html > > El jue, 17 abr 2025,

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
lation was enabled in their case, by the way.) > > I still need to dig deeper into the root cause, but just a heads-up — > under certain conditions, Spark might trigger multiple executions. We also > used the explode function in our project, but didn’t observe any > duplicate calls... so f

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
"bootstrap.servers": "localhost:9092", >"acks": "all", >"linger.ms": "1000", >"batch.size": "10", >"key.serializer": > "org.apache.kafka.common.serialization.ByteArraySerializer", >&qu

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
given the producers built in backoff. This would allow for you to retry n times and then upon final (hopefully not) failure update your dataset for further processing. -dan On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla wrote: > @daniel williams > > > Operate your producer in transa

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
Yes. Operate your producer in transactional mode. Checkpointing is an abstract concept only applicable to streaming. -dan On Wed, Apr 16, 2025 at 7:02 AM Abhishek Singla wrote: > Hi Team, > > We are using foreachPartition to send dataset row data to third system via > HTTP client. The operatio

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
If you are building a broadcast to construct a producer with a set of options then the producer is merely going to operate how it’s going to be configured - it has nothing to do with spark save that the foreachPartition is constructing it via the broadcast. A strategy I’ve used in the past is to *

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
; ssl.keystore.type = JKS > ssl.protocol = TLSv1.3 > ssl.provider = null > ssl.secure.random.implementation = null > ssl.trustmanager.algorithm = PKIX > ssl.truststore.location = null > ssl.truststore.password = null > ssl.truststore.type = JKS > transaction.timeout.ms = 6 > transaction

Re: Kafka Connector: producer throttling

2025-03-26 Thread daniel williams
If you're using structured streaming you can pass in options as kafka. into options as documented. If you're using Spark in batch form you'll want to do a foreach on a KafkaProducer via a Broadcast. All KafkaProducer specific options

Re: Kafka Connector: producer throttling

2025-02-24 Thread daniel williams
I think you should be using a foreachPartition and a broadcast to build your producer. From there you will have full control of all options and serialization needed via direct access to the KafkaProducer, as well as all options therein associated (e.g. callbacks, interceptors, etc). -dan On Mon,

Unsubscribe

2024-10-06 Thread Daniel Maangi

Re: Help - Learning/Understanding spark web UI

2024-09-26 Thread Daniel Aronovic
much of the information in a more straightforward and digestible way. You can check it out here: https://github.com/dataflint/spark I'm one of the contributors to the project. Cheers! *Daniel Aronovich, MSc* On Thu, Sep 26, 2024 at 3:30 PM Ilango wrote: > > > Hi Karthick, &

How to provide a Zstd "training mode" dictionary object

2024-05-15 Thread Saha, Daniel
results? Thanks, Daniel

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Saha, Daniel
`spark.shuffle.sort.bypassMergeThreshold` to above the max reducer partitions significantly slowed the rate of disk usage 5. Daniel [1] https://github.com/apache/spark/blob/8f5a647b0bbb6e83ee484091d3422b24baea5a80/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L369 [2] https://github.com

unsubscribe

2024-01-10 Thread Daniel Maangi

Unsubscribe

2023-12-12 Thread Daniel Maangi

unsubscribe

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe

Re: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance

2023-08-08 Thread Daniel Tavares de Santana
unsubscribe From: Mich Talebzadeh Sent: Tuesday, August 8, 2023 4:43 PM To: user @spark Subject: [EXTERNAL] Use of ML in certain aspects of Spark to improve the performance I am currently pondering and sharing my thoughts openly. Given our reliance on gathered

Re: [EXTERNAL] Spark Not Connecting

2023-07-12 Thread Daniel Tavares de Santana
unsubscribe From: timi ayoade Sent: Wednesday, July 12, 2023 6:11 AM To: user@spark.apache.org Subject: [EXTERNAL] Spark Not Connecting Hi Apache spark community, I am a Data EngineerI have been using Apache spark for some time now. I recently tried to use it bu

Getting SparkRuntimeException: Unexpected value for length in function slice: length must be greater than or equal to 0

2023-06-06 Thread Bariudin, Daniel
o') > 0.6)` the code executes perfectly without any exceptions. **Note: The code works if I add the following code:** spark = SparkSession.builder \ .master("local") \ .appName("test-app") \ .config("spark.driver.bindAddress", "127.0.0.1") \

unsubscribe

2023-03-30 Thread Daniel Tavares de Santana
unsubscribe

Re: Writing Custom Spark Readers and Writers

2022-04-06 Thread daniel queiroz
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/read/index.html https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/connector/write/index.html https://developer.here.com/documentation/java-scala-dev/dev_guide/spark-connector/index.html Grato, Daniel

Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
x27;. If you execute it as is, > it fails for circularity. Both are bad, so it's just disallowed. > Just fix your code? > > On Mon, Dec 13, 2021 at 11:27 AM Daniel de Oliveira Mantovani < > daniel.oliveira.mantov...@gmail.com> wrote: > >> I've reduced the code

Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
a good > reason to do that other than it's what you do now. > I'm not clear if it's coming across that this _can't_ work in the general > case. > > On Mon, Dec 13, 2021 at 11:03 AM Daniel de Oliveira Mantovani < > daniel.oliveira.mantov...@gmail.com> wrot

Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
probably easy to do that, so you don't want to do that. You want > different names for different temp views, or else ensure you aren't doing > the kind of thing shown in the SO post. You get the problem right? > > On Mon, Dec 13, 2021 at 10:43 AM Daniel de Oliveira Mantova

Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
ly explicitly disallowed in all cases now, but, you > should not be depending on this anyway - why can't this just be avoided? > > On Mon, Dec 13, 2021 at 10:06 AM Daniel de Oliveira Mantovani < > daniel.oliveira.mantov...@gmail.com> wrote: > >> Sean, >> >&

Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
ean Owen wrote: > ... but the error is not "because that already exists". See your stack > trace. It's because the definition is recursive. You define temp view > test1, create a second DF from it, and then redefine test1 as that result. > test1 depends on test1. > > On

Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
ooks 'valid' - you define a temp view in terms of its own > previous version, which doesn't quite make sense - somewhere the new > definition depends on the old definition. I think it just correctly > surfaces as an error now,. > > On Mon, Dec 13, 2021 at 9:41 AM Daniel

spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-13 Thread Daniel de Oliveira Mantovani
fo] at scala.collection.IterableLike.foreach(IterableLike.scala:74) [info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73) [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:56) -- -- Daniel Mantovani

Re: Spark salesforce connector

2021-11-25 Thread daniel queiroz
< timeout) done } /* * wait until the job is completed */ def awaitJobCompleted(jobId: String, token: String): Boolean = { val timeoutDuration = FiniteDuration(60L, MILLISECONDS) val initSleepIntervalDuration = FiniteDuration(200L, MILLISECONDS) val maxSleepInte

Re: How to Flatten JSON/XML/Parquet/etc in Apache Spark With Quenya-DSL

2021-11-22 Thread Daniel de Oliveira Mantovani
Hi Mich, Unfortunately it doesn't support PySpark, just Scala/Java. Wouldn't be a big deal to implement the Quenya DSL for PySpark as well, I will add to the roadmap. Thank you On Mon, Nov 22, 2021 at 1:24 PM Mich Talebzadeh wrote: > Ok interesting Daniel, > > I did

How to Flatten JSON/XML/Parquet/etc in Apache Spark With Quenya-DSL

2021-11-22 Thread Daniel de Oliveira Mantovani
ps://medium.com/@danielmantovani/flattening-json-in-apache-spark-with-quenya-dsl-b3af6bd2442d Project Page: https://github.com/modakanalytics/quenya-dsl For all data engineers who won't spend time anymore flattening nested data structures XOXO -- -- Daniel Mantovani

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
ver transforms the queries emitted by applications and converts them into an equivalent form in HiveQL. Try to change the "NativeQuery" parameter and see if it works :) On Tue, Jul 20, 2021 at 1:26 PM Daniel de Oliveira Mantovani < daniel.oliveira.mantov...@gmail.com> wrote: > Ins

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 20 Jul 2021 at 17:05, Daniel de Oliveira Mantovani < >

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Daniel de Oliveira Mantovani
oke:HiveSessionProxy.java:78, >>> org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36, >>> org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63, >>> java.security.AccessController:doPrivileged:AccessController.java:-2, >>> javax.security.auth.Subject:doAs:Subject.java:422, >>> org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1875, >>> org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59, >>> com.sun.proxy.$Proxy35:executeStatementAsync::-1, >>> org.apache.hive.service.cli.CLIService:executeStatementAsync:CLIService.java:295, >>> org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507, >>> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437, >>> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422, >>> org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39, >>> org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39, >>> org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56, >>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286, >>> java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149, >>> java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624, >>> java.lang.Thread:run:Thread.java:748, >>> *org.apache.hadoop.hive.ql.parse.ParseException:line 1:39 cannot recognize >>> input near '"first_name"' 'TEXT' ',' in column name or primary key or >>> foreign key:33:6, >>> org.apache.hadoop.hive.ql.parse.ParseDriver:parse:ParseDriver.java:221, >>> org.apache.hadoop.hive.ql.parse.ParseUtils:parse:ParseUtils.java:75, >>> org.apache.hadoop.hive.ql.parse.ParseUtils:parse:ParseUtils.java:68, >>> org.apache.hadoop.hive.ql.Driver:compile:Driver.java:564, >>> org.apache.hadoop.hive.ql.Driver:compileInternal:Driver.java:1425, >>> org.apache.hadoop.hive.ql.Driver:compileAndRespond:Driver.java:1398, >>> org.apache.hive.service.cli.operation.SQLOperation:prepare:SQLOperation.java:205], >>> sqlState:42000, errorCode:4, errorMessage:Error while compiling >>> statement: FAILED: ParseException line 1:39 cannot recognize input near >>> '"first_name"' 'TEXT' ',' in column name or primary key or foreign key), >>> Query: CREATE TABLE profile_test.person_test ("first_name" TEXT , >>> "last_name" TEXT , "country" TEXT ). >>> ... 77 more >>> >>> >>> Found similar issue ion Jira : >>> >>> https://issues.apache.org/jira/browse/SPARK-31614 >>> >>> There no comments in that and the resolution is Incomplete, is there >>> any way we can do in spark to write data into the hive as JDBC mode. >>> >>> Thanks for any help. >>> >>> >>> Thanks, >>> Badrinath. >>> >> > -- -- Daniel Mantovani

Re: [apache spark] Does Spark 2.4.8 have issues with ServletContextHandler

2021-06-14 Thread Daniel de Oliveira Mantovani
r test >>>> cases- >>>> >>>> java.lang.NoClassDefFoundError: Could not initialize class >>>> org.spark_object.jetty.servlet.ServletContextHandler >>>> at >>>> org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:143) >>>> >>>> When I checked in Maven dependencies, I could find the >>>> ServletContextHandlerclass under spark-core_2.11-2.4.8.jar in given package >>>> hierarchy. I have compiled code, dependencies have been resolved, classpath >>>> is updated. >>>> >>>> Any hint regarding this error would help. >>>> >>>> >>>> Thank you >>>> >>>> Kanchan >>>> >>> -- -- Daniel Mantovani

Re: How to submit a job via REST API?

2020-12-18 Thread Daniel de Oliveira Mantovani
;> Von meinem iPhone gesendet >>> >>> Am 24.11.2020 um 03:34 schrieb Zhou Yang : >>> >>>  >>> Dear experts, >>> >>> I found a convenient way to submit job via Rest API at >>> https://gist.github.com/arturmkrtchyan/5d8559b2911ac951d34a#file-submit_job-sh >>> . >>> But I did not know whether can I append `—conf` parameter like what I >>> did in spark-submit. Can someone can help me with this issue? >>> >>> *Regards, Yang* >>> >>> > > -- > Regards, > Vaquar Khan > +1 -224-436-0783 > Greater Chicago > > > -- -- Daniel Mantovani

Purpose of type in pandas_udf

2020-11-12 Thread Daniel Stojanov
Hi, Note "double" in the function decorator. Is this specifying the type of the data that goes into pandas_mean, or the type returned by that function? Regards, @pandas_udf("double", PandasUDFType.GROUPED_AGG) def pandas_mean(v):     return v.sum() --

Pyspark application hangs (no error messages) on Python RDD .map

2020-11-10 Thread Daniel Stojanov
Hi, This code will hang indefinitely at the last line (the .map()). Interestingly, if I run the same code at the beginning of my application (removing the .write step) it executes as expected. Otherwise, the code appears further along in my application which is where it hangs. The debugging messag

DStreams stop consuming from Kafka

2020-11-10 Thread Razvan-Daniel Mihai
Hello, I have a usecase where I have to stream events from Kafka to a JDBC sink. Kafka producers write events in bursts of hourly batches. I started with a structured streaming approach, but it turns out that structured streaming has no JDBC sink. I found an implementation in Apache Bahir, but it

Re: Confuse on Spark to_date function

2020-11-05 Thread Daniel Stojanov
On 5/11/20 2:48 pm, 杨仲鲍 wrote: Code ```scala object Suit{ case class Data(node:String,root:String) def apply[A](xs:A *):List[A] = xs.toList def main(args: Array[String]): Unit ={ val spark = SparkSession.builder() .master("local") .appName("MoneyBackTest") .getOrCreate() import spark.imp

How does order work in Row objects when .toDF() is called?

2020-11-05 Thread Daniel Stojanov
>>> row_1 = psq.Row(first=1, second=2) >>> row_2 = psq.Row(second=22, first=11) >>> spark.sparkContext.parallelize([row_1, row_2]).toDF().collect() [Row(first=1, second=2), Row(first=22, second=11)] (Spark 3.0.1) What is happening in the above? When .toDF() is called it appears that order is m

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-30 Thread Daniel Chalef
ate instead of a pivot, and assembling the vector using > a UDF. > > On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef > wrote: > >> Hello, >> >> I have a very large long-format dataframe (several billion rows) that I'd >> like to pivot and vectorize (using th

[Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-10-29 Thread Daniel Chalef
hat I should attempt in order to achieve my end goal of a vector of attributes for every entity? Thanks, Daniel

Re: MongoDB plugin to Spark - too many open cursors

2020-10-26 Thread Daniel Stojanov
eds of the hardware itself. Regards, On 26/10/20 1:52 pm, lec ssmi wrote: Is the connection pool configured by mongodb full? Daniel Stojanov <mailto:m...@danielstojanov.com>> 于2020年10月26日周一 上午10:28写道: Hi, I receive an error message from the MongoDB server if there are

MongoDB plugin to Spark - too many open cursors

2020-10-25 Thread Daniel Stojanov
Hi, I receive an error message from the MongoDB server if there are too many Spark applications trying to access the database at the same time (about 3 or 4), "Cannot open a new cursor since too many cursors are already opened." I am not too sure of how to remedy this. I am not sure how the

Re: reading a csv.gz file from sagemaker using pyspark kernel mode

2020-10-08 Thread Daniel Jankovic
ecause you didn't fill all the required information. BR, Daniel On Wed, Oct 7, 2020 at 3:44 PM cloudytech43 < cloudytechi.intellip...@gmail.com> wrote: > I am trying to read a compressed CSV file in pyspark. but I am unable to > read > in pyspark kernel mode in sagemaker. >

Re: S3 read/write from PySpark

2020-08-06 Thread Daniel Stojanov
Traceback (most recent call last): File "", line 1, in File "/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/pyspark/sql/readwriter.py", line 535, in csv return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) File "/home/daniel

Understanding Spark execution plans

2020-08-05 Thread Daniel Stojanov
Hi, When an execution plan is printed it lists the tree of operations that will be completed when the job is run. The tasks have somewhat cryptic names of the sort: BroadcastHashJoin, Project, Filter, etc. These do not appear to map directly to functions that are performed on an RDD. 1) Is there

S3 read/write from PySpark

2020-08-05 Thread Daniel Stojanov
quot;, "true").csv(f"s3a://{aws_bucket}/{aws_filename}") Leads to this error message: Traceback (most recent call last): File "", line 1, in File "/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/pyspark/sql/readwriter.py", l

Is possible to give options when reading semistructured files using SQL Syntax?

2020-07-28 Thread Daniel de Oliveira Mantovani
Is possible to give options when reading semistructured files using SQL Syntax like in the example below: "SELECT * FROM csv.`file.csv` For example, if I want to have header=true. Is it possible ? Thanks -- -- Daniel Mantovani

Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-17 Thread Daniel de Oliveira Mantovani
d these useful. > > > https://ask.streamsets.com/question/7/how-do-you-configure-a-hive-impala-jdbc-driver-for-data-collector/?answer=8#post-id-8 > > On Thu, Jul 9, 2020 at 11:28 AM Daniel de Oliveira Mantovani < > daniel.oliveira.mantov...@gmail.com> wrote: > >> One of my

Why can window functions only have fixed window sizes?

2020-07-15 Thread Daniel Stojanov
Hi, My understanding of window functions is that they can only operate on fixed window sizes. For example, I can create a window like the following:     Window.partitionBy("group_identifier").orderBy("sequencial_counter").rowsBetween(-4, 5) or even:     Window.partitionBy("group_identifier").o

Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-09 Thread Daniel de Oliveira Mantovani
download the driver from Cloudera here: https://www.cloudera.com/downloads/connectors/hive/jdbc/2-6-1.html On Tue, Jul 7, 2020 at 12:03 AM Daniel de Oliveira Mantovani < daniel.oliveira.mantov...@gmail.com> wrote: > Hello Gabor, > > I meant, third-party connector* not "connectio

Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Daniel de Oliveira Mantovani
oint me in that direction. > > > > > > > > -- > > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Daniel Mantovani

Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Daniel de Oliveira Mantovani
Hello Gabor, I meant, third-party connector* not "connection". Thank you so much! On Mon, Jul 6, 2020 at 1:09 PM Gabor Somogyi wrote: > Hi Daniel, > > I'm just working on the developer API where any custom JDBC connection > provider(including Hive) can be added.

How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Daniel de Oliveira Mantovani
5 Do you know if there's a workaround for this ? Maybe using a third-party connection ? Thank you so much -- -- Daniel Mantovani

LynxKite is now open-source

2020-06-24 Thread Daniel Darabos
f you work with graphs. But even if you only use SparkSQL you may find it a nice (graphical) environment for data exploration. We're going to do development completely in the open and with the community from now on, so any bug reports and pull requests are extremely welcome. Cheers, Daniel

LynxKite is now open-source

2020-06-24 Thread Daniel Darabos
f you work with graphs. But even if you only use SparkSQL you may find it a nice (graphical) environment for data exploration. We're going to do development completely in the open and with the community from now on, so any bug reports and pull requests are extremely welcome. Cheers, Daniel

Spark 2.3 and Kafka client library version

2020-04-28 Thread Ahn, Daniel
I have a keberized HDFS cluster. When I use structured streaming with Kafka (with SASL_SSL/PLAINTEXT), I believe I’m blocked by Kafka-5294. It seems like fix version in 0.11.0.0 Kafka client library. I have a Spark 2.3 cluster, and it’s using 0.10.0.1 kafka client library. Do you know if I can

Spark Mongodb connector hangs indefinitely, not working on Amazon EMR

2020-04-21 Thread Daniel Stojanov
When running a Pyspark application on my local machine I am able to save and retrieve from the Mongodb server using the Mongodb Spark connector. All works properly. When submitting the exact same application on my Amazon EMR cluster I can see that the package for the Spark driver is being properly

[Structured Streaming] Checkpoint file compact file grows big

2020-04-15 Thread Ahn, Daniel
Are Spark Structured Streaming checkpoint files expected to grow over time indefinitely? Is there a recommended way to safely age-off old checkpoint data? Currently we have a Spark Structured Streaming process reading from Kafka and writing to an HDFS sink, with checkpointing enabled and writing

[Pyspark] - Spark uses all available memory; unrelated to size of dataframe

2020-04-08 Thread Daniel Stojanov
My setup: using Pyspark; Mongodb to retrieve and store final results; Spark is in standalone cluster mode, on a single desktop. Spark v.2.4.4. Openjdk 8. My spark application (using pyspark) uses all available system memory. This seems to be unrelated to the data being processed. I tested with 32G

Spark Dataset API for secondary sorting

2019-12-03 Thread Daniel Zhang
Hi, Spark Users: I have a question related to the way I use the spark Dataset API for my case. If the "ds_old" dataset is having 100 records, with 10 unique $"col1", and for the following pseudo-code: val ds_new = ds_old.repartition(5, $"col1").sortWithinPartitions($"col2").mapPartitions(new M

[Spark SS] Spark-23541 Backward Compatibility on 2.3.2

2019-09-26 Thread Ahn, Daniel
Is it tested whether this fix is backward compatible (https://issues.apache.org/jira/browse/SPARK-23541) for 2.3.2? I see that fix version is 2.4.0 in Jira. But quickly reviewing pull request (https://github.com/apache/spark/pull/20698), it looks like all the code change is limited to spark-sql

EMR Spark 2.4.3 executor hang

2019-08-30 Thread Daniel Zhang
Hi, All: We are testing the EMR and compare with our on-premise HDP solution. We use one application as the test: EMR (5.21.1) with Hadoop 2.8.5 + Spark 2.4.3 vs HDP (2.6.3) with Hadoop 2.7.3 + Spark 2.2.0 The application is very simple, just read Parquet raw file, then do a DS.repartition(id_co

Creating Spark buckets that Presto / Athena / Hive can leverage

2019-06-15 Thread Daniel Mateus Pires
n use so that Presto/Athena JOINs leverage the special layout of the data? e.g. CREATE EXTERNAL TABLE ...(on Presto/Athena) df.write.bucketBy(...).partitionBy(...). (in spark) then copy this data to S3 with s3-dist-cp then MSCK REPAIR TABLE (on Presto/Athena) Daniel

Blog post: DataFrame.transform -- Spark function composition

2019-06-05 Thread Daniel Mateus Pires
largely around issues / performance, this post only concerns itself with code readability, hopefully not off-topic! I welcome any feedback I can get :) Daniel

[no subject]

2019-03-28 Thread Daniel Sierra
unsubscribe

java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT on EMR

2019-03-18 Thread Daniel Zhang
Hi, I know the JIRA of this error (https://issues.apache.org/jira/browse/SPARK-18112), and I read all the comments and even PR for it. But I am facing this issue on AWS EMR, and only in Oozie Spark Action. I am looking for someone can give me a hint or direction, so I can see if I can overco

unsubscribe

2019-01-31 Thread Daniel O' Shaughnessy

[no subject]

2019-01-31 Thread Daniel O' Shaughnessy
unsubscribe

[no subject]

2019-01-30 Thread Daniel O' Shaughnessy
Unsubscribe

unsubscribe

2019-01-29 Thread Daniel O' Shaughnessy
unsubscribe

[no subject]

2018-12-19 Thread Daniel O' Shaughnessy
unsubscribe

Re: how to use cluster sparkSession like localSession

2018-11-01 Thread Daniel de Oliveira Mantovani
?ftlId=1&name=0049003208&uid=0049003208%40znv.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%220049003208%40znv.com%22%5D> > 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制 > - To > unsubscribe e-mail: user-unsubscr...@spark.apache.org -- -- Daniel de Oliveira Mantovani Perl Evangelist/Data Hacker +1 786 459 1341

[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

2018-07-20 Thread Daniel Mateus Pires
of the brand! spark.read.json(df2.select("value").as[String]).show /* +---+---+ | a| c| +---+---+ | b| d| +---+---+ */ ``` Ideally I'd like something similar to spark.read.json that would keep the partitioning values and merge it with

Setting log level to DEBUG while keeping httpclient.wire on WARN

2018-06-29 Thread Daniel Haviv
wire: log4j.rootCategory=DEBUG,console ... log4j.logger.org.apache=WARN log4j.logger.httpclient.wire.header=WARN log4j.logger.httpclient.wire.content=WARN log4j.logger.org.apache.commons.httpclient=WARN log4j.logger.httpclient=WARN log4j.logger.httpclient.wire.header=WARN Thank you. Daniel

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread Daniel Pires
Thanks for coming back with the solution! Sorry my suggestion did not help Daniel On Wed, 20 Jun 2018, 21:46 mattl156, wrote: > Alright so I figured it out. > > When reading from and writing to Hive metastore Parquet tables, Spark SQL > will try to use its own Parquet support ins

Re: [Spark SQL]: How to read Hive tables with Sub directories - is this supported?

2018-06-20 Thread Daniel Pires
) It will bring you the data with the schema inferred from the JSON + 3 fields : extract_year, extract_month, extract_day I was looking for the documentation that described this way of partitioning but could not find it, I can reply with the link once I do Hope that helps, --- Daniel Mateus

[Spark-sql Dataset] .as[SomeClass] not modifying Physical Plan

2018-06-17 Thread Daniel Pires
g the wrong things from `.as[SomeClass]` ? Why wouldn’t it by default add a Project to the Query Plan ? Best regards, Daniel Mateus Pires Data Engineer

[PySpark Pipeline XGboost] How to use XGboost in PySpark Pipeline

2018-05-31 Thread Daniel Du
xgboost import XGBClassifier ... model = XGBClassifier() model.fit(X_train, y_train) pipeline = Pipeline(stages=[..., model, ...]) It is convenient to use the pipeline api, so can anybody give some advices? Thank you! Daniel -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com

Thrift server not exposing temp tables (spark.sql.hive.thriftServer.singleSession=true)

2018-05-30 Thread Daniel Haviv
e import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 import org.apache.spark.sql.hive.thriftserver._ val df = spark.read.orc( "s3://sparkdemoday/crimes_orc") df.createOrReplaceTempView("cached_df") df.cache df.count val sql = spark.sqlContext HiveThriftServer2.startWithContext(sql) Thank you. Daniel

Interest in adding ability to request GPU's to the spark client?

2018-05-15 Thread Daniel Galvez
here have advice on why this is a bad idea. Unfortunately, I am not familiar enough with Mesos and Kubernetes right now to know how they schedule gpu resources and whether adding support for requesting GPU's from them to the spark-submit client would be simple. Daniel -- Daniel Galvez http:

flatMapGroupsWithState not timing out (spark 2.2.1)

2018-01-12 Thread daniel williams
Hi, I’m attempting to leverage flatMapGroupsWithState to handle some arbitrary aggregations and am noticing a couple of things: - *ProcessingTimeTimeout* + *setTimeoutDuration* timeout not being honored - *EventTimeTimeout* + watermark value not being honored. - *EventTimeTimeout* + *

Writing a UDF that works with an Interval in PySpark

2017-12-11 Thread Daniel Haviv
;Could not parse datatype: calendarinterval',), , (u'{"type":"struct","fields":[{"name":"","type":"date","nullable":true,"metadata":{}},{"name":"","type":"calendarinterval","nullable":true,"metadata":{}}]}',)) Any ideas how can I achieve this (or even better, has someone already done this)? Thank you. Daniel

Re: UDF issues with spark

2017-12-10 Thread Daniel Haviv
Some code would help to debug the issue On Fri, 8 Dec 2017 at 21:54 Afshin, Bardia < bardia.afs...@changehealthcare.com> wrote: > Using pyspark cli on spark 2.1.1 I’m getting out of memory issues when > running the udf function on a recordset count of 10 with a mapping of the > same value (arbirt

Re: Getting Message From Structured Streaming Format Kafka

2017-12-01 Thread Daniel de Oliveira Mantovani
gs. I Couldn't find anything useful there. Thank you very much! On Thu, Nov 2, 2017 at 12:04 PM, Burak Yavuz wrote: > Hi Daniel, > > Several things: > 1) Your error seems to suggest you're using a different version of Spark > and a different version of the sql-kafka

Getting Message From Structured Streaming Format Kafka

2017-11-02 Thread Daniel de Oliveira Mantovani
s.select("key") noAggDF .writeStream .format("console") .start() But I'm having the error: http://paste.scsys.co.uk/565658 How do I get my messages using kafka as format from Structured Streaming ? Thank you -- -- Daniel de Oliveira Mantovani Perl Evangelist/Data Hacker +1 786 459 1341

Writing custom Structured Streaming receiver

2017-11-01 Thread Daniel Haviv
Hi, Is there a guide to writing a custom Structured Streaming receiver? Thank you. Daniel

Getting RabbitMQ Message Delivery Tag (Stratio/spark-rabbitmq)

2017-10-30 Thread Daniel de Oliveira Mantovani
the message it self and not an object with the RabbitMQ message structure, which would include the delivery tag. I really need the delivery tag to write an efficient and safe reader. Someone knows how to get the delivery tag ? Or should I use other library to read from RabbitMQ ? Thank you --

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
ons there are, you will need to coalesce or repartition. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com wrote: > Thanks Daniel! > > I've been wondering that f

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
-configuration-options I have no idea why it defaults to a fixed 200 (while default parallelism defaults to a number scaled to your number of cores), or why there are two separate configuration properties. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
On Thu, Sep 28, 2017 at 7:23 AM, Gourav Sengupta wrote: > > I will be very surprised if someone tells me that a 1 GB CSV text file is > automatically split and read by multiple executors in SPARK. It does not > matter whether it stays in HDFS, S3 or any other system. > I can't speak to *any* sys

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB) > text file? Does it use InputFormat do create multiple splits and creates 1 > partition per split? Also, in case of S3 or NFS, how does the input split > work? I understand for HDFS files are already pre-split so Spark can us

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> no matter what you do and how many nodes you start, in case you have a > single text file, it will not use parallelism. > This is not true, unless the file is small or is gzipped (gzipped files cannot be split).

Re: Nested RDD operation

2017-09-19 Thread Daniel O' Shaughnessy
ot;event_name")).select("eventIndex").first().getDouble( 0)) }) }) Wondering if there is any better/faster way to do this ? Thanks. On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin wrote: > Hey Daniel, not sure this will help, but... I had a similar need where i > wanted th

Nested RDD operation

2017-09-15 Thread Daniel O' Shaughnessy
Hi guys, I'm having trouble implementing this scenario: I have a column with a typical entry being : ['apple', 'orange', 'apple', 'pear', 'pear'] I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1] I'm attempting to do this but because of the nested operation on another RDD I g

  1   2   3   4   5   6   >