Is it in any case appropriate to use log4j 1.x which is not maintained anymore
and has other security vulnerabilities which won’t be fixed anymore ?
> Am 13.12.2021 um 06:06 schrieb Sean Owen :
>
>
> Check the CVE - the log4j vulnerability appears to affect log4j 2, not 1.x.
> There was menti
Do you use the HiveContext in Spark? Do you configure the same options there?
Can you share some code?
> Am 07.08.2019 um 08:50 schrieb Rishikesh Gawade :
>
> Hi.
> I am using Spark 2.3.2 and Hive 3.1.0.
> Even if i use parquet files the result would be same, because after all
> sparkSQL isn't
I would remove the all GC tuning and add it later once you found the underlying
root cause. Usually more GC means you need to provide more memory, because
something has changed (your application, spark Version etc.)
We don’t have your full code to give exact advise, but you may want to rethink
What does your data source structure look like?
Can’t you release it at the end of the build scan method?
What technology is used in the transactional data endpoint?
> Am 24.05.2019 um 15:36 schrieb Abhishek Somani :
>
> Hi experts,
>
> I am trying to create a custom Spark Datasource(v1) to r
Also on AWS and probably some more cloud providers
> Am 19.03.2019 um 19:45 schrieb Steve Loughran :
>
>
> you might want to look at the work on FPGA resources; again it should just be
> a resource available by a scheduler. Key thing is probably just to keep the
> docs generic
>
> https://ha
lmost exactly 100ms to
> process 1 result (as seen by the consecutive TID’s below) or any logging I
> may be able to turn on to narrow the search.
>
> There are no errors or warnings in the logs.
>
>
> From: Jörn Franke [mailto:jornfra...@gmail.com]
> Sent: Mon
Well it is a little bit difficult to say, because a lot of things are mixing up
here. What function is calculated? Does it need a lot of memory? Could it be
that you run out of memory and some spillover happens and you have a lot of IO
to disk which is blocking?
Related to that could be 1 exec
Maybe it is better to introduce a new datatype that supports negative scale,
otherwise the migration and testing efforts for organizations running Spark
application becomes too large. Of course the current decimal will be kept as it
is.
> Am 07.01.2019 um 15:08 schrieb Marco Gaido :
>
> In gen
I don’t know your exact underlying business problem, but maybe a graph
solution, such as Spark Graphx meets better your requirements. Usually
self-joins are done to address some kind of graph problem (even if you would
not describe it as such) and is for these kind of problems much more efficie
not re-apply pushed filters. If data source lies, many things can
> go wrong...
>
>> On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke wrote:
>> Well even if it has to apply it again, if pushdown is activated then it will
>> be much less cost for spark to see if the filter has b
rice for that).
>
> Is there any other option I am not considering?
>
> Best regards,
> Alessandro
>
> Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke ha scritto:
>> BTW. Even for json a pushdown can make sense to avoid that data is
>> unnecessary ending in Spark (
BTW. Even for json a pushdown can make sense to avoid that data is unnecessary
ending in Spark ( because it would cause unnecessary overhead).
In the datasource v2 api you need to implement a SupportsPushDownFilter
> Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama :
>
> Hi,
>
> I'm a support
It was already available before DataSourceV2, but I think it might have been an
internal/semi-official API (eg json is an internal datasource since some time
now). The filters were provided to the datasource, but you will never know if
the datasource has indeed leveraged them or if for other rea
Is the original file indeed utf-8? Especially Windows environments tend to mess
up the files (E.g. Java on Windows does not use by default UTF-8). However,
also the software that processed the data before could have modified it.
> Am 10.11.2018 um 02:17 schrieb lsn24 :
>
> Hello,
>
> Per the d
This is not fully correct. If you have less files then you need to move some
data to some other nodes, because not all the data is there for writing (even
the case for the same node, but then it is easier from a network perspective).
Hence a shuffling is needed.
> Am 15.10.2018 um 05:04 schrie
I think it makes sense to remove it.
If it is not too much effort and the architecture of the flume source is not
considered as too strange one may extract it as a separate project and put it
on github in a dedicated non-supported repository. This would enable
distributors and other companies t
What is the ultimate goal of this algorithm? There could be already algorithms
that can do this within Spark. You could also put a message on Kafka (or
another broker) and have spark applications listen to them to trigger further
computation. This would be also more controlled and can be done a
Can’t you remove the dependency to the databricks CSV data source? Spark has
them now integrated since some versions so it is not needed.
> On 31. Aug 2018, at 05:52, Srabasti Banerjee
> wrote:
>
> Hi,
>
> I am trying to run below code to read file as a dataframe onto a Stream (for
> Spark S
gt; programs).
>
>
>> On Mon, May 7, 2018 at 10:05 PM Jörn Franke wrote:
>> Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA
>> scheduling, so it might be worth to have the last point generic that not
>> only the Spark scheduler, but all supported
Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA scheduling,
so it might be worth to have the last point generic that not only the Spark
scheduler, but all supported schedulers can use GPU.
For the other 2 points I just wonder if it makes sense to address this in the
ml framew
ry that most likely have to be reimplemented twice in Python…
>> Or there might be a way to force our lib execution in the same JVM as Spark
>> uses. To be seen… Again the most elegant way would be the datasource.
>>
>> Cheers,
>> Jakub
>>
>>
>>
Some note on the internal API - it used to change with each release which was
quiet annoying because other data sources (Avro, HadoopOffice etc) had to
follow up in this. In the end it is an internal API and thus does not guarantee
to be stable. If you want to have something stable you have to
Spark at some point in time used for the formats shipped with Spark (eg
parquet) an internal API that is not the data source API. You can look on how
this is implemented for Parquet and co in the Spark source code.
Maybe this is the issue you are facing?
Have you tried to put your encapsulation
And the usual hint when migrating - do not migrate only but also optimize the
ETL process design - this brings the most benefit s
> On 5. Apr 2018, at 08:18, Jörn Franke wrote:
>
> Ok this is not much detail, but you are probably best off if you migrate them
> to SparkSQL.
>
suchst cost based
optimizer.
> On 5. Apr 2018, at 08:02, Pralabh Kumar wrote:
>
> Hi
>
> I have lot of ETL jobs (complex ones) , since they are SLA critical , I am
> planning them to migrate to spark.
>
>> On Thu, Apr 5, 2018 at 10:46 AM, Jörn Franke wrote:
You need to provide more context on what you do currently in Hive and what do
you expect from the migration.
> On 5. Apr 2018, at 05:43, Pralabh Kumar wrote:
>
> Hi Spark group
>
> What's the best way to Migrate Hive to Spark
>
> 1) Use HiveContext of Spark
> 2) Use Hive on Spark
> (https://
I think most of the scala development in Spark happens with sbt - in the open
source world.
However, you can do it with Gradle and Maven as well. It depends on your
organization etc. what is your standard.
Some things might be more cumbersome too reach in non-sbt scala scenarios, but
this is
>>> difference. Probably the simplest argument for a lot of time being spent
>>> sorting (in some use cases) is the fact it's still one of the standard
>>> benchmarks.
>>>
>>> On Mon, Dec 4, 2017 at 1:55 AM, Jörn Franke >> <mailto:jornfra.
I do not think that the data source api exposes such a thing. You can however
proposes to the data source api 2 to be included.
However there are some caveats , because sorted can mean two different things
(weak vs strict order).
Then, is really a lot of time lost because of sorting? The best t
Or bytetype depending on the use case
> On 23. Nov 2017, at 10:18, Herman van Hövell tot Westerflier
> wrote:
>
> You need to use a StringType. The CharType and VarCharType are there to
> ensure compatibility with Hive and ORC; they should not be used anywhere else.
>
>> On Thu, Nov 23, 2017
uster running somewhere.
>
>
>> On Sun, 12 Nov 2017 at 17:17 Jörn Franke wrote:
>> Why do you even mind?
>>
>> > On 11. Nov 2017, at 18:42, Cristian Lorenzetto
>> > wrote:
>> >
>> > Considering the case i neednt hdfs, it there a way for
Why do you even mind?
> On 11. Nov 2017, at 18:42, Cristian Lorenzetto
> wrote:
>
> Considering the case i neednt hdfs, it there a way for removing completely
> hadoop from spark?
> Is YARN the unique dependency in spark?
> is there no java or scala (jdk langs)YARN-like lib to embed in a proj
alongside DataNodes, so the
> DataNode process would get some resources.
>
> The other thing you can do is to increase `dfs.client.socket-timeout` in
> hadoopConf,
> I see that it's set to 12 in your case right now
>
>> On Thu, Nov 9, 2017 at 4:28 PM, Jan-Hendrik Za
Maybe contact Oracle support?
Do you have maybe accidentally configured some firewall rules? Routing issues?
Maybe only one of the nodes...
> On 9. Nov 2017, at 20:04, Jan-Hendrik Zab wrote:
>
>
> Hello!
>
> This might not be the perfect list for the issue, but I tried user@
> previously
> Writing every table to parquet and reading it could be very much time
> consuming, currently entire job could take ~8 hours on 8 node of 100 Gig ram
> 20 core cluster, not only used utilized by me but by larger team.
>
> Thanks
>
>
>> On Fri, Nov 3, 2017 at 1:31 AM,
Hi,
Do you have a more detailed log/error message?
Also, can you please provide us details on the tables (no of rows, columns,
size etc).
Is this just a one time thing or something regular?
If it is a one time thing then I would tend more towards putting each table in
HDFS (parquet or ORC) and
Scala 2.12 is not yet supported on Spark - this means also not JDK9:
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-14220
If you look at the Oracle support then jdk 9 is anyway only supported for 6
months. JDK 8 is Lts (5 years) JDK 18.3 will be only 6 months and JDK 18.9 is
l
I would address databricks with this issue - it is their repository
> On 26. Oct 2017, at 18:43, comtef wrote:
>
> I've used spark for a couple of years and I found a way to contribute to the
> cause :).
> I've found a blocker in Spark XML extension
> (https://github.com/databricks/spark-xml).
Not sure I got to fully understand the issue (source code is always helpful ;-)
but why don't you override the toString method of IPAddress. So, IP address
could still be byte , but when it is displayed then toString converts the
byteaddress into something human-readable?
> On 15. Aug 2017, at
Try sparksession.conf().set
> On 28. Jul 2017, at 12:19, Chetan Khatri wrote:
>
> Hey Dev/ USer,
>
> I am working with Spark 2.0.1 and with dynamic partitioning with Hive facing
> below issue:
>
> org.apache.hadoop.hive.ql.metadata.HiveException:
> Number of dynamic partitions created is 1344
I do think this is the right way, you will have to do testing with test data
verifying that the expected output of the calculation is the output.
Even if the logical Plan Is correct your calculation might not be. E.g. There
can be bugs in Spark, in the UI or (what is very often) the client descr
I think your example relates to scheduling, e.g. it makes sense to use oozie or
similar to fetch the data at specific point in times.
I am also not a big fan of caching everything. In a Multi-user cluster with a
lot of Applications you waste a lot of resources making everybody less
efficient.
I think this is a rather simplistic view. All the tools to computation
in-memory in the end. For certain type of computation and usage patterns it
makes sense to keep them in memory. For example, most of the machine learning
approaches require to include the same data in several iterative calcul
Which Spark version are you using? What are you trying to do exactly and what
is the input data? As far as I know, akka has been dropped in recent Spark
versions.
> On 30 Jan 2017, at 00:44, aravasai wrote:
>
> I have a spark job running on 2 terabytes of data which creates more than
> 30,000
I also agree with Joseph and Sean.
With respect to spark-packages. I think the issue is that you have to manually
add it, although it basically fetches the package from Maven Central (or custom
upload).
From an organizational perspective there are other issues. E.g. You have to
download it from
Hi,
What about yarn or mesos used in combination with Spark. The have also cgroups.
Or a kubernetes etc deployment.
> On 15 Dec 2016, at 17:37, Hegner, Travis wrote:
>
> Hello Spark Devs,
>
>
> I have finally completed a mostly working proof of concept. I do not want to
> create a pull requ
Maybe titandb ?! It uses Hbase to store graphs and solr (on HDFS) to index
graphs. I am not 100% sure it supports it, but probably.
It can also integrate Spark, but analytics on a given graph only.
Otherwise you need to go for dedicated graph system.
> On 24 Oct 2016, at 16:41, Marco wrote:
>
>
You should take also into account that spark has different option to represent
data in-memory, such as Java serialized objects, Kyro serialized, Tungsten
(columnar optionally compressed) etc. the tungsten thing depends heavily on the
underlying data and sorting especially if compressed.
Then, yo
Is it the traditional bitmap indexing? I would not recommend it for big data.
You could use bloom filters and min/max indexes in-memory which look to be more
appropriate. However, if you want to use bitmap indexes then you would have to
do it as you say. However, bitmap indexes may consume a lo
You should see at it both levels: there is one bloom filter for Orc data and
one for data in-memory.
It is already a good step towards an integration of format and in-memory
representation for columnar data.
> On 22 Jun 2016, at 14:01, BaiRan wrote:
>
> After building bloom filter on existi
Based on the underlying Hadoop FileFormat. This one does it mostly based on
blocksize. You can change this though.
> On 21 Jun 2016, at 12:19, Sachin Aggarwal wrote:
>
>
> when we use readStream to read data as Stream, how spark decides the no of
> RDD and partition within each RDD with respe
I am not sure what you compare here. You would need to provide additional
details, such as algorithms and functionality supported by your framework. For
instance, Spark has built-in fault-tolerance and is a generic framework, which
has advantage with respect to development and operations, but ma
How did you configure YARN queues? What scheduler? Preemption ?
> On 19 Feb 2016, at 06:51, Prabhu Joseph wrote:
>
> Hi All,
>
>When running concurrent Spark Jobs on YARN (Spark-1.5.2) which share a
> single Spark Context, the jobs take more time to complete comparing with when
> they ran
Probably a newer Hive version makes a lot of sense here - at least 1.2.1. What
storage format are you using?
I think the old Hive version had a bug where it always scanned all partitions
unless you limit it in the on clause of the query to a certain partition (eg on
date=20201119)
> On 28 Jan 2
Is there any distributor supporting these software components in combination?
If no and your core business is not software then you may want to look for
something else, because it might not make sense to build up internal know-how
in all of these areas.
In any case - it depends all highly on y
Would it be possible to use views to address some of your requirements?
Alternatively it might be better to parse it yourself. There are open source
libraries for it, if you need really a complete sql parser. Do you want to do
it on sub queries?
> On 05 Nov 2015, at 23:34, Yana Kadiyska wrote:
I am not sure what are you trying to achieve here. Have you thought about
using flume? Additionally maybe something like rsync?
Le sam. 12 sept. 2015 à 0:02, Varadhan, Jawahar
a écrit :
> Hi all,
>I have a coded a custom receiver which receives kafka messages. These
> Kafka messages have FTP
Well what do you do in case of failure?
I think one should use a professional ingestion tool that ideally does not
need to reload everything in case of failure and verifies that the file has
been transferred correctly via checksums.
I am not sure if Flume supports ftp, but Ssh,scp should be support
58 matches
Mail list logo