Hi spark users,
A few years back I created a java implementation of the hnsw algorithm in
my spare time. Hnsw is an algorithm to do k-nearest neighbour search. Or as
as people tend to refer to it now: vector search
It can can be used to implement things like recommendation systems, image
search,
The only way I ever got it to work with spark standalone is via web hdfs.
See
https://issues.apache.org/jira/browse/SPARK-5158?focusedCommentId=16516856&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16516856
On Fri, 8 Jan 2021 at 18:49, Sudhir Babu Pothineni
wro
pache/spark/rdd/RDD.scala#L298
But since we're using an old version that does not really help
On Fri, 22 Jan 2021 at 15:34, Sean Owen wrote:
> RDDs are immutable, and Spark itself is thread-safe. This should be fine.
> Something else is going on in your code.
>
> On Fri, Jan 2
copy of the rdd the joh will complete fine.
I suspect it's a bad idea to use the same rdd from two threads but I could
not find any documentation on the subject.
Should it be possible to do this and if not can anyone point me to
documentation pointing our that this is not on the table
--jelmer
I am writing something that partitions a data set and then trains a machine
learning model on the data in each partition
The resulting model is very big and right now i am storing it in an rdd as
a pair of :
partition_id and very_big_model_that_is_hundreds_of_megabytes_big
but it is becoming in
aneous tasks per executor.
>
> On Sun, 8 Dec 2019, 8:16 pm jelmer, wrote:
>
>> I have a job, running on yarn, that uses multithreading inside of a
>> mapPartitions transformation
>>
>> Ideally I would like to have a small number of partitions but have a high
>>
I have a job, running on yarn, that uses multithreading inside of a
mapPartitions transformation
Ideally I would like to have a small number of partitions but have a high
number of yarn vcores allocated to the task (that i can take advantage of
because of multi threading)
Is this possible?
I tri
I have 2 dataframes , lets call them A and B,
A is made up out of [unique_id, field1]
B is made up out of [unique_id, field2]
The have the exact same number of rows, and every id in A is also present
in B
if I execute a join like this A.join(B,
Seq("unique_id")).select($"unique_id", $"field1") t
Hi,
I am using a org.apache.spark.sql.Encoder to serialize a custom object.
I now want to pass this column to a udf so it can do some operations on it
but this gives me the error :
Caused by: java.lang.ClassCastException: [B cannot be cast to
The code included at the problem demonstrates the is
evant items by doing a groupBy
On Sun, 30 Jun 2019 at 01:45, Chris Teoh wrote:
> The closest thing I can think of here is if you have both dataframes
> written out using buckets. Hive uses this technique for join optimisation
> such that both datasets of the same bucket are read
I have 2 dataframes,
Dataframe A which contains 1 element per partition that is gigabytes big
(an index)
Dataframe B which is made up out of millions of small rows.
I want to join B on A but i want all the work to be done on the executors
holding the partitions of dataframe A
Is there a way to
11 matches
Mail list logo