[ANNOUNCE] Version 2.0.0-beta1 of hnswlib spark released

2025-03-12 Thread jelmer
Hi spark users, A few years back I created a java implementation of the hnsw algorithm in my spare time. Hnsw is an algorithm to do k-nearest neighbour search. Or as as people tend to refer to it now: vector search It can can be used to implement things like recommendation systems, image search,

Re: Spark standalone - reading kerberos hdfs

2021-01-24 Thread jelmer
The only way I ever got it to work with spark standalone is via web hdfs. See https://issues.apache.org/jira/browse/SPARK-5158?focusedCommentId=16516856&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16516856 On Fri, 8 Jan 2021 at 18:49, Sudhir Babu Pothineni wro

Re: Using same rdd from two threads

2021-01-24 Thread jelmer
pache/spark/rdd/RDD.scala#L298 But since we're using an old version that does not really help On Fri, 22 Jan 2021 at 15:34, Sean Owen wrote: > RDDs are immutable, and Spark itself is thread-safe. This should be fine. > Something else is going on in your code. > > On Fri, Jan 2

Using same rdd from two threads

2021-01-22 Thread jelmer
copy of the rdd the joh will complete fine. I suspect it's a bad idea to use the same rdd from two threads but I could not find any documentation on the subject. Should it be possible to do this and if not can anyone point me to documentation pointing our that this is not on the table --jelmer

Cleanup hook for temporary files produced as part of a spark job

2020-05-24 Thread jelmer
I am writing something that partitions a data set and then trains a machine learning model on the data in each partition The resulting model is very big and right now i am storing it in an rdd as a pair of : partition_id and very_big_model_that_is_hundreds_of_megabytes_big but it is becoming in

Re: Request more yarn vcores than executors

2019-12-08 Thread jelmer
aneous tasks per executor. > > On Sun, 8 Dec 2019, 8:16 pm jelmer, wrote: > >> I have a job, running on yarn, that uses multithreading inside of a >> mapPartitions transformation >> >> Ideally I would like to have a small number of partitions but have a high >>

Request more yarn vcores than executors

2019-12-08 Thread jelmer
I have a job, running on yarn, that uses multithreading inside of a mapPartitions transformation Ideally I would like to have a small number of partitions but have a high number of yarn vcores allocated to the task (that i can take advantage of because of multi threading) Is this possible? I tri

Any way to make catalyst optimise away join

2019-11-29 Thread jelmer
I have 2 dataframes , lets call them A and B, A is made up out of [unique_id, field1] B is made up out of [unique_id, field2] The have the exact same number of rows, and every id in A is also present in B if I execute a join like this A.join(B, Seq("unique_id")).select($"unique_id", $"field1") t

Custom encoders and udf's

2019-09-10 Thread jelmer
Hi, I am using a org.apache.spark.sql.Encoder to serialize a custom object. I now want to pass this column to a udf so it can do some operations on it but this gives me the error : Caused by: java.lang.ClassCastException: [B cannot be cast to The code included at the problem demonstrates the is

Re: Map side join without broadcast

2019-06-30 Thread jelmer
evant items by doing a groupBy On Sun, 30 Jun 2019 at 01:45, Chris Teoh wrote: > The closest thing I can think of here is if you have both dataframes > written out using buckets. Hive uses this technique for join optimisation > such that both datasets of the same bucket are read

Map side join without broadcast

2019-06-29 Thread jelmer
I have 2 dataframes, Dataframe A which contains 1 element per partition that is gigabytes big (an index) Dataframe B which is made up out of millions of small rows. I want to join B on A but i want all the work to be done on the executors holding the partitions of dataframe A Is there a way to