Dear Nick,
Thanks for your quick reply.
I quickly implemented your proposal, but I do not see any
improvement. In fact, the test data set of around 3 GB occupies a
total of 10 GB in worker memory, and the execution time of queries
is like 4 times slower than
You will need to use PySpark vectors to store in a DataFrame. They can be
created from Numpy arrays as follows:
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([("src1", "pkey1", 1, Vectors.dense(np.array([0,
1, 2])))])
On Wed, 28 Jun 2017 at 12:23 Judit Planas wrote:
> Dear a
Dear all,
I am trying to store a NumPy array (loaded from an HDF5 dataset)
into one cell of a DataFrame, but I am having problems.
In short, my data layout is similar to a database, where I have a
few columns with metadata (source of information, primary key, et