Public
Hello all,
We noticed a HUGE difference between using pyspark and spark in scala.
Pyspark runs:
* on my work computer in +-350 seconds
* on my home computer in +- 130 seconds (Windows defender enabled)
* on my home computer in +- 105 seconds (Windows defender disabled)
* on my home computer as Scala code in +- 7 seconds
*
The script:
def setUp(self):
self.left = self.parallelize([
('Wim', 46),
('Klaas', 18)
]).toDF('name: string, age: int')
self.right = self.parallelize([
('Jiri', 25),
('Tomasz', 23)
]).toDF('name: string, age: int')
def test_simple_union(self):
sut = self.left.union(self.right)
self.assertDatasetEquals(sut, self.parallelize([
('Wim', 46),
('Klaas', 18),
('Jiri', 25),
('Tomasz', 23)
]).toDF('name: string, age: int')
)
Disclaimer <http://www.kbc.com/KBCmailDisclaimer>