Cool! Now I understand how to approach this problem. At my last
position, I don't think we did it quite efficiently. Maybe a blog post
by me?
Henry
On 02/28/2017 01:22 AM, 颜发才(Yan Facai) wrote:
Google is your friend, Henry.
http://stackoverflow.com/questions/21185092/apache-spark-map-vs-mapp
Thanks! That works:
def process_file(my_iter):
the_id = "init"
final = []
for chunk in my_iter:
lines = chunk[1].split("\n")
for line in lines:
if line[0:15] == 'WARC-Record-ID:':
the_id = line[15:]
final.append(Row(the_id = the_
This won't work:
rdd2 = rdd.flatMap(splitf)
rdd2.take(1)
[u'WARC/1.0\r']
rdd2.count()
508310
If I then try to apply a map to rdd2, the map only works on each
individual line. I need to create a state machine as in my second
function. That is, I need to apply a key to each line, but the key
Hi, Henry
In first example the dict d always contains only one value because the_Id
is same, in second case duct grows very quickly.
So, I can suggest to firstly apply map function to split you file with
string on rows then please make repartition and then apply custom logic
Example:
def splitf(
Not sure where you want me to put yield. My first try caused an error in
Spark that it could not pickle generator objects.
On 02/26/2017 03:25 PM, ayan guha wrote:
Hi
We are doing similar stuff, but with large number of small-ish files.
What we do is write a function to parse a complete file
Hi
We are doing similar stuff, but with large number of small-ish files. What
we do is write a function to parse a complete file, similar to your parse
file. But we use yield, instead of return and flatmap on top of it. Can you
give it a try and let us know if it works?
On Mon, Feb 27, 2017 at 9:
using wholeFiles to process formats that can not be split per line is not
"old"
and there are plenty of problems for which RDD is still better suited than
Dataset or DataFrame currently (this might change in near future when
Dataset gets some crucial optimizations fixed).
On Sun, Feb 26, 2017 at
I am actually using Spark 2.1 and trying to solve a real life problem.
Unfortunately, some of the discussion of my problem went off line, and
then I started a new thread.
Here is my problem. I am parsing crawl data which exists in a flat file
format. It looks like this:
u'WARC/1.0',
u'WARC-
Hi Henry,
Those guys in Databricks training are nuts and still use Spark 1.x for
their exams. Learning SPARK is a VERY VERY VERY old way of solving problems
using SPARK.
The core engine of SPARK, which even I understand, has gone through several
fundamental changes.
Just try reading the file usi
The file is so small that a stand alone python script, independent of
spark, can process the file in under a second.
Also, the following fails:
1. Read the whole file in with wholeFiles
2. use flatMap to get 50,000 rows that looks like: Row(id="path",
line="line")
3. Save the results as CVS
Hi, Tremblay.
Your file is .gz format, which is not splittable for hadoop. Perhaps the
file is loaded by only one executor.
How many executors do you start?
Perhaps repartition method could solve it, I guess.
On Sun, Feb 26, 2017 at 3:33 AM, Henry Tremblay
wrote:
> I am reading in a single smal
11 matches
Mail list logo