Yes, using a map() function that is transforming the input lines into the
String array.
On Fri, Apr 17, 2015 at 9:03 PM, Flavio Pompermaier
wrote:
> My csv has 32 columns,ia there a way to create a Dataset?
> On Apr 17, 2015 7:53 PM, "Stephan Ewen" wrote:
>
>> I don't think there is any built-i
My csv has 32 columns,ia there a way to create a Dataset?
On Apr 17, 2015 7:53 PM, "Stephan Ewen" wrote:
> I don't think there is any built-in functionality for that.
>
> You can probably read the header with some custom code (in the driver
> program) and use the CSV reader (skipping the header)
I don't think there is any built-in functionality for that.
You can probably read the header with some custom code (in the driver
program) and use the CSV reader (skipping the header) as a regular data
source.
On Fri, Apr 17, 2015 at 5:03 PM, Flavio Pompermaier
wrote:
> Hi guys,
> how can I rea
Hi guys,
how can I read a csv but keeping the header in some variable without
throwing it away?
Thanks in advance,
Flavio
Hi!
After a quick look over the code, this seems like a bug. One cornercase of
the overflow handling code does not check for the "running out of memory"
condition.
I would like to wait if Robert Waury has some ideas about that, he is the
one most familiar with the code.
I would guess, though, th
Hi Squirrels,
I have some trouble with a delta-iteration transitive closure program [1].
When I run the program, I get the following error:
java.io.EOFException
at
org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333)
at
org.apache.flink.runti
I just set the -Xmx in the VM parameters to 10g and I have 8 virtual cores
(4 physical).
On Fri, Apr 17, 2015 at 2:48 PM, Robert Metzger wrote:
> How much memory are you giving to each Flink TaskManager?
>
> On Fri, Apr 17, 2015 at 2:05 PM, Flavio Pompermaier
> wrote:
>
>> Hi Robertm
>> I forgo
Hi Flavio!
The cause is usually as the exception method says: Too many duplicate keys.
The side that builds the hash table has one key occurring so often that not
all records with that key fit into memory together, even after multiple
out-of-core recursions.
Here is a list of things to check:
-
Hi,
the log4j.properties file looks nice. The issue is that the resources
folder is not marked as a source/resource folder, that's why Eclipse is not
adding the file to the classpath.
Have a look here:
http://stackoverflow.com/questions/5081316/where-is-the-correct-location-to-put-log4j-propertie
How much memory are you giving to each Flink TaskManager?
On Fri, Apr 17, 2015 at 2:05 PM, Flavio Pompermaier
wrote:
> Hi Robertm
> I forgot to update about this error.
> The root cause was an OOM cause by Jena RDF serialization that was causing
> the failing of the entire job.
> I also
> https:
Hi Robertm
I forgot to update about this error.
The root cause was an OOM cause by Jena RDF serialization that was causing
the failing of the entire job.
I also
https://stackoverflow.com/questions/29660894/jena-thrift-serialization-oom-due-to-gc-overhead
I created a thread on StackOverflow, let's s
Hi to all,
I have this strange exception in my program, do you know what could be the
cause of it?
java.lang.RuntimeException: Hash join exceeded maximum number of
recursions, without reducing partitions enough to be memory resident.
Probably cause: Too many duplicate keys.
at
org.apache.flink.run
There is no caching mechanism.
To do the left outer join as in Tills implementation, you need to collect
all elements of one! iterator in memory. If you know, that one of the two
iterators contains at most 1 element, you should collect that in memory and
stream the elements of the other iterator.
Could you explain a little more in detail this caching mechanism with a
simple code snippet...?
Thanks,
Flavio
On Apr 17, 2015 1:12 PM, "Fabian Hueske" wrote:
> If you know that the group cardinality of one input is always 1 (or 0) you
> can make that input the one to cache in memory and stream
Here are some rough cornerpoints for serialization efficiency in Flink:
- Tuples are a bit more efficient than POJOs, because they do not support
(and encode) possible subclasses and they do not involve and reflection
code at all.
- Arrays are more efficient than collections (collections go in
If you know that the group cardinality of one input is always 1 (or 0) you
can make that input the one to cache in memory and stream the other input
with potentially more group elements.
2015-04-17 4:09 GMT-05:00 Flavio Pompermaier :
> That would be very helpful...
>
> Thanks for the support,
> F
I think running the program multiple times is a reasonable way to start
working on this.
I would try and see whether this can be re-written to a non-nested
iterations case. Nestes iterations algorithms may have much more overhead
to start with.
Stephan
On Tue, Apr 14, 2015 at 3:53 PM, BenoƮt Ha
Cool :-)
We should add this tip to the documentation of the project operator.
I'll open a JIRA for that.
2015-04-17 5:41 GMT-05:00 Flavio Pompermaier :
> That worked like a charm!
>
> Thanks a lot Fabian!
>
> On Fri, Apr 17, 2015 at 12:37 PM, Fabian Hueske wrote:
>
>> The problem is cause by th
You could try to work around this using a custom Partioner [1].
myData.partitionCustom(new MyPartitioner(),
"myPartitionField").sortPartition("myPartitionField").writeToCsv(...);
In that case, you need to implement the Partition function yourself. To do
that "right" you need to know the value dis
That worked like a charm!
Thanks a lot Fabian!
On Fri, Apr 17, 2015 at 12:37 PM, Fabian Hueske wrote:
> The problem is cause by the project() operator.
> The Java compiler does infer its return type and defaults to Tuple.
>
> You can help the compiler like this:
>
> DataSet> ds2 = ds.project(0)
The problem is cause by the project() operator.
The Java compiler does infer its return type and defaults to Tuple.
You can help the compiler like this:
DataSet> ds2 = ds.project(0).distinct(0);
2015-04-17 4:33 GMT-05:00 Flavio Pompermaier :
> I have errors in Eclipse doing something like:
>
>
I have errors in Eclipse doing something like:
DataSet> ds =
DataSet> ds2 = .ds.project(0).distinct(0);
It says that I have to declare ds2 as a Dataset
On Fri, Apr 17, 2015 at 11:15 AM, Maximilian Michels wrote:
> Hi Flavio,
>
> Do you have an exapmple? The DistinctOperator should return
That would be very helpful...
Thanks for the support,
Flavio
On Fri, Apr 17, 2015 at 10:04 AM, Till Rohrmann
wrote:
> No its not, but at the moment there is afaik no other way around it. There
> is an issue for proper outer join support [1]
>
> [1] https://issues.apache.org/jira/browse/FLINK-68
Hi Flavio,
Do you have an exapmple? The DistinctOperator should return a typed output
just like all the other operators do.
Best,
Max
On Fri, Apr 17, 2015 at 10:07 AM, Flavio Pompermaier
wrote:
> Hi guys,
>
> I'm trying to make (in Java) a project().distinct() but then I cannot
> create the ge
No its not, but at the moment there is afaik no other way around it. There
is an issue for proper outer join support [1]
[1] https://issues.apache.org/jira/browse/FLINK-687
On Fri, Apr 17, 2015 at 10:01 AM, Flavio Pompermaier
wrote:
> Could resolve the problem but the fact to accumulate stuff i
Hi guys,
I'm trying to make (in Java) a project().distinct() but then I cannot
create the generated dataset with a typed tuple because the distinct
operator returns just an untyped Tuple.
Is this an error in the APIs or am I doing something wrong?
Best,
Flavio
Could resolve the problem but the fact to accumulate stuff in a local
variable is it safe if datasets are huge..?
On Fri, Apr 17, 2015 at 9:54 AM, Till Rohrmann
wrote:
> If it's fine when you have null string values in the cases where
> D1.f1!="a1" or D1.f2!="a2" then a possible solution could l
If it's fine when you have null string values in the cases where
D1.f1!="a1" or D1.f2!="a2" then a possible solution could look like (with
Scala API):
val ds1: DataSet[(String, String, String)] = getDS1
val ds2: DataSet[(String, String, String)] = getDS2
ds1.coGroup(ds2).where(2).equalTo(0) {
(
Hi Fabian,
thanks for your reply, my question was exactly about that problem, range
partitioning.
As I have to process a large dataset of values, and to apply a datamining
algorythm on each partition, for me an important point is that the final
result is ordered, to do not lose the sense of data.
Hi Till,
thanks for the reply.
What I'd like to do is to merge D1 and D2 if there's a ref from D1 to D2
(D1.f2==D2.f0).
If this condition is true, I would like to produce a set of tuples with the
matching elements
at the first to places (D1.*f2*, D2.*f0*) and the other two values (if
present) of th
Hello everyone,
I was just wondering, which class would be most efficient to store collections
of primitive elements, and which one to store objects, within POJOs and tuples
from a serialization point of view. And would it make any difference if such a
collection is not embedded within a POJO/t
Hi Flavio,
I don't really understand what you try to do. What does D1.f2(D1.f1==p1)
mean? What does happen if the condition in D1.f2(if D1.f1==p2) is false?
Where does the values a1 and a2 in (A, X, a1, a2) come from when you join
[(A, p3, X), (X, s, V)] and [(A, p3, X), (X, r, 2)]? Maybe you can
32 matches
Mail list logo