Jamie, Suneel thanks a lot, your replies have been very helpful.
I will definitely take a look at XMLInputFormat.
In any case the files are not very big: on average 100-200kB up to a max of
a couple of MB.
On 8 June 2016 at 04:23, Suneel Marthi wrote:
> You can use Mahout XMLInputFormat with
What is the recommended practice for using a dedicated ExecutionContexts
inside Flink code?
We are making some external network calls using Futures. Currently all of
them are running on the global execution context (import
scala.concurrent.ExecutionContext.Implicits.global).
Thanks
-Soumya
Hi,
When the web monitor restarts the uploaded jars disappear — in fact, every time
it restarts the upload directory is different.
This seems intentional:
https://github.com/apache/flink/blob/master/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/WebRuntimeMonitor.java#L162
You can use Mahout XMLInputFormat with Flink - HAdoopInputFormat
definitions. See
http://stackoverflow.com/questions/29429428/xmlinputformat-for-apache-flink
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Read-XML-from-HDFS-td7023.html
On Tue, Jun 7, 2016 at 10:11 PM, Jamie Grie
Hi Andrea,
How large are these data files? The implementation you've mentioned here
is only usable if they are very small. If so, you're fine. If not read
on...
Processing XML input files in parallel is tricky. It's not a great format
for this type of processing as you've seen. They are tric
Chesnay:Just want to thank you. I might have one or two related questions later
on, but now just thanks.
On Tuesday, June 7, 2016 8:18 AM, Greg Hogan wrote:
"The question is how to encapsulate numerous transformations into one object
or may be a function in Apache Flink Java setting
I'm assuming what you're trying to do is essentially sum over two different
fields of your data. I would do this with my own ReduceFunction.
stream
.keyBy("someKey")
.reduce(CustomReduceFunction) // sum whatever fields you want and return
the result
I think it does make sense that Flink cou
You can handle this multiple ways.. If there is a natural timestamp in
StreamB you can just use it very naturally by doing this:
streamB
.assignTimestamps(...) // your assigner
.connect(streamA)
.flatMap(...) // your CoFlatMapFunction
.timeWindow(...)
.whatever()
Here the event t
Suggestions in-line below...
On Mon, Jun 6, 2016 at 7:26 PM, Yukun Guo wrote:
> Hi,
>
> I'm working on a project which uses Flink to compute hourly log statistics
> like top-K. The logs are fetched from Kafka by a FlinkKafkaProducer and
> packed
> into a DataStream.
>
> The problem is, I find th
I have a question around event timestamps after a flatMap transformation, I
am using the event time time characteristic. I have two streams entering a
CoFlatMap. Stream A simply updates state in that CoFlatMap and does not
output any events. Stream B inserts events of type B which then output a
thi
If you use open vpn for accessing aws then you can use private IP of ec2
machine from your laptop.
Thanks
Ashutosh
On Tue, Jun 7, 2016 at 11:00 PM, Shannon Carey wrote:
> We're also starting to look at automating job deployment/start to Flink
> running on EMR. There are a few options:
>
>-
Thanks for the clarification.
On Tue, Jun 7, 2016 at 9:15 PM, Aljoscha Krettek
wrote:
> Hi,
> I'm afraid you're running into a bug into the special processing-time
> window operator. A suggested workaround would be to switch to
> characteristic IngestionTime and use TumblingEventTimeWindows.
>
>
Hi,
I'm afraid you're running into a bug into the special processing-time
window operator. A suggested workaround would be to switch to
characteristic IngestionTime and use TumblingEventTimeWindows.
I also open a Jira issue for the bug so that we can keep track of it:
https://issues.apache.org/jir
On Tue, Jun 7, 2016 at 4:52 AM, Stephan Ewen wrote:
> The concern you raised about the sink being synchronous is exactly what my
> last suggestion should address:
>
> The internal state backends can return a handle that can do the sync in a
> background thread. The sink would continue processing
Ah, sorry, you are right. You could also call keyBy again before the
second sum, but maybe someone else has a better idea.
Best,
Gábor
2016-06-07 16:18 GMT+02:00 Al-Isawi Rami :
> Thanks Gábor, but the first sum call will return
>
> SingleOutputStreamOperator
>
> I could not do another sum call
Hi all,
I am evaluating Apache Flink for processing large sets of Geospatial data.
The use case I am working on will involve reading a certain number of GPX
files stored on Amazon S3.
GPX files are actually XML files and therefore cannot be read on a line by
line basis.
One GPX file will produce
Thanks Gábor, but the first sum call will return
SingleOutputStreamOperator
I could not do another sum call on that. Would tell me how did you manage to do
stream.sum().sum()
Regards,
-Rami
On 7 Jun 2016, at 16:13, Gábor Gévay
mailto:gga...@gmail.com>> wrote:
Hello,
In the case of "sum", yo
After a second look to KryoSerializer I fear that Input and Output are
never closed..am I right?
On Tue, Jun 7, 2016 at 3:06 PM, Flavio Pompermaier
wrote:
> Hi Aljoscha,
> of course I can :)
> Thanks for helping me..do you think it is the right thing to do calling
> reset()?
> Actually, I don't
Hello,
In the case of "sum", you can just specify them one after the other, like:
stream.sum(1).sum(2)
This works, because summing the two fields are independent. However,
in the case of "keyBy", the information is needed from both fields at
the same time to produce the key.
Best,
Gábor
2016
Hi Aljoscha,
of course I can :)
Thanks for helping me..do you think it is the right thing to do calling
reset()?
Actually, I don't know whether this is meaningful or not, but I already ran
the job successfully once on the cluster (a second attempt is curerntly
running) after my accidental modificat
The problem is why is the window end time in the future ?
For example if my window size is 60 seconds and my window is being evaluated at
3.00 pm then why is the window end time 3.01 pm and not 3.00 pm even when the
data that is being evaluated falls in the window 2.59 - 3.00.
Sent from my iP
That's nice. Can you try it on your cluster with an added "reset" call on
the buffer?
On Tue, 7 Jun 2016 at 14:35 Flavio Pompermaier wrote:
> After "some" digging into this problem I'm quite convinced that the
> problem is caused by a missing reset of the buffer during the Kryo
> deserialization
Hi,
Is there any reason why “keyBy" accepts multi-field, while for example “sum”
does not.
-Rami
Disclaimer: This message and any attachments thereto are intended solely for
the addressed recipient(s) and may contain confidential information. If you are
not the intended recipient, please notif
After "some" digging into this problem I'm quite convinced that the problem
is caused by a missing reset of the buffer during the Kryo deserialization,
likewise to what has been fixed by FLINK-2800 (
https://github.com/apache/flink/pull/1308/files).
That fix added an output.clear() in theKryoExcept
"The question is how to encapsulate numerous transformations into one
object or may be a function in Apache Flink Java setting."
Implement CustomUnaryOperation. This can then be applied to a DataSet by
calling `DataSet result = DataSet.runOperation(new MyOperation<>(...));`.
On Mon, Jun 6, 2016 a
1a. ah. yeah i see how it could work, but i wouldn't count on it in a
cluster.
you would (most likely) run the the sub-job (calculating pi) only on a
single node.
1b. different execution environments generally imply different flink
programs.
2. sure it does, since it's a normal flink job. yo
Hi Elias!
The concern you raised about the sink being synchronous is exactly what my
last suggestion should address:
The internal state backends can return a handle that can do the sync in a
background thread. The sink would continue processing messages, and the
checkpoint would only be acknowled
Chesnay:
1a. The code actually works, that is the point. 1b. What restrict for a Flink
program to have several execution environments?2. I am not sure that your
modification allows for parallelism. Does it?3. This code is a simple example
of writing/organizing large and complicated programs, wh
could you state a specific problem?
On 07.06.2016 06:40, Soumya Simanta wrote:
I've a simple program which takes some inputs from a command line
(Socket stream) and then aggregates based on the key.
When running this program on my local machine I see some output that
is counter intuitive to m
from what i can tell from your code you are trying to execute a job
within a job. This just doesn't work.
your main method should look like this:
|publicstaticvoidmain(String[]args)throwsException{doublepi =new
classPI().compute();System.out.println("We estimate Pi to be: "+pi);}|
On 06.0
Hi Robert,
Thank you for checking the issue. That INFO is the only information Flink
workers say.
I agree your point of view. Looks like it closes the connections to all
other topics which is not used(idle) although it's a bit misleading.
Ref: https://github.com/edenhill/librdkafka/issues/437
Hi Jack,
right now this is not possible except when writing a custom operator. We
are working on support for a time-to-live setting on states, this should
solve your problem.
For writing a custom operator, check out DataStream.transform() and
StreamMap, which is the operator implementation for Map
Thanks for your answer Ufuk.
However, I have been reading about KeySelector and I don't understand
completely how it works with my idea.
I am using an algorithm that gives me an score between some different
strings. My idea is: if the score is higher than 0'80 for example, then
those two strings
33 matches
Mail list logo