On Sun, 16 Jun 2013, Radu Gheorghe wrote:
2013/6/14 David Lang <[email protected]>
On Fri, 14 Jun 2013, Radu Gheorghe wrote:
Hi Mahesh,
If you don't need mysql for a specific reason, I'd suggest you try thowing
your logs in Elasticsearch. Here's a tutorial:
http://wiki.rsyslog.com/index.**php/HOWTO:_rsyslog_%2B_**elasticsearch<http://wiki.rsyslog.com/index.php/HOWTO:_rsyslog_%2B_elasticsearch>
I assume you'll get way better insert and query performance than you can
with mysql (ie: with bulks, I get 10-20K logs indexed per second on my
$500
laptop. Then I can query in 100M-200M logs within a second. Depends on
your
settings). Plus, it's super-easy to scale Elasticsearch by adding new
nodes.
For querying, there are several, tools, the most popular being Kibana:
http://three.kibana.org/
Just to note, one of the things that makes MySQL so slow or Mahesh is it's
safety features. After each insert, MySQL makes sure the data is safe on
disk before it considers the insert complete.
By that, you mean it does a fsync after every transaction? I thought it
doesn't do this (at least not by default, with neither MyISAM nor InnoDB).
But then again, at least InnoDB does it more often than ES does.
I don't remember the table types, but the newer of the two does do fsync after
each transaction, which is how it actually properly supports transactions. This
is why it was such a big deal when MySQL changed the default.
If the system crashes, the data will be there. There are config options to
override this in MySQL.
To get the numbers that elasticsearch is getting on your laptop, it's
almost certinly not doing this.
I assume you lose some data if the whole system suddenly goes down. But if
just ES does (ie: kill -9 the JVM), you shouldn't lose any data.
I think ES writes stuff in a very different way than MySQL does. When you
index something in ES, it does the indexing in memory and writes the raw
data in the transaction
log<http://www.elasticsearch.org/guide/reference/index-modules/translog/>.
Only after this is done you get a reply from ES.
The transaction log is replayed on startup in case something goes wrong and
you lose the data you had in memory. Every once in a while, it writes what
it has to disk in the actual Lucene
index<http://www.elasticsearch.org/guide/reference/glossary/#shard>where
it stores data "permanently".
These chunks of data that it writes are
segments<https://lucene.apache.org/core/3_6_2/fileformats.html#Segments>,
which consist of multiple files. The thing about segments is that they're
immutable. And to make sure that you don't end up with a gazzillion
segments, these are asynchronously
merged<http://www.elasticsearch.org/guide/reference/index-modules/merge/>from
time to time.
the thing is that if it doesn't do a fsync, you have no guarantee that the data
is on the disk. And it's very possible for later data to make it to the disk
before earlier data does.
doing a kill -9 isn't the same as a system crash.
when you do a kill -9 the kernel and filesystem code contain all the data that
the application wrote, and will present that data if asked, and will eventually
get it to disk.
But if the system looses power, any data not actually written to disk is lost.
And (depending on lots of implementation details) it's possible to end up with
holes in files, or files created that have no content, or even files created,
with space allocated for them, but stray data from the drive in that space, not
what the application wrote.
I suspect that what ES does is that it writes the data in long sequential
writes, and tries to make it so that if there is power loss, logs will be lost
but not corrupted. It can do that at the data rates that you are describing.
It's writing hundreds, if not thousands of logs per 'transaction'
this is probably acceptable, but you do need to be aware of the tradeoff.
Right, there are always trade-offs. I'm sorry if I came across as the
"you're using the wrong technology" guy. I hate it when people do that.
In this particular case, I understand it's only about aggregating logs and
searching them afterwards instead of doing that with straight files. And
this is exactly what ES is about, so I thought it would be easier/better to
give it a shot. And I don't see write speed as being its strong point,
either - that would be the search speed.
I think that you are correct in saying that ES is better than MySQL for this,
but I was wanting to point out that the reason why MySQL is as slow as he was
seeing is because it's making sure that each transaction is safe before
proceeding.
Relaxing this guarantee is the sort of thing that all the No-SQL databases do,
and most of their performance wins are possible only because they do not provide
the same guarantees that the traditional SQL databases provide.
David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.