On Wed, May 14, 2014 at 3:31 PM, Mark Gaiser <[email protected]> wrote: > On Wed, May 14, 2014 at 12:50 PM, Aaron J. Seigo <[email protected]> wrote: >> On Sunday, May 11, 2014 21:57:58 Mark Gaiser wrote: >>> Note: my recent incremental cleanups already made it about a second >>> faster then it used to be. It used to be around 5.500ms. However, the >>> current speed is still ~7x slower then using raw C++ and Qt. My goal >>> is to get it within 2x slower then raw C++/Qt. >> >> i did some benchmarking and profiling of this in the past and threading won't >> help that much. i did actually have a pretty simple threading model >> implemented which provided a fairly significantly speed up that was >> noticeable >> (e.g. 20%+) in the large (>50k files) folder case, but that was really just >> masking the problem: it offloaded the reaaaaaally slow part so that i/o could >> continue, which is classic latency vs throughput. the real solution is to cut >> latency here, by which i mean get rid of UDSEntry and stop using Qt's slow >> collection classes for things like kio_file. and because it was offoading a >> really slow part that came in large chunks (collections of dirents) it didn't >> parallelize all that nicely: on large dirs i/o would finish quickly and the >> thread pool would starve as each thread pushed its way through the UDSEntry >> dance as jobs piled up from the now unencumbered i/o path. the latency of >> threading killed throughput as a result. > > Yeah, i gave up on the parallel approach. I only did one test with it > (as can be read in my previous message here) just to (stubbornly) try > it out and observe the slowdown and because it's fairly simple with > QtConcurrent. It's not worth it in this code path. >> >> with kio_file, the reaaaaaaally slow bit, and i mean REAAAAALY slow, as in >> 90% >> of wall clock time, is populating and serializing UDSEntrys. in particular, >> adding fields is very expensive. each field is added one at a time and the >> mechanism for storing those fields is horrendously slow. honestly, i don't >> even >> know why UDSEntry is used for the IOSlave side: it means creating huge >> numbers >> of small objects which each have their own collection objects holding a >> further inefficient structure (the Field struct in the FieldHash). each of >> those >> objects has a private data member that is a QSharedData ... > > Yes, i observed the exact same thing. > - Creating UDSEntry objects is expensive > - Putting them in a QDataStream is even more expensive > > Those two alone cause certainly 75% of all time spend while listing a folder. > >> >> UDSEntry makes sense for the client side (though these days i'd rather just >> have a more efficient private data structure with a QAIM in front of that, >> but >> compatibility must be preserved) ... but for the ioslave side, especially >> ones >> like kio_file, it's completely needless overhead as a result of an >> overgeneralization in the code. >> >> for kio_file, the ioslave side basically does: >> >> * list entries via the protocol (e.g. readdir in the case of kio-file) >> * turn each entry into a UDSEntry >> * add the UDSEntry to a cold store >> * pump each UDSEntry out to the client once the cold store is full >> >> what would make so much more sense imho is having kio_file write directly to >> a >> stream ... which is exactly what those UDSEntry objects end up doing anyways! >> they are created one at a time, and then when there are N of them (or N >> seconds have passed) they serialize out to a QDataStream. for kio_file, the >> UDSEntry fields are NEVER changed and they are NEVER queried individually on >> the ioslave side. they are bulked up and then drained, and for no particular >> reason other than it lets one re-use UDSEntry and have a symmetry on both >> sides of the code base. >> >> it would be fairly trivial to have a class that has an API very much like >> UDSEntry but instead has a QByteArray internally that it just writes >> everything to. the contract with its users would be: >> >> * one entry at a time, serialized >> * no changing data once put there >> * no getting the data back out >> >> the class would then just batch up data into its QByteArray and then shunt it >> across the wire when needed. no middle men, no 100s of 1000s of objects and >> QLists and << operators. just an API sth like: >> >> startEntry >> addField >> endEntry >> >> internally it could either batch up the fields and then stream them out when >> endEntry is called, or (and this is what i would do personally) on startEntry >> it would set up the header but with bad data (e.g. field count of 0) and then >> start appending the fields as they come in and increment a field count >> variable. >> when endEntry is called, update the header with the right field count. this >> would be much faster, though it would require a strictly serial approach to >> creating items. all 'safeties' such as "if you set the 'is a symlink' field >> twice, it gets updated" and would rely much more on the ioslave being well >> behaved. batching up the fields internally (so the equivalent of one >> UDSEntry's >> internals) and then writing them out would be enough to return these >> safeties, >> but for kio_file i don't think that would *ever* be useful. > > I am very happy with your feedback here :) > > The current UDSEntry (as it is on git now) cannot be used directly for > your suggestion. However, the UDSEntry i have here is already very > close to your suggestion. The data it stores looks like this: > > struct Field { > inline Field() : m_long(0) { } > QString m_str; > long long m_long; > }; > typedef QHash<uint, Field> FieldHash; > FieldHash fields; > QString m_name; // UDS_NAME > qint64 m_size; // UDS_SIZE > quint64 m_device_id; // UDS_DEVICE_ID > quint64 m_inode; // UDS_INODE > quint32 m_mode; // UDS_ACCESS > qint64 m_mtime; // UDS_MODIFICATION_TIME > qint64 m_atime; // UDS_ACCESS_TIME > quint32 m_user; // UDS_USER_ID > quint32 m_group; // UDS_GROUP_ID > > In other words, i already store the needed stat fields directly in > UDSEntry. The fieldhash is only being used when custom fields are > injected. > > The serializing/unserializing now looks like this: > > s = QDataStream > > s << a.d->m_name.toUtf8().constData(); > s << a.d->m_size; > s << a.d->m_device_id; > s << a.d->m_atime; > s << a.d->m_mtime; > s << a.d->m_mode; > s << a.d->m_user; > s << a.d->m_group; > s << a.d->fields.size(); > > And if the fiels.size() count is > 0 then it starts injecting those > custom fields to the stream. > > Thus far i've been trying to make UDSEntry more efficient and trying > to put stuff into QDataStream as quickly as possible. An approach i > didn't even think about is leaving UDSEntry out of it completely from > the kio_file slave. > > All your idea requires is: > 1. kio_file to do the batching internally (not via SlaveBase::listEntry) > 2. kio_file to never call SlaveBase::listEntry but call > SlaveBase::listEntries instead with a QByteArray > 3. A new SlaveBase::listEntries which takes a QByteArray and basically > calls send(MSG_LIST_ENTRIES, QByteArray) directly. Or make the send > function public in which case kio_file can directly send it's data to > the socket.. It's going to be an API change either way. Or a second > listEntries function or a public send function. > > I'm implementing this approach now. I'm very curious how this will turn out :) > >> >> as for determinate order of fields (to take advantage of the static list of >> QStrings to share QString data trick), that would be preserved as the order >> of >> addField calls in kio_file is fully deterministic: it is the exact same order >> every time. >> >> so while i disagree with some of dfaure's arguments, in particular that we >> should care about single CPU systems (in practice , they simply do not exist >> anymore for KDE software; even mobile is nearly all multicore now) and that >> the intermediate data structures have a lot of overhead (UDSEntry is already >> the definition of "intermediate data structure" and "overhead"; using >> readdir_r >> instead of readdir has exceptionally little impact next to that) ... he is >> essentially correct: the answer is not threading. it is decreasing the >> operations. such as by kicking UDSEntry to the curb on the ioslave side. >> >> UDSEntry could still use optimization on the client side of ioslave >> relationship, of course ... >> >> -- >> Aaron J. Seigo >> _______________________________________________ >> Kde-frameworks-devel mailing list >> [email protected] >> https://mail.kde.org/mailman/listinfo/kde-frameworks-devel >>
Small update. I've implemented the "aaron approach" and you where very right! Listing 500.000 files: - Before: ~3.500 ms - After: ~1.500 ms These are the number as they arrive in the application that requests the data. So yes, this is with socket and multi process overhead included. Note: the "before" is with quite a few optimizations in UDSEntry and some other code paths that are used while listing a directory. If you would compare this against KIO as it it's now in git then the before number would be ~4.500 ms. If you look at it from a week ago (some patches are already in KIO) then it would be around ~5.500 which would be KDE 4.xx. Raw speed is ~700ms and i want to be - at most - twice as slow while keeping all of KIO's features which means that i'm just 0.1 ms away from my minimal goal :D _______________________________________________ Kde-frameworks-devel mailing list [email protected] https://mail.kde.org/mailman/listinfo/kde-frameworks-devel
