On 2015-12-11 07:56, Kay C Lan wrote:
On Thu, Dec 10, 2015 at 4:38 PM, Mark Waddingham <m...@livecode.com> wrote:


The "word" chunk is not loosely implemented - it does precisely what it is
meant to do.

Which of course is the reason why the sort container command has no
problem if you sort by word on the right side of the equation - 'by word x
of each'

Indeed but remember that the 'right hand side' is the 'sort key' - it allows the parts which are to be sorted to be mapped to something else to do the sort. The point at issue here is how to split and then recombine the parts which are sorted, not what actually is used to perform the sort.

Well LC's definition of what a word is isn't exactly universally accepted, but once you understand it, it's extremely powerful and saves a huge amount of effort. As LC has it's own definition of what a word is then surely it could define exactly how it's going to output the final combine. More on
this below.

Actually LC's definition of a word is a well-defined concept - it is essentially what you might call a 'shell token' as it is the same definition that (UNIX) shells use to process arguments:

   ls foo -- list directory foo
   ls "foo bar" -- list directory "foo bar"

Perhaps it shouldn't really have been called a 'word', which is why we added a 'segment' synonym for it in 7 where we introduced 'trueWord' (which is closer to what people might actually consider to be a word, albeit still algorithmically defined).

This is not quite true is it:

put "the,quick,brown," into tVar
put the number of items in tVar into msg -- 3
sort items of tVar
put the number of items in tVar into msg -- 4

Now I don't wish to discuss why this is, I understand why it is, I'm OK as
to why it is. As with LC's definition of what a word is, when you
understand what is happening under the hood you can work around it or work
it to your advantage. Of note in the above, the number of chars has
remained the same.

Hehe - perhaps best not to open that particular can of worms. For what its worth, I'd actually class that behavior as an anomaly as it breaks the logic of string lists - if there is a trailing delimiter, the trailing delimiter should be ignored but preserved: sorting "the,quick,brown," should result in "brown,quick,the," and not ",brown,quick,the". (Indeed, I noticed this particular case hadn't been added to BZ - it has now: http://quality.livecode.com/show_bug.cgi?id=16588).

Now if you don't agree, and think it should come out some other way, that's
OK, all that matters is whatever the output, it is consistent and
published. LC could convert all tabs to spaces, it could remove all
instances of multiple whitespace and replace it with a single space, I
don't care, just as long as whatever it does is consistent and published.
Just as some people don't think "New York" is one word, LC does, it's
published that quoted phrases are counted as one word, and that's a very
powerful thing.

I don't think I do agree with 'trying to do something sensible with the whitespace' as I don't really see why that would be useful. If you break down a string into a sequence of segments (I'll stop using word since it perhaps obfuscates the issue slightly ;)) then what use is the whitespace after that? Particularly if it has been reordered in some 'arbitrary' way. (Here I mean 'arbitrary' in the sense that there are a great many choices one could make as to how one might 'do something with' the whitespace here and as such any one choice can be seen as arbitrary - I don't think there are any particularly logical arguments one could make as to why to favour one choice over another beyond personal taste and explicit specific use-case).

So again, as LC can already sort words on the right side of the equation - sort xxxxx of tVar by word y of each, it's seems only a minor step to make
it possible on the left side of the equation - sort words of tVar.
Obviously the sorting mechanism is in place it's just the actual
presentation that needs a little thought - surely not that hard.

Yes - there's nothing particularly 'hard' about making sort act on segments, although it has nothing to do with the sort key part (as I've said before).

If we break sort down into the steps which are actually taken the choices become more clear:

1) split the things you want to sort into a list (numerically keyed array)

2) sort the elements of the list (via a sortKey if specified)

3) combine the list back into a string

Clearly (1) is well defined for segments - you can iterate over a string using 'segment x of' and construct a list of all the segments within it. Similarly (2) is well defined as at this point there is no need to ponder segments or any other 'text chunk' structure, since the things you want to sort have been neatly listed as separate entities. It is (3) which is where there is some freedom of choice.

Basically, the choice one makes at (3) doesn't really matter as long as:

repeat for each segment x in tMyWords
  add 1 to tMyWordCount[x]
end repeat

sort segments of tMyWords

repeat for each segment y in tMyWords
  subtract 1 from tMyWordCount[x]
end repeat

Ends up with tMyWordCount being an array where all elements are zero. i.e. You can iterate over the segments before the sort, and after the sort, and end up seeing exactly the same segments in exactly the same multiplicities (just in a different order).

In the vein of 'KISS' (keep it simple stupid) it therefore seems sensible to make the simplest choice for how to recombine the string after sorting - and I think that is to use a single space.

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to