Sheesh! No one should let me reply to email without sufficient
caffiene! Now, I actually just read and grokked your question...
I haven't tested this. And, I included a line that removes
apostrophes (eg; "steve's" becomes "steve"). You can comment that
line out if this doesn't suit your needs. The function returns an
array of all words that meet or exceed your threshold count, sorted
by decreasing frequency (you can remove the sort section too, if you
don't need it).
Sorry about the previous two messages...
function your_dream_function($phrase, $threshold) {
$phrase = strtolower(trim($phrase));
$WordList = $Temp = array();
$Bits = split('[[:space:]]+', $phrase);
for ($i=0; $i<count($Bits); $i++) {
$Bits[$i] = ereg_replace("'.*", '', $Bits[$i]); # Apostrophe removal
$Temp[$Bits[$i]]++;
}
arsort($Temp); # Sorts in descending arder
reset($Temp);
for ($i=0; $i<count($Temp); $i++) {
if ($Temp[$i] >= $threshold) { $WordList[] = $Temp[$i]; }
}
return ($WordList);
}
Some notes:
(1) I can't recall if PHP3's arsort() would treat the array value as
a string or as a numeric. I think it would always treat the value as
a string (meaning that '2' would come before '12' in an descending
sort), so in that case you would have to modify the function above in
one place (only if you wanted the sort):
Replace
$Temp[$Bits[$i]]++;
in the first for loop with
$x = $Temp[$Bits[$i]]++;
$Temp[$Bits[$i]] = substr(10000+$x, 1);
This should left-pad the number with zeros (for up to 9999
occurrences found), so that the ASCII sort sequence is in numeric
order. Using PHP4's arsort($Temp, SORT_NUMERIC) would be much handier!
(2) Iff (if and ONLY if) you are using the sort, you could get a
slight speed improvement by modifying the second for loop as follows;
replace
if ($Temp[$i] >= $threshold) { $WordList[] = $Temp[$i]; }
with
if ($Temp[$i] >= $threshold) {
$WordList[] = $Temp[$i];
} else {
break; # all subsequent words will be below the threshold
}
Again, this is all untested code.
-steve
At 3:43 PM -0700 4/14/01, Steve Edberg wrote:
>OK, this is less elegant (to me, anyway), and probably a bit slower, but:
>
>function word_occurrence($word,$phrase) {
> $word = strtolower($word); # this way,
> $phrase = strtolower($phrase); # case is irrelevant
>
> $Bits = split('[^[:alnum:]]+', $phrase);
> $count = 0;
> for ($i=0; $i<count($Bits); $i++) {
> if ($Bits[$i] == $word) { $count++; }
> }
>
> return ($count);
>}
>
>It should also handle hyphenated & apostrophied (is that a word?)
>words correctly, such as
>
> coffe's a drag
>or
> coffe-heads are strange
>
>If you want to count words that INCLUDE dashes or apostrophes, you'd
>have to use "[^[:alnum:]'-]+" in the split() function. Or, just
>break the string up by whitespace, and use '[[:space:]]+'.
>
> -steve
>
>
>
>---Original Message ---
>At 11:07 PM +0200 4/14/01, n e t b r a i n wrote:
>>Hi all,
>>anyone have or know where I can find a small function in order to extract
>>from a string the most relevant words in it?
>>
>>something like this:
>>
>>$var="I love coffe ... Coffe is from Brazil and coffe with milk ..";
>>$occurence=2;
>>//$occurence means word that are repeat 2 or more times
>>my_dream_funct($var,$occurence);
>>//the funct now return the word _ coffe _
>>
>>many thanks in advance
>>max
>>
>>ps.plz note: I need that it works on php3
>>
>
>
>Well, just offthetopofmyhead:
>
>function word_occurrence($word,$phrase) {
> $word = strtolower($word); # this way,
> $phrase = strtolower($phrase); # case is irrelevant
>
> $Bits = split($word.'[^[:alnum:]]*', $phrase);
>
> return (count($Bits)-1);
>}
>
>I tested this, and it works fine (php 3.0.12) EXCEPT it counts
>'coffecoffe' as TWO words, not zero. If that's the behavior you
>want, then it's fine. Now I'm intrigued...I want to find a single
>regular expression that will NOT match 'coffecoffe'. Perhaps preg_
>functions (available on PHP >= 3.0.9).
>
>And, I tried things like
>
> split('[^[:alnum:]]*'.$word.'[^[:alnum:]]*', " $phrase ")
>
>...didn't work.
>
>-steve
>
>--
>+----------- 12 April 2001: Forty years of manned spaceflight -----------+
>| Steve Edberg University of California, Davis |
>| [EMAIL PROTECTED] Computer Consultant |
>| http://aesric.ucdavis.edu/ http://pgfsun.ucdavis.edu/ |
>+-------------------------- www.yurisnight.net --------------------------+
>
>--
>PHP General Mailing List (http://www.php.net/)
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>To contact the list administrators, e-mail: [EMAIL PROTECTED]
--
+----------- 12 April 2001: Forty years of manned spaceflight -----------+
| Steve Edberg University of California, Davis |
| [EMAIL PROTECTED] Computer Consultant |
| http://aesric.ucdavis.edu/ http://pgfsun.ucdavis.edu/ |
+-------------------------- www.yurisnight.net --------------------------+
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]