Michael Enke recently asked in pgsql-bugs about VARDATA and C strings
(BUG #2574: C function: arg TEXT data corrupt). Since that's not a bug,
I've moved this follow-up to pgsql-general.
On Mon, 2006-08-14 at 11:27 -0400, Tom Lane wrote:
> The usual way to get a C string from a TEXT datum is to call textout,
> eg
> str = DatumGetCString(DirectFunctionCall1(textout, datumval));
Yikes! I've been accessing VARDATA text data like Michael for years
(code below). I account for length and don't expect null-termination,
but I don't use anything like Tom's suggestion above. (I always try to
do what Tom says because that usually hurts less.)
I have three questions:
1) I based everything I did on examples lifted nearly verbatim from a
7.x manual, and I bet Michael did similarly. I've never heard of
DatumGetCString, DirectFunctionCall1, or textout. Are these and other
treasures documented somewhere?
2) Does DatumGetCString(DirectFunctionCall1(textout, datumval)) do
something other than null terminate a string? All of the strings are
from [-A-Z0-1*]; server_encoding has been either SQL_ASCII or UTF8 in
case that's relevant.
3) Is there any reason to believe that the code below is problematic?
Thanks,
Reece
#include <postgres.h>
#include <fmgr.h>
#include <ctype.h>
#include <string.h>
static char* clean_sequence(const char* in, int32 n);
PG_FUNCTION_INFO_V1(pg_clean_sequence);
Datum pg_clean_sequence(PG_FUNCTION_ARGS)
{
text* t0; /* in */
text* t1; /* out */
char* tmp;
int32 tmpl;
if ( PG_ARGISNULL(0) )
{ PG_RETURN_NULL(); }
t0 = PG_GETARG_TEXT_P(0);
tmp = clean_sequence( VARDATA(t0), VARSIZE(t0)-VARHDRSZ );
tmpl = (int32) strlen(tmp);
/* copy temp sequence into new pg variable */
t1 = (text*) palloc( tmpl + VARHDRSZ );
if (!t1)
{ elog( ERROR, "couldn't palloc (%d bytes)", tmpl+VARHDRSZ ); }
memcpy(VARDATA(t1),tmp,tmpl);
VARATT_SIZEP(t1) = tmpl + VARHDRSZ;
pfree(tmp);
PG_RETURN_TEXT_P(t1);
}
/* clean_sequence -- strip non-IUPAC symbols
The intent is to strip non-sequence data which might result from
copy-pasting a fasta file or some such.
in: char*, length
out: char*, |out|<=length, NULL-TERMINATED
out is palloc'd memory; caller must free
allow chars from IUPAC std 20
+ selenocysteine (U) + ambiguity (BZX) + gap (-) + stop (*)
*/
#define isseq(c) ( ((c)>='A' && (c)<='Z' && (c)!='J' && (c)!='O') \
|| ((c)=='-') \
|| ((c)=='*') )
char* clean_sequence(const char* in, int32 n) {
char* out;
char* oi;
int32 i;
out = palloc( n + 1 ); /* w/null */
if (!out)
{ elog( ERROR, "couldn't palloc (%d bytes)", n+1 ); }
for( i=0, oi=out; i<=n-1; i++ ) {
char c = toupper(in[i]);
if ( isseq(c) ) {
*oi++ = c;
}
}
*oi = '\0';
return(out);
}
--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match