That is indeed a serious error. Besides going either direction, memcpy doesn't need to work via bytes. It could use 16-byte vector registers. It could use something like PowerPC's dcbz instruction, which causes a cache line (of the destination) to be allocated in an all-zero state rather than fetched from RAM.
Use memmove() if you must have an overlap.