The advice you've given seems fine.
My colleagues and I have done a little research and here's what we believe is going on:
FCP editing happens at video frame rates. Since many audio only file types don't have video frame rates (understandably), the rate and duration are specified in video frames.
At media import time, FCP converts everything's duration into a number of video frames, using the current easy-setup default sequence preset rate.
After that, if the user edits a clip onto a sequence with a different frame rate, there is a rate conversion from the 'original' to the destination (even though the source may be audio and not actually have a video frame rate).
The rate conversion means then that the media will potentially get sped up/slowed-down in this rate conversion step (as if a speed effect were applied though none is explicitly applied with rate conversion), with issues becoming more obvious as the media gets longer. (No samples are lost but the media will be resampled.)
To avoid the media duration getting modified as if you were mixing video frame rates when importing and editing audio media, users and developers alike can make sure that the rate for the sequence, enclosed clip items, and their files are all the same. This can be done with careful setting of the easy setup before importing media to match the output format (what you'll set the sequence rate to be), as you've noted.
For the developer this could be done with a modification of the XML later, though frame durations and edit timing will most likely also have to be changed appropriately if the rate is changed via XML.
Also, XML import allows a XML sequence setting to be specified as an import option, which will control how the audio only media is brought in via XML, overriding the easy-setup sequence preset in that case.
The difference between 29.97 fps and 30 fps is subtle enough that it is not obviously wrong, just mysteriously out of sync. Over a 1-hour period, it will drift out of sync by about 4 seconds.
By the way
<rate>
<timebase>30</timebase>
<ntsc>TRUE</ntsc>
</rate>
is interpreted as ~29.97 fps while
<rate>
<timebase>30</timebase>
</rate>
or
<rate>
<timebase>30</timebase>
<ntsc>FALSE</ntsc>
</rate>
is interpreted as 30fps while
NOTE: This aberration also occurs when you import audio with your easy setups at another rate, say 25 fps, and then edit the media on a 30 fps sequence.
Thanks,
Helena & Kelly.