Apple

This is rather long message, but I tried to describe my problem in details so that it may be easier to provide explanation and/or solution.

I’m trying to optimise file copy performance of a file manager application. The copy engine developed for this application is based on copyfile() POSIX API. The previous engine was based File Manager FSCopyObjectAsync() API, but we ditched it because it’s been deprecated since OS X 10.8 and we wanted to implement deep folder merge functionality, which was practically impossible with the old API. We’ve chosen for copyfile() because it provides good foundation for folder merging, it has callbacks (to report copying progress) and is recommended by Apple Documentation (File Manager Reference, recommended replacements for deprecated functions).

Since the new engine is in use, some users started reporting slower copying of big files in certain situations (depending on source and target disks and similar), so I decided to see what can be improved. Since a copying of a single big file is in question, it was clear the engine spent most or its time in the copyfile() function itself and in its part dealing particularly with file read/write. Fortunately is copyfile() part of Darwin and hence open sourced, so I could peek into it. This is the part dealing with file read/write (I did some re-indentation myself to make it more readable):

while ((nread = read(s->src_fd, bp, blen)) > 0)

{

    ssize_t nwritten;

    size_t left = nread;

    void *ptr = bp;

    int loop = 0;

    while (left > 0)

{

        nwritten = write(s->dst_fd, ptr, MIN(left, oBlocksize));

        switch (nwritten)

{

            case 0:

                if (++loop > 5)

{

                    copyfile_warn("writing to output %d times resulted in 0 bytes written", loop);

                    ret = -1;

                    s->err = EAGAIN;

                    goto exit;

}

                break;

            case -1:

                copyfile_warn("writing to output file got error");

                if (status)

{

                    int rv = (*status)(COPYFILE_COPY_DATA, COPYFILE_ERR, s, s->src, s->dst, s->ctx);

                    if (rv == COPYFILE_SKIP)

{

                        ret = 0;

                        goto exit;

}

                    if (rv == COPYFILE_CONTINUE)

{

                        errno = 0;

                        continue;

}

}

                ret = -1;

                goto exit;

            default:

                left -= nwritten;

                ptr = ((char*)ptr) + nwritten;

                loop = 0;

                break;

}

        s->totalCopied += nwritten;

        if (status)

{

            int rv = (*status)(COPYFILE_COPY_DATA, COPYFILE_PROGRESS,  s, s->src, s->dst, s->ctx);

            if (rv == COPYFILE_QUIT)

{

                ret = -1; s->err = errno = ECANCELED;

                goto exit;

}

}

}

}

It’s immediately obvious it’s not optimal; reading of the source file and writing into the destination file are done sequentially in a single thread. Every write() has to wait for its read() counterpart. If, for example, we have a file which needs 10 cycles of read/write to copy (assuming read and write buffer are of the same size) and assuming both source and destination are on the same disk, thus assuming reading is faster and takes 1 seconds, while writing takes 2 seconds, the whole file would be copied in 30 seconds (10 cycles, 1 sec read and 2 sec write in each cycle). Separating reading and writing in threads, and again assuming reading is faster than writing for source and destination on the same disk, only the first write() would have to wait for the first read() to complete (1 second), all other writing cycles would have their data ready and would start immediately after the previous cycle has finished. The file would be copied in 21 seconds (theoretically), 30% faster.

Without doing any profiling and measuring, I implemented the threaded approach, only to discover there weren’t any speed gains. On the contrary, in some situations it even went slower. So, I looked deeper in threads execution, only to discover reading and writing were still done sequentially (only in different threads); each write() still had to wait for its companion read() to complete, implying that read() is slower than write() on the same disk. I was very surprised by that fact and still believed there’s problems with my implementation, so this time I went to do some profiling.

I created a very simple command line utility. It’s complete main() is as follows:

#include <stdio.h>

#include <copyfile.h>

int main(int argc, const char * argv[])

{

    char *src = "">"/Users/milke/Downloads/smalltest.zip";

    char *dst = "/Users/milke/Desktop/smalltest.zip";

    copyfile_state_t state = copyfile_state_alloc();

    copyfile_flags_t flags = COPYFILE_ALL | COPYFILE_NOFOLLOW;

    return copyfile(src, dst, state, flags);

}

The test file “smalltest.zip” was exactly 256MB (286,435,456 bytes) big, with random bytes arrangement (thus avoiding any eventual read/write caching that the system may be doing and that I don’t even know about). The source disk/volume is also the destination one; it’s Apple's SATA SSD, model APPLE SSD TS512C, build into 2011 MacBook Pro. I run the utility in Instruments > Time Profiler and it actually showed reading taking longer than writing!! The percentage were always different, but it ranged from read=52% against write=42%, to one extreme of read=65% against write=30%.

After this, I took copyfile.c and modified it to measure and print times of each read() and write() call. I calculated average values, rejecting minimums and maximums (but recording them). I also chose for same read and write buffer size of 16MB (original copyfile.c consults statfs() to get optimal sizes for read and write on a volume). Hence, my test file is copied in 16 read/write calls. Result are always different (understandably), but in general they are like shown in the following list:

1: read 16777216 bytes in 0.087021 seconds

1: written 16777216 bytes in 0.073020 seconds

2: read 16777216 bytes in 0.081063 seconds

2: written 16777216 bytes in 0.074134 seconds

3: read 16777216 bytes in 0.082879 seconds

3: written 16777216 bytes in 0.076724 seconds

4: read 16777216 bytes in 0.188362 seconds

4: written 16777216 bytes in 0.075116 seconds

5: read 16777216 bytes in 0.193575 seconds

5: written 16777216 bytes in 0.074075 seconds

6: read 16777216 bytes in 0.084868 seconds

6: written 16777216 bytes in 0.077992 seconds

7: read 16777216 bytes in 0.277805 seconds

7: written 16777216 bytes in 0.076839 seconds

8: read 16777216 bytes in 0.087295 seconds

8: written 16777216 bytes in 0.071497 seconds

9: read 16777216 bytes in 0.270342 seconds

9: written 16777216 bytes in 0.072851 seconds

10: read 16777216 bytes in 0.089179 seconds

10: written 16777216 bytes in 0.076859 seconds

11: read 16777216 bytes in 0.280369 seconds

11: written 16777216 bytes in 0.077752 seconds

12: read 16777216 bytes in 0.089682 seconds

12: written 16777216 bytes in 0.073565 seconds

13: read 16777216 bytes in 0.266644 seconds

13: written 16777216 bytes in 0.071825 seconds

14: read 16777216 bytes in 0.092464 seconds

14: written 16777216 bytes in 0.076359 seconds

15: read 16777216 bytes in 0.281460 seconds

15: written 16777216 bytes in 0.074397 seconds

16: read 16777216 bytes in 0.090654 seconds

16: written 16777216 bytes in 0.072744 seconds

+++!!!+++ average read cycle time = 0.155796 seconds (minimal = 0.081063, maximal = 0.281460)

+++!!!+++ average write cycle time = 0.074733 seconds (minimal = 0.071497, maximal = 0.077992)

This shows average reading being almost 100% slower than average writing on the same disk. Also, all writing times are similar, while reading shows much more discrepancy. If the destination is another disk/volume, reading times reduce for about 50%.

One more note: I repeated this test many (hundreds of) times, starting the utility again, or copying a single source file into 10 different copies and results are pretty much similar. But in a few occasions, I got surprisingly different results, like in the following list:

1: read 16777216 bytes in 0.013782 seconds

1: written 16777216 bytes in 0.069654 seconds

2: read 16777216 bytes in 0.003336 seconds

2: written 16777216 bytes in 0.072545 seconds

3: read 16777216 bytes in 0.003111 seconds

3: written 16777216 bytes in 0.075811 seconds

4: read 16777216 bytes in 0.003093 seconds

4: written 16777216 bytes in 0.077601 seconds

5: read 16777216 bytes in 0.003052 seconds

5: written 16777216 bytes in 0.092845 seconds

6: read 16777216 bytes in 0.003200 seconds

6: written 16777216 bytes in 0.078246 seconds

7: read 16777216 bytes in 0.003082 seconds

7: written 16777216 bytes in 0.093462 seconds

8: read 16777216 bytes in 0.003148 seconds

8: written 16777216 bytes in 0.074416 seconds

9: read 16777216 bytes in 0.003856 seconds

9: written 16777216 bytes in 0.088351 seconds

10: read 16777216 bytes in 0.003113 seconds

10: written 16777216 bytes in 0.076054 seconds

11: read 16777216 bytes in 0.003131 seconds

11: written 16777216 bytes in 0.082186 seconds

12: read 16777216 bytes in 0.003122 seconds

12: written 16777216 bytes in 0.077315 seconds

13: read 16777216 bytes in 0.003094 seconds

13: written 16777216 bytes in 0.079402 seconds

14: read 16777216 bytes in 0.003137 seconds

14: written 16777216 bytes in 0.076901 seconds

15: read 16777216 bytes in 0.004836 seconds

15: written 16777216 bytes in 0.103139 seconds

16: read 16777216 bytes in 0.003169 seconds

16: written 16777216 bytes in 0.073829 seconds

+++!!!+++ average read cycle time = 0.003316 seconds (minimal = 0.003052, maximal = 0.013782)

+++!!!+++ average write cycle time = 0.079926 seconds (minimal = 0.069654, maximal = 0.103139)

I’ve got no slightest idea what happen in those few cases and if I did anything special to get them (I don’t think so, I was just repeating the test over and over), but they basically match what I expected before doing all testing and measuring: reading being significantly faster that writing and no much discrepancy between different read() calls.

As I’m not a file system expert, I’d really appreciate if anyone can shed some light on the results I got. I’d also appreciate if anyone can recommend more efficient API for reading and writing files, especially in threaded environment. Would it be better to use Dispatch I/O Channel API in such situation?

Thanks a lot for eventual help.

-- Dragan