In my case, the data is not changing. Small files have no impact since I'm assuming we generate a list of disk blocks to move. Whether those blocks comprise a lot of small files, or one huge file, makes little difference. OK, depending on where the block pointers live, it might require more writes to the index, but we're talking a difference of factor 2 or 3 at most. In my case, I believe there are few files less than 1MB.
If you can generate the entire list of block moves in stage 2, then you also know where the blocks are, and can easily compensate for those which are on the slower side. Even without any such compensation, the total error would be far less than one order of magnitude.
Apparently the algorithm used does not create the entire list of required moves in stage 2. I'd be pleased to have my ignorance relieved as to why the task is more complex than I have suggested. I'm guessing it has to do with the fact that the disk is mounted, arguably requiring a more dynamic approach.