Estimating time of completion of a data migration process

tufkap

I've just finished writing a utility to migrate data from optical libraries to a NAS box and now I'm trying to come up with a formula to estimate the time of completion given the following data: 1) Total amount of data 2) No: of drives in library 3) Average read speed of the drives 4) Total no: of files 5) Fixed overhead for each file 6) Average write speed of the NAS box (this also takes into account the network write speed) The formula that I am using now looks like this: (Total data / (No: of drives * Average read speed)) + (Total files * Fixed overhead) + (Total data/ Average write speed) I don't think this right in all cases. The utility launches 1 thread for each drive in the library. So there is some parallelisation of the copy process. But I think the above formula would only work if the copying is done in a sequential manner. Does anyone have a better idea on how to do this by taking into account that the reads and writes happen in parallel? Please note that in the program itself, I just use the no: of files processed so far and the time taken to process them to guesstimate the time remaining. This formula is to create an excel file where the user can enter the data given above and get an approximate time of completion before actually starting the migration. Any help is greatly appreciated.

The user formerly known as pkam.

Tim Craig

I think the gating factor is the slower of reading, writing, or data transfer rate. It doesn't matter how fast you can read the data if writing slower. If reading can't keep up with writing, then reading is the limiting process. Of course, this analysis is based on aggregate rates which may be hard to judge but it seems you have some average values to work with.

If you don't have the data, you're just another asshole with an opinion.

tufkap

Thank you for taking the time to respond to my question, Tim.

The user formerly known as pkam.

Alan Balkany

There are complexities and interactions you can't anticipate, so a more reliable approach is to make completion-time measurements for different parameter combinations. Looking at the graphs of times for different values of a single parameter will give you insight as to how it really affects completion time. Multiple regression will give you formulas that estimate completion time based on the values of multiple parameters.

tufkap

Thank you for your response Alan. I just needed a rough estimate. So for now I'm using the method suggested by Tim. Also the utility has currently been tested only on a small test configuration. When we do further testing, I'll try out your method.

The user formerly known as pkam.