Introduction
Copying large files may result in errors, as described in the original article, especially on so called cranky hardware. I experienced it with a distinct USB drive, and it drove me crazy because I wanted to use this
drive for backups of large containers. Especially if you try to recover a large backup file or you want to verify the integrity of a backup you might be interested in the code discussed in this article.
In this version I will concentrate on file comparison only. The source code repository is available at Github.
Background
The original article provided two tools, one for error tolerant copying and one for verifying a copy. The article provides not only information about how to compare such files, but also about how to achieve distinct UI effects with WPF and Windows. If you run Windows and have the .NET framework installed you just might want to check out the original article.
If you are interested in a native C++ version (or you just love reading alternative versions) you are welcome to go on. This article will focus on C++ code which will compare two files and return a result if they are
suspected to be identical.
The native C++ version has some advantages. I will not discuss any performance benefits of native code as this is not the focus of the C++ implementation and I believe they can be neglected here.
- C++ code is more portable, so it is more likely that you can use this code easily if a C++ compiler is available. At least I suspect that common libraries like Boost and a C++ compiler are more common on *ix systems than Mono. And there may still be systems where Mono is not available.
- Such a system utility might be an important tool when everything else goes wrong. Imagine you just reinstalled your OS and have no possibility to download and install .NET or Mono to run this program, but you are in desperate need of verifying/copying your backup with this tool.
- A decent UI is most definitely something I prefer, but in some situations you might want (or are even forced) to use a shell utility.
- A static build of this program runs with no dependencies at all.
The code to compare file contents
This is an excerpt of my version of the file comparison code. The part which does the actual work.
c_buffer_bytes
is set to 1024 * 1024 * 10 (10 MB). It should be configurable, which I plan to do in upcoming versions.
m_max_retries
is set to 3 by default, and may be set via command line.
bool CSequenceComparer::Compare(const std::string& first_file, const std::string& second_file)
{
char* p_buffer;
char* p_compare_buffer;
short retries = 0;
while ( ! input_stream.eof() )
{
streamoff input_offset = input_stream.tellg();
input_stream.read(p_buffer, c_buffer_bytes);
const streamsize num_of_input_bytes = input_stream.gcount();
streamoff compare_offset = compare_stream.tellg();
compare_stream.read(p_compare_buffer, c_buffer_bytes);
const streamsize num_of_compare_bytes = compare_stream.gcount();
bool equal = false;
if ( num_of_input_bytes == num_of_compare_bytes )
{
if ( memcmp(p_buffer, p_compare_buffer, (size_t)num_of_input_bytes) == 0 )
{
equal = true;
}
}
retries = equal ? 0 : (retries + 1);
if (retries > m_max_retries)
{
break;
}
if (retries > 0)
{
}
}
delete [] p_buffer;
delete [] p_compare_buffer;
return retries == 0;
}
Points of Interest
If you are paranoid you will discover that the code above regards a file
or its parts as equal as soon as the first success in comparing them is achieved. So in this case you might want to add code that will double
check a successful comparison because the lower the compare buffer is
set to in the first place, the higher the probability of a false
positive.
Since I wanted to provide a portable version I included the boost file-system library for the sake of readable code. In my first version created under windows I used some windows only functions to check whether files where available. This is the part which should work with Linux and Windows (its commented out above):
path filePath1(first_file);
path filePath2(second_file);
if ( !exists(filePath1) || !is_regular_file(filePath1)
|| !exists(filePath2) || !is_regular_file(filePath2) )
{
throw std::invalid_argument("Please provide regular file paths as arguments");
}
I also discovered that the ifstream::seekg()
method in MSVC resets some status bits. When running on Linux I have to invoke ifstream::clear()
before continuing.
History
Submitted to CodeProject 17 October, 2012.