Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / file

Fault tolerant file comparison

4.33/5 (2 votes)
17 Oct 2012CPOL3 min read 12.6K  
This is an alternative for Fault Tolerance for Large Files on Cranky Hardware

Introduction

Copying large files may result in errors, as described in the original article, especially on so called cranky hardware. I experienced it with a distinct USB drive, and it drove me crazy because I wanted to use this
drive for backups of large containers. Especially if you try to recover a large backup file or you want to verify the integrity of a backup you might be interested in the code discussed in this article.  

In this version I will concentrate on file comparison only. The source code repository is available at Github.

Background

The original article provided two tools, one for error tolerant copying and one for verifying a copy. The article provides not only information about how to compare such files, but also about how to achieve distinct UI effects with WPF and Windows. If you run Windows and have the .NET framework installed you just might want to check out the original article.   

If you are interested in a native C++ version (or you just love reading alternative versions) you are welcome to go on. This article will focus on C++ code which will compare two files and return a result if they are
suspected to be identical.  

The native C++ version has some advantages. I will not discuss any performance benefits of native code as this is not the focus of the C++ implementation and I believe they can be neglected here. 

  • C++ code is more portable, so it is more likely that you can use this code easily if a C++ compiler is available. At least I suspect that common libraries like Boost and a C++ compiler are more common on *ix systems than Mono. And there may still be systems where Mono is not available.
  • Such a system utility might be an important tool when everything else goes wrong. Imagine you just reinstalled your OS and have no possibility to download and install .NET or Mono to run this program, but you are in desperate need of verifying/copying your backup with this tool.  
  • A decent UI is most definitely something I prefer, but in some situations you might want (or are even forced) to use a shell utility. 
  • A static build of this program runs with no dependencies at all.

The code to compare file contents 

This is an excerpt of my version of the file comparison code. The part which does the actual work.

c_buffer_bytes is set to 1024 * 1024 * 10 (10 MB).  It should be configurable, which I plan to do in upcoming versions.

m_max_retries is set to 3 by default, and may be set via command line.

C++
bool CSequenceComparer::Compare(const std::string& first_file, const std::string& second_file)
{
    // [...] checking that files exist and are regular files (not directories)
    // [...] open files

    char* p_buffer;
    char* p_compare_buffer;
    // [...] allocating c_buffer_bytes of memory for each buffer
    
    short retries = 0;
    while ( ! input_stream.eof() )
    {
        streamoff input_offset = input_stream.tellg();
        input_stream.read(p_buffer, c_buffer_bytes);
        const streamsize num_of_input_bytes = input_stream.gcount();

        streamoff compare_offset = compare_stream.tellg();
        compare_stream.read(p_compare_buffer, c_buffer_bytes);
        const streamsize num_of_compare_bytes = compare_stream.gcount();

        bool equal = false;
        if ( num_of_input_bytes == num_of_compare_bytes )
        {
            if ( memcmp(p_buffer, p_compare_buffer, (size_t)num_of_input_bytes) == 0 )
            {
                equal = true;
            }
        }

        retries = equal ? 0 : (retries + 1);
        if (retries > m_max_retries)
        {
            break;
        }

        if (retries > 0)
        {
            // [...] seek to previously stored offsets to retry read
        }
    }

    delete [] p_buffer;
    delete [] p_compare_buffer;
    
    return retries == 0;
}

Points of Interest  

If you are paranoid you will discover that the code above regards a file or its parts as equal as soon as the first success in comparing them is achieved. So in this case you might want to add code that will double check a successful comparison because the lower the compare buffer is set to in the first place, the higher the probability of a false positive.  

Since I wanted to provide a portable version I included the boost file-system library for the sake of readable code. In my first version created under windows I used some windows only functions to check whether files where available. This is the part which should work with Linux and Windows (its commented out above): 

C#
path filePath1(first_file);
path filePath2(second_file);
if ( !exists(filePath1) || !is_regular_file(filePath1) 
    || !exists(filePath2) || !is_regular_file(filePath2) )
{
    throw std::invalid_argument("Please provide regular file paths as arguments");
}

I also discovered that the ifstream::seekg() method in MSVC resets some status bits. When running on Linux I have to invoke ifstream::clear() before continuing.

History 

Submitted to CodeProject 17 October, 2012.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)