BlogAll Blog Posts | Next Post | Previous Post
Friday, May 20, 2011If there is one thing I do a lot is playing "find the differences" with xls and xlsx files.
I do it when reverse engineering a feature: If I want to find some undocumented byte that changes some feature, I save the file in Excel with and without the feature, open both files and find what changed.
I do it when testing: We have a really huge set of "reference files" that our test suite must create in order to verify FlexCel is working correctly. If some code change makes a file different from the reference file, I need to find the differences to know if we introduced a bug, or if it is ok and the reference file must be updated.
And I do it as an Excel user, when I need to find what really changed between two data files.
Differencing the raw filesFor reverse engineering and testing, I just look at the raw data in a diff program (I currently use the excellent Beyond Compare). Many years of doing so almost every day give you some Matrix-like powers where you just look at a bunch of Hexadecimal codes and you can see how that means that a formula changed. Let's see it with a real world example of a file I had to deal a week ago:
As you can easily see from the image, somehow the formula that was at row 1 is now at row 2, and the formula that was at row 2 moved to row 3. This is clearly a 1-off error, so we need to look at the code and fix it, we must have forgotten some "-1" somewhere.
But while this works fine for a controlled environment, it breaks when you want to compare arbitrary files. An xls file has thousands of "reserved" bytes that will change whenever you change the Excel version, and records might change a lot, without the file actually changing. What is even worse, all strings in an xls or xlsx file are stored inside an unsorted string table, and referenced by number. So one file might have the string number 1 at cell A1 and string number 2 at cell A2, while the other has the string number 2 at A1 and 1 at A2, and both files would be exactly the same because the shared string table is ordered differently. This is true also for xlsx files. Unless changes are trivial and controlled, diffing raw files can only go so far.
Time for a second example. I took the file I had just used in the images above, opened it in Office 2010 and saved it. This is now how the diff looks like:
As you can see at the left column, there is too much "red stuff" for this to be useful.
Differencing using ApiMateSo, for general differencing we need a better solution. Normally, all diff tools come with some way to diff xls files, but in both WinMerge and Beyond Compare they just convert the files to csv and diff that. All styling is removed. And if you look at the title of this post, you will see this is unacceptable, we want to know changes in styles too. Maybe the data is the same but now the title is 24 points instead of 22? We want to know.
What we really need is some kind of tool that could "convert" an xls or xlsx file to some kind of text representation, so we can diff that text later. Creating a tool like that was in my perpetual todo list of things-that-I-should-investigate-someday-but-never-will-because-there-are-more-urgent-things-to-do-right-now. Of course, until the day I woke up and realized that I already had such a tool and it is already distributed with FlexCel .NET.
It is called APIMate, and it will convert an xls or xlsx file to C#, Delphi or VB.NET code. It was designed as a learning tool to answer all those "how to" questions ("How to change the color of a cell?", "How to insert an image?"), and it does a great job at that, to the point that I use it myself all the time instead of looking at the docs. But there is nothing forbidding using that text representation in a diff tool. I just needed to add some changes so it can be called from the command line and so it will diff all the sheets instead of just the active sheet.
Setting it up is easy: You need the APIMate shipped with FlexCel 5.5 as older versions won't run from the command line. Then you need to go to your diff application and setup a filter to import xls and xlsx files. While this is different in all diff applications, most of them have the option. In particular, in Beyond Compare you would go to "Tools->File Formats", select "xls" and write ApiMate.exe as the filter app. When used as a console application, you need to pass two parameters to APIMate.exe: the source xls/x file and the target text file that will be generated.
Once you have done that, you can right click any two Excel files in the Windows explorer and diff them. Going back to our original example, this is how it would look now:
While APIMate won't tell you every difference in the file (for example it won't show charts yet) it will do a much better job than the bundled diff filters. Myself, I still do mostly raw diffs when developing since I need to know about every bit that changed, but for general use "ApiMate diff" has proven to be surprisingly useful.
As a final example, let's look at the following files:
What changed? This is what APIMate will show: (note that we are showing only changed lines)
As you can see the range C6:C7 has been merged. A cell changed the formatting (you can't see that in the screenshot because of the filtered lines, but if we saw the full diff it would be simple to see it was cell C4). Also the image was moved and the checkbox changed from checked to unchecked. Oh, and the selected cell changed too.
One last thing.We have made it so you don't need to be a registered FlexCel user to use this. If you just want to diff Excel files and don't need FlexCel, you can install the trial, and keep APIMate.exe and FlexCel.dll in some folder. It will work fine and won't expire.
This blog post has received 3 comments.
All Blog Posts | Next Post | Previous Post