Today’s random project is de-duplicating a set of files from a 2008 backup. This is an old archive of five ZIP files built from five old DVD backups of my files from my computer back in 2008. And what these ZIP files contained were personal files and work files and programming projects from all my jobs up to that point. Some files go back to 1993, back when floppies were the primary storage medium. The starting point for the effort is: 20.6 GB in 40,286 files in 4,072 folders.
That sounds like a pretty daunting task. And it’s not the first time I’ve considered cleaning it up. I know it needs it because these ZIP files have ZIP files inside of them which have even more ZIPs inside of them. It’s a Russian doll of redundancy.
Step one is to expand the first zip files into one folder. Then, I will use the freeware tool SMF (Search My Files) to find duplicates. Eliminate the dupes, which hopefully includes some ZIP files, then expand the next level of ZIP files inside the folders and repeat.
The first run, it found 28k potential dupes – I assume that’s based on filename. In 7 mins it created hashes for all those file, then quickly identified 16k dupes. I worked my way through the biggest files, getting down to 1.5MB files and after that first trim, the cleanup folder was: 19.6 GB in 40,105 files in 4072 folders
So that effort saved me about a gig of space. Worth the effort? Probably not. Am I going to work through the other 15k files? Absolutely not. What I discovered I needed was a duplicate folder finder, which would check to see if all the files in two different folders were the same. That would involve creating a checksum at the folder level as well as at the file level. By deleting some files from one folder and not others, I was not helping the duplication problem and actually making it worse by now having two folders, each with incomplete file contents.
Ok. That was a total bust. The more I dug into ZIPs, the worse things got. That was about 90 minutes of effort for no good results. So I deleted it all and extracted the original ZIP files again. This time, I’m going to break out the files into the different sources. I have 2 or 3 different work archives, plus my personal stuff. One of the problems this may solve is when I copied work files to home and now I had two archives of the same stuff. Then at some point, I’ll have to resolve archives of different time periods. I would probably want to keep the newest version.
An hour or so into the organization process, I’m feeling pretty good about this attempt. I manually identify some dupes and immediately wipe them out. Then I use the dupe checker utility to look at smaller folders, so I don’t get hit with tens of thousands of dupes. The result of this effort? 19.3 GB.
At this point, I’m pretty satisfied with where the archive is at. Moreso, most of the files have been unearthed from their nested ZIPs, so I can find the dupes and delete them. So this was more of a cleaning exercise than anything. There’s some talk on the internet about being a hoarder of digital data and how easy it is to do that because it seems so lightweight. But if you open up your “archive” folder and immediately close it because it’s too overwhelming, that should be a warning sign. Physical or digital, stuff serves no purpose if it can’t be found and accessed with minimal effort. That’s been the biggest satisfaction for me from this task, that at least things are a little more in order, even if they’re not perfect yet.
Comments are closed.