This is one of the projects which I created to test a number of things
and also to deal with an annoyance when working with backups. The
central idea in this problem, is that we work with read only files
which we copy and want to store somewhere. On one hand this is
typically a backup situation, on another hand we also find the same
concept back in functional programming languages and the idea here was
to be able to speed up calculations substantially if we would be able
to distinguish truly different inputs into functions. That ios to say:
if the calculation was performed on a file with the same content then
it shoudl not be redone. Of course these is the pretext. The small
tests below come not even close to solving such memoiziation problem
efficiently. Actually, the problem I discovered with this approach is
that the file content comparison will take a long time when the store
grows, even so that it is better to just compare file at the proper
place instead of comparing them every opportuinity we have.
Hard linking duplicate files
Anyway, I'm getting sidetracked here. From a historic point of view I
first took the program fdupes
written by Adrian Lopez and
since I liked it (I actually still like it), I modified it somewhat
such that it would make hardlinks between duplicate files. This makes
it possible to have an efficient store on disk without any alteration
to the directory content.
An observation with larger stores is that this program does it's
job nicely but that it still will take more time when the store grows.
In the end this little program neede 6 hours to go through 13Gb of
around 500000 files, which is too long. Especially if one realizes that
most of those files were alrteady compared against one anpother and
that there is not such a good reason to compare those files _again_ at
a later stage. As such I set out to create a tool that would sort files
into a store and work incrementally. Each new file would be 'imported'
in the store.
Importing files into a Read-Only Store
The source below contains a program that will take a directory and
import the full content of that directory into a store by linking from
within the store to the files in that directoty. This means that none
of the files in the direcorty to import can be writeable. The4y should
all be treated as readonly. The advantage of this system is that it is
faster than finding files afterwards. The disadvantage is that those
files _should not change_. So if you are undisciplined and just think
you can start modifying these files: it will screw up your entire store
foreverer. A change to one file is disastrous and cannot be solved. So
be careful.
which will import each of the files directly or indirectly under Test
and store them into the current directory, which is consiudered the
target store. Each file will be assigned a unique id which can then be
used further in rtelational databases or so. Below is an output of such
a run