From ed47c1557915bb2472f6959e723cd76155312a98 Mon Sep 17 00:00:00 2001 From: Chris Xiong Date: Mon, 6 Apr 2020 00:50:58 +0800 Subject: Add deduper (unfinished tool for finding image duplicates). --- deduper/libpuzzle/README | 202 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 202 insertions(+) create mode 100644 deduper/libpuzzle/README (limited to 'deduper/libpuzzle/README') diff --git a/deduper/libpuzzle/README b/deduper/libpuzzle/README new file mode 100644 index 0000000..502a1c0 --- /dev/null +++ b/deduper/libpuzzle/README @@ -0,0 +1,202 @@ + + .:. LIBPUZZLE .:. + + http://libpuzzle.pureftpd.org + + + ------------------------ BLURB ------------------------ + + +The Puzzle library is designed to quickly find visually similar images (gif, +png, jpg), even if they have been resized, recompressed, recolored or slightly +modified. + +The library is free, lightweight yet very fast, configurable, easy to use and +it has been designed with security in mind. This is a C library, but it also +comes with a command-line tool and PHP bindings. + + + ------------------------ REFERENCE ------------------------ + + +The Puzzle library is a implementation of "An image signature for any kind of +image", by H. CHI WONG, Marschall BERN and David GOLDBERG. + + + ------------------------ COMPILATION ------------------------ + + +In order to load images, the library relies on the GD2 library. +You need to install gdlib2 and its development headers before compiling +libpuzzle. +The GD2 library is available as a pre-built package for most operating systems. +Debian and Ubuntu users should install the "libgd2-dev" or the "libgd2-xpm-dev" +package. +Gentoo users should install "media-libs/gd". +OpenBSD, NetBSD and DragonflyBSD users should install the "gd" package. +MacPorts users should install the "gd2" package. +X11 support is not required for the Puzzle library. + +Once GD2 has been installed, configure the Puzzle library as usual: + +./configure + +This is a standard autoconf script, if you're not familiar with it, please +have a look at the INSTALL file. + +Compile the beast: + +make + +Try the built-in tests: + +make check + +If everything looks fine, install the software: + +make install + +If anything goes wrong, please submit a bug report to: + libpuzzle [at] pureftpd [dot] org + + + ------------------------ USAGE ------------------------ + + +The API is documented in the libpuzzle(3) and puzzle_set(3) man pages. +You can also play with the puzzle-diff test application. +See puzzle-diff(8) for more info about the puzzle-diff application. + +In order to be thread-safe, every exported function of the library requires a +PuzzleContext object. That object stores various run-time tunables. + +Out of a bitmap picture, the Puzzle library can fill a PuzzleCVec object : + + PuzzleContext context; + PuzzleCVec cvec; + + puzzle_init_context(&context); + puzzle_init_cvec(&context, &cvec); + puzzle_fill_cvec_from_file(&context, &cvec, "directory/filename.jpg"); + +The PuzzleCvec structure holds two fields: + signed char *vec: a pointer to the first element of the vector + size_t sizeof_vec: the number of elements + +The size depends on the "lambdas" value (see puzzle_set(3)). + +PuzzleCvec structures can be compared: + + d = puzzle_vector_normalized_distance(&context, &cvec1, &cvec2, 1); + +d is the normalized distance between both vectors. If d is below 0.6, pictures +are probably similar. + +If you need further help, feel free to subscribe to the mailing-list (see +below). + + + ------------------------ INDEXING ------------------------ + + +How to quickly find similar pictures, if they are millions of records? + +The original paper has a simple, yet efficient answer. + +Cut the vector in fixed-length words. For instance, let's consider the +following vector: + +[ a b c d e f g h i j k l m n o p q r s t u v w x y z ] + +With a word length (K) of 10, you can get the following words: + +[ a b c d e f g h i j ] found at position 0 +[ b c d e f g h i j k ] found at position 1 +[ c d e f g h i j k l ] found at position 2 +etc. until position N-1 + +Then, index your vector with a compound index of (word + position). + +Even with millions of images, K = 10 and N = 100 should be enough to have very +little entries sharing the same index. + +Here's a very basic sample database schema: + ++-----------------------------+ +| signatures | ++-----------------------------+ +| sig_id | signature | pic_id | ++--------+-----------+--------+ + ++--------------------------+ +| words | ++--------------------------+ +| pos_and_word | fk_sig_id | ++--------------+-----------+ + +I'd recommend splitting at least the "words" table into multiple tables and/or +servers. + +By default (lambas=9) signatures are 544 bytes long. In order to save storage +space, they can be compressed to 1/third of their original size through the +puzzle_compress_cvec() function. Before use, they must be uncompressed with +puzzle_uncompress_cvec(). + + + ------------------------ PUZZLE-DIFF ------------------------ + + +A command-line tool is also available for scripting or testing. + +It is installed as "puzzle-diff" and comes with a man page. + +Sample usage: + +- Output distance between two images: + +$ puzzle-diff pic-a-0.jpg pics-a-1.jpg +0.102286 + +- Compare two images, exit with 10 if they look the same, exit with 20 if +they don't (may be useful for scripts): + +$ puzzle-diff -e pic-a-0.jpg pics-a-1.jpg +$ echo $? +10 + +- Compute distance, without cropping and with computing the average intensity +of the whole blocks: + +$ puzzle-diff -p 1.0 -c pic-a-0.jpg pic-a-1.jpg +0.0523151 + + + ------------------------ COMPARING IMAGES WITH PHP ------------------------ + + +A PHP extension is bundled with the Libpuzzle package, and it provides PHP +bindings to most functions of the library. + +Documentation for the Libpuzzle PHP extension is available in the README-PHP +file. + + + ------------------------ APPS USING LIBPUZZLE ------------------------ + + +Here are third-party projects using libpuzzle: + +* ftwin - http://jok.is-a-geek.net/ftwin.php + ftwin is a tool useful to find duplicate files according to their content on +your file system. + +* Python bindings for libpuzzle: PyPuzzle + https://github.com/ArchangelSDY/PyPuzzle + + + ------------------------ STATUS ------------------------ + + +This project is unfortunately not maintained any more. Pull requests are +always welcome, but I don't use this library any more and I don't have enough +spare time to actively work on it. -- cgit v1.2.3