Lynn Quam | 3 Jun 22:30
Favicon

DIRECTORY calling DELETE-DUPLICATES is slow

I waited about 20 minutes for DIRECTORY to finish getting the
pathnames of a large directory containing about 27000 files.  The
problem is that DIRECTORY calls DELETE-DUPLICATES which is implemented
using an order n-squared algorithm!  For large sequences, the use of a
hash-table would make sense, but would cause consing.

I better change would be to the DIRECTORY function itself.  Note that
the result of DELETE-DUPLICATES is passed to SORT.  If the list were
sorted first, then it is easy (order n) to remove duplicates.

Carl Shapiro | 4 Jun 01:52

Re: DIRECTORY calling DELETE-DUPLICATES is slow

On Tue, Jun 3, 2008 at 1:33 PM, Lynn Quam <quam <at> ai.sri.com> wrote:
I better change would be to the DIRECTORY function itself.  Note that
the result of DELETE-DUPLICATES is passed to SORT.  If the list were
sorted first, then it is easy (order n) to remove duplicates.

That would seem like a very reasonable change.  Would you like to write a patch?
Lynn Quam | 5 Jun 18:24
Favicon

Re: DIRECTORY calling DELETE-DUPLICATES is slow

Here is a simple hack to DIRECTORY that avoids doing the n-squared
call to delete-duplicates, by sorting the RESULTS first, then scanning
the sorted list for duplicates.

(defun directory (pathname &key (all t) (check-for-subdirs t)
		  (truenamep t) (follow-links t))
  "Returns a list of pathnames, one for each file that matches the given
   pathname.  Supplying :ALL as nil causes this to ignore Unix dot files.  This
   never includes Unix dot and dot-dot in the result.  If :TRUENAMEP is NIL,
   then symbolic links in the result are not expanded, which is not the
   default because TRUENAME does follow links and the result pathnames are
   defined to be the TRUENAME of the pathname (the truename of a link may well
   be in another directory).  If FOLLOW-LINKS is NIL then symbolic links are
   not followed."
  (flet ((ordered-strings-remove-duplicates (list)
	   (loop for prev = nil then elem
		 for elem in list
		 when (or (null prev) (not (string= elem prev)))
		   collect elem)))
    (let ((results nil))
      (enumerate-search-list
	  (pathname (merge-pathnames pathname
				     (make-pathname :name :wild
						    :type :wild
						    :version :wild
						    :defaults *default-pathname-defaults*)
				     :wild))
	(enumerate-matches (name pathname nil :follow-links follow-links)
	  (when (or all
		    (let ((slash (position #\/ name :from-end t)))
		      (or (null slash)
			  (= (1+ slash) (length name))
			  (char/= (schar name (1+ slash)) #\.))))
	    (push name results))))
      (let ((*ignore-wildcards* t))
	(mapcar #'(lambda (name)
		    (let ((name (if (and check-for-subdirs
					 (eq (unix:unix-file-kind name)
					     :directory))
				    (concatenate 'string name "/")
				    name)))
		      (if truenamep (truename name) (pathname name))))
		(ordered-strings-remove-duplicates (sort results #'string<)))))))

Carl Shapiro | 5 Jun 21:33

Re: DIRECTORY calling DELETE-DUPLICATES is slow

On Thu, Jun 5, 2008 at 9:24 AM, Lynn Quam <quam <at> ai.sri.com> wrote:
Here is a simple hack to DIRECTORY that avoids doing the n-squared
call to delete-duplicates, by sorting the RESULTS first, then scanning
the sorted list for duplicates.

Thanks.  Is it safe to assume that with this change directory performing adequately for you?

Gmane