URL as file specifier - details & scheme code.

Harvey J. Stein abel@netvision.net.il
Sun, 4 May 1997 21:19:05 +0300

My previous post regarding URLs was a little inaccurate.  It's actually 9
parts, and the filename isn't separated from the path.  Here're the
details in case anyone decides to follow this, uh, path.

Note, this still doesn't cover:

   -Handling %xy as an escape for hex xy.
   -Details of relative URL parsing.
   -Proxy handling.

Here's my description, as well as the code.  It's a little rough
because I first wrote it for STk, then ported it to SCM, and when I
last touched it, I was in the process of porting it to Guile.  (Another
argument for a standard scheme library.)

;;; url.scm - URL Parsing library (used by the www package).
;;;           Version 0.5.
;;; Copyright (c) 1995 Harvey J. Stein (hjstein@math.huji.ac.il)
;;; This code is freely usable and distributable as long as this
;;; heading remains.

;;; Usage:
;;; (url:parse-url url &optional parent)
;;;   Takes a string (url), and returns a parsed url.  One may apply
;;;   The general url form handled is:
;;;      service://user:password@host:port/path;parameters?query#anchor
;;;   This function takes an optional second argument - the parent url
;;;   (a parsed url).  When the 2nd argument is given, the url is
;;;   interpreted relative to the parent url.
;;;   If a proxy environment variable is defined for the service of
;;;   the given url (i.e. - HTTP_PROXY, FTP_PROXY, etc), then the url
;;;   is parsed for being retrieved via the specified proxy.  That is,
;;;   the parsed url is the same as the parsed url of the
;;;   corresponding environment variable, with the filename component
;;;   replaced by the url being parsed (in string form).  Said *_PROXY
;;;   environment variables should be URLs.
;;;   The following functions may be applied to a parsed url:
;;;   (url:unparse-url url)
;;;         Returns a string (a fully qualified url) which would parse
;;;         into url.
;;;   (url:service  parsed-url)
;;;         Returns the service (i.e. - protocol) (as a symbol), or #f
;;;         if none was given.
;;;   (url:user parsed-url)
;;;         Returns user name, or #f if none was supplied.
;;;   (url:password parsed-url)
;;;         Returns the password, or #f if none was supplied.
;;;   (url:host parsed-url)
;;;         Returns the host name, or #f if none was supplied.
;;;         Note - the host name can be "" (as in file:///foo/bar).
;;;   (url:port-number parsed-url)
;;;         The port number in the url, or #f if none was supplied.
;;;   (url:filename parsed-url)
;;;         The file name (i.e. - path name) of the url, or #f if
;;;         none was supplied.
;;;   (url:anchor parsed-url)
;;;         The anchor in the url, or #f if none was supplied.  It can
;;;         be "" - as in http://foo.bar.com/file#
;;;   (url:parameters parsed-url)
;;;         The parameters (#f if not supplied, empty string if
;;;         parameter delimiter is supplied, but no parameters are
;;;         supplied.
;;;   (url:query parsed-url)
;;;         The query parameters (#f if not supplied, empty string if
;;;         query delimiter is supplied, but no query parameters are
;;;         supplied.
;;;   (url:through-proxy? parsed-url)
;;;         #t iff url has been parsed to be passed through a proxy.
;;;         Proxying is handled by parsing the proxy address, and
;;;         passing the unparsed url through as the file name.

;;; Overview:

;;; Internet RFC 1808 discusses how to interpret relative URLs.  In
;;; doing so, it gives algorithms both for parsing URLs and for
;;; computing relative URLs.
;;; To parse a URL, they say to follow the following procedure:
;;;   1. Everything incl & after 1st "#" is the anchor.  Of the rest,
;;;   2. Everything incl & before 1st ":" is the scheme, assuming at
;;;      least 1 char before ":" & all chars are scheme allowable
;;;      [a-zA-Z0-9+.-].  Of the rest,
;;;   3. If it starts with "//", everything up to (but not incl) next
;;;      ocurrence of "/" or until end is network location.  Of the rest,
;;;   4. everything from 1st "?" until end is the query info.  Of the
;;;      rest,
;;;   5. everything from 1st ";" until end is the parameters field.  Of
;;;      the rest,
;;;   6. everything remaining is the path.
;;; After all this is done, one must remove the "#" from the beginning
;;; of the anchor, the ":" from the end of the scheme, the "//" from
;;; the beginning of the net location, the "?" from the beginning of
;;; the query part, and the ";" from the beginning of the params part.
;;; One typically leaves the "/" on the beginning of the path part,
;;; because it shows up iff the URL is not relative.
;;; This means that the following regexp should be able to separate
;;; out these 6 basic parts of the url:
;;; "^([a-zA-Z0-9+.-]+:|)(//[^#/]*|)([^#?;]*|)(;[^?#]*|)(\\?[^#]*|)(#.*|)$"
;;; Once this is done, the net address must be parsed.  I believe this
;;; can be done as follows:
;;;   1. Everything before & including the 1st "@" is the
;;;      username/password part.  Of the rest,
;;;   2. Everything after & including the last ":" is the port number,
;;;      provided the ":" is only followed by digits.
;;;   3. What's left is the host name.
;;; Bugs
;;;   -We don't handle the standard URL escaping mechanism for passing
;;;    special characters, namely, converting %xy to the character
;;;    whose ascii code in hex is xy.
;;;   -The parsing of http://foo.bar.com is problematic.  On the one
;;;    hand, there's no file name 
;;;    become /.
;;;   -Proxy junk barely tested...

(load "misc")
;;(require 'string-case)

(define url:*proxy-env-vars*
  '((http   "HTTP_PROXY")
    (ftp    "FTP_PROXY")
    (wais   "WAIS_PROXY")
    (gopher "GOPHER_PROXY")))

(define url:parse-url   #f)
(define url:unparse-url #f)
(define url:service     #f)
(define url:user        #f)
(define url:password    #f)
(define url:host        #f)
(define url:port-number #f)
(define url:filename    #f)
(define url:anchor      #f)
(define url:parameters  #f)
(define url:query       #f)
(define url:through-proxy? #f)

(let ()

(define *proxy-servers* ())
;;; Takes a URL as an argument and returns a list containing the 
;;; protocol, the host name, and the file name.

(define (parse-url url . parent)
  (proxitize (apply relativize (basic-parse-url url) parent)))

(define (basic-parse-url url)
  (let* ((base (split url-regexp url))
	 (up-hp (split up-hp-regexp (safe-list-ref base 1)))
	 (u-p   (split u-p-regexp (safe-list-ref up-hp 0)))
	 (h-p   (split-hp-part (safe-list-ref up-hp 1))))
;;    (format #t "basic-parse-url:\n   base=~s\n   up-hp=~s\n   u-p=~s\n   h-p=~s\n"
;;	    base up-hp u-p h-p)
    (if (not (and base up-hp u-p h-p))
      (let* ((dirty-url (map (lambda (x) (if (string=? x "") #f x))
			     (append (list (car base))
				     (cddr base))))
	     (srv  (maybe-chop-end (list-ref dirty-url 0)))
	     (user (maybe-chop-end (list-ref dirty-url 1)))
	     (pass (maybe-chop-end (list-ref dirty-url 2)))
	     (host (list-ref dirty-url 3))
	     (port (maybe-chop-beg (list-ref dirty-url 4)))
	     (path (list-ref dirty-url 5))
	     (parm (maybe-chop-beg (list-ref dirty-url 6)))
	     (quer (maybe-chop-beg (list-ref dirty-url 7)))
	     (anch (maybe-chop-beg (list-ref dirty-url 8))))
	(if (and (not host)
		 (string? (safe-list-ref base 1))
		 (> (string-length (safe-list-ref base 1)) 0))
	    (set! host ""))
	(list (if srv (string->symbol (string-downcase srv))
	      (if port (string->number port)

(define (relativize url . parent)
  (define (inherit-service)
    (set! url (smerge-lists url parent 0)))

  (define (inherit-netloc)
    (set! url (smerge-lists url parent 1 2 3 4)))

  (define (inherit-path)
    (set! url (smerge-lists url parent 5)))

  (define (inherit-parameters)
    (set! url (smerge-lists url parent 6)))

  (define (inherit-query)
    (set! url (smerge-lists url parent 7)))

  (define (smerge-lists url parent . positions)
    (define (smerge-aux url parent positions ref)
      (cond ((null? positions) url)
	    ((null? url) parent)
	    ((null? parent) url)
	    ((= (car positions) ref)
	     (cons (car parent)
		   (smerge-aux (cdr url) (cdr parent) (cdr positions) (+ 1 ref))))
	     (cons (car url)
		   (smerge-aux (cdr url) (cdr parent) positions (+ 1 ref))))))
    (smerge-aux url parent positions 0))

  (define (merge-paths)
    (let* ((base (string-append (dirname (filename parent))
			       (filename url)))
	   (newpath (expand-file-name base)))
;;      (format #t "merge-paths: base=~s, newpath=~s\n" base newpath)
      (set! url (smerge-lists url `(serv user pass host port ,newpath) 5))))
  (if (not (null? parent)) (set! parent (car parent)))
  (cond ((null? parent) url)
	((string=? "" (unparse-url parent)) url)
	((string=? "" (unparse-url url)) parent)
	((service url) url)
	((host url) (inherit-service) url)
	((and (filename url)
	      (> (string-length (filename url)) 0)
	      (char=? (string-ref (filename url) 0) #\/))
	 (inherit-service) (inherit-netloc) url)
	((and (not (filename url))
	      (parameters url))
	 (inherit-service) (inherit-netloc) (inherit-path) url)
	((and (not (filename url))
	      (query url))
	 (inherit-service) (inherit-netloc) (inherit-path) (inherit-parameters)
	((not (filename url))
	 (inherit-service) (inherit-netloc) (inherit-path) (inherit-parameters)
	 (inherit-service) (inherit-netloc) (merge-paths) url)))

(define (proxitize parsed-url)
  (let ((proxy (assoc (service parsed-url) *proxy-servers*)))
    (cond (proxy
	   (set! proxy (cadr proxy))
	   (list (service proxy)
		 (user proxy)
		 (password proxy)
		 (host proxy)
		 (port-number proxy)
		 (unparse-url parsed-url)
		 (parameters proxy)
		 (query proxy)
		 (anchor proxy)
		 #t))			; Is through proxy.
(define (unparse-url url)
  (cond ((through-proxy? url)
	 (filename url))
	 (let ((srv (service url)))
	   (if srv (set! srv (symbol->string srv)))
	   (string-append (maybe-append srv ":")
			  (string-netaddr url)
			  (if (filename url) (filename url) "")
			  (maybe-append ";" (parameters url))
			  (maybe-append "?" (query url))
			  (maybe-append "#" (anchor url)))))))

;;;  --------- Regexps for parsing ---------------

(define string->regexp regcomp)

;; String for guile:
;;   "^\\([a-zA-Z0-9+.-]\\+:\\|\\)\\(//[^#/]*\\|\\)\\([^#?;]*\\|\\)\\(;[^?#]*\\|\\)\\(\\?[^#]*\\|\\)\\(#.*\\|\\)$"))

;; String for stk & scm:
;;   "^([a-zA-Z0-9+.-]+:|)(//[^#/]*|)([^#?;]*|)(;[^?#]*|)(\\?[^#]*|)(#.*|)$"))

(define url-regexp 
       (group (one-or-more (set "a-zA-Z0-9+.-"))
       (group ("//" (zero-or-more (set "^#/")))
       (group (zero-or-more (set "^#?;"))
       (group (";" (zero-or-more (set "^#?")))
       (group ("\\?" (zero-or-more (set "^#")))
       (group (("#") (zero-or-more "."))

;;   "^/?/?([^@]*@|)(.*)$"))		; Strips off // automatically.
(define up-hp-regexp
       (zero-or-one "/")
       (zero-or-one "/")
       (group (zero-or-more (set "^@"))
       (group (zero-or-more any-char))

;;      "^([^@:]*[:@]|)(.*)$"))
(define u-p-regexp
       (group (zero-or-more (set "^@:"))
	      (set ":@")
       (group (zero-or-more any-char))

;;    "^(.*)(:[0-9]+)$"))
(define h-p-test-regexp
    '(begin (group (zero-or-more any-char))
	    (group ":" (one-or-more (set "0-9")))

;;; ----------------- Support routines for basic-parse-url ---------------
(define (string-netaddr url)
  (let ((prt (port-number url)))
    (if prt (set! prt (number->string prt)))
    (cond ((password url)
	   (string-append "//" (user url) ":" (password url)
			  "@" (host url)
			  (maybe-append ":" prt)))
	   ((user url)
	    (string-append "//" (user url) "@" (host url)
			   (maybe-append ":" prt)))
	   ((host url)
	    (string-append "//" (host url)
			   (maybe-append ":" prt)))
	   (else ""))))

(define (split-hp-part h-p)
  (or (split h-p-test-regexp h-p)
      (list h-p "")))

(define (maybe-append s1 s2)
  (if (and s1 s2) (string-append s1 s2)

(define (safe-list-ref maybe-list index)
  (if (and (list? maybe-list)
	   (< index (length maybe-list)))
      (list-ref maybe-list index)

(define (maybe-chop-end maybe-string)
  (if (and (string? maybe-string)
	   (> (string-length maybe-string) 0))
      (substring maybe-string 0 (- (string-length maybe-string) 1))

(define (maybe-chop-beg maybe-string)
  (if (and (string? maybe-string)
	   (> (string-length maybe-string) 0))
      (substring maybe-string 1 (string-length maybe-string))

;;; ---------- Url access routines -------------

(define (service url)
  (list-ref url 0))

(define (user url)
  (list-ref url 1))

(define (password url)
  (list-ref url 2))

(define (host url)
  (list-ref url 3))

(define (port-number url)
  (list-ref url 4))

(define (filename url)
  (list-ref url 5))

(define (parameters url)
  (list-ref url 6))

(define (query url)
  (list-ref url 7))

(define (anchor url)
  (list-ref url 8))

(define (through-proxy? url)
  (list-ref url 9))

;;; ------------ Exports -------------------------
(set! url:parse-url      parse-url)
(set! url:unparse-url    unparse-url)
(set! url:service        service)
(set! url:user           user)
(set! url:password       password)
(set! url:host           host)
(set! url:port-number    port-number)
(set! url:filename       filename)
(set! url:anchor         anchor)
(set! url:through-proxy? through-proxy?)

(define (url:getenv evar)
  (false-if-exception (getenv evar)))

;;; -------------- Set up proxy list -----------------
(define (get-proxy-evar evar)
  (let ((e (url:getenv evar)))
    (if e
	(basic-parse-url e)

(define *proxy-servers*
  (let loop ((l url:*proxy-env-vars*))
       (cond ((null? l) ())
	      (let ((p (get-proxy-evar (cadar l))))
		(if p
		    (cons (list (caar l) p)
			  (loop (cdr l)))
		  (loop (cdr l))))))))

(provide "url")

;;; misc.scm - Miscellaneous utility routines
;;;           Version 0.2.
;;; Copyright (c) 1995 Harvey J. Stein (abel@netvision.net.il)
;;; This code is freely usable and distributable as long as this
;;; heading remains.

;;; Most importantly includes list->regexp-string to simplify writing
;;; regular expressions.  I was going blind trying to convert regexps
;;; first from stk format to scm format, and then from scm format to
;;; guile format.

;;(require 'regex)

(define (split regexp maybe-str)
  (if (string? maybe-str)
      (apply-matches (regexec regexp maybe-str) maybe-str)

(define (apply-matches matches string)
  (if matches
      (map (lambda (m) (substring string (car m) (cdr m)))
	   (cdr (vector->list matches)))

(define (dirname f)
  (define r (regcomp "^(.*/|)([^/]*)$"))
  (car (split r f)))

(define (mkdirpath path perm)
  (define (bitset? num bitmask)
    (odd? (quotient num bitmask)))

  (define (readwrite-dir? path)
    (let ((s (stat path)))
      (and s
	   (bitset? (vector-ref (stat path) 2)
	   (access path "rwx"))))
  (let ((dir (dirname path)))
    (cond ((readwrite-dir? path)
	  ((readwrite-dir? dir)
	   (mkdir path perm))
	   (if (mkdirpath (substring dir 0 (- (string-length dir) 1))
	       (mkdir path perm)

;; Replaces all "/./" with "/", and "foo/../bar" with "bar"
(define (expand-file-name fn)
  (define sd$ (regcomp "/\\.$"))
  (define sdd$ 
    (regcomp "[^/.]+/\\.\\.$|[^/.]+\\./\\.\\.$|[^/.]+\\.\\./\\.\\.$"))
  (define sds (regcomp "/\\./"))
  (define sdds (regcomp "[^/]+/\\.\\./"))
  (let ((nfn
	 (let loop ((reduced (string-edit sds
	   (let ((full-reduced (string-edit sdds
	     (if (string=? full-reduced reduced)
		 (loop full-reduced))))))
    (string-edit sdd$
		 (string-edit sd$ "/" nfn #t)

(define (expand-file-name fn)
  (define delim (regcomp "/+"))
  (define (do-dots a l)
    (cond ((null? l) (reverse a))
	  ((and (string=? (car l) "..")
		(not (null? a))
		(not (string=? (car a) "..")))
	   (do-dots (cdr a) (cdr l)))
	   (do-dots (cons (car l) a)
		    (cdr l)))))
  (define (do-dot l)
    (cond ((null? l) ())
	  ((string=? (car l) ".")
	   (do-dot (cdr l)))
	   (cons (car l) (do-dot (cdr l))))))

  (define (rebuild l)
    (cond ((null? l) ())
	  ((null? (cdr l)) l)
	   (cons (car l)
		 (cons "/"
		       (rebuild (cdr l)))))))
  (let* ((f (vector->list (string-split delim fn)))
	 (f' (if (and (not (null? f))
		      (string=? (car f) ""))
		 (cdr f)
	 (efn (rebuild (do-dots () (do-dot f')))))
;;    (format #t "  f = ~s\n f' = ~s\nefn = ~s\n"
;;	    f f' efn)
    (cond ((and (null? f')
		(not (null? f)))
	  ((and (null? f')
		(null? f))
	   (if (string=? (car f) "")
	       (set! efn (cons "/" efn)))
	   (let ((last (car (reverse f'))))
	     (if (and (or (char=? (string-ref fn (- (string-length fn) 1))
			  (string=? last "..")
			  (string=? last "."))
		      (not (and (null? (cdr efn))
				(string=? (car efn) "/"))))
		 (set! efn (append efn '("/")))))))

    (apply string-append efn)))

(define (string-downcase s)
  (list->string (map char-downcase (string->list s))))

;;; Set these up for your particular scheme regexp package...
(define regexp-start-group "\\(")
(define regexp-end-group "\\)")
(define regexp-start-set "[")
(define regexp-end-set "]")
(define regexp-one-or-more "\\+")
(define regexp-zero-or-more "*")
(define regexp-zero-or-one "\\?")
(define regexp-or "\\|")
(define regexp-begin "^")
(define regexp-end "$")
(define regexp-any-char ".")

(define (list->regexp-string l)
  (cond ((null? l) "")
	((and (list? l)
	      (symbol? (car l)))
	 (case (car l)
	   ((group) (string-append regexp-start-group
				   (list->regexp-string (cdr l))
	   ((set)   (string-append regexp-start-set
				   (list->regexp-string (cdr l))
	   ((one-or-more) (string-append (list->regexp-string (cdr l))
	   ((zero-or-more) (string-append (list->regexp-string (cdr l))
	   ((zero-or-one) (string-append (list->regexp-string (cdr l))
	   ((begin) (string-append regexp-begin
				   (list->regexp-string (cdr l))))
	   ((end) (string-append regexp-end
				 (list->regexp-string (cdr l))))
	   ((any-char) (string-append regexp-any-char
				      (list->regexp-string (cdr l))))
	   ((or) (string-append regexp-or
				(list->regexp-string (cdr l))))))
	((list? l)
	 (string-append (list->regexp-string (car l))
			(list->regexp-string (cdr l))))

;;(provide 'misc)