URL as file specifier - details & scheme code.
Harvey J. Stein
abel@netvision.net.il
Sun, 4 May 1997 21:19:05 +0300
My previous post regarding URLs was a little inaccurate. It's actually 9
parts, and the filename isn't separated from the path. Here're the
details in case anyone decides to follow this, uh, path.
Note, this still doesn't cover:
-Handling %xy as an escape for hex xy.
-Details of relative URL parsing.
-Proxy handling.
Here's my description, as well as the code. It's a little rough
because I first wrote it for STk, then ported it to SCM, and when I
last touched it, I was in the process of porting it to Guile. (Another
argument for a standard scheme library.)
;;; url.scm - URL Parsing library (used by the www package).
;;; Version 0.5.
;;;
;;; Copyright (c) 1995 Harvey J. Stein (hjstein@math.huji.ac.il)
;;; This code is freely usable and distributable as long as this
;;; heading remains.
;;; Usage:
;;; (url:parse-url url &optional parent)
;;; Takes a string (url), and returns a parsed url. One may apply
;;; The general url form handled is:
;;;
;;; service://user:password@host:port/path;parameters?query#anchor
;;;
;;; This function takes an optional second argument - the parent url
;;; (a parsed url). When the 2nd argument is given, the url is
;;; interpreted relative to the parent url.
;;;
;;; If a proxy environment variable is defined for the service of
;;; the given url (i.e. - HTTP_PROXY, FTP_PROXY, etc), then the url
;;; is parsed for being retrieved via the specified proxy. That is,
;;; the parsed url is the same as the parsed url of the
;;; corresponding environment variable, with the filename component
;;; replaced by the url being parsed (in string form). Said *_PROXY
;;; environment variables should be URLs.
;;;
;;; The following functions may be applied to a parsed url:
;;;
;;; (url:unparse-url url)
;;; Returns a string (a fully qualified url) which would parse
;;; into url.
;;; (url:service parsed-url)
;;; Returns the service (i.e. - protocol) (as a symbol), or #f
;;; if none was given.
;;; (url:user parsed-url)
;;; Returns user name, or #f if none was supplied.
;;; (url:password parsed-url)
;;; Returns the password, or #f if none was supplied.
;;; (url:host parsed-url)
;;; Returns the host name, or #f if none was supplied.
;;; Note - the host name can be "" (as in file:///foo/bar).
;;; (url:port-number parsed-url)
;;; The port number in the url, or #f if none was supplied.
;;; (url:filename parsed-url)
;;; The file name (i.e. - path name) of the url, or #f if
;;; none was supplied.
;;; (url:anchor parsed-url)
;;; The anchor in the url, or #f if none was supplied. It can
;;; be "" - as in http://foo.bar.com/file#
;;; (url:parameters parsed-url)
;;; The parameters (#f if not supplied, empty string if
;;; parameter delimiter is supplied, but no parameters are
;;; supplied.
;;; (url:query parsed-url)
;;; The query parameters (#f if not supplied, empty string if
;;; query delimiter is supplied, but no query parameters are
;;; supplied.
;;; (url:through-proxy? parsed-url)
;;; #t iff url has been parsed to be passed through a proxy.
;;; Proxying is handled by parsing the proxy address, and
;;; passing the unparsed url through as the file name.
;;; Overview:
;;; Internet RFC 1808 discusses how to interpret relative URLs. In
;;; doing so, it gives algorithms both for parsing URLs and for
;;; computing relative URLs.
;;;
;;; To parse a URL, they say to follow the following procedure:
;;;
;;; 1. Everything incl & after 1st "#" is the anchor. Of the rest,
;;; 2. Everything incl & before 1st ":" is the scheme, assuming at
;;; least 1 char before ":" & all chars are scheme allowable
;;; [a-zA-Z0-9+.-]. Of the rest,
;;; 3. If it starts with "//", everything up to (but not incl) next
;;; ocurrence of "/" or until end is network location. Of the rest,
;;; 4. everything from 1st "?" until end is the query info. Of the
;;; rest,
;;; 5. everything from 1st ";" until end is the parameters field. Of
;;; the rest,
;;; 6. everything remaining is the path.
;;;
;;; After all this is done, one must remove the "#" from the beginning
;;; of the anchor, the ":" from the end of the scheme, the "//" from
;;; the beginning of the net location, the "?" from the beginning of
;;; the query part, and the ";" from the beginning of the params part.
;;; One typically leaves the "/" on the beginning of the path part,
;;; because it shows up iff the URL is not relative.
;;;
;;; This means that the following regexp should be able to separate
;;; out these 6 basic parts of the url:
;;;
;;; "^([a-zA-Z0-9+.-]+:|)(//[^#/]*|)([^#?;]*|)(;[^?#]*|)(\\?[^#]*|)(#.*|)$"
;;;
;;; Once this is done, the net address must be parsed. I believe this
;;; can be done as follows:
;;;
;;; 1. Everything before & including the 1st "@" is the
;;; username/password part. Of the rest,
;;; 2. Everything after & including the last ":" is the port number,
;;; provided the ":" is only followed by digits.
;;; 3. What's left is the host name.
;;;
;;; Bugs
;;; -We don't handle the standard URL escaping mechanism for passing
;;; special characters, namely, converting %xy to the character
;;; whose ascii code in hex is xy.
;;; -The parsing of http://foo.bar.com is problematic. On the one
;;; hand, there's no file name
;;; become /.
;;; -Proxy junk barely tested...
(load "misc")
;;(require 'string-case)
(define url:*proxy-env-vars*
'((http "HTTP_PROXY")
(ftp "FTP_PROXY")
(wais "WAIS_PROXY")
(gopher "GOPHER_PROXY")))
(define url:parse-url #f)
(define url:unparse-url #f)
(define url:service #f)
(define url:user #f)
(define url:password #f)
(define url:host #f)
(define url:port-number #f)
(define url:filename #f)
(define url:anchor #f)
(define url:parameters #f)
(define url:query #f)
(define url:through-proxy? #f)
(let ()
(define *proxy-servers* ())
;;; Takes a URL as an argument and returns a list containing the
;;; protocol, the host name, and the file name.
(define (parse-url url . parent)
(proxitize (apply relativize (basic-parse-url url) parent)))
(define (basic-parse-url url)
(let* ((base (split url-regexp url))
(up-hp (split up-hp-regexp (safe-list-ref base 1)))
(u-p (split u-p-regexp (safe-list-ref up-hp 0)))
(h-p (split-hp-part (safe-list-ref up-hp 1))))
;; (format #t "basic-parse-url:\n base=~s\n up-hp=~s\n u-p=~s\n h-p=~s\n"
;; base up-hp u-p h-p)
(if (not (and base up-hp u-p h-p))
#f
(let* ((dirty-url (map (lambda (x) (if (string=? x "") #f x))
(append (list (car base))
u-p
h-p
(cddr base))))
(srv (maybe-chop-end (list-ref dirty-url 0)))
(user (maybe-chop-end (list-ref dirty-url 1)))
(pass (maybe-chop-end (list-ref dirty-url 2)))
(host (list-ref dirty-url 3))
(port (maybe-chop-beg (list-ref dirty-url 4)))
(path (list-ref dirty-url 5))
(parm (maybe-chop-beg (list-ref dirty-url 6)))
(quer (maybe-chop-beg (list-ref dirty-url 7)))
(anch (maybe-chop-beg (list-ref dirty-url 8))))
(if (and (not host)
(string? (safe-list-ref base 1))
(> (string-length (safe-list-ref base 1)) 0))
(set! host ""))
(list (if srv (string->symbol (string-downcase srv))
#f)
user
pass
host
(if port (string->number port)
#f)
path
parm
quer
anch
#f)))))
(define (relativize url . parent)
(define (inherit-service)
(set! url (smerge-lists url parent 0)))
(define (inherit-netloc)
(set! url (smerge-lists url parent 1 2 3 4)))
(define (inherit-path)
(set! url (smerge-lists url parent 5)))
(define (inherit-parameters)
(set! url (smerge-lists url parent 6)))
(define (inherit-query)
(set! url (smerge-lists url parent 7)))
(define (smerge-lists url parent . positions)
(define (smerge-aux url parent positions ref)
(cond ((null? positions) url)
((null? url) parent)
((null? parent) url)
((= (car positions) ref)
(cons (car parent)
(smerge-aux (cdr url) (cdr parent) (cdr positions) (+ 1 ref))))
(else
(cons (car url)
(smerge-aux (cdr url) (cdr parent) positions (+ 1 ref))))))
(smerge-aux url parent positions 0))
(define (merge-paths)
(let* ((base (string-append (dirname (filename parent))
(filename url)))
(newpath (expand-file-name base)))
;; (format #t "merge-paths: base=~s, newpath=~s\n" base newpath)
(set! url (smerge-lists url `(serv user pass host port ,newpath) 5))))
(if (not (null? parent)) (set! parent (car parent)))
(cond ((null? parent) url)
((string=? "" (unparse-url parent)) url)
((string=? "" (unparse-url url)) parent)
((service url) url)
((host url) (inherit-service) url)
((and (filename url)
(> (string-length (filename url)) 0)
(char=? (string-ref (filename url) 0) #\/))
(inherit-service) (inherit-netloc) url)
((and (not (filename url))
(parameters url))
(inherit-service) (inherit-netloc) (inherit-path) url)
((and (not (filename url))
(query url))
(inherit-service) (inherit-netloc) (inherit-path) (inherit-parameters)
url)
((not (filename url))
(inherit-service) (inherit-netloc) (inherit-path) (inherit-parameters)
(inherit-query)
url)
(else
(inherit-service) (inherit-netloc) (merge-paths) url)))
(define (proxitize parsed-url)
(let ((proxy (assoc (service parsed-url) *proxy-servers*)))
(cond (proxy
(set! proxy (cadr proxy))
(list (service proxy)
(user proxy)
(password proxy)
(host proxy)
(port-number proxy)
(unparse-url parsed-url)
(parameters proxy)
(query proxy)
(anchor proxy)
#t)) ; Is through proxy.
(else
parsed-url))))
(define (unparse-url url)
(cond ((through-proxy? url)
(filename url))
(else
(let ((srv (service url)))
(if srv (set! srv (symbol->string srv)))
(string-append (maybe-append srv ":")
(string-netaddr url)
(if (filename url) (filename url) "")
(maybe-append ";" (parameters url))
(maybe-append "?" (query url))
(maybe-append "#" (anchor url)))))))
;;; --------- Regexps for parsing ---------------
(define string->regexp regcomp)
;; String for guile:
;; "^\\([a-zA-Z0-9+.-]\\+:\\|\\)\\(//[^#/]*\\|\\)\\([^#?;]*\\|\\)\\(;[^?#]*\\|\\)\\(\\?[^#]*\\|\\)\\(#.*\\|\\)$"))
;; String for stk & scm:
;; "^([a-zA-Z0-9+.-]+:|)(//[^#/]*|)([^#?;]*|)(;[^?#]*|)(\\?[^#]*|)(#.*|)$"))
(define url-regexp
(string->regexp
(list->regexp-string
'(begin
(group (one-or-more (set "a-zA-Z0-9+.-"))
":"
or)
(group ("//" (zero-or-more (set "^#/")))
or)
(group (zero-or-more (set "^#?;"))
or)
(group (";" (zero-or-more (set "^#?")))
or)
(group ("\\?" (zero-or-more (set "^#")))
or)
(group (("#") (zero-or-more "."))
or)
end))))
;; "^/?/?([^@]*@|)(.*)$")) ; Strips off // automatically.
(define up-hp-regexp
(string->regexp
(list->regexp-string
'(begin
(zero-or-one "/")
(zero-or-one "/")
(group (zero-or-more (set "^@"))
"@"
or)
(group (zero-or-more any-char))
end))))
;; "^([^@:]*[:@]|)(.*)$"))
(define u-p-regexp
(string->regexp
(list->regexp-string
'(begin
(group (zero-or-more (set "^@:"))
(set ":@")
or)
(group (zero-or-more any-char))
end))))
;; "^(.*)(:[0-9]+)$"))
(define h-p-test-regexp
(string->regexp
(list->regexp-string
'(begin (group (zero-or-more any-char))
(group ":" (one-or-more (set "0-9")))
end))))
;;; ----------------- Support routines for basic-parse-url ---------------
(define (string-netaddr url)
(let ((prt (port-number url)))
(if prt (set! prt (number->string prt)))
(cond ((password url)
(string-append "//" (user url) ":" (password url)
"@" (host url)
(maybe-append ":" prt)))
((user url)
(string-append "//" (user url) "@" (host url)
(maybe-append ":" prt)))
((host url)
(string-append "//" (host url)
(maybe-append ":" prt)))
(else ""))))
(define (split-hp-part h-p)
(or (split h-p-test-regexp h-p)
(list h-p "")))
(define (maybe-append s1 s2)
(if (and s1 s2) (string-append s1 s2)
""))
(define (safe-list-ref maybe-list index)
(if (and (list? maybe-list)
(< index (length maybe-list)))
(list-ref maybe-list index)
#f))
(define (maybe-chop-end maybe-string)
(if (and (string? maybe-string)
(> (string-length maybe-string) 0))
(substring maybe-string 0 (- (string-length maybe-string) 1))
#f))
(define (maybe-chop-beg maybe-string)
(if (and (string? maybe-string)
(> (string-length maybe-string) 0))
(substring maybe-string 1 (string-length maybe-string))
#f))
;;; ---------- Url access routines -------------
(define (service url)
(list-ref url 0))
(define (user url)
(list-ref url 1))
(define (password url)
(list-ref url 2))
(define (host url)
(list-ref url 3))
(define (port-number url)
(list-ref url 4))
(define (filename url)
(list-ref url 5))
(define (parameters url)
(list-ref url 6))
(define (query url)
(list-ref url 7))
(define (anchor url)
(list-ref url 8))
(define (through-proxy? url)
(list-ref url 9))
;;; ------------ Exports -------------------------
(set! url:parse-url parse-url)
(set! url:unparse-url unparse-url)
(set! url:service service)
(set! url:user user)
(set! url:password password)
(set! url:host host)
(set! url:port-number port-number)
(set! url:filename filename)
(set! url:anchor anchor)
(set! url:through-proxy? through-proxy?)
(define (url:getenv evar)
(false-if-exception (getenv evar)))
;;; -------------- Set up proxy list -----------------
(define (get-proxy-evar evar)
(let ((e (url:getenv evar)))
(if e
(basic-parse-url e)
#f)))
(define *proxy-servers*
(let loop ((l url:*proxy-env-vars*))
(cond ((null? l) ())
(else
(let ((p (get-proxy-evar (cadar l))))
(if p
(cons (list (caar l) p)
(loop (cdr l)))
(loop (cdr l))))))))
)
(provide "url")
;;; misc.scm - Miscellaneous utility routines
;;; Version 0.2.
;;;
;;; Copyright (c) 1995 Harvey J. Stein (abel@netvision.net.il)
;;; This code is freely usable and distributable as long as this
;;; heading remains.
;;; Most importantly includes list->regexp-string to simplify writing
;;; regular expressions. I was going blind trying to convert regexps
;;; first from stk format to scm format, and then from scm format to
;;; guile format.
;;(require 'regex)
(define (split regexp maybe-str)
(if (string? maybe-str)
(apply-matches (regexec regexp maybe-str) maybe-str)
#f))
(define (apply-matches matches string)
(if matches
(map (lambda (m) (substring string (car m) (cdr m)))
(cdr (vector->list matches)))
#f))
(define (dirname f)
(define r (regcomp "^(.*/|)([^/]*)$"))
(car (split r f)))
(define (mkdirpath path perm)
(define (bitset? num bitmask)
(odd? (quotient num bitmask)))
(define (readwrite-dir? path)
(let ((s (stat path)))
(and s
(bitset? (vector-ref (stat path) 2)
#o40000)
(access path "rwx"))))
(let ((dir (dirname path)))
(cond ((readwrite-dir? path)
#t)
((readwrite-dir? dir)
(mkdir path perm))
(else
(if (mkdirpath (substring dir 0 (- (string-length dir) 1))
perm)
(mkdir path perm)
#f)))))
;; Replaces all "/./" with "/", and "foo/../bar" with "bar"
(define (expand-file-name fn)
(define sd$ (regcomp "/\\.$"))
(define sdd$
(regcomp "[^/.]+/\\.\\.$|[^/.]+\\./\\.\\.$|[^/.]+\\.\\./\\.\\.$"))
(define sds (regcomp "/\\./"))
(define sdds (regcomp "[^/]+/\\.\\./"))
(let ((nfn
(let loop ((reduced (string-edit sds
"/"
fn
#t)))
(let ((full-reduced (string-edit sdds
""
reduced
#t)))
(if (string=? full-reduced reduced)
reduced
(loop full-reduced))))))
(string-edit sdd$
""
(string-edit sd$ "/" nfn #t)
#t)))
(define (expand-file-name fn)
(define delim (regcomp "/+"))
(define (do-dots a l)
(cond ((null? l) (reverse a))
((and (string=? (car l) "..")
(not (null? a))
(not (string=? (car a) "..")))
(do-dots (cdr a) (cdr l)))
(else
(do-dots (cons (car l) a)
(cdr l)))))
(define (do-dot l)
(cond ((null? l) ())
((string=? (car l) ".")
(do-dot (cdr l)))
(else
(cons (car l) (do-dot (cdr l))))))
(define (rebuild l)
(cond ((null? l) ())
((null? (cdr l)) l)
(else
(cons (car l)
(cons "/"
(rebuild (cdr l)))))))
(let* ((f (vector->list (string-split delim fn)))
(f' (if (and (not (null? f))
(string=? (car f) ""))
(cdr f)
f))
(efn (rebuild (do-dots () (do-dot f')))))
;; (format #t " f = ~s\n f' = ~s\nefn = ~s\n"
;; f f' efn)
(cond ((and (null? f')
(not (null? f)))
"/")
((and (null? f')
(null? f))
"")
(else
(if (string=? (car f) "")
(set! efn (cons "/" efn)))
(let ((last (car (reverse f'))))
(if (and (or (char=? (string-ref fn (- (string-length fn) 1))
#\/)
(string=? last "..")
(string=? last "."))
(not (and (null? (cdr efn))
(string=? (car efn) "/"))))
(set! efn (append efn '("/")))))))
(apply string-append efn)))
(define (string-downcase s)
(list->string (map char-downcase (string->list s))))
;;; Set these up for your particular scheme regexp package...
(define regexp-start-group "\\(")
(define regexp-end-group "\\)")
(define regexp-start-set "[")
(define regexp-end-set "]")
(define regexp-one-or-more "\\+")
(define regexp-zero-or-more "*")
(define regexp-zero-or-one "\\?")
(define regexp-or "\\|")
(define regexp-begin "^")
(define regexp-end "$")
(define regexp-any-char ".")
(define (list->regexp-string l)
(cond ((null? l) "")
((and (list? l)
(symbol? (car l)))
(case (car l)
((group) (string-append regexp-start-group
(list->regexp-string (cdr l))
regexp-end-group))
((set) (string-append regexp-start-set
(list->regexp-string (cdr l))
regexp-end-set))
((one-or-more) (string-append (list->regexp-string (cdr l))
regexp-one-or-more))
((zero-or-more) (string-append (list->regexp-string (cdr l))
regexp-zero-or-more))
((zero-or-one) (string-append (list->regexp-string (cdr l))
regexp-zero-or-one))
((begin) (string-append regexp-begin
(list->regexp-string (cdr l))))
((end) (string-append regexp-end
(list->regexp-string (cdr l))))
((any-char) (string-append regexp-any-char
(list->regexp-string (cdr l))))
((or) (string-append regexp-or
(list->regexp-string (cdr l))))))
((list? l)
(string-append (list->regexp-string (car l))
(list->regexp-string (cdr l))))
(else
l)))
;;(provide 'misc)