File   : TODO
Author : Richard A. O'Keefe
SCCS   : "@(#)97/11/20 TODO    1.5"
About  : Known problems and ideas for the future.


PRIORITY ONE;

    Replace 'long int' with a suitable typedef to make the program
    32-bit ready.  The program will _work_ on 64-bit systems; that's
    not a problem, but it is _advertised_ as doing 32-bit arithmetic,
    and currently on 64-bit systems it doesn't.  The change was
    backed out, because perhaps it is the advertisement that should
    change; maybe it should say 'arithmetic is doing using the native
    `long' type from C'?


Good things from GNU m4:

    -G         (instead of compile-time EXTENDED option)

    -I, $M4PATH        (but how to do this portably across all of unix. msdos,
                macos, vms, cms, mvs?  Hmmm.  We don't _have_ to do it
                on all of them, after all...

    indir(macro,arg1,...,argn)
               indirect macro call when the macro name is not an identifier.
               This is actually programmable in pdm4 as it stands,
               given one spare identifier:
               define(indir, `define(`_indir',`$1')_indir($[2@$])')
               You'll find this in indir.m4.

    Suppress expansion for built-in macros if they really need an
    argument and there isn't one.  The list should include
       define          undefine
       pushdef         popdef
       include         sinclude
       paste           spaste
       substr          index
       incr            decr
       eval            ifdef
       divert          undivert
    and perhaps some others.

    debugfile([Filename]) is basically an freopen() on stderr;
    it's doable and might be useful.

Things from System V M4:

       -s      emit #line directives to maintain line sync

               (a) This really needs to be parameterised so that some
                   other string can be emitted.
               (b) It would take a major rewrite to make this easy.

       -Bn     set the pushback buffer to n bytes AND
               set the argument string area to n bytes.
               System V M4 does _not_ let them be set independently,
               which is a pity, because the pushback buffer probably
               wants to be bigger than the argument string area.
               The System V M4 default is 4096.

       -Hn     set the symbol table to n buckets; default 199.
               System V M4 requires the size to be prime; we have
               no reason to care what the size is.

       -Sn     set the call stack to n slots.  You need 1 per
               argument and (pdm4: 4, svr4: 3) per call.
               System V M4 has a default of 100.

       -Tn     set the token buffer to n bytes;
               System V M4 has a default of 512.  In this M4,
               it only affects the possible size of a macro name,
               and has no other uses.

               PDM4 sets all these areas at compile time, when you
               select -DSMALL (suitable for a PDP-11/60),
               -DMEDIUM (suitable for "small" model on an old PC),
               or neither (suitable for "large" model on an old PC).
               It wouldn't be enormously hard to make these settable,
               bit it would also not be hard to do the _right_  thing,
               which is to set an over-all memory limit and just let
               these areas expand automatically.

       traceon traceon(m1,...mn) should switch tracing on for the
               macros named in its arguments, traceon() should
               switch tracing on for all macros.

       traceoff like traceon(), but switches tracing off.

               The main issue here is to figure out what tracing should
               _do_.  There are four obvious "ports":
               1. after name recognition, before argument gathering
               2. after argument gathering
               3. after expansion, before rescanning
               4. after rescanning
               With pdm4's internal architecting, it would take a major
               rewrite to catch port 4.  Port 1 is not of much interest.
               So I suggest that ports 2 and _maybe_ 3 should be caught.
               Port 3 can deal with a very large amount of text, so just
               listing arrival at port 2 would probably be a useful step.

Cleanup still to be done:

    use sysexits.h
       I am no longer sure that this is useful.

    lock macro definitions so that
       define(a, 1)
       a(undefine(`a'))
    won't misbehave.  This is the only known major bug remaining.
    Oddly enough, System V M4 and GNU M4 seem to have the same problem.


    There was a week-long maintenance blitz on pdm4 in November 1997.
    This has resulted in a lot of changes, which means that the other
    ports (macos, msdos, vms, os2, ibmvm) need retesting.


Other things I want

    New command line options:

       -U-     Remove all built in definitions.
               Since we can do -D'id=$!nn', this is useful; we can
               put back exactly the macros we want.

       -w {i|w|e}
               Warnings: ignore, warn, or error.  Of course, for this
               to be useful, we need to check for more warnings first!

       -N n    Set the number of diversions to be n instead of 9 (n is
               the new maximum diversion number).  Will never be redundant,
               as to retain backwards compatibility, out-of-range divnums
               need to be mapped to the bit-bucket.

       -k {o|v|g|p}
               Kompatibility:  with old (v7) m4, with UNIX system V,
               with GNU m4, and with PD m4 respectively.  Perhaps a
               better way to go would be to have compatibility files,
               e.g. v7compat.m4.

    16-bit cleanliness

    Merge the chrsave stack and the pushback buffer so there is only
    one character array:
       +---------....-------------+
       |stack-->       <--pushback|
       +---------....-------------+
    This would also eliminate the `token' array, so there would be one
    array (with one size to worry about) instead of three.  We'd even
    save on copying, because a macro name would be stored (in inspect)
    in _exactly_ the right place for a strsave().


    exponentiation in eval(), maybe?  x**n, x^^n raise x to the nth power.
    Not really very useful when only integers are available.  sqr(),
    copied from Pascal, _is_ available.

    m4expand() can be done more efficiently than m4expand.m4 does it.
    This could go with built-in support for `indir' and `builtin'.

    m4xargs(n, body, e11, e1n, ..., em1, emn)
       evaluates body m times, the ith time with $1..$n bound to ei1..ein.

    defmatch(name, regexp1, body1, ... regexpn, bodyn, default)
       compiles a matching procedure name(string) which will match
       the string against each regexpi in turn, and when one matches,
       will evaluate bodyi with $1...$n bound to the substrings you'd
       expect to be in \1...\n and with $0 bound to the substring
       matched by the whole pattern.  An important difference between
       this and what's in GNU m4 is that this can _compile_ the
       regexps and the defined macro uses the compiled versions.

    defstring(Left[,Body[,Right[,Escape[,...]]]])
       There is a major problem with changecom and changequote:
       there can only be ONE kind of comment at a time;
       there can only be ONE kind of quote at a time.
       Now, when processing C++ or C9x, you want THREE kinds of
       comments in the same language:
               /*...*/
               //...\n
               #...\n
       If you are processing an XBASE language, you want even more.
       And it can be handy occasionally to have more than one set of
       quotes.  Just as defquote() was based on my qtclib, so this
       would be based on my esclib.  Left would be the left quote,
       Right, defaulting to left, would be the corresponding right
       quote, Escape would be the escape character (if any), and so
       on.  When Left was found, the body of the string would be
       built up and then Body would be expanded.  Within Body,
       $0 would be Left (like define)
       $1 would be the processed string
       #2 would be Right.
       To strip PL/I-style comments:
           defstring(/*,`',*/)
       To pass BCPL-style comments to the current output stream:
           defstring(//,`$"$0$1$2$"',`
           ')
       To process quotes a la standard m4,
           defstring(<,`$:$1$:',>)
        except that the $: hack isn't there yet to support it.

       To make this work, we need to change the main loop of macro()
       to search a trie.  That's tedious rather than difficult.

    m4time()

       We need a good way to handle time-stamps.  (If we had, we could
       pretty much build make() in m4, and we could provide time-stamps
       in output files, which is a really good idea.)
       I have designed an interface, but not an implementation.

       In the rules given below, words will be accepted in either case;
       only the letters shown in capitals willa actually be checked.

       m4time(<base> <offset>* [<format>])

       <base> ::=
           Built                       # when m4 was built
           Atime,<file>                # UNIX atime of File
           Ctime,<file>                # UNIX ctime of File
           Mtime,<file>                # UNIX mtime of File
           Now                         # Right this second
           Today                       # Midday today
           Given,<8601 time>           # A time in restricted ISO 8601 form

           <8601 time> ::= yyyy [mm [dd [hh [mm [ss [.ssssssss]]]]]

           Non-digit characters may separate groups but are otherwise
           completely ignored.  A group may be short if and only if
           there is a non-digit character separating it from other
           groups.  One normally expects a timestamp like
               1997.11.19T18:48:34.043
           If a group is omitted, it defaults to the smallest legal
           value (mm dd => 1, hh mm ss => 0).

           Some thought needs to be given to mapping Atime, Ctime, Mtime
           to other operating systems before this is implemented.
           Given accepts [

       <offset> ::=
           + <expr>, <unit>
           - <expr>, <unit>
           Next,<when>                 # the earliest <when> > now
           Previous,<when>             # the latest <when> < now

           <unit> ::=
               Year[s]
               mOnth[s]
               MOnth[s]
               Fortnight[s]
               Week[s]
               Day[s]
               Hour[s]
               mInute[s]
               MInute[s]
               Second[s]
               MSec

           <when> ::= {mm|*} [{dd|*} [{hh|*} [{ss [.ssssssss] | *}]]]
                   |  {SUnday|Monday|Tuesday|Wednesday|THursday
                      |Friday|SAturday} [{hh|*} [{ss [.ssssssss] | *}]]]
                   |  <unit>

           If a group is omitted, it defaults to the smallest legal
           value (dd => 1, hh mm ss => 0).  * means any value.  Think
           of advancing or reversing a clock until a time matching
           the given pattern is reached.

       <format> ::=
           <C strftime string>

           If the format is omitted (indicated by there being an odd
           number of arguments) the default is `%Y.%m.%dT%H:%M:%S'.

       For example, if we wanted to say something like
       ``midnight after a file has existed for a full 99 days'',
       m4time(ctime,File,+99,days,next,*.*t00:00)
       would do the trick.

    m4file()

    We need a portable way of dealing with files.  This involves
    dealing with directories, making temporary files, parsing and
    unparsing files, &c.  I did something similar for Quintus Prolog,
    based on Common Lisp.  This is a bit spiffed up, dealing better
    with VMS, CMS, and URLs.

    Part     UNIX  MSDOS VMS   MacOS CMS   MVS   URLs
    Protocol N/A   N/A   N/A   N/A   N/A   N/A   pro:
    Host     //h   //h   h::   N/A   N/A   N/A   //h
    Device   N/A   dev:  dev:  dev:  ..fm  ???   :port
    Abs a    /dir  /dir  [dir] dir:  ???   ???   /dir
    Abs r0   dir   dir   [.dir :dir  ???   ???   N/A
    Abs r1   ../   ../   [-.   ::dir ???   ???   N/A
    Abs uUser~User ???   N/A   N/A   ???   ???   ~User
    Path     d/d   d/d   [d.d] d:d   ???   ???   d/d
    Name     14+   8     31    31    fn..  ???   18
   Extension .e    .e    .e    .e    .ft.  ???   .e
    Version  N/A   N/A   ;v    M/A   N/A   ???   N/A
    Member   (m)   N/A   (m)   N/A   (m)   ???   #ref

       Note: VMS used to allow <d.d.d> as well as [d.d.d]
             MSDOS allows /d/d/ as well as \d\d\
             MacOS doesn't really have the `extension' concept,
             but things like C compilers usually use it anyway.
             Treating a CMS `file mode' as a device isn't quite
             right (it's that and more) but it's not too far
             wrong.  MVS has generation and cycle numbers (version
             numbers) and members; should the dots in a file name
             be treated as directory-related?

       m4file(Append-ok?,<file>)
               => 0 (no) | 1 (yes)

       m4file(Backup,<file>)
               => <file> modified so that renaming <file> to the result
                  will give it a backup name.  Could be .xxx -> .BAR,
                  or twiddle a version number, or use emacs convention,
                  or ...
       m4file(Change,<file>,to,<file>)
               => empty; the file has its name changed.

       m4file(Delete,<file>)
               => empty; the file is deleted

       m4file(Exists?,<file>)
               => 0 (no) | 1 (yes)

       m4file(Format,<file>)
               => F<n>         Fixed length records of n bytes each
                | V<n>         Variable length records, up to n bytes each
                | CR           Stream-CR (Mac) format
                | CRLF         Stream-CRLF (DOS) format
                | LF           Stream-LF (UNIX) format
                | DIR          Directory
                | U            Unknown

       m4file(Group,<file>)
               => file's group as a word if possible, as a decimal number
                  if not (a UNIX-ism; makes sense on VMS)

       m4file(Input-ok?,<file>)
               => 0 (no) | 1 (yes)

       m4file(Kopy,<file>,as,<file>)
               => empty; the file is copied.  Sorry about the spelling.

       m4file(Link,<file>,to,<file>{,Hard|Soft})
               => empty; a new link is created.  Hard is the UNIX
               default.  Soft is the MacOS default, and I believe the
               Windows default.  Error if it can't be done, of course.

       m4file(Mkdir,<dir>)
               => empty; a directory is created.  It is not an error
               if the directory already exists.

       m4file(Name,<file> {, <modification>}* [, <part>])
               => a revised file name; see below.

       m4file(Output-ok?,<file>)
               => 0 (no) | 1 (yes)

       m4file(Paste,<file>)
               => copies contents of files to current output stream
               define(spaste,
                `ifelse(m4file(input-ok,`$1'),1,
                         `m4file(`paste',`$1')')')

       m4file(Rmdir,<dir>)
               => empty; the directory is removed.  Only allowed if there
               are no entries in the directory,

       m4file(Size,<file>[,{Characters|Lines|Records|Blocks|Kilobytes}])
               => the size of the file, in decimal, measured in the
               given units.  The default is Characters.

       m4file(Tty?,<file>)
               => 0 (no) | 1 (yes, the file is a terminal)

       m4file(User,<file>)
               => file's owner as a word if possible, as a decimal number
                  if not (a UNIX-ism; makes sense on VMS)

       m4file(Void,<file>)
               => 0 (no) | 1 (yes, the file is empty)
               NB: this may be cheaper than eval(m4file(Size,<file>)==0)

       m4file(Wild,<pattern>)
               => a quoted comma separated list of file names matching
               <pattern>.  <pattern> should be matched appropriately
               for the current system, I guess.  Modern UNIX systems
               have a glob() function, in POSIX.2, that can be used.

    Name processing is really based on Interlisp-D, with a dash of
    Common Lisp, modified to cope with a few operating systems neither
    was apparently concerned with.

       <modification> ::= <part>,<replacement>
                       |  Parent
                       |  Child,<subdir>

       <part> ::= <simple part>
                |  <derived part>
                |  <part>+<part>

       <simple part> ::=
               Host
               Device
               Tree    (a | r<n> | u<user>) [/d/d | \d\d | .d.d | :d:d &c]
                       covers both Abs and Path
               Name
               Extension
               Version
               Member

       <derived tree> ::=
               Origin          = host+device
               Site            = host+device+tree
               Base            = host+device+tree+name
               File            = name+extension
               Root            = host+device+(a/)
               All             = everything
               Item            = Member if the base has a member,
                               = Name otherwise.

       In a modification, Parent moves up a level in the directory
       tree, Child,s moves down a level to subdirectory s (it must
       be a subdirectory, not a file, use File for that), and a
       <part> implies replacing the corresponding components of
       the file name so far with the parts to be found in <replacement>.

       At the end, if a <part> is listed without a <replacement>,
       just that information is returned.  The default is everything.
       Of course, there are obvious nasty problems about empty
       components -vs- missing components.  In UNIX, 'a.' and 'a'
       are different file names, although oddly enough
       '///x' at the start of a file _is_ equivalent to '/x'.
       Some of this still needs sorting out.  The ones where this
       are the most pressing are the extension, where we can rule
       that `' means no extension, while any extension that is
       actually present will begin with `.', and similarly a version
       if present must begin with ';'.  The details need spelling
       out, but then, what is there about m4 that doesn't?

       For example, suppose we have an M4 library.  On a UNIX system,
       library element X might be the file '/opt/local/m4/lib/X.m4'.
       On a VM/CMS system, it might be 'LIBRARY M4 A (X)'.  So a
       single file could do
       define(libfile,
           ifdef(unix,  /opt/local/m4/lib/XXXX.m4,
                 msdos, C:/M4/LIB/xxxx.M4,
                 vms,   M4LIB$:xxxx.M4,
                 macos, m4disc:library:XXXX.m4,
                 ibmvm, LIBRARY M4 (xxxxx) A,
                        XXXX.m4))dnl
       Then we could load a particular file by doing
           include(m4file(name,libfile,item,fred))

       A file that wants to include the fred.m4 file in the jim
       subdirectory of the directory containing the first file
       could do
           include(m4file(name,m4file(name,__file__,site),
                               child,jim,file,fred.m4))
       At first sight, this is a little clumsy, but compared with what?
       How would _you_ do it?

       My knowledge of CMS just predates the release when they added
       directories.  I did read the manuals for the next release, but
       never used it, so have forgotten what directories look like in
       CMS.  I would appreciate advice about the
           Windows NT file system,
           OS/2 file system,
           OpenVMS file system (is it still like VMS 5.x?)
           CMS file system
           MVS file system

    The weird thing is that m4time and m4file would probably dwarf the
    rest of m4.

