Skip to content

Up for discussion: User-provided commands for reading shell input #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

jpco
Copy link
Collaborator

@jpco jpco commented Mar 27, 2025

Another wacky change from the chicken ranch. I think this one is pretty exciting, though.

Here's what's included:

  1. This PR adds a $&readline primitive.

The primitive itself isn't terribly interesting, it essentially does just what you'd expect - it takes a single optional prompt as its argument and its return semantics are the same as $&read. It also interacts with the readline history primitives and $&resetterminal as you would expect. However, it also (sort of) supports user-programmable file completion -- by uncommenting readline.c:218 and defining the following function, you can test it out:

fn %complete in {
  let (f = $in^*) if {access $f} {
    result $f
  } {
    result ()
  }
}

That's novel and interesting. But not exactly enough so to justify a whole new primitive just to have it. So what's the point of this primitive?

  1. This PR modifies $&parse to take a "reader command" which it calls to fetch input.

So instead of $&parse $prompt, we do something like $&parse {$&readline $prompt} (what it really looks like is just below). $&parse calls the reader command at least once in order to fetch input to parse, but the number of calls exactly depends on the input itself, which is why it needs to be modeled this way. (In theory we could do a push-style $&parse primitive where we fetch input, feed it to $&parse, and then repeat depending on its output, but this is a more straightforward change from the prior state of things). When $&parse gets an empty list return value from its reader command, it understands that as an EOF and throws an eof exception.

So we added the $&readline primitive because now, that's how we use readline in the shell!

Doing this enables some impressive flexibility. For example, a %parse function which behaves like the existing %parse $prompt primitive looks something like:

fn %parse prompt {
	if %is-interactive {
		let (p = $prompt(1); input = ())
		let (code = <={$&parse {
			let (r = <={$&readline $p}) {
				input = $input $r
				p = $prompt(2)
				result $r
			}
		}}) {
			if {!~ $#fn-%write-history 0} {
				%write-history <={%flatten \n $input}
			}
			result $code
		}
	} {
		$&parse $&read
	}
}

But this is much more flexible and explicit than the existing setup. Some ways you could change this to suit your fancy:

  • %write-history each line individually rather than buffering the input and writing it all at the end (this would be like the pre Minor overhaul of history writing #65 behavior).

  • Use only a single $prompt for every input line, or use three different $prompt elements, or even dynamically generate $prompt for each individual line

  • Use something other than $&readline or $&read. Currently those are the only two reasonable options, but with this setup, a whole other line editing library is just a matter of calling a different reader command.

  • Do pre-processing on the read-in text before returning it. This can be used for lots of read-time behaviors, such as (in increasing order of bad-idea-ness):

    • history expansion
    • the equivalent of zsh's $KEYBOARD_HACK
    • precedence-hacking to allow things like time foo | bar => time {foo | bar}
    • "real" aliases which let you do weird things like alias fonk='echo blah |'

Note that the reader command is called with stdin set to the current shell command input (which, unlike other shells, is a bit different than the user's idea of stdin). If using a $&runinput primitive like the %run-file from #79, then the set of $&runinput, $&parse, and the "reader command" make a coherent "system" of functions, where an extremely minimal -- but just about complete, modulo handling the eof exception -- REPL would look like:

fn minimal-loop script {
	$&runinput $script {
		forever {
			<={$&parse $&read}
		}
	}
}

where: (a) $&runinput would set up a new command input and track certain values (line number, etc.), and (b) $&parse would interpret the input as shell commands, calling (c) $&read to actually read the input from the file set up by $&runinput. It's not entirely "orthogonal" to have the $&runinput and $&parse behaviors depend on each other this much (you can't properly read from the command input without $&parse, and you can't read an input to $&parse without a context set up by $&runinput), but given some of the complexities at play here, it seems fairly minimal -- and should extend fairly cleanly to other behaviors a user might want to add to get a "complex" input system.

Knowledgeable readers might be asking by now, how is this setup possible? Shell commands during parsing are inherently memory-buggy given the way the parser and GC interact -- this was mentioned way early on in the old mailing list (https://wryun.github.io/es-shell/mail-archive/msg00039.html). That's been solved here with the last (actually first, as far as implementation order went) bit:

  1. This PR introduces a new, separate memory space, which I call "pspace" in the code, for building parse trees.

Originally, based in part on some comments in the mailing list, I thought I would need to replace the parser in order to play nice with GC'd memory (see both of my attempts, as well as #91). The problem is that replacing the parser is a fair amount of work, and parsing has so many tiny, fiddly behaviors (including a lot of allocations) that it's easy to add a heap of bugs just replacing the parser, before even getting to fixing the memory situation or developing the new input behaviors. So it went off "in the weeds" each time.

So to bypass that, what happens here is that allocations either in or used by the parser, instead of using gcalloc, use palloc, which allocates these pointers in "pspace", which is like normal GC space except it is 1. disjoint (no normal GCs touch pspace), and 2. explicitly collected at the end of a $&parse call, when the generated parse tree is retrieved. At that point the parse tree is treated as the only root of pspace and it is copied into GC space for normal use.

This was surprisingly simple to implement, and works shockingly well, given it's only ~200 new lines of code.


Not all of this PR is set in stone, but there are two particular aspects I think are very good and important:

First of all, the pspace change unblocks the entire notion of extensible "reading", in the sense that a shell is a "R"EPL. This is the foundational technical innovation of this PR, to let us try to play a little bit of "catch-up" with the fancy interactive abilities of basically every other shell these days. By itself, the palloc stuff is also a no-op for the user, and has ~no impact on either performance or memory usage as far as I can tell.

Second, I was able to pull all the readline logic out of the rest of the shell and collect it in readline.c. Localizing all the extra complexity of line editing libraries like this should make it far easier to manage. A possible direction to follow up here could be the "pluggable primitives" idea that has been floated off and on in the past, or at least some other way to more flexibly choose the line editing library that we like (rc has some precedent here, though it isn't based on primitives.)

In general, follow-ups I see to this would be:

  • Follow through on programmable completion and key binding in readline

  • Explore other line editing libraries and commands such as libedit/editline, linenoise/replxx, or even linecook (hello @injinj :). Or what about alternative frontends to es entirely? As part of this, figure out how to make it easier to "plug in" these different libraries at the C level.

  • Performance improvements -- using $&read for shell input makes a "seeking read" much more useful (Without the "seeking read", this PR slows down shell tests by ~9%, and with it, that slowdown disappears).

  • Support separate reading logic for heredocs (ever want to have a separate prompt or history file just for heredocs? I do).

  • Explore getting $&runinput or something similar into the shell and explore how the whole input system is changed/simplified by just using these new primitives together. I think we could probably shorten input.c by another ~200 lines and simplify the control flow by a lot. This PR and Up for discussion: Move the "main runtime" of es out of C and into es script #79 together almost completely remove the need for run_interactive in the shell -- I think that's quite interesting.

For completeness, I want to make sure to mention a couple bugs/limitations of the current implementation:

  • Calling %parse messes with the input's state, which can cause a surprising shell exit ($&parse {result ()}) or just slightly-wrong line numbers. I don't know if this is a "bug" or just a sharp edge of a powerful tool. If $&runinput were present then $&parse calls could be better insulated from parent REPLs. I would want to fix this by presenting a "virtual file" to the reader command where, when the reader command reads from stdin, the "virtual file" interface reads from the input file while doing its own line-tracking, but I don't think C provides that kind of interface -- at least not conveniently.

  • $&read can't handle \0 bytes. This was already found in $&reading a NUL byte crashes the shell #93, but in this new setting it's more of a problem -- I had to comment out a trip.es test case to account for it. This would be an easy fix to implement but requires a Design Choice.

What do folks think of all this? Promising?

@jpco
Copy link
Collaborator Author

jpco commented Mar 28, 2025

These test segfaults sure are disconcerting; I can't repro them with the same compilation flags on my computer. I'll keep poking.

@jpco jpco force-pushed the palloc branch 4 times, most recently from 159d11f to b8110e9 Compare March 29, 2025 19:11
@jpco
Copy link
Collaborator Author

jpco commented Apr 12, 2025

Under the assumption that people definitely do definitely want:

  • More read-time extensibility of some kind (if not necessarily this user-visible design), and
  • More flexibility for different libraries to be included in the shell,

I'm going to merge some of the non-user-visible pieces of this separately: in particular the core pspace stuff, and moving the LOCAL_GETENV stuff to var.c where it makes much more sense. I also merged %write-history since people were already on board with that change conceptually and I'd gotten it to a place where I like it standalone but it also composes well with these reader commands, and I re-opened the "fast-$&read" PR #107.

jpco and others added 2 commits April 14, 2025 07:40
I now (too late) understand what it was doing in the first place.
jpco added 8 commits May 6, 2025 11:45
This works for either bison or byacc, though it isn't POSIX-compliant.
Not sure it's the right call for the real thing, but it's enough to
allow us to make sure that es interacts with the parser properly.
This branch is only a proof of concept, so let's not be shy about
including potential changes that help it stand up on its own.
Still todo: pspaces themselves, and then a good bit of refactoring to
make the code more straightforward given all the changes so far.
This doesn't work with GCPROTECT, so that's a TODO.  It's also somewhat
messy.
Still doesn't work with GCPROTECT.
It's not "per-Input", but it's largely equivalent, and I don't want to
make Input a GC'd object just for a single field.
The interaction between pspaces and GCPROTECT is still weak after this
commit, which may be a to-do to fix, but it isn't actively broken
anymore.
@jpco
Copy link
Collaborator Author

jpco commented May 20, 2025

(Modulo a small memory bug I'm tracking down,) I have gotten things to a place where multiple Inputs can have parsing going at once -- essentially, you can now parse while you parse, as long as they're parsing different Inputs. This is necessary to allow es expressions to be parsed from strings while running an input command, which happens (for example) if you're calling %write-history per-line and you have a %write-history defined in the environment. This resolves what should be the last major surprising behavior for people doing "normal" things; there are still edge cases while reading NUL bytes in scripts and side-effects of manually mucking with $&parse.

Part of making this work has been making the generated parser itself able to run multiple instances at once. For expediency for this proof-of-concept, I did this with %define api.pure full. This is supported by both bison and byacc, but certainly isn't POSIX. There are a few options for resolving this potential portability pitfall:

  • Get rid of the %define api.pure full, and do something Clever to manage the memory of multiple parsers running at once. This is, plainly, difficult.
  • Get rid of the %define api.pure full and try to work around it -- for example, maybe in ./configure we try to test if the parser can run multiple instances at once, and if not, fall back to a more limited multi-parsing setup where we have a separate parser for "normal" inputs and one for "string" inputs, and we can run the two parsers separately but each parser only one-at-a-time... This is more straightforward than the previous option, but still complicated, and produces a less flexible runtime.
  • Keep the %define api.pure full, but have y.tab.c and y.tab.h in the git repo so that you don't have to generate it while you're building es on this-or-that machine. This could allow us to keep the yacc-based grammar, keep the convenient "pure" API surface, and also keep the shell's build portable to lots of different machines (moreso, even, because we'd remove yaccs as a build-time dependency). This is the route rc took, and they didn't even need a pure parser to justify it :)
  • Abandon yacc-based parsers entirely, either replacing it with another parser generator or using a hand-rolled parser. This is much more straightforward than working around yacc limitations, but it's a lot of effort, it has a lot of potential for new parsing bugs, and isn't directly useful for demonstrating the actual goal of customizable reading.

So, clearly, my vote is 3, at least short-term, until we decide to really replace the parser with something more custom.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant