Up for discussion: User-provided commands for reading shell input #178

jpco · 2025-03-27T13:39:29Z

Another wacky change from the chicken ranch. I think this one is pretty exciting, though.

Here's what's included:

This PR adds a $&readline primitive.

The primitive itself isn't terribly interesting, it essentially does just what you'd expect - it takes a single optional prompt as its argument and its return semantics are the same as $&read. It also interacts with the readline history primitives and $&resetterminal as you would expect. However, it also (sort of) supports user-programmable file completion -- by uncommenting readline.c:218 and defining the following function, you can test it out:

fn %complete in {
  let (f = $in^*) if {access $f} {
    result $f
  } {
    result ()
  }
}

That's novel and interesting. But not exactly enough so to justify a whole new primitive just to have it. So what's the point of this primitive?

This PR modifies $&parse to take a "reader command" which it calls to fetch input.

So instead of $&parse $prompt, we do something like $&parse {$&readline $prompt} (what it really looks like is just below). $&parse calls the reader command at least once in order to fetch input to parse, but the number of calls exactly depends on the input itself, which is why it needs to be modeled this way. (In theory we could do a push-style $&parse primitive where we fetch input, feed it to $&parse, and then repeat depending on its output, but this is a more straightforward change from the prior state of things). When $&parse gets an empty list return value from its reader command, it understands that as an EOF and throws an eof exception.

So we added the $&readline primitive because now, that's how we use readline in the shell!

Doing this enables some impressive flexibility. For example, a %parse function which behaves like the existing %parse $prompt primitive looks something like:

fn %parse prompt {
	if %is-interactive {
		let (p = $prompt(1); input = ())
		let (code = <={$&parse {
			let (r = <={$&readline $p}) {
				input = $input $r
				p = $prompt(2)
				result $r
			}
		}}) {
			if {!~ $#fn-%write-history 0} {
				%write-history <={%flatten \n $input}
			}
			result $code
		}
	} {
		$&parse $&read
	}
}

But this is much more flexible and explicit than the existing setup. Some ways you could change this to suit your fancy:

%write-history each line individually rather than buffering the input and writing it all at the end (this would be like the pre Minor overhaul of history writing #65 behavior).
Use only a single $prompt for every input line, or use three different $prompt elements, or even dynamically generate $prompt for each individual line
Use something other than $&readline or $&read. Currently those are the only two reasonable options, but with this setup, a whole other line editing library is just a matter of calling a different reader command.
Do pre-processing on the read-in text before returning it. This can be used for lots of read-time behaviors, such as (in increasing order of bad-idea-ness):
- history expansion
- the equivalent of zsh's $KEYBOARD_HACK
- precedence-hacking to allow things like time foo | bar => time {foo | bar}
- "real" aliases which let you do weird things like alias fonk='echo blah |'

Note that the reader command is called with stdin set to the current shell command input (which, unlike other shells, is a bit different than the user's idea of stdin). If using a $&runinput primitive like the %run-file from #79, then the set of $&runinput, $&parse, and the "reader command" make a coherent "system" of functions, where an extremely minimal -- but just about complete, modulo handling the eof exception -- REPL would look like:

fn minimal-loop script {
	$&runinput $script {
		forever {
			<={$&parse $&read}
		}
	}
}

where: (a) $&runinput would set up a new command input and track certain values (line number, etc.), and (b) $&parse would interpret the input as shell commands, calling (c) $&read to actually read the input from the file set up by $&runinput. It's not entirely "orthogonal" to have the $&runinput and $&parse behaviors depend on each other this much (you can't properly read from the command input without $&parse, and you can't read an input to $&parse without a context set up by $&runinput), but given some of the complexities at play here, it seems fairly minimal -- and should extend fairly cleanly to other behaviors a user might want to add to get a "complex" input system.

Knowledgeable readers might be asking by now, how is this setup possible? Shell commands during parsing are inherently memory-buggy given the way the parser and GC interact -- this was mentioned way early on in the old mailing list (https://wryun.github.io/es-shell/mail-archive/msg00039.html). That's been solved here with the last (actually first, as far as implementation order went) bit:

This PR introduces a new, separate memory space, which I call "pspace" in the code, for building parse trees.

Originally, based in part on some comments in the mailing list, I thought I would need to replace the parser in order to play nice with GC'd memory (see both of my attempts, as well as #91). The problem is that replacing the parser is a fair amount of work, and parsing has so many tiny, fiddly behaviors (including a lot of allocations) that it's easy to add a heap of bugs just replacing the parser, before even getting to fixing the memory situation or developing the new input behaviors. So it went off "in the weeds" each time.

So to bypass that, what happens here is that allocations either in or used by the parser, instead of using gcalloc, use palloc, which allocates these pointers in "pspace", which is like normal GC space except it is 1. disjoint (no normal GCs touch pspace), and 2. explicitly collected at the end of a $&parse call, when the generated parse tree is retrieved. At that point the parse tree is treated as the only root of pspace and it is copied into GC space for normal use.

This was surprisingly simple to implement, and works shockingly well, given it's only ~200 new lines of code.

Not all of this PR is set in stone, but there are two particular aspects I think are very good and important:

First of all, the pspace change unblocks the entire notion of extensible "reading", in the sense that a shell is a "R"EPL. This is the foundational technical innovation of this PR, to let us try to play a little bit of "catch-up" with the fancy interactive abilities of basically every other shell these days. By itself, the palloc stuff is also a no-op for the user, and has ~no impact on either performance or memory usage as far as I can tell.

Second, I was able to pull all the readline logic out of the rest of the shell and collect it in readline.c. Localizing all the extra complexity of line editing libraries like this should make it far easier to manage. A possible direction to follow up here could be the "pluggable primitives" idea that has been floated off and on in the past, or at least some other way to more flexibly choose the line editing library that we like (rc has some precedent here, though it isn't based on primitives.)

In general, follow-ups I see to this would be:

Follow through on programmable completion and key binding in readline
Explore other line editing libraries and commands such as libedit/editline, linenoise/replxx, or even linecook (hello @injinj :). Or what about alternative frontends to es entirely? As part of this, figure out how to make it easier to "plug in" these different libraries at the C level.
Performance improvements -- using $&read for shell input makes a "seeking read" much more useful (Without the "seeking read", this PR slows down shell tests by ~9%, and with it, that slowdown disappears).
Support separate reading logic for heredocs (ever want to have a separate prompt or history file just for heredocs? I do).
Explore getting $&runinput or something similar into the shell and explore how the whole input system is changed/simplified by just using these new primitives together. I think we could probably shorten input.c by another ~200 lines and simplify the control flow by a lot. This PR and Up for discussion: Move the "main runtime" of es out of C and into es script #79 together almost completely remove the need for run_interactive in the shell -- I think that's quite interesting.

For completeness, I want to make sure to mention a couple bugs/limitations of the current implementation:

Calling %parse messes with the input's state, which can cause a surprising shell exit ($&parse {result ()}) or just slightly-wrong line numbers. I don't know if this is a "bug" or just a sharp edge of a powerful tool. If $&runinput were present then $&parse calls could be better insulated from parent REPLs. I would want to fix this by presenting a "virtual file" to the reader command where, when the reader command reads from stdin, the "virtual file" interface reads from the input file while doing its own line-tracking, but I don't think C provides that kind of interface -- at least not conveniently.
$&read can't handle \0 bytes. This was already found in $&reading a NUL byte crashes the shell #93, but in this new setting it's more of a problem -- I had to comment out a trip.es test case to account for it. This would be an easy fix to implement but requires a Design Choice.

What do folks think of all this? Promising?

this keeps the in-progress parse tree separate from the rest of the shell memory

This is terrible. We should probably make a new recursive-copy function to put in Tags instead.

jpco · 2025-03-28T15:17:28Z

These test segfaults sure are disconcerting; I can't repro them with the same compilation flags on my computer. I'll keep poking.

jpco · 2025-04-12T15:57:24Z

Under the assumption that people definitely do definitely want:

More read-time extensibility of some kind (if not necessarily this user-visible design), and
More flexibility for different libraries to be included in the shell,

I'm going to merge some of the non-user-visible pieces of this separately: in particular the core pspace stuff, and moving the LOCAL_GETENV stuff to var.c where it makes much more sense. I also merged %write-history since people were already on board with that change conceptually and I'd gotten it to a place where I like it standalone but it also composes well with these reader commands, and I re-opened the "fast-$&read" PR #107.

I now (too late) understand what it was doing in the first place.

This works for either bison or byacc, though it isn't POSIX-compliant. Not sure it's the right call for the real thing, but it's enough to allow us to make sure that es interacts with the parser properly.

This branch is only a proof of concept, so let's not be shy about including potential changes that help it stand up on its own.

Still todo: pspaces themselves, and then a good bit of refactoring to make the code more straightforward given all the changes so far.

This doesn't work with GCPROTECT, so that's a TODO. It's also somewhat messy.

Still doesn't work with GCPROTECT.

It's not "per-Input", but it's largely equivalent, and I don't want to make Input a GC'd object just for a single field.

The interaction between pspaces and GCPROTECT is still weak after this commit, which may be a to-do to fix, but it isn't actively broken anymore.

jpco · 2025-05-20T15:42:09Z

(Modulo a small memory bug I'm tracking down,) I have gotten things to a place where multiple Inputs can have parsing going at once -- essentially, you can now parse while you parse, as long as they're parsing different Inputs. This is necessary to allow es expressions to be parsed from strings while running an input command, which happens (for example) if you're calling %write-history per-line and you have a %write-history defined in the environment. This resolves what should be the last major surprising behavior for people doing "normal" things; there are still edge cases while reading NUL bytes in scripts and side-effects of manually mucking with $&parse.

Part of making this work has been making the generated parser itself able to run multiple instances at once. For expediency for this proof-of-concept, I did this with %define api.pure full. This is supported by both bison and byacc, but certainly isn't POSIX. There are a few options for resolving this potential portability pitfall:

Get rid of the %define api.pure full, and do something Clever to manage the memory of multiple parsers running at once. This is, plainly, difficult.
Get rid of the %define api.pure full and try to work around it -- for example, maybe in ./configure we try to test if the parser can run multiple instances at once, and if not, fall back to a more limited multi-parsing setup where we have a separate parser for "normal" inputs and one for "string" inputs, and we can run the two parsers separately but each parser only one-at-a-time... This is more straightforward than the previous option, but still complicated, and produces a less flexible runtime.
Keep the %define api.pure full, but have y.tab.c and y.tab.h in the git repo so that you don't have to generate it while you're building es on this-or-that machine. This could allow us to keep the yacc-based grammar, keep the convenient "pure" API surface, and also keep the shell's build portable to lots of different machines (moreso, even, because we'd remove yaccs as a build-time dependency). This is the route rc took, and they didn't even need a pure parser to justify it :)
Abandon yacc-based parsers entirely, either replacing it with another parser generator or using a hand-rolled parser. This is much more straightforward than working around yacc limitations, but it's a lot of effort, it has a lot of potential for new parsing bugs, and isn't directly useful for demonstrating the actual goal of customizable reading.

So, clearly, my vote is 3, at least short-term, until we decide to really replace the parser with something more custom.

jpco added 12 commits March 24, 2025 21:50

Small simplification of gc functions

084f0ac

working (?!) implementation of palloc and friends

cec72d5

this keeps the in-progress parse tree separate from the rest of the shell memory

only scan the necessary pspace pointers while copying into gcspace

ea81fd3

This is terrible. We should probably make a new recursive-copy function to put in Tags instead.

move LOCAL_GETENV logic out of input.c into var.c

ba9b099

Pull in %write-history from wryun#65

b8b31bd

Start shoving readline logic into readline.c

2eb69a1

Shove yet more readline logic into readline.c

dd2fef1

initial sketch of a $&readline primitive

1a4923b

Demo user-defineable completion function

288fdd8

use hook functions for shell input

2c05f72

Make pseal(NULL) work okay

ec430cb

Li'l bug fixes

b8110e9

jpco force-pushed the palloc branch 4 times, most recently from 159d11f to b8110e9 Compare March 29, 2025 19:11

jpco added 4 commits March 30, 2025 20:23

Add (and fix) tests for new parser behavior

8c3ae10

Try to isolate test execution better

a679134

Small test improvements

eea5e0c

fix obvious little segfault bug

0fae344

jpco mentioned this pull request Apr 12, 2025

%read in chunks for lseek()able files. #107

Open

jpco mentioned this pull request Apr 12, 2025

Move LOCAL_GETENV handling to var.c where the other env handling lives. #190

Merged

jpco and others added 2 commits April 14, 2025 07:40

Restore previous scanspace logic.

1c00488

I now (too late) understand what it was doing in the first place.

Merge remote-tracking branch 'upstream/master' into palloc

d3f888f

This was referenced Apr 23, 2025

Add special-case readline completion logic for first-position #202

Open

Allocate memory from the parser in "pspace", not GC space #205

Merged

jpco force-pushed the palloc branch from 59c574e to e74ac6a Compare May 3, 2025 20:47

Merge remote-tracking branch 'upstream/master' into palloc

e74ac6a

jpco added 8 commits May 6, 2025 11:45

Configure our yacc to use "pure" mode.

cb1f629

This works for either bison or byacc, though it isn't POSIX-compliant. Not sure it's the right call for the real thing, but it's enough to allow us to make sure that es interacts with the parser properly.

Add "fastread" patch.

5064cb2

This branch is only a proof of concept, so let's not be shy about including potential changes that help it stand up on its own.

Make most parsing and lexing state per-Input.

0feb012

Still todo: pspaces themselves, and then a good bit of refactoring to make the code more straightforward given all the changes so far.

Rough draft - make per-Input pspaces

963a6db

This doesn't work with GCPROTECT, so that's a TODO. It's also somewhat messy.

Fix configure.ac bug

aa4b3d2

Fix per-input pspace.

2c74768

Still doesn't work with GCPROTECT.

Make fillcmd work with multiple Inputs.

f8a735f

It's not "per-Input", but it's largely equivalent, and I don't want to make Input a GC'd object just for a single field.

Un-break GCPROTECT with per-Input pspaces.

147af90

The interaction between pspaces and GCPROTECT is still weak after this commit, which may be a to-do to fix, but it isn't actively broken anymore.

jpco force-pushed the palloc branch from c918418 to 561f58f Compare May 20, 2025 15:09

Merge remote-tracking branch 'upstream/master' into palloc

561f58f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Up for discussion: User-provided commands for reading shell input #178

Up for discussion: User-provided commands for reading shell input #178

jpco commented Mar 27, 2025 •

edited

Loading

jpco commented Mar 28, 2025

jpco commented Apr 12, 2025

jpco commented May 20, 2025

Up for discussion: User-provided commands for reading shell input #178

Are you sure you want to change the base?

Up for discussion: User-provided commands for reading shell input #178

Conversation

jpco commented Mar 27, 2025 • edited Loading

jpco commented Mar 28, 2025

jpco commented Apr 12, 2025

jpco commented May 20, 2025

jpco commented Mar 27, 2025 •

edited

Loading