Open
Description
PRs 70289 and 2714 introduced changes to auto generate markdown documentation from source for format settings and core settings,
We would like to do the same for:
- Global Server SettingsConsolidate "restrictions on query complexity" pageCore SettingsFunctionsMergeTree SettingsSystem tables
Additional to do:
- Add badges to global server settingsConsistent formatting between all pagesImprove autogenerate script by having each query in it's own file
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
Blargian commentedon Oct 25, 2024
@rschu1ze I would value your input on how we will approach auto generating markdown from source for functions in particular. Global server settings and system tables appear to be manageable using the approach so far used for format and core settings, but markdown generation for functions is not as straight forward.
Starting with the obvious, most functions are missing C++ source documentation like this one has:
https://github.com/ClickHouse/ClickHouse/blob/aaca3b61567e9f22c69c441aef629f39b3d8b406/src/Functions/sin.cpp#L17-L23
Ideally we automate as much of updating documentation in source as possible... the thought of working through all the functions again one by one to update documentation in source is not one I am particularly fond of (neither for you as reviewer I suspect). It seems to me that unless we standardise the current function docs to a high enough level this in itself might be a challenge. Take for instance arithmetic-functions - some have only a syntax section, some have syntax and examples, and more recently updated ones have syntax, arguments, returned value, examples. In other places we have heading 'parameters' instead of 'arguments' etc. I think that scripting something to modify C++ source documentation from the markdown will be tricky if there is too little uniformity in the structure of each markdown page.
Once that is done it seems plausible to script something to update C++ source similar to what was done in Alexey's PR, maybe in batches or per category of functions on the docs page, and once all source files have documentation we can extend functionality of
FunctionFactory
to generate markdown from .cpp.Does that approach sound reasonable?
rschu1ze commentedon Oct 28, 2024
@Blargian I was afraid that day would come :-)
Let's start with server settings docs. I agree that these are manageable with the approach in #2714. There are two hurdles:
The existing public docs (server settings and the internal docs in src/Core/ServerSettings.cpp may not be in sync. So Step 1 would be to go through each setting individually and make sure the more correct/sophisticated/verbose version appears in ServerSettings.cpp.
For the normal and session settings, corresponding files Core/Settings.cpp and Core/FormatFactorySettings.h contain all relevant settings. As a result, system table
system.settings
shows all normal settings (and I guess it contains the format settings as well ... didn't check but it also doesn't matter for the sake of the argument). The same is not true for ServerSettings.cpp, i.e. there are server settings outside that file. An example for which I am guilty of are the query cache settings. They are documented publicly here but they are not part of ServerSettings.cpp orsystem.server_settings
. The main reason for this is that server settings can be nested, e.g. for query_cache settings you could have this XML documentation:ClickHouse ships with a template configuration file (here) ... as you can see, the majority of server settings is nested. (Note that even the template configuration file contains only a subset of all server settings).
There is no principal reason why the nesting itself could not be represented in ServerSettings.cpp. The reason it is not done is that nesting can come with additional constraints, depending on the setting. E.g. in the query cache example above, each sub-tag (e.g.
<max_size_in_bytes>
) may occur at most once below<query_cache>
. Such constraints can be different for other nested structures, for example the logging configuration (public docs) may contain this XML:Note how
<logger>
occurs arbitrarily many times below<levels>
. Such constraints are too complex to express in ServerSettings.cpp.Long story short: The best we can do is to auto-generate the settings in ServerSettings.cpp (which still contains 155 settings as of now) and ignore everything else.
(thoughts about system table and function docs in the next comments).
rschu1ze commentedon Oct 28, 2024
Before discussing system table and function docs:
The public docs contain a page on restricting query complexity via settings. The content of this page largely overlaps with the (auto-generated) setting docs.
It would be cool to consolidate both and
Related to that, there is this weird doc page for which we should probably apply the same steps.
rschu1ze commentedon Oct 28, 2024
Aaaaand it doesn't stop there ...
The publicly documented merge tree settings are at the moment also not auto-generated. They are conceptually similar to normal settings (no nesting), so this will be straightforward:
Consolidate the public docs and MergeTreeSettings.cpp (code).
Auto-generate docs using the same approach as Autogenerate settings #2714.
rschu1ze commentedon Oct 28, 2024
About auto-generating docs for system views:
Instead of a single doc page, each system table has its own page (see here) - 100 in total if I counted correctly (*).
Each system table is created ("attached") at startup, this happens in this file. The file contains table-level comments, e.g.
This table contains a single row with a single dummy UInt8 column containing the value 0. Used when the table is not specified explicitly, for example in queries like SELECT 1
. This is what is shown byI'd say, the first step is to make sure that the internal comment string is in-sync with the publicy documented per-table comment string (e.g. here for
system.users
).Later, we can auto-generate the public table comment string from the internal table comment string.
The next thing to consider are, for each system table, the docs of the resepective column names, their data types, and their per-column comment string. Every system table is implemented by its own C++ file, e.g. this file for
sytem.users
. This is what is shown bySome of the existing system table docs have additional sections like
See also
andExamples
, e.g. system.backup_log. We better don't auto-generate these sections because of two reasons 1: they typically make heavy use of markdown, e.g. to create markdown tables and link other pages. Including additional formatting in C++ is rather ugly. 2. We'd need to expose these fields from the database via standard system views. This either means to add newSEE_ALSO
andEXAMPLES
columns tosystem.tables
(which would be awful) or encoding/squeezing more information into the existing fieldsystem.tables.COMMENT
and parsing the individual fields back similar to https://github.com/ClickHouse/clickhouse-docs/pull/2714/files#diff-0ed5a36db0e1c468253084936c7502d77f11a46a0c3b64038d09b1ffd676dcf5R54 (and that would be ugly as well). 3. the number of system where we need to maintain "See also" and "Examples" sections manually a lot smaller than the number of functions that have similar sections. I'd say the extra overhead is okay.The good thing about (*) above is that we don't need to do a big bang PR for system table docs. We can iterate table-by-table, check the rendered docs and improve incrementally.
EDIT: #706 is also relevant.
rschu1ze commentedon Oct 28, 2024
Function docs will be the most challenging.
Newly added functions are nowadays required to have in-source docs (example). Unfortunately, the majority of functions still only comes with public docs, so ... as usual ..., the first step would be to synchronize the public docs back into the in-source docs (*). This wil be a lot of "fun", I promise!
In-source docs are specified for each function in the form of a
FunctionDocumentation
object (header, source). The exact fields of this class (e.g.description
,syntax
,arguments
) and the formatting they expect were invented ca. a year ago and they are not set in stone. I.e. if we think we should change then, we can do so. System tablesystem.functions
gobbles everything up (code):The public function docs are grouped into categories, e.g. "Arithmetic", "Arrays", "arrayJoin", .... Keeping this grouping when we auto-generate docs makes sense, IMHO, otherwise newbies will have a much harder time to find the correct function of 1000+ functions. We should use the
categories
field inFunctionDocumentation
, respectively insystem.functions
to group functions. There are currently three problems:FunctionDocumentation::Category
is defined asstd::set
). For simplicity, we should allow only a single category.Within groups (in the public docs), the functions are loosely sorted by descending popularity / ascending obscurity. See e.g. the string functions for an example. To maintain this sorting in auto-generated docs, we'd need a relative order between functions. E.g., with the previous string function example, function 'empty' could have order = 1, 'notEmpty' could have order = 2, 'left' could have order = 5, etc. The order could be made a new field within
FunctionDocumentation
. BUT: As you can imagine, maintaining the existing order within a group when a new function is added or deleted will be a nightmare. For example, adding a function "in the middle" means all subsequent functions will need to have their order updated. I am therefore willing to lift the popularity sorting and to sort all functions in auto-generated docs alphabetically. That should be okay assuming that groups contain sufficiently few functions each.The fields of
FunctionDocumentation
may contain markdown, e.g.One could argue if there are more C++-ish ways to encode things like text formatting, links, tables (or if that should even be possible at all). In the end, I think embedded markdown is the least bad option and after all, markdown is designed to be human-readable.
The previous example and almost all other places that define in-source docs uses designated initializer syntax. It is compact but with longer strings it becomes hard to read and edit. It is one of the reasons why people hate writing in-source docs. My proposal is that you use this as a template instead:
Since some of the fields will contain linebreaks, we can make the template even more readable by using raw strings:
Note: It makes sense to split the
Arguments
andReturnedValue
types (see "src/Common/FunctionDocumentation.h") into sub-fieldsname
,type
,description
, respectivelyname
,type
. Having to specify the type separately makes it harder to forget it.So with all the fields filled out for the example, we'll get:
Note how all useful text is neatly left-aligned and uses no linebreaks. Pretty readable, if you ask me.
To sum up, I'd propose to proceed group-by-group, take care of the things I mentioned above, and then make adjustments as you go since there are probably plenty of things I forgot.
rschu1ze commentedon Oct 28, 2024
@Blargian And to reply to your original thoughts:
For any given function, the in-source docs should ideally be the more verbose/exhaustive/complete version of sections "syntax", "arguments", "examples" etc. of the in-source docs and the public docs. The idea is to make this step as mechanical as possible (it will still need to be done by hand). There is no need to come up with new sections "syntax", "arguments", "examples" if neither the in-source nor the public docs contain it.
Yes. Let's do it category-by-category. The day when we can finally delete the public function docs will be so glorious.
rschu1ze commentedon Oct 28, 2024
@justindeguzman FYI ^^
alexey-milovidov commentedon Oct 29, 2024
It is ok to include Markdown directly in .cpp - while it is ugly, we have no better way to do it.
We should include the most detailed documentation in .cpp, because it will be the single source of this documentation.
rschu1ze commentedon Oct 29, 2024
Agree about Markdown.
Re "most detailed documentation": This is relevant for
Comment
field of each system table (I actually see no alternative to that). Advantages: Thematically, examples and references are "comments". 2. During parsing, we can easily split these sections again. 3. Fieldcomment
would become quite long but only very few people will look at fieldsystem.columns.comment
and bother.system.function
though as far as I see, the existing fields cover all relevant structures already. Anyways, function docs will require some experimentation due to their complexity so we may need to come up with some other approaches if needed.Blargian commentedon Nov 15, 2024
@rschu1ze regarding system tables - possibly a silly question, i'm not sure if it's feasible, but could we not maybe create a new
system.docs.settings
table which would have the setting name and a field for the example in markdown? The markdown insystem.settings
is a real eye-sore now that we have put markdown in thedescription
field:At least that way we keep
system.settings
respectable looking and 'hide' out the ugly bit.rschu1ze commentedon Nov 15, 2024
Agree that is an eye-sore. But don't worry too much:
SELECT name, value, changed
onsystem.settings
instead of aSELECT *
. Alternatively, a different output format also helps, e.g.SELECT * FROM system.settings FORMAT Vertical
.SELECT * FROM system.tables
below.It is not a difficult thing to add new "technical" system tables which are only useful in the context of generating docs from the sources. I'd say we can decide if we like to do that after all docs are auto-generated, depending how painful the system tables are to look at then.
21 remaining items