Skip to content

Es.write.operation documentation is deceptive on default values when used via spark #2206

Open
@robwithhair

Description

@robwithhair

What kind an issue is this?

  • Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
    The easier it is to track down the bug, the faster it is solved.
    Feature Request. Start by telling us what problem you’re trying to solve.
    Often a solution already exists! Don’t send pull requests to implement new features without
    first getting our support. Sometimes we leave features out on purpose to keep the project small.

Issue description

Documentation suggests default es.write.operation is index but when used via spark output mode "update" the default mode is actually upsert. This information is only available by reading code.

Documentation is deceptive because it suggests that in spark update mode the default value of index will be used when actually the default is overridden to be "upsert" it appears in testing and by visually reviewing code.

Steps to reproduce

Code:

N/A as is documentation fix

Strack trace:

N/A

Activity

jbaiera

jbaiera commented on Mar 19, 2024

@jbaiera
Member

This could be better detailed in the docs for sure.

When using update mode in Spark SQL, the connector changes the operation to be "upsert" since 1) it needs to use that request mode to satisfy the invariants defined by Spark and 2) it's anticipating your need for that setting to be set to use that mode and so it just sets it for you so you don't have to say you want to update data in multiple places.

Fun fact: There are actually quite a lot of things in Spark that we plug into in order to modify the connector's behavior based on your API usage, like pushing down queries to ES (by default we don't filter results from the server, but we generate queries based on the query plan if we're able to) or limiting returned fields from the server (we'll intercept the field projection from Spark if it's available so we don't pull a bunch of fields from each document that aren't needed for the operation). It's tough to list these all out because in some cases we are merging existing configurations together, in other cases we override them, and sometimes we're just offloading some of the concern on to the library code so users don't have to worry about configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @jbaiera@robwithhair

        Issue actions

          Es.write.operation documentation is deceptive on default values when used via spark · Issue #2206 · elastic/elasticsearch-hadoop