Skip to content

Keep local inputs and outputs local in processing jobs when using local mode #2484

Open
@marcelgwerder

Description

@marcelgwerder

Describe the feature you'd like
Keep local inputs and outputs local in processing jobs when using local mode.

How would this feature be used? Please describe.
Currently, all local inputs are uploaded to the default bucket specified on the Sagemaker session.
There is no differentiation between local mode and non-local mode.

# If the source is a local path, upload it to S3
# and save the S3 uri in the ProcessingInput source.
parse_result = urlparse(file_input.s3_input.s3_uri)
if parse_result.scheme != "s3":
desired_s3_uri = s3.s3_path_join(
"s3://",
self.sagemaker_session.default_bucket(),
self._current_job_name,
"input",
file_input.input_name,
)
s3_uri = s3.S3Uploader.upload(
local_path=file_input.s3_input.s3_uri,
desired_s3_uri=desired_s3_uri,
sagemaker_session=self.sagemaker_session,
kms_key=kms_key,
)
file_input.s3_input.s3_uri = s3_uri
normalized_inputs.append(file_input)

This has two issues:

  • The LocalSession does not take a default_bucket argument and thus the bucket used for the upload can only be changed by modifying the private attribute of the session after creation, which is hacky to say the least. It is also not documented anywhere, as far as I can tell.
  • It is not ideal to upload all local inputs to s3 just to download them to the container again in local mode. Local mode should be truly local if inputs/outputs are local paths.

Describe alternatives you've considered

  • Use s3 paths for inputs and outputs but this slows down local mode since the SDK will download/upload those inputs/outputs each time. Local mode should enable quick local testing independent of cloud resources like s3 buckets.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions