Description
Describe the bug
Input objects configured with s3_data_type="S3Prefix"
:
- in remote jobs, will name the files they matched using their full basename.
- In local mode, names of matched files will consist only of the suffix that remains after removal of the prefix.
This behaviour is exhibited by both TrainingInput
and ProcessingInput
objects using s3_data_type="S3Prefix"
.
Here's an example to illustrate this issue:
Case 1 - partial match of filenames
Given the following S3 folder structure:
s3://{BUCKET_NAME}
`-- data
`-- train-data.csv
`-- train-ground-truth.csv
And an input definition which only partially matches the desired files (i.e. too short to fully match any of the files, but too long to match only the parent directory)
TrainingInput( # could be replaced by ProcessingInput - the behaviour is the same
s3_data_type="S3Prefix",
s3_data=f"s3://{BUCKET_NAME}/data/train" # S3Prefix partially matches the names of relevant files
),
when ran in local mode, the training inputs folder will contain:
# os.walk of /opt/ml/input/data - local mode
[
('/opt/ml/input/data', ['training'], []),
('/opt/ml/input/data/training', [], ['-data.csv', '-ground-truth.csv'])
]
the matching part of the prefix: "train" was removed from the filenames
whereas in remote mode:
# `os.walk` of /opt/ml/input/data - remote mode
[
('/opt/ml/input/data', ['training'], []),
('/opt/ml/input/data/training', [], ['train-data.csv', 'train-ground-truth.csv'])
]
filenames remained unchanged.
Case 2 - partial match of parent directory name
If the s3_data
URI only manages to partially match a directory name, the behaviour is similar, but applies to directory names instead.
Given an expanded folder structure:
s3://{BUCKET_NAME}
`-- data
`-- train-data.csv
`-- train-ground-truth.csv
`-- raw-data.txt
`-- debug
`-- last.log
and S3 URI which partially matches a parent directory name (also matching a parent's sibling, on purpose)
s3_data=f"s3://{BUCKET_NAME}/d" # partial match of both the "data" and "digits" folder
will result in (local run):
os.walk of /opt/ml/input/data
[
('/opt/ml/input/data', ['training'], []),
('/opt/ml/input/data/training', ['ata', 'ebug'], []),
('/opt/ml/input/data/training/ata', [], ['train-ground-truth.csv', 'train-data.csv', 'raw-data.txt']),
('/opt/ml/input/data/training/ebug', [], ['last.log'])
]
note that "data" and "debug" got renamed to "ata" and "ebug", as suffixes of the ".../d" in the provided prefix, but the names of files contained within were not changed.
whereas remotely:
os.walk of /opt/ml/input/data
[
('/opt/ml/input/data', ['training'], ['training-manifest']),
('/opt/ml/input/data/training', ['data', 'debug'], []),
('/opt/ml/input/data/training/data', [], ['train-data.csv', 'raw-data.txt', 'train-ground-truth.csv']),
('/opt/ml/input/data/training/debug', [], ['last.log'])
]
both "data" and "debug" retain their original names.
To reproduce
- Create the following directory structure in S3
s3://{BUCKET_NAME}
`-- data
`-- train-data.csv
`-- train-ground-truth.csv
`-- raw-data.txt
`-- debug
`-- last.log
- Create a jupyter notebook with the following contents:
import sagemaker
from sagemaker.workflow.pipeline_context import LocalPipelineSession, PipelineSession
from sagemaker.pytorch import PyTorch
from sagemaker.workflow.steps import TrainingStep, TrainingInput
from sagemaker.workflow.pipeline import Pipeline
LOCAL_RUN = True # or "False" to run remotely
BUCKET_NAME = "<bucket-id>" # replace with a valid bucket ID
IAM_ROLE = sagemaker.get_execution_role()
pipeline_session = LocalPipelineSession() if LOCAL_RUN else PipelineSession()
pytorch_estimator = PyTorch(
sagemaker_session=pipeline_session,
role=IAM_ROLE,
instance_type="local" if LOCAL_RUN else "ml.c5.xlarge",
instance_count=1,
framework_version="1.9.1",
py_version="py38",
entry_point="./entry_point.py",
code_location=f"s3://{BUCKET_NAME}/data"
)
step = TrainingStep(
name="dummy-training-step",
step_args=pytorch_estimator.fit(
inputs=TrainingInput(
s3_data_type="S3Prefix",
s3_data=f"s3://{BUCKET_NAME}/data/train" # for a partial match on filenames
# s3_data=f"s3://{BUCKET_NAME}/d" # for a partial match on directory names
),
)
)
pipeline = Pipeline(
name="s3prefix-debug",
steps=[step],
sagemaker_session=pipeline_session
)
pipeline.upsert(
role_arn=IAM_ROLE,
description="debug pipeline"
)
execution = pipeline.start()
- Add an
entry_point.sh
and populate with the following code. It's meant only to report the contents of the relevant part of the container's filesystem.
import os
from pprint import pprint
path = '/opt/ml/input/'
print(f"os.walk of {path}")
pprint(list(os.walk(path)))
- Run the notebook twice; the first time with
LOCAL_RUN
set toTrue
first and thenFalse
. Save the log output of each run. - Compare the results of os.walk of both runs.
Expected behavior
Local execution mode should maintain the full basename of matched files/dirs - just like the remote mode does.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: v2.126
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version: 1.9.1
- Python version: 3.8
- CPU or GPU: CPU
- Custom Docker image (Y/N): N