Skip to content

Names of files partially matched by S3Prefix differ between local and remote job environments #3612

Open
@macsaktr

Description

@macsaktr

Describe the bug

Input objects configured with s3_data_type="S3Prefix":

  • in remote jobs, will name the files they matched using their full basename.
  • In local mode, names of matched files will consist only of the suffix that remains after removal of the prefix.

This behaviour is exhibited by both TrainingInput and ProcessingInput objects using s3_data_type="S3Prefix".

Here's an example to illustrate this issue:

Case 1 - partial match of filenames

Given the following S3 folder structure:

s3://{BUCKET_NAME}
`-- data
    `-- train-data.csv
    `-- train-ground-truth.csv

And an input definition which only partially matches the desired files (i.e. too short to fully match any of the files, but too long to match only the parent directory)

TrainingInput(  # could be replaced by ProcessingInput - the behaviour is the same
    s3_data_type="S3Prefix",
    s3_data=f"s3://{BUCKET_NAME}/data/train" # S3Prefix partially matches the names of relevant files
),

when ran in local mode, the training inputs folder will contain:

#  os.walk of /opt/ml/input/data - local mode
[
  ('/opt/ml/input/data', ['training'], []),
  ('/opt/ml/input/data/training', [], ['-data.csv', '-ground-truth.csv'])
]

the matching part of the prefix: "train" was removed from the filenames

whereas in remote mode:

# `os.walk` of /opt/ml/input/data - remote mode
[
 ('/opt/ml/input/data', ['training'], []),
 ('/opt/ml/input/data/training', [], ['train-data.csv', 'train-ground-truth.csv'])
]

filenames remained unchanged.

Case 2 - partial match of parent directory name

If the s3_data URI only manages to partially match a directory name, the behaviour is similar, but applies to directory names instead.
Given an expanded folder structure:

s3://{BUCKET_NAME}
`-- data
    `-- train-data.csv
    `-- train-ground-truth.csv
    `-- raw-data.txt
`-- debug
     `-- last.log

and S3 URI which partially matches a parent directory name (also matching a parent's sibling, on purpose)

s3_data=f"s3://{BUCKET_NAME}/d"  # partial match of both the "data" and "digits" folder

will result in (local run):

os.walk of /opt/ml/input/data
[
 ('/opt/ml/input/data', ['training'], []),
 ('/opt/ml/input/data/training', ['ata', 'ebug'], []),
 ('/opt/ml/input/data/training/ata', [], ['train-ground-truth.csv', 'train-data.csv', 'raw-data.txt']),
 ('/opt/ml/input/data/training/ebug', [], ['last.log'])
]

note that "data" and "debug" got renamed to "ata" and "ebug", as suffixes of the ".../d" in the provided prefix, but the names of files contained within were not changed.

whereas remotely:

os.walk of /opt/ml/input/data
[
 ('/opt/ml/input/data', ['training'], ['training-manifest']),  
 ('/opt/ml/input/data/training', ['data', 'debug'], []),  
 ('/opt/ml/input/data/training/data',  [],  ['train-data.csv', 'raw-data.txt', 'train-ground-truth.csv']),  
 ('/opt/ml/input/data/training/debug', [], ['last.log'])
]

both "data" and "debug" retain their original names.

To reproduce

  1. Create the following directory structure in S3
s3://{BUCKET_NAME}
`-- data
    `-- train-data.csv
    `-- train-ground-truth.csv
    `-- raw-data.txt
`-- debug
     `-- last.log
  1. Create a jupyter notebook with the following contents:
import sagemaker
from sagemaker.workflow.pipeline_context import LocalPipelineSession, PipelineSession
from sagemaker.pytorch import PyTorch
from sagemaker.workflow.steps import TrainingStep, TrainingInput
from sagemaker.workflow.pipeline import Pipeline

LOCAL_RUN = True  # or "False" to run remotely
BUCKET_NAME = "<bucket-id>"  # replace with a valid bucket ID
IAM_ROLE = sagemaker.get_execution_role()

pipeline_session = LocalPipelineSession() if LOCAL_RUN else PipelineSession()

pytorch_estimator = PyTorch(
    sagemaker_session=pipeline_session,
    role=IAM_ROLE,
    instance_type="local" if LOCAL_RUN else "ml.c5.xlarge",
    instance_count=1,
    framework_version="1.9.1",
    py_version="py38",
    entry_point="./entry_point.py",
    code_location=f"s3://{BUCKET_NAME}/data"
)

step = TrainingStep(
    name="dummy-training-step",
    step_args=pytorch_estimator.fit(
        inputs=TrainingInput(
            s3_data_type="S3Prefix",
            s3_data=f"s3://{BUCKET_NAME}/data/train"  # for a partial match on filenames 
            # s3_data=f"s3://{BUCKET_NAME}/d"  # for a partial match on directory names 
        ),
    )
)

pipeline = Pipeline(
    name="s3prefix-debug",
    steps=[step],
    sagemaker_session=pipeline_session
)

pipeline.upsert(
    role_arn=IAM_ROLE, 
    description="debug pipeline"
)

execution = pipeline.start()
  1. Add an entry_point.sh and populate with the following code. It's meant only to report the contents of the relevant part of the container's filesystem.
import os
from pprint import pprint
path = '/opt/ml/input/'
print(f"os.walk of {path}") 
pprint(list(os.walk(path)))
  1. Run the notebook twice; the first time with LOCAL_RUN set to True first and then False. Save the log output of each run.
  2. Compare the results of os.walk of both runs.

Expected behavior

Local execution mode should maintain the full basename of matched files/dirs - just like the remote mode does.

System information

A description of your system. Please provide:

  • SageMaker Python SDK version: v2.126
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: 1.9.1
  • Python version: 3.8
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions