Names of files partially matched by S3Prefix differ between local and remote job environments

## **Describe the bug**
Input objects configured with `s3_data_type="S3Prefix"`:
* in remote jobs, will name the files they matched using their full basename. 
* In local mode, names of matched files will consist only of the suffix that remains after removal of the prefix. 

This behaviour is exhibited by both `TrainingInput` and `ProcessingInput` objects using `s3_data_type="S3Prefix"`.


Here's an example to illustrate this issue:
### Case 1 - partial match of filenames
Given the following S3 folder structure:
```
s3://{BUCKET_NAME}
`-- data
    `-- train-data.csv
    `-- train-ground-truth.csv
```
And an input definition which only partially matches the desired files (i.e. too short to fully match any of the files, but too long to match only the parent directory)
```python
TrainingInput(  # could be replaced by ProcessingInput - the behaviour is the same
    s3_data_type="S3Prefix",
    s3_data=f"s3://{BUCKET_NAME}/data/train" # S3Prefix partially matches the names of relevant files
),
```
**when ran in local mode**, the training inputs folder will contain:
```yaml
#  os.walk of /opt/ml/input/data - local mode
[
  ('/opt/ml/input/data', ['training'], []),
  ('/opt/ml/input/data/training', [], ['-data.csv', '-ground-truth.csv'])
]
```
the matching part of the prefix: "train" was removed from the filenames

whereas in *remote mode*:
```yaml
# `os.walk` of /opt/ml/input/data - remote mode
[
 ('/opt/ml/input/data', ['training'], []),
 ('/opt/ml/input/data/training', [], ['train-data.csv', 'train-ground-truth.csv'])
]
```
filenames remained unchanged.

### Case 2 - partial match of parent directory name
If the `s3_data` URI only manages to partially match a directory name, the behaviour is similar, but applies to directory names instead. 
Given an expanded folder structure:
```
s3://{BUCKET_NAME}
`-- data
    `-- train-data.csv
    `-- train-ground-truth.csv
    `-- raw-data.txt
`-- debug
     `-- last.log
```

and S3 URI which partially matches a parent directory name (also matching a parent's sibling, on purpose)
```python
s3_data=f"s3://{BUCKET_NAME}/d"  # partial match of both the "data" and "digits" folder
```
will result in (local run):
```yaml
os.walk of /opt/ml/input/data
[
 ('/opt/ml/input/data', ['training'], []),
 ('/opt/ml/input/data/training', ['ata', 'ebug'], []),
 ('/opt/ml/input/data/training/ata', [], ['train-ground-truth.csv', 'train-data.csv', 'raw-data.txt']),
 ('/opt/ml/input/data/training/ebug', [], ['last.log'])
]

```
note that "data" and "debug" got renamed to "ata" and "ebug", as suffixes of the ".../d" in the provided prefix, but the names of files contained within were not changed.

whereas remotely:
```yaml
os.walk of /opt/ml/input/data
[
 ('/opt/ml/input/data', ['training'], ['training-manifest']),  
 ('/opt/ml/input/data/training', ['data', 'debug'], []),  
 ('/opt/ml/input/data/training/data',  [],  ['train-data.csv', 'raw-data.txt', 'train-ground-truth.csv']),  
 ('/opt/ml/input/data/training/debug', [], ['last.log'])
]
```
both "data" and "debug" retain their original names.

## **To reproduce**
1. Create the following directory structure in S3
```
s3://{BUCKET_NAME}
`-- data
    `-- train-data.csv
    `-- train-ground-truth.csv
    `-- raw-data.txt
`-- debug
     `-- last.log
```

2. Create a jupyter notebook with the following contents:
```python
import sagemaker
from sagemaker.workflow.pipeline_context import LocalPipelineSession, PipelineSession
from sagemaker.pytorch import PyTorch
from sagemaker.workflow.steps import TrainingStep, TrainingInput
from sagemaker.workflow.pipeline import Pipeline

LOCAL_RUN = True  # or "False" to run remotely
BUCKET_NAME = "<bucket-id>"  # replace with a valid bucket ID
IAM_ROLE = sagemaker.get_execution_role()

pipeline_session = LocalPipelineSession() if LOCAL_RUN else PipelineSession()

pytorch_estimator = PyTorch(
    sagemaker_session=pipeline_session,
    role=IAM_ROLE,
    instance_type="local" if LOCAL_RUN else "ml.c5.xlarge",
    instance_count=1,
    framework_version="1.9.1",
    py_version="py38",
    entry_point="./entry_point.py",
    code_location=f"s3://{BUCKET_NAME}/data"
)

step = TrainingStep(
    name="dummy-training-step",
    step_args=pytorch_estimator.fit(
        inputs=TrainingInput(
            s3_data_type="S3Prefix",
            s3_data=f"s3://{BUCKET_NAME}/data/train"  # for a partial match on filenames 
            # s3_data=f"s3://{BUCKET_NAME}/d"  # for a partial match on directory names 
        ),
    )
)

pipeline = Pipeline(
    name="s3prefix-debug",
    steps=[step],
    sagemaker_session=pipeline_session
)

pipeline.upsert(
    role_arn=IAM_ROLE, 
    description="debug pipeline"
)

execution = pipeline.start()
```

3. Add an `entry_point.sh` and populate with the following code. It's meant only to report the contents of the relevant part of the container's filesystem.
```python
import os
from pprint import pprint
path = '/opt/ml/input/'
print(f"os.walk of {path}") 
pprint(list(os.walk(path)))
```
4. Run the notebook twice; the first time with `LOCAL_RUN` set to `True` first and then `False`. Save the log output of each run.
5. Compare the results of os.walk of both runs.

## **Expected behavior**
Local execution mode should maintain the full basename of matched files/dirs - just like the remote mode does.

## **System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: v2.126
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: PyTorch
- **Framework version**: 1.9.1
- **Python version**: 3.8
- **CPU or GPU**: CPU
- **Custom Docker image (Y/N)**:  N


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Names of files partially matched by S3Prefix differ between local and remote job environments #3612

Describe the bug

Case 1 - partial match of filenames

Case 2 - partial match of parent directory name

To reproduce

Expected behavior

System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Names of files partially matched by S3Prefix differ between local and remote job environments #3612

Description

Describe the bug

Case 1 - partial match of filenames

Case 2 - partial match of parent directory name

To reproduce

Expected behavior

System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions