Description
What did you find confusing? Please describe.
So, I've been trying to figure out the right JSON format to use when calling the Amazon SageMaker Factorization Machine model for predictions with AWS CLI. I thought the documentation was clear, but when I put it into practice, things didn't work out as expected. I'm not sure if I'm missing something or if my code has some other issue.
Describe how documentation can be improved
It would be helpfull to have a working example of a factorization machine being trained and then calleth through the aws cli. Or through Boto3 client invoke_endpoint which is eqiuvalent int the way the body of the request is submited.
Additional context
Bellow is the reproducible code. All is being run inside a jupyter notebook on SageMaker studio.
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip
%cd ml-100k
!shuf ua.base -o ua.base.shuffled
!head -10 ua.base.shuffled
!head -10 ua.test
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
import boto3, csv, io, json
import numpy as np
from scipy.sparse import lil_matrix
nbUsers=943
nbMovies=1682
nbFeatures=nbUsers+nbMovies
nbRatingsTrain=90570
nbRatingsTest=9430
# For each user, build a list of rated movies.
# We'd need this to add random negative samples.
moviesByUser = {}
for userId in range(nbUsers):
moviesByUser[str(userId)]=[]
with open('ua.base.shuffled','r') as f:
samples=csv.reader(f,delimiter='\t')
for userId,movieId,rating,timestamp in samples:
moviesByUser[str(int(userId)-1)].append(int(movieId)-1)
def loadDataset(filename, lines, columns):
# Features are one-hot encoded in a sparse matrix
X = lil_matrix((lines, columns)).astype('float32')
# Labels are stored in a vector
Y = []
line=0
with open(filename,'r') as f:
samples=csv.reader(f,delimiter='\t')
for userId,movieId,rating,timestamp in samples:
X[line,int(userId)-1] = 1
X[line,int(nbUsers)+int(movieId)-1] = 1
if int(rating) >= 4:
Y.append(1)
else:
Y.append(0)
line=line+1
Y=np.array(Y).astype('float32')
return X,Y
X_train, Y_train = loadDataset('ua.base.shuffled', nbRatingsTrain, nbFeatures)
X_test, Y_test = loadDataset('ua.test',nbRatingsTest,nbFeatures)
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nbRatingsTrain, nbFeatures)
assert Y_train.shape == (nbRatingsTrain, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nbRatingsTrain-zero_labels))
print(X_test.shape)
print(Y_test.shape)
assert X_test.shape == (nbRatingsTest, nbFeatures)
assert Y_test.shape == (nbRatingsTest, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nbRatingsTest-zero_labels))
bucket = <your-bucket>
prefix = 'sagemaker/fm-movielens'
train_key = 'train.protobuf'
train_prefix = '{}/{}'.format(prefix, 'train3')
test_key = 'test.protobuf'
test_prefix = '{}/{}'.format(prefix, 'test3')
output_prefix = 's3://{}/{}/output'.format(bucket, prefix)
def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
buf = io.BytesIO()
smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
buf.seek(0)
obj = '{}/{}'.format(prefix, key)
boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
return 's3://{}/{}'.format(bucket,obj)
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)
test_data = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/factorization-machines:latest',
'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/factorization-machines:latest',
'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/factorization-machines:latest',
'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/factorization-machines:latest'}
fm = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
get_execution_role(),
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
output_path=output_prefix,
sagemaker_session=sagemaker.Session())
fm.set_hyperparameters(feature_dim=nbFeatures,
predictor_type='binary_classifier',
mini_batch_size=1000,
num_factors=64,
epochs=100)
fm.fit({'train': train_data, 'test': test_data})
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer
class FMSerializer(JSONSerializer):
def serialize(self, data):
js = {'instances': []}
for row in data:
js['instances'].append({'features': row.tolist()})
return json.dumps(js)
fm_predictor = fm.deploy(
initial_instance_count=1,
instance_type="ml.m4.xlarge",
serializer=FMSerializer(),
deserializer= JSONDeserializer())
Everything up to this point works. Now, I am able to get results through the predictor method and predict in the following way.
result = fm_predictor.predict(X_test[1000:1010].toarray())
print(result)
print (Y_test[1000:1010])
but If i try to use aws cli like this:
aws sagemaker-runtime invoke-endpoint ^
--endpoint-name <your-endpoint-name> ^
--body "{\"instances\": [ {\"features\": {\"keys\": [1, 2000], \"shape\": [2625], \"values\": [1, 1]}}]}" ^
--content-type application/json ^
--accept application/json ^
--profile <yourprofile> ^
results
I get the following error:
Invalid base64: "{"instances": [ {"features": {"keys": [1, 2000], "shape": [2625], "values": [1, 1]}}]}"
Any guidance on what is wrong with this request would be appreciated.
Thank you!