-
Notifications
You must be signed in to change notification settings - Fork 1
Cloud Component
The Cloud Component is the component in charge of creating, coordinating, and terminating DML Sessions. It consists of a pool of Cloud Nodes, each having a one-to-one relationship with a Repo.
The Cloud Component main function is to coordinate training between the Library Nodes and aggregate the weights received from them. When busy working, each Cloud Node holds state related to the DML Session in progress.
-
The Dashboard API queries the Cloud Node to know its status (busy or idle). This is done with an HTTP request to one of the Cloud Node's endpoints.
-
Explora submits DML Jobs to the Cloud Node, which it uses to create a new DML Session and executes it with the Libraries that are connected to it at the time. This is a Websocket message but could be changed into an HTTP message.
-
The Libraries use their API Key to discover and authenticate with their corresponding Cloud Node through Websokets. They communicate constantly through Websockets and HTTP over the course of a DML Session (e.g., the Cloud Node may notify (WS) the Libraries that a new round should start, the Libraries may download (HTTP) the new model from the Cloud Node, and then send the new weights (WS) to the Cloud Node once it's done training).
Cloud Nodes are built using Python 3.6, Autobahn, and Flask. They are deployed upon the creation of a Repo through the Dashboard Component and mass updated on every new commit to the cloud-node repository in GitHub.
At the moment, Cloud Nodes only support Keras models sent by Explora and can only work with Libraries that train using Tensorflow.js.
See Cloud Node's Endpoints Docs.
Each Cloud Node is deployed programmatically by the Dashboard API, through AWS Elastic Beanstalk, in the us-west-1 region, under the cloud-node application, and into a new environment titled by the Repo ID that was just created. The URL to a Cloud Node then becomes <repo_id>.au4c4pd2ch.us-west-1.elasticbeanstalk.com
.
To guarantee that the latest version of the Cloud Node is automatically deployed during the creation of a Repo, the Dashboard API pulls the Cloud Node source from S3 (from s3://cloud-node-deployment/cloud-node/cloudnode_build.zip
), which gets automatically updated on every change in the cloud-node GitHub repo (see below for more details).
Cloud Nodes are mass and self updated through AWS CodePipeline.
On the creation of a Repo and its corresponding Cloud Node, the cloud-node-deploy CodePipeline deployment trigger gets updated to upgrade the just created Cloud Node on every push to the master branch of the cloud-node GitHub repo.
To guarantee that new Cloud Nodes that are yet to be created have the latest version of the source code, this pipeline also uploads a zip file of the cloud-node source code to S3 (to s3://cloud-node-deployment/cloud-node/cloudnode_build.zip
), which gets pushed to AWS Elastic Beanstalk on the creation of a Cloud Node (see section above for more details).
The security in the cloud nodes is minimal at the moment. Secure Websocket (WSS) between Explora and Cloud Nodes or Libraries and Cloud Nodes is not implemented yet, the JWT Token is not being checked when the Dashboard makes a request to the Cloud Node, and the API Key is not being authorized when the Library talks to the Cloud Node. There needs to be a way for the Library to authorize the Cloud Node as well.
- Scale better
- Idea 1: Larger instance types
- Idea 2: Separate Coordinator from Aggregator
- Idea 3: Unlink Cloud Nodes from Repo and make a Cloud Node Pool where any Repo can rent nodes from depending on the size of the job
- Implement an aggregation algorithm/system that doesn't require Cloud Nodes
- Fix Known Issues
- Security is not implemented in the Cloud Nodes yet (WSS nor JWT nor API Keys)
- It's unclear what happens to an active DML Session when the Cloud Node is being updated (the session may terminate abruptly)
- Federated Learning rounds do not timeout — if too many nodes disconnect in the middle of a session the round will halt until enough nodes connect and train for that round