Skip to content

Commit 9ccf899

Browse files
authored
Route callFunctions through the planner (#345)
* skeleton for the planner work * scheduler: fix compilation issues * tests: fix compilation issues * tests: planner, proto, redis, and state tests working * tests: util tests running * tests: some mpi tests working * tests: mpi tests working * planner: update endpoint to use new callbatch method * tests: towards migration test working * debugging migration tests * tests: migration tests in the scheduler passing * tests: scheduler tests passing * nits: run clang-format * git: bump minor version * dist-tests: all mpi tests (minus migration) working * dist-tests: mpi function migration tests running * dist-tests: all mpi dist tests working * mpi: remove obsolete migration check period * dist-tests: dist tests passing commenting out remote threads * nits: run clang format * tests: fix failing json test * tests: fix failing mpi test * nits: run clang format * planner: add client/server method to get scheduling decision and add tests * tests: remove unnecessary mpi sleep between tests * threads: remove register/deregister and use a common result management via planner (all tests passing locally without sanitisers) * tests: fix typo in tests * batch: allow different messages to have different function names * tests: set enough resources for executor tests * nit: run clang format * tasks: overwrite cpu count in tests * tasks: override cpu count in tests * cleaup + fix env. variable setting * set environment variable to a string * tests: fix failing tests * tests: add enough slots for executor reaping test * scheduler: remove unnecessary accounting structures * mpi: clean-up todos and comments * tests: wait for mpi messages to be executed when counting the number of messages sent to avoid race conditions * tests: more mpi message waiting * mpi: remove comment * planner: clean-up * threads: re-introduce remote threads (most tests and dist-tests passing, still need to uncomment more) * dist-tests: re-introduce all remote threads tests * tests: fix executor tests failing after planner resource managing update * tests: fix scheduler tests after planner changes * tests: fix snapshot tests after changes * tests: fix util tests * migration: don't release slots when setting a migration message result, as we already release them when making the decision * dist-tests: fix multiple mpi world migration too * dist-tests: fix race condition in mpi dist test * executor: add test using executeThreads * planner-cli: remove callFunctions signature with hints * scheduler: clean-up includes in main scheduler file * snapshot: clean-up and make clear that we currently don't delete snapshots * self-review: cleanup * more self-review * planner: add log level to print state * scheduler: use defined timeouts instead of hardcoded numbers * tests: remove race in mpi test * planner: add info logging around migration opportunities * dist-tests: make migration test less flaky
1 parent a4e9d23 commit 9ccf899

File tree

70 files changed

+2100
-3512
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+2100
-3512
lines changed

.env

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FAABRIC_VERSION=0.6.1
2-
FAABRIC_CLI_IMAGE=faasm.azurecr.io/faabric:0.6.1
1+
FAABRIC_VERSION=0.7.0
2+
FAABRIC_CLI_IMAGE=faasm.azurecr.io/faabric:0.7.0
33
COMPOSE_PROJECT_NAME=faabric-dev
44
CONAN_CACHE_MOUNT_SOURCE=./conan-cache/

.github/workflows/tests.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ jobs:
2323
if: github.event.pull_request.draft == false
2424
runs-on: ubuntu-latest
2525
container:
26-
image: faasm.azurecr.io/faabric:0.6.1
26+
image: faasm.azurecr.io/faabric:0.7.0
2727
credentials:
2828
username: ${{ secrets.ACR_SERVICE_PRINCIPAL_ID }}
2929
password: ${{ secrets.ACR_SERVICE_PRINCIPAL_PASSWORD }}
@@ -36,7 +36,7 @@ jobs:
3636
if: github.event.pull_request.draft == false
3737
runs-on: ubuntu-latest
3838
container:
39-
image: faasm.azurecr.io/faabric:0.6.1
39+
image: faasm.azurecr.io/faabric:0.7.0
4040
credentials:
4141
username: ${{ secrets.ACR_SERVICE_PRINCIPAL_ID }}
4242
password: ${{ secrets.ACR_SERVICE_PRINCIPAL_PASSWORD }}
@@ -50,7 +50,7 @@ jobs:
5050
if: github.event.pull_request.draft == false
5151
runs-on: ubuntu-latest
5252
container:
53-
image: faasm.azurecr.io/faabric:0.6.1
53+
image: faasm.azurecr.io/faabric:0.7.0
5454
credentials:
5555
username: ${{ secrets.ACR_SERVICE_PRINCIPAL_ID }}
5656
password: ${{ secrets.ACR_SERVICE_PRINCIPAL_PASSWORD }}
@@ -73,7 +73,7 @@ jobs:
7373
REDIS_QUEUE_HOST: redis
7474
REDIS_STATE_HOST: redis
7575
container:
76-
image: faasm.azurecr.io/faabric:0.6.1
76+
image: faasm.azurecr.io/faabric:0.7.0
7777
credentials:
7878
username: ${{ secrets.ACR_SERVICE_PRINCIPAL_ID }}
7979
password: ${{ secrets.ACR_SERVICE_PRINCIPAL_PASSWORD }}
@@ -113,7 +113,7 @@ jobs:
113113
REDIS_QUEUE_HOST: redis
114114
REDIS_STATE_HOST: redis
115115
container:
116-
image: faasm.azurecr.io/faabric:0.6.1
116+
image: faasm.azurecr.io/faabric:0.7.0
117117
credentials:
118118
username: ${{ secrets.ACR_SERVICE_PRINCIPAL_ID }}
119119
password: ${{ secrets.ACR_SERVICE_PRINCIPAL_PASSWORD }}
@@ -167,7 +167,7 @@ jobs:
167167
REDIS_QUEUE_HOST: redis
168168
REDIS_STATE_HOST: redis
169169
container:
170-
image: faasm.azurecr.io/faabric:0.6.1
170+
image: faasm.azurecr.io/faabric:0.7.0
171171
credentials:
172172
username: ${{ secrets.ACR_SERVICE_PRINCIPAL_ID }}
173173
password: ${{ secrets.ACR_SERVICE_PRINCIPAL_PASSWORD }}

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.6.1
1+
0.7.0

include/faabric/batch-scheduler/BatchScheduler.h

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,11 @@
77

88
#define DO_NOT_MIGRATE -98
99
#define DO_NOT_MIGRATE_DECISION \
10-
SchedulingDecision(DO_NOT_MIGRATE, DO_NOT_MIGRATE)
10+
faabric::batch_scheduler::SchedulingDecision(DO_NOT_MIGRATE, DO_NOT_MIGRATE)
1111
#define NOT_ENOUGH_SLOTS -99
1212
#define NOT_ENOUGH_SLOTS_DECISION \
13-
SchedulingDecision(NOT_ENOUGH_SLOTS, NOT_ENOUGH_SLOTS)
13+
faabric::batch_scheduler::SchedulingDecision(NOT_ENOUGH_SLOTS, \
14+
NOT_ENOUGH_SLOTS)
1415

1516
namespace faabric::batch_scheduler {
1617

@@ -70,7 +71,7 @@ class BatchScheduler
7071
std::shared_ptr<faabric::BatchExecuteRequest> req);
7172

7273
virtual std::shared_ptr<SchedulingDecision> makeSchedulingDecision(
73-
const HostMap& hostMap,
74+
HostMap& hostMap,
7475
const InFlightReqs& inFlightReqs,
7576
std::shared_ptr<faabric::BatchExecuteRequest> req) = 0;
7677

@@ -111,7 +112,7 @@ class BatchScheduler
111112
std::shared_ptr<SchedulingDecision> decisionB) = 0;
112113

113114
virtual std::vector<Host> getSortedHosts(
114-
const HostMap& hostMap,
115+
HostMap& hostMap,
115116
const InFlightReqs& inFlightReqs,
116117
std::shared_ptr<faabric::BatchExecuteRequest> req,
117118
const DecisionType& decisionType) = 0;

include/faabric/batch-scheduler/BinPackScheduler.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ class BinPackScheduler final : public BatchScheduler
1111
{
1212
public:
1313
std::shared_ptr<SchedulingDecision> makeSchedulingDecision(
14-
const HostMap& hostMap,
14+
HostMap& hostMap,
1515
const InFlightReqs& inFlightReqs,
1616
std::shared_ptr<faabric::BatchExecuteRequest> req) override;
1717

@@ -21,7 +21,7 @@ class BinPackScheduler final : public BatchScheduler
2121
std::shared_ptr<SchedulingDecision> decisionB) override;
2222

2323
std::vector<Host> getSortedHosts(
24-
const HostMap& hostMap,
24+
HostMap& hostMap,
2525
const InFlightReqs& inFlightReqs,
2626
std::shared_ptr<faabric::BatchExecuteRequest> req,
2727
const DecisionType& decisionType) override;

include/faabric/batch-scheduler/SchedulingDecision.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ class SchedulingDecision
101101

102102
std::set<std::string> uniqueHosts();
103103

104-
void print();
104+
void print(const std::string& logLevel = "debug");
105105
};
106106

107107
}

include/faabric/mpi/MpiWorld.h

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -201,9 +201,7 @@ class MpiWorld
201201

202202
/* Function Migration */
203203

204-
void prepareMigration(
205-
int thisRank,
206-
std::shared_ptr<faabric::PendingMigrations> pendingMigrations);
204+
void prepareMigration(int thisRank);
207205

208206
private:
209207
int id = -1;
@@ -267,8 +265,5 @@ class MpiWorld
267265
int count,
268266
MPI_Status* status,
269267
MPIMessage::MPIMessageType messageType = MPIMessage::NORMAL);
270-
271-
/* Function migration */
272-
bool hasBeenMigrated = false;
273268
};
274269
}

include/faabric/planner/Planner.h

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
#pragma once
22

3+
#include <faabric/batch-scheduler/SchedulingDecision.h>
34
#include <faabric/planner/PlannerState.h>
45
#include <faabric/planner/planner.pb.h>
56
#include <faabric/proto/faabric.pb.h>
7+
#include <faabric/snapshot/SnapshotRegistry.h>
68

79
#include <shared_mutex>
810

@@ -31,7 +33,7 @@ class Planner
3133
void printConfig() const;
3234

3335
// ----------
34-
// Util
36+
// Util public API
3537
// ----------
3638

3739
bool reset();
@@ -64,6 +66,12 @@ class Planner
6466
std::shared_ptr<faabric::BatchExecuteRequestStatus> getBatchResults(
6567
int32_t appId);
6668

69+
std::shared_ptr<faabric::batch_scheduler::SchedulingDecision>
70+
getSchedulingDecision(std::shared_ptr<BatchExecuteRequest> req);
71+
72+
std::shared_ptr<faabric::batch_scheduler::SchedulingDecision> callBatch(
73+
std::shared_ptr<BatchExecuteRequest> req);
74+
6775
private:
6876
// There's a singleton instance of the planner running, but it must allow
6977
// concurrent requests
@@ -72,12 +80,31 @@ class Planner
7280
PlannerState state;
7381
PlannerConfig config;
7482

83+
// Snapshot registry to distribute snapshots in THREADS requests
84+
faabric::snapshot::SnapshotRegistry& snapshotRegistry;
85+
86+
// ----------
87+
// Util private API
88+
// ----------
89+
7590
void flushHosts();
7691

7792
void flushExecutors();
7893

94+
// ----------
95+
// Host membership private API
96+
// ----------
97+
7998
// Check if a host's registration timestamp has expired
8099
bool isHostExpired(std::shared_ptr<Host> host, long epochTimeMs = 0);
100+
101+
// ----------
102+
// Request scheduling private API
103+
// ----------
104+
105+
void dispatchSchedulingDecision(
106+
std::shared_ptr<faabric::BatchExecuteRequest> req,
107+
std::shared_ptr<faabric::batch_scheduler::SchedulingDecision> decision);
81108
};
82109

83110
Planner& getPlanner();

include/faabric/planner/PlannerApi.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,7 @@ enum PlannerCalls
1313
// Scheduling calls
1414
SetMessageResult = 8,
1515
GetMessageResult = 9,
16+
GetSchedulingDecision = 10,
17+
CallBatch = 11,
1618
};
1719
}

include/faabric/planner/PlannerClient.h

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
#pragma once
22

3+
#include <faabric/batch-scheduler/SchedulingDecision.h>
34
#include <faabric/planner/planner.pb.h>
5+
#include <faabric/snapshot/SnapshotClient.h>
46
#include <faabric/transport/MessageEndpointClient.h>
57
#include <faabric/util/PeriodicBackgroundThread.h>
68

@@ -31,12 +33,14 @@ class KeepAliveThread : public faabric::util::PeriodicBackgroundThread
3133
};
3234

3335
/*
34-
* Local state associated with the current host, used to cache results and
35-
* avoid unnecessary interactions with the planner server.
36+
* Local state associated with the current host, used to store useful state
37+
* like cached results to unnecessary interactions with the planner server.
3638
*/
3739
struct PlannerCache
3840
{
3941
std::unordered_map<uint32_t, MessageResultPromisePtr> plannerResults;
42+
// Keeps track of the snapshots that have been pushed to the planner
43+
std::set<std::string> pushedSnapshots;
4044
};
4145

4246
/*
@@ -79,10 +83,19 @@ class PlannerClient final : public faabric::transport::MessageEndpointClient
7983
faabric::Message getMessageResult(const faabric::Message& msg,
8084
int timeoutMs);
8185

86+
faabric::batch_scheduler::SchedulingDecision callFunctions(
87+
std::shared_ptr<faabric::BatchExecuteRequest> req);
88+
89+
faabric::batch_scheduler::SchedulingDecision getSchedulingDecision(
90+
std::shared_ptr<faabric::BatchExecuteRequest> req);
91+
8292
private:
8393
std::mutex plannerCacheMx;
8494
PlannerCache cache;
8595

96+
// Snapshot client for the planner snapshot server
97+
std::shared_ptr<faabric::snapshot::SnapshotClient> snapshotClient;
98+
8699
faabric::Message doGetMessageResult(
87100
std::shared_ptr<faabric::Message> msgPtr,
88101
int timeoutMs);

include/faabric/planner/PlannerServer.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,12 @@ class PlannerServer final : public faabric::transport::MessageEndpointServer
3434
std::unique_ptr<google::protobuf::Message> recvGetMessageResult(
3535
std::span<const uint8_t> buffer);
3636

37+
std::unique_ptr<google::protobuf::Message> recvGetSchedulingDecision(
38+
std::span<const uint8_t> buffer);
39+
40+
std::unique_ptr<google::protobuf::Message> recvCallBatch(
41+
std::span<const uint8_t> buffer);
42+
3743
private:
3844
faabric::planner::Planner& planner;
3945
};

include/faabric/planner/PlannerState.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
#pragma once
22

3+
#include <faabric/batch-scheduler/BatchScheduler.h>
34
#include <faabric/planner/planner.pb.h>
45
#include <faabric/proto/faabric.pb.h>
56

@@ -23,5 +24,8 @@ struct PlannerState
2324
// Map holding the hosts that have registered interest in getting an app
2425
// result
2526
std::map<int, std::vector<std::string>> appResultWaiters;
27+
28+
// Map keeping track of the requests that are in-flight
29+
faabric::batch_scheduler::InFlightReqs inFlightReqs;
2630
};
2731
}

include/faabric/scheduler/FunctionCallApi.h

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,6 @@ enum FunctionCalls
66
NoFunctionCall = 0,
77
ExecuteFunctions = 1,
88
Flush = 2,
9-
Unregister = 3,
10-
GetResources = 4,
11-
PendingMigrations = 5,
12-
SetMessageResult = 6,
9+
SetMessageResult = 3,
1310
};
1411
}

include/faabric/scheduler/FunctionCallClient.h

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -20,21 +20,9 @@ std::vector<
2020
std::pair<std::string, std::shared_ptr<faabric::BatchExecuteRequest>>>
2121
getBatchRequests();
2222

23-
std::vector<std::pair<std::string, faabric::EmptyRequest>>
24-
getResourceRequests();
25-
26-
std::vector<std::pair<std::string, std::shared_ptr<faabric::PendingMigrations>>>
27-
getPendingMigrationsRequests();
28-
29-
std::vector<std::pair<std::string, faabric::UnregisterRequest>>
30-
getUnregisterRequests();
31-
3223
std::vector<std::pair<std::string, std::shared_ptr<faabric::Message>>>
3324
getMessageResults();
3425

35-
void queueResourceResponse(const std::string& host,
36-
faabric::HostResources& res);
37-
3826
void clearMockRequests();
3927

4028
// -----------------------------------
@@ -52,14 +40,8 @@ class FunctionCallClient : public faabric::transport::MessageEndpointClient
5240

5341
void sendFlush();
5442

55-
faabric::HostResources getResources();
56-
57-
void sendPendingMigrations(std::shared_ptr<faabric::PendingMigrations> req);
58-
5943
void executeFunctions(std::shared_ptr<faabric::BatchExecuteRequest> req);
6044

61-
void unregister(faabric::UnregisterRequest& req);
62-
6345
void setMessageResult(std::shared_ptr<faabric::Message> msg);
6446
};
6547

include/faabric/scheduler/FunctionCallServer.h

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,8 @@ class FunctionCallServer final
2323
std::unique_ptr<google::protobuf::Message> recvFlush(
2424
std::span<const uint8_t> buffer);
2525

26-
std::unique_ptr<google::protobuf::Message> recvGetResources(
27-
std::span<const uint8_t> buffer);
28-
29-
std::unique_ptr<google::protobuf::Message> recvPendingMigrations(
30-
std::span<const uint8_t> buffer);
31-
3226
void recvExecuteFunctions(std::span<const uint8_t> buffer);
3327

34-
void recvUnregister(std::span<const uint8_t> buffer);
35-
3628
void recvSetMessageResult(std::span<const uint8_t> buffer);
3729
};
3830
}

0 commit comments

Comments
 (0)