Single scheduler - multiple worker architecture with GRPC and Go

Single scheduler (conductor) and multiple workers
Single scheduler (conductor) and multiple workers

Part 2 - The scheduler overview

Full codegithub.com/KorayGocmen/scheduler-worker-grpc
Scheduler exposes an HTTP-server that gives a gateway to the scheduler-worker-cluster. HTTP requests to the scheduler api is translated and proxied into workers. - Scheduler runs an HTTP server as well as a GRPC server. - Scheduler keeps record of the available workers in the cluster. - Workers can register and deregister on the scheduler.
Data Structures When a new worker registers to the scheduler, a worker object is created. Pointer to this worker object is kept in the workers map with key being the worker id and value being the pointer to the worker. Worker objects holds the id of the worker and the address of the worker's GRPC server to start/end/query jobs.
workers
var (
  workersMutex = &sync.Mutex{}
  workers      = make(map[string]*worker)
)

// worker holds the information about registered workers
//    - id: uuid assigned when the worker first register.
//    - addr: workers network address, later used to create grpc client to the worker
type worker struct {
  id   string
  addr string
}
Configuration for the scheduler is done through the config.toml file. I wrote about how I approach configuration in Go projects. You can read more hereHandling configuration in Go
grpc_server: - addr: Address on which the GRPC server will be run. - use_tls: Whether the GRPC server should use TLS. If true, crt_file and key_file should be provided. - crt_file: Path to the certificate file for TLS. - key_file: Path to the key file for TLS. http_server: - addr: Address on which the HTTP server will be run on.
scheduler/config.toml
[grpc_server]
addr = "127.0.0.1:50000"
use_tls = false
crt_file = "server.pem"
key_file = "server.key"

[http_server]
addr = "127.0.0.1:3000"
GRPC Server Only 2 GRPC requests required in the scheduler GRPC server which are used to register and deregister workers on the scheduler.
Scheduler GRPC server definitions
service Scheduler {
  rpc RegisterWorker(RegisterReq) returns (RegisterRes) {}
  rpc DeregisterWorker(DeregisterReq) returns (DeregisterRes) {}
}

message RegisterReq {
  string address = 1;
}

message RegisterRes {
  bool success = 1;
  string workerID = 2;
}

message DeregisterReq {
  string workerID = 1;
}

message DeregisterRes {
  bool success = 1;
}
RegisterWorker is called by any worker that comes online, the scheduler then assigns a unique worker ID to that worker and registers the worker in the workers map with the id and the GRPC address passed in the request. This address will later be used to start/stop/query jobs on the worker. DeregisterWorker is called by any worker that goes offline, the workerID specified in the deregister request will be removed from known workers.
HTTP API There are 3 requests that can be made to the scheduler via HTTP. These are to start/stop/query jobs on a specific worker. A job is calling a specified file via a specified command on a worker. For example: if you have a python worker, scheduler can call a python script via python3 and query the results later. Start job: POST "/start" Request: - "Run which file on which worker with what command" - Provide: command / path / worker_id Response: - returns job_id: which can be used to query or stop the job later.
example start request
{
  "worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
  "command": "bash",
  "path": "worker/scripts/count.sh"
}
example start response - success
{
  "job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
example start response - fail
{
  "error": "worker not found"
}
Stop job: POST "/stop" Request: - "Stop which job on which worker" - Provide: worker_id / job_id Response: - returns success: boolean response.
example stop request
{
  "worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
  "job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
example stop response - success
{
  "success": true
}
example stop response - fail
{
  "error": "worker not found"
}
Query job: POST "/stop" Request: - "Query which job on which worker" - Provide: worker_id / job_id Response: - returns done: boolean for if the job is done. - returns error: boolean for if the job had an error. - returns error_text: string for the error thrown by the job.
example query request
{
  "worker_id": "d5feef60-3029-11e9-b210-d663bd873d93",
  "job_id": "4c5ced1c-5ea9-40f8-90ce-63d09cea26f6"
}
example query response - success
{
  "done": true,
  "error": true,
  "error_text": "signal: killed"
}
example query response - fail
{
  "error": "worker not found"
}
HTTP request and response structs on scheduler
// apiStartJobReq expected API payload for `/start`
type apiStartJobReq struct {
  Command  string `json:"command"`
  Path     string `json:"path"`
  WorkerID string `json:"worker_id"`
}

// apiStartJobRes returned API payload for `/start`
type apiStartJobRes struct {
  JobID string `json:"job_id"`
}

// apiStopJobReq expected API payload for `/stop`
type apiStopJobReq struct {
  JobID    string `json:"job_id"`
  WorkerID string `json:"worker_id"`
}

// apiStopJobRes returned API payload for `/stop`
type apiStopJobRes struct {
  Success bool `json:"success"`
}

// apiQueryJobReq expected API payload for `/query`
type apiQueryJobReq struct {
  JobID    string `json:"job_id"`
  WorkerID string `json:"worker_id"`
}

// apiQueryJobRes returned API payload for `/query`
type apiQueryJobRes struct {
  Done      bool   `json:"done"`
  Error     bool   `json:"error"`
  ErrorText string `json:"error_text"`
}

// apiError is used as a generic api response error
type apiError struct {
  Error string `json:"error"`
}

Koray Gocmen
Koray Gocmen

University of Toronto, Computer Engineering.

Architected and implemented reliable infrastructures and worked as the lead developer for multiple startups.