GlusterFS
πͺ΅ GlusterFS Setup for Shared Storage (Ray Cluster)
Section titled βπͺ΅ GlusterFS Setup for Shared Storage (Ray Cluster)βIn a Ray distributed setup, all worker nodes need access to certain shared resources used by the application.
This includes:
.env(environment variables for models and settings).hydra_config(application configuration)- Uploaded files (
/data) - Model weights (e.g.
/model_weightsif using HF local cache)
1οΈβ£ Setup VPN (if required)
Section titled β1οΈβ£ Setup VPN (if required)βIf your Ray nodes are not on the same local network, set up a VPN between them first.
β‘ Refer to the dedicated VPN setup guide.
You can skip this step if your nodes are already on the same LAN.
2οΈβ£ Setup GlusterFS (Distributed Filesystem)
Section titled β2οΈβ£ Setup GlusterFS (Distributed Filesystem)βGlusterFS allows you to share and replicate storage across multiple nodes with redundancy and better fault tolerance.
This guide assumes:
- You have 4 machines on the same private network
- You want all of them to share
/ray_mount
π§ Install GlusterFS and start the GlusterFS
Section titled βπ§ Install GlusterFS and start the GlusterFSβRun this on all 4 machines:
sudo apt updatesudo apt install -y glusterfs-serversudo systemctl enable --now glusterdπ€ Connect all nodes into a trusted pool
Section titled βπ€ Connect all nodes into a trusted poolβFrom one node (e.g. the Ray head), run:
gluster peer probe <IP_OF_NODE_2>gluster peer probe <IP_OF_NODE_3>gluster peer probe <IP_OF_NODE_4>Confirm with:
gluster peer statusπ Create bricks on each node
Section titled βπ Create bricks on each nodeβOn each node, run:
sudo mkdir -p /gluster/bricks/ray_mountπ¦ Create the replicated GlusterFS volume
Section titled βπ¦ Create the replicated GlusterFS volumeβFrom one node (e.g. the Ray head):
gluster volume create rayvol replica 4 \ <IP1>:/gluster/bricks/ray_mount \ <IP2>:/gluster/bricks/ray_mount \ <IP3>:/gluster/bricks/ray_mount \ <IP4>:/gluster/bricks/ray_mount \ forceStart the volume:
gluster volume start rayvolπ Mount the volume on all nodes
Section titled βπ Mount the volume on all nodesβInstall the client tools:
sudo apt install -y glusterfs-clientCreate the mount point:
sudo mkdir -p /ray_mountMount it (on each node):
sudo mount -t glusterfs <ANY_NODE_IP>:/rayvol /ray_mountTo make this permanent across reboots:
echo "<ANY_NODE_IP>:/rayvol /ray_mount glusterfs defaults,_netdev 0 0" | sudo tee -a /etc/fstabβ Replace
<ANY_NODE_IP>with one of your node IPs in the GlusterFS cluster.
π Copy required data to the shared folder
Section titled βπ Copy required data to the shared folderβFrom any node:
sudo cp -r .hydra_config /ray_mount/sudo cp .env /ray_mount/sudo mkdir /ray_mount/data /ray_mount/model_weightssudo chown -R ubuntu:ubuntu /ray_mountβ Ensure that the ownership is set to the user running Ray workers (e.g.
ubuntu) so that all nodes can read/write.
Now, all Ray nodes will have consistent access to required data and configurations via /ray_mount, backed by a fault-tolerant and distributed filesystem.