SCARIe on network-aware Grids
SCARIe
is a Grid-based software correlator for
radio-telescope images requires high-throughput communication, but also specific
services such as soft real-time or constant throughput. In addition, the
application needs to claim/release resources on-the-fly as result of an
optimization process.
Processing a telescope signal requires high
bandwidth due to the high sample rate used by the telescopes. Currently, SCARIe application handles the telescope signal by using
TCP streams between nodes, but other communication protocols such as RTP could
be implemented, too.
Figure 1. Distribute the
SCARIe application on a grid: one input node for
each telescope to distribute the workload, a certain amount of correlator nodes for signal processing, and one single
output node for merging the results. |
Although the first experiments done with SCARIe in DAS3 grid [SC08] worked for the minimal setup (4
telescopes streaming 256Mbps each, see Figure 1), one of the most important problem
of SCARIe on grid relates to the networking
capabilities, especially when going to higher data rates. Despite there is
1Gbps and 10Gbps networking capabilities in grid, the network does not provide
a constant throughput for SCARIe application due to the
unpredictable bandwidth usage by other nodes, too. Therefore, the nodes were
statically assigned to the experiment by setting up firewalling rules.
The future demands of SCARIe
application already envision using of 32 telescopes, each streaming up to 4Gbp.
Moreover, the more telescopes participate in an experiment, the more flexible
the system has to become in order to cope with the incoming/outgoing of
telescopes during an experiment (Earth rotates and only a part of the
telescopes can observe the target on the sky). In order to allow SCARIe application running on grids with such future
demands, we need to provide the following grid characteristics:
· Constant
throughput between the nodes involved in SCARIe
application;
· Flexibility
in choosing specific network characteristics to be guaranteed such as
low-delay, high throughput, etc;
· Ability
to add/remove nodes on-the-fly during the experiment as part of an optimization
process due to the change in the application requirements in terms of both
networking and computational resources.
A
grid could provide such networking characteristics if the network resources are
integrated into the grid middleware. Hence, network resources can be claimed
dynamically by any application on-demand, similar to the computational resources
are in used nowadays.
Network resource control in Grid middleware
We propose to provide control over network
resources in distributed computing by (1) enhancing a grid middleware with a
network broker and (2) use a traffic manipulation system, called streamline,
installed in every distributed node.
WS-VLAM (http://staff.science.uva.nl/~gvlam/wsvlam)
is a grid workflow execution environment which support coordinated execution of
distributed Grid-enabled components combined in a workflow. Each Grid application
is encapsulated in a container “NAME”, which takes care of state updates to the
workflow system and provides an execution environment that allows the
manipulation of various aspects of the application. For example, the container
can implement a socket interposing mechanism to insert tokens in traffic.
1
- User deploys an experiment: application & basic
infrastructure requirements;
2 – WS-VLAM maps the experiment
using Actuator onto available Grid resources which were detected by
Profiler;
3 - Control
loops may occur in which WS-VLAM is a controller to adjust the resources
such as to solve the applications demands regardless of the environment
changes;
4 - Broker manages the computational
resources;
5 - NetBroker programs the networking infrastructure of Grid;
Each grid node
supports the applications running under WS-VLAM supervision and provides the
application-specific network services through application-components ACs
as supported by network elements NEs.
Streamline (http://netstreamline.org)
is a software package that allows traffic manipulation at different levels from
sockets down to IP packets and Ethernet frames. Streamline operates at both spaces: kernel and
user spaces. A host runs the Streamline as a kernel module that can be
controlled via a specific interface SLCLI in order to set needed rules to manipulate
the IP packets. A host can run a “SL monitor” that can receive remote commands
from “SL controller”.
For the purpose of testing the proposed solution
we designed and implement a small testbed showing a minimal
Grid in which 8 nodes are interconnected through 2 networks, as follows: a
default network uses a shared 1Gbps gigabit switch and a second network uses a
network processor unit programmed to route IP packets at 1Gbps, too.
A first experiment measures the network
performances between applications interconnected in pairs (e.g., DAS1-DAS2,
DAS3-DAS4, DAS5-DAS6, DAS7-DAS8) in the following scenario:
1
– WS-VLAM management starts applications and setup the paths one by one on the
default network (10.1.0.x);
2
– When measured network performance (throughput) decreased below an application
threshold, WS-VLAM receives an application notice and
hence, it starts “offloading” the paths from 10.1.0.x network onto 10.10.0.x
network;
Due to the shared 1Gbps switch, the per-path performance
decreases while more paths are established and exchange data traffic at
maximum. The switch offers one single network service: best-effort. We show
that we need to use a grid middleware that controls the network resources in
grids in order to offer specific network services on behalf of the
applications.
A second experiment shows how VLAM manages
network resources on behalf of grid-applications (a send/receive TCP
application). This application has a threshold in throughput/delay, which if
drops below a set point is able to send asynchronous events up to the workflow manager
(VLAM) in order to request help in reallocating network resources. The
reallocation consists in choosing a different path for pair connectivity:
network 10.1.0.x (shared gigabit switch) or the network 10.10.0.x (programmable
network processor).