📋 WP5 Student Tasks

WP5 Introduction

SubTimeFrame (STF) object start its life on the FLP node (First Level Processor). It contains a collection of detector data for a specific time interval. Processing tasks on the FLP can also create extra data objects that are associated with the currently handled STF object. Once processing on FLP is finished, the STF object is transferred to an EPN (Event Processing Node), where STF objects from all FLP nodes, corresponding to the same time interval, are aggregated into a single TimeFrame (TF) object. The tasks of WP5 include handling of STF objects, from creation until they are transmitted and aggregated into TF objects on the EPNs.

SubTimeFrame (STF) and TimeFrame (TF) verification

Keywords:
Distributed algorithms, High performance computing, C++, SQLite, Console tool: C++, GoLang, GUI: Qt

Data collection

For development and verification of the STF→TF data distribution, it is necessary to independently verify that content of TF indeed includes all the data from FLPs. To achieve this, we propose collecting summaries of all STF objects containing the description of all associated STF data (STFSummary). The STFSummary object should contain only the metadata associated with the STF, i.e. copies of O2 headers described by the O2 Data Model. Optionally, it can also contain checksums of the data blocks. In addition, the summary object will include HBMap object. The storage object for the STFSummary shall allow fast navigation and support for custom queries. Our proposal is to develop a simple SQL schema and use the SQLite database.

Automatic verification

All STFSummary objects will then be transported to the same destination EPN, parallel to the STF objects. STFSummary objects should be merged to a single FLP_TFSummary object. 
Once the TF object is created from all STFs, the same process of summarization shall be repeated, now on a complete TF, in order to produce the EPN_TFSummary object.
Simple verification process should check whether FLP_TFSummary and EPN_TFSummary objects are equivalent. The process might include the comparison of data block checksums if they are collected.

Manual inspection

For debugging and development purposes it would be beneficial if the user has the ability to check for a mismatch between FLP_TFSummary and EPN_TFSummary objects. For this, a user tool for inspecting the summary objects should be developed. 
Requirements:
  • Reporting. The tool should report differences between the two summary objects.
  • Custom queries. The tool should accept custom user queries (SQL queries are acceptable, given the predefined schema), e.g. get the count of objects of the specific type, calculate the size of all objects of the same type, etc.
  • Configuration file. Ability to read different verification criteria for a configuration file (in form of SQL queries) is highly desirable.
  • GUI. Optional…

Load balancing and traffic shaping

Keywords:
Simulation, Modeling, Distributed consensus protocols, Distributed resource allocation and planning

Load balancing of EPN utilization

It is envisioned that a new TimeFrame is created every ~22 ms, i.e. at the rate of 44 Hz. Each new TF must be assigned to an EPN with a goal to spread the processing evenly over the whole EPN cluster (order of 1500 nodes). Due to variability in TF sizes, processing performance, network transfer throughput, a simple round-robin selection of an EPN will not work. For this purpose, a global map of EPN utilization will have to be built, and updated in real time. Furthermore, all FLPs (about 250 nodes) will need to agree on a single EPN at a rate of 44Hz on average.

Load balancing of Interconnect (traffic shaping)

Additional criteria for EPN selection is imposed by the topology of FLP→EPN network. The EPN selection algorithm should keep the network traffic equally distributed over all network devices (L2 or Infiniband switches) and interconnection links.  Additionally, the process would take different failure domains into account (EPN server, a rack of EPNs, one of the leaf or higher level switches). 

Tasks

Investigate/propose/model/simulate:
  • Global EPN utilization map. Propose methods for keeping the global EPN utilization map. Algorithm should take into account relatively large number of EPN nodes. Proposed solution should be resistant to split-brain and other network partition conditions.
  • EPN selection. Evaluate feasibility and robustness of different EPN selection methods. Distributed agreement of selected EPN must be shared by all FLP nodes. Investigate usability of consistent hashing techniques, such as Rendezvous hashing.
  • Network traffic shaping. Propose and evaluate methods for incorporating network traffic shaping into EPN selection process. This task must consider arbitrary network topology.


Reference: