his project investigated the optimization of data processing capabilities for a repository containing over 2.5 million radiation portal monitor (RPM) daily files, collectively occupying over 6 terabytes of storage. Due to the large volume of files, processing times were prohibitive, prompting the exploration of methods to significantly reduce the processing duration while expanding the capacity to handle a substantially large portion of the files concurrently. We present a data processing workflow that integrates multiple modern techniques into a cohesive system, which harnesses cluster computing methodologies to enhance processing efficiency. The workflow initiates by ingesting raw RPM daily f iles, employing parsing techniques to transform them into a more optimized format for streamlined analysis using Python. Leveraging cluster computing capabilities, the process is designed to load and combine batches of files onto individual cores where algorithms can be applied. Algorithms output results as new files for each batch of data; these result files provide a foundation for subsequent analysis. The final piece of this workflow for our use case is an interactive visualization tool. Using the algorithmic outputs and the optimized access to parsed files, this tool enables rapid and comprehensive data analysis, which also serves as a feedback mechanism for algorithm development. By combining existing data processing tools and methodologies within a structured framework, this system offers an efficient solution for handling the large volume of RPM daily files. The integration of an optimized f ile format, cluster computing, and interactive visualizations streamlines the workflow, facilitating expedited analysis and insights. This workflow has improved reproducibility and reduced run times from days or weeks down to hours or minutes.
Year
2024
Abstract