Tools and environments for data processing and analysis
While several funding initiatives support limited data processing by the NF-OSI to transform data into new forms of value, the vast majority of value comes from data reanalyzed or reused in other ways by researchers in the community. To help with such efforts, this guide summarizes known tools/environments for creating new data and knowledge from data shared on the portal. These options represent different paths, with several important considerations and constraints to think about. For example, one especially helpful aspect is adoption & support – whether an option has official integration with Synapse, is known to be used successfully in practice by the community, or has been evaluated by our staff to provide useful resources for the community and assess potential for more official support.
This documentation will be updated to reflect new developments and experiences. If you would like to share your experience, suggest community additions and improvements, or have questions, feel free to reach out to nf-osi@sagebionetworks.org.
Comparison Table
Tool / Environment | Category | Adoption & Support | Data Access Method | Standout Features | Best Use Cases | Limitations / Considerations | Resources |
---|---|---|---|---|---|---|---|
Visualization Platform | External integration | Web UI; linked from certain portal Datasets | Interactive genomics visualization, cohort exploration, mutation analysis | Quick exploratory analysis, clinical summaries, mutation screening | Can't customize analyses; only certain datasets available (from pre-processed data) | ||
Managed Platform | Synapse integration since 2024 | DRS connection + UI | Collaborative workspaces, ergonomic pre-built R and Python environments | Collaborative genomics projects, custom analyses | Cloud costs; learning curve | ||
Managed Platform | Synapse integration (coming end of 2025 - early 2026) | DRS connection + UI | Off-the-shelf standard pipelines (including nf-core), easy collaboration, guided analysis | Standard genomics analyses, pre-built workflows | Limited pipeline flexibility; cloud costs; newer platform | In development | |
Managed Platform | Synapse integration (coming 2026) | DRS connection + UI | Collaborative workspaces, thousands of pipelines | Collaborative projects, custom analyses, specific grant / award has facilitated billing with Terra | Cloud costs; learning curve | ||
Institutional HPC | Self-Managed Compute | Community validated | Command line/API client download | High computational power, flexible tool installation, familiar environment | Large-scale workflows, custom pipeline development | Institution-specific access; harder to collaborate and reproduce externally | |
Private Cloud (AWS/GCP/Azure) | Self-Managed Compute | Community validated | Cloud storage transfer | Scalable compute, collaborative sharing, flexible analysis environments | Team collaborations, scalable analyses, custom environments | Cloud costs; requires cloud expertise | |
Local Machine | Self-Managed Compute | Community validated | Command line/API client/web UI download | Complete control, offline analysis, familiar environment | Small datasets, method development, proof-of-concept work | Hardware limitations; slow for large datasets; hardest to share work | |
Managed Platform | Staff evaluated | Command line/API client | Reproducible research environment, standard pipeline support (nf-core) | Reproducible analyses, standardized methods, includes publication workflows | Pay-per-compute model | ||
Biomni (open-source version) | Agentic AI Platform, Self-Managed Compute | Staff evaluated | Command line/API client | TBD | TBD | API costs; more analysis than low-level raw data processing | In development |
Agentic AI Platform, Managed Platform | Staff evaluated | Command line/API client | TBD | TBD | API costs; more analysis than low-level raw data processing | In development | |
Jataware Biome | Agentic AI Platform, Managed Platform | Staff evaluated | Command line/API client | TBD | TBD | API costs; more analysis than low-level raw data processing | In development |
Helpful Notes
DRS (Data Repository Service) Connections
DRS is a standardized API that enables secure, direct data access between repositories and analysis platforms. When a platform has "DRS connection," it eliminates the need to manually download and upload data - the analysis platform can directly access files from Synapse with proper permissions.
You authorize the connection once, then data appears automatically in your analysis workspace without manual file transfers. The advantages include:
Faster workflow setup (no waiting for downloads)
Reduced storage costs (data stays in Synapse until needed)
Better data provenance tracking
Automatic permission enforcement
Example workflow with DRS: Select your Synapse dataset → authorize platform access → data appears ready for analysis
Example workflow without DRS: Download data from Synapse → upload to analysis platform → begin analysis
Platforms like Cavatica use DRS to provide seamless integration, while self-managed compute environments require manual data transfer using command-line tools or APIs.