`ProteinStructureStore` speed

ProteinStructureStore uses HDF5 for lazy IO of structures, but HDF5.jl is just a wrapper for some pre-built binary that doesn't support parallelization. This effectively bottlenecks the IO speed. A rough test showed that for a dataset with ~20 properties, ~300KB per structure, reading 100 structures takes ~1 second.

At the moment, this format and structure is viable for repositories of protein structures with chain and residue-wise information that is otherwise expensive to gather. It might not be optimal for direct use in workflows that require high throughput protein data look-ups.

Programs that require fast IO might have to serialize into some faster intermediate format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!