-
Notifications
You must be signed in to change notification settings - Fork 769
Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON)
#4712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-10-25 11:17:29 UTC |
Linter Bot Results:Hi @marinegor! Thanks for making this PR. We linted your code and found the following: Some issues were found with the formatting of your code.
Please have a look at the Please note: The |
richardjgowers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far, will require a small test file to check reader/parser halves.
| np.array, | ||
| list( | ||
| zip( | ||
| *[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to follow the logic here, a comment breaking down what this double nested loop iteration into a zip is doing would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a little comment explaining that
package/pyproject.toml
Outdated
| "pytng>=0.2.3", | ||
| "gsd>3.0.0", | ||
| "rdkit>=2020.03.1", | ||
| "gemmi", # for mmcif format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will probably be optional, so other imports will have to respect that too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what to do with that -- it's already in the [project.optional-dependencies] table, is there an example of making it more optional?
|
thanks to amazing contribution by @PardhavMaradani it seems that the current version is working now! Seems that some extra attributes and/or my own implementation of I've re-requested reviews from @orbeckst @yuxuanzhuang @ljwoods2. One open question that I have: technically, this topology/coordinates parser can also parse PDB without any changes to it, like that: import MDAnalysis as mda
mda.Universe('testsuite/MDAnalysisTests/data/mmcif/1BD2.pdb.gz', format='mmcif')
# <Universe with 6378 atoms>do we want to add this option to the existing PDBParser (e.g. introduce there a |
|
I'd open a separate issue for the PDB discussion – it will only delay this one. If it can be treated separately then do it separately. (Orthogonality is great!) I don't think I have time to review so please don't wait for me. |
| ------- | ||
| MDAnalysis Topology object | ||
| """ | ||
| structure = gemmi.read_structure(self.filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping so this doesn't get lost, still think this should accept a path obj (unless I'm missing something)
|
oh, sorry @ljwoods2, I forgot to reply earlier!
what do you actually mean by that? like, should we explicitly convert a string to path, that's it? or are there more checks to be done?
|
|
@marinegor pushed directly here and then reverted, oops. Let me know what you think of the PR, this will allow passing streams, pathlib.Paths, etc, as the existing PDBParser can already handle (also via openany) |
|
not sure who to ping for review -- I feel like @yuxuanzhuang @richardjgowers and @ljwoods2 you've been participating in the process, so you might be able to do it after some months as well? :) |
|
Just to get back to that: @orbeckst @yuxuanzhuang @ljwoods2 could you please review that again? Or alternatively, I could add I'd be in favor or merging it as is, especially since it also would help with #4943 I guess. After it's merged, we can get to discussion in #5141 and start thinking about perhaps making this reader default, since benchmarks show it's also faster -- e.g. 140 vs 190 ms for Universe). |
|
I had another look over this and was looking through some of the I did some some small tests with the Changing just the coordinate reader testing on
|
ljwoods2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I'm not opposed to switching to FlatStructure if others are interested
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look and I think from the technical side this should be able to get go in soon. I have mostly comments on docs and formatting.
Please also
- add to Table of Supported File Formats
- add to Table of Supported Topology Formats
- open issue on the User Guide to add the formats (format reference and IIRC there's some other stuff that needs to be done)
| default: 'cython' | ||
| filelock: | ||
| default: 'filelock' | ||
| fasteners: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been fixed properly? Maybe rebase against current develop, given how old the PR is.
| ========================================================================== | ||
| MDAnalysis reads coordinates from MMCIF (macromolecular Crystallographic Information File) files, also known as PDBx/mmCIF format, | ||
| using the ``gemmi`` library as a backend. MMCIF is a more modern and flexible alternative to the PDB format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to gemmi
| Capabilities | ||
| ------------ | ||
| The MMCIF reader implementation uses the gemmi library to parse files and extract coordinates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The MMCIF reader implementation uses the gemmi library to parse files and extract coordinates | |
| The MMCIF reader implementation uses the :mod:`gemmi` library to parse files and extract coordinates |
Possibly add gemmi docs to intersphinx links?
| Fixes | ||
|
|
||
| Enhancements | ||
| * Add reading from mmCIF, also known as PDBx format, using crystallographic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for single-frame files
| Basic structure loading:: | ||
| .. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| residue_segindex=segidx, | ||
| ) | ||
|
|
||
| def _get_structure(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated across topology and coordinates. Could we have it in only one place and import it from the other? I think the PDBParser is already doing something like that and follow precedent.
| You can also use :mod:`~MDAnalysis.topology.MMCIFParser` to parse PDB files | ||
| that you're having troubles parsing with standard PDB parser. ``MMCIFParser`` | ||
| uses ``gemmi`` library (https://github.com/project-gemmi/gemmi) that is developed | ||
| together with RCSB, and might work better for your particular situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need an explicit example, eg (?) Universe(PDB, topology_format=MDAnalysis.topology.MMCIFParser.MMCIFParser, format=MDAnalysis.coordinates.MMCIF.MMCIFReader) or something like that.
| from . import MinimalParser | ||
| from . import ITPParser | ||
| from . import FHIAIMSParser | ||
| from . import MMCIFParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add newline at end? whatever black says...
|
|
||
|
|
||
| class MMCIFParser(TopologyReaderBase): | ||
| """Parser that obtains a list of atoms from a standard MMCIF/PDBx file using ``gemmi`` library (https://github.com/project-gemmi/gemmi). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put links under names instead of bare links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there changes in TPR?
Fixes #2367 and also extends #4303 and solves #5089
Changes made in this Pull Request:
gemmilibrary (link) to parse mmcif filesclass MMCIFReader(base.SingleFrameReaderBase)andclass MMCIFParser(TopologyReaderBase)classes for thatAs a bonus, this implementation would potentially allow to read any of the gemmi-supported formats (source):
Also, this (with slight modifications) also would allow reading mmcif with multiple models sharing the same topology, as well as more feature-rich parsing of PDBs (the same code without changes can be used for parsing altlocs, charges, etc, from all of these formats).
However, I'm slightly lost on what's to be done next for this PR to be merged, so I'm asking if someone could help me navigate here (tagging @richardjgowers here as author of original PDBx implementation 4303).
PR Checklist
Developers certificate of origin
📚 Documentation preview 📚: https://mdanalysis--4712.org.readthedocs.build/en/4712/