Re: re improved carbohydrate data

GC
Greg Couch
Sat, Dec 5, 2020 10:12 AM

I wasn't subscribed to ccp4bb, so my posts didn't make it there. I am
now, so in the future they will show up.

If 5A5A started out as a PDB file, then I would expect that the mmCIF
version's auth_asym_id would match the ATOM record chain id.  If it
started out as a mmCIF file, then it is dependent on the conversion
process.  I'm still learning about the glycan representation, so I don't
have any expectations of the conversion process yet.

The null label_seq_id is a big problem.  It is a problem because it
makes it impossible to reliably tell what residue an atom is in without
looking at non-mandatory columns in the atom_site table. That is already
the case with solvent residues.  ChimeraX uses the non-mandatory
auth_seq_id column to help disambiguate non-polymer residues.  Other
groups have talked about adding another column to the atom_site table. 
Whatever technique is used, all producers and consumers of the mmCIF
format files should use the same one. Otherwise, it is easy to
accidentally put an atom in the wrong residue or accidentally combine
residues.

It might be possible for the PDB to "fix" label_seq_id.  It is defined
as a reference into the entity_poly_seq table.  Solvent and glycan
residues aren't polymers, so they aren't in that table.  If the
definition of label_seq_id were altered to be dependent on the entity
type (might not be possible with the current CIF dictionary technology,
haven't dug that deep yet), then it could be non-null for non-polymer
residues.  And then the label_* columns could form a proper unique
database key for each row of the atom_site table.  That would be awesome
if it ever happens.

    -- Greg

On 12/4/2020 12:32 PM, Marcin Wojdyr wrote:

Hi Greg,

the discussion was cross-posted between ccp4bb and pdb-l, and it seems
I'm not subscribed to the latter. I read your email in the web archive
of pdb-l.
I agree with what you wrote, but looking into an example file:
https://urldefense.com/v3/https://files.rcsb.org/view/5A5A.cif;!!Mih3wA!SIcdqtzj5TS2ycObHazWOvk6h06Z9kBclx764Mkq6TmTUcNP6bTHT6z-nbs5c04$
and comparing it with data from
https://urldefense.com/v3/ftp://snapshots.rcsb.org/20200101/pub/pdb/data/structures/divided/;!!Mih3wA!SIcdqtzj5TS2ycObHazWOvk6h06Z9kBclx764Mkq6TmTUcNP6bTHT6z-kvuJizI$
I see that it's auth_asym_id that was changed.
Additionally, in the new chains label_seq_id is null (.) even if they
have more than one residue.

Marcin

I wasn't subscribed to ccp4bb, so my posts didn't make it there. I am now, so in the future they will show up. If 5A5A started out as a PDB file, then I would expect that the mmCIF version's auth_asym_id would match the ATOM record chain id.  If it started out as a mmCIF file, then it is dependent on the conversion process.  I'm still learning about the glycan representation, so I don't have any expectations of the conversion process yet. The null label_seq_id is a big problem.  It is a problem because it makes it impossible to reliably tell what residue an atom is in without looking at non-mandatory columns in the atom_site table. That is already the case with solvent residues.  ChimeraX uses the non-mandatory auth_seq_id column to help disambiguate non-polymer residues.  Other groups have talked about adding another column to the atom_site table.  Whatever technique is used, all producers and consumers of the mmCIF format files should use the same one. Otherwise, it is easy to accidentally put an atom in the wrong residue or accidentally combine residues. It might be possible for the PDB to "fix" label_seq_id.  It is defined as a reference into the entity_poly_seq table.  Solvent and glycan residues aren't polymers, so they aren't in that table.  If the definition of label_seq_id were altered to be dependent on the entity type (might not be possible with the current CIF dictionary technology, haven't dug that deep yet), then it could be non-null for non-polymer residues.  And then the label_* columns could form a proper unique database key for each row of the atom_site table.  That would be awesome if it ever happens.     -- Greg On 12/4/2020 12:32 PM, Marcin Wojdyr wrote: > Hi Greg, > > the discussion was cross-posted between ccp4bb and pdb-l, and it seems > I'm not subscribed to the latter. I read your email in the web archive > of pdb-l. > I agree with what you wrote, but looking into an example file: > https://urldefense.com/v3/__https://files.rcsb.org/view/5A5A.cif__;!!Mih3wA!SIcdqtzj5TS2ycObHazWOvk6h06Z9kBclx764Mkq6TmTUcNP6bTHT6z-nbs5c04$ > and comparing it with data from > https://urldefense.com/v3/__ftp://snapshots.rcsb.org/20200101/pub/pdb/data/structures/divided/__;!!Mih3wA!SIcdqtzj5TS2ycObHazWOvk6h06Z9kBclx764Mkq6TmTUcNP6bTHT6z-kvuJizI$ > I see that it's auth_asym_id that was changed. > Additionally, in the new chains label_seq_id is null (.) even if they > have more than one residue. > > Marcin