Fwd: [ccp4bb] Coming July 29: Improved Carbohydrate Data at the PDB -- N-glycans are now separate chains if more than one residue

JY
Jasmine Young
Tue, Dec 8, 2020 8:55 PM

Dear PDB Data Users:

Thank you for providing feedback on the results of an archival-level
carbohydrate remediation project that led to the re-release of over
14,000 PDB structures in July 2020. This update includes diverse
oligosaccharides: glycosylation; metabolites such as maltose, sucrose,
cellulose fragments; glycosaminoglycans, such as fragments of heparin
and heparan sulfate; epitope patterns such as A/B blood group antigens
and the H-type or Lewis-type stems; and many artificial carbohydrates
mimicking or counting natural products
(https://www.wwpdb.org/documentation/carbohydrate-remediation).

Starting in 2017, this PDB remediation aimed to standardize the
biochemical nomenclature of the carbohydrate components following the
IUPAC-IUBMB recommendations established by the carbohydrate community
(https://urldefense.com/v3/https://media.iupac.org/publications/pac/1996/pdf/6810x1919.pdf;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__IigBLWQ$
<https://urldefense.com/v3/https://media.iupac.org/publications/pac/1996/pdf/6810x1919.pdf;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__IigBLWQ$ >), and
to provide uniform representation of oligosaccharides to improve the
identification and searchability of oligosaccharides modeled in the PDB
structures.  During the remediation planning, wwPDB consulted community
users and the PDBx/mmCIF Working Group and made data files available on
GitHub in early 2020 for community feedback. wwPDB has collaborated with
Robert Woods at University of Georgia in US, researchers at The Noguchi
Institute and Soka University in Japan, and Thomas Lutteke in Germany to
generate uniform linear descriptors for the oligosaccharide sequences.

To achieve these community goals, each oligosaccharide is represented as
a branched entity with complete biochemical description and each
glycosidic linkage specified. The full representation of carbohydrates
is provided in the mmCIF format file, but this is not possible in legacy
PDB format files (as the format has been frozen since 2012
(https://www.wwpdb.org/documentation/file-formats-and-the-pdb
https://www.wwpdb.org/documentation/file-formats-and-the-pdb).

Proper indexing is necessary for branched entity representation and for
generation of linear descriptors, hence the ordering (numbering) starts
at the reducing end (#1), where the glycosylation occurs, to the
non-reducing end in ascending order. Unique chain IDs are assigned to
branched entities (oligosaccharides) to avoid residue numbering
overlapped with protein residues and to enable consistent numbering for
every oligosaccharide. For example, in PDB ID 6WPS, there are 5
oligosaccharides associated with the same protein chain A, the
consistent ordering and numbering can only be retained with unique chain
ID for each oligosaccharide in both PDBx/mmCIF and PDB format files

For archival consistency, a single-monosaccharide is defined as a
non-polymer and treated consistently with other non-polymer ligands in
the PDB. A single-monosaccharide occurring at a glycosylation site has a
unique chain ID in the PDBx/mmCIF file (_atom_site.label_asym_id) but
not in the PDB format file.

Using PDB ID 6WPS as an example, the PDBx/mmCIF data item
_atom_site.label_asym_id corresponds to the column #7 in the atom_site
coordinates section has an asym ID ‘Y’ for the 1st instance of
single-monosaccharide, NAG bound to ASN 61 of protein chain ‘A’. The ‘Y’
value is unique for this monosaccharide. The additional chain ID
(_atom_site.auth_asym_id) in the PDBx/mmCIF file that mapped to the PDB
format file for this NAG is chain ‘A’, which is consistently represented
as any other non-polymer ligands associated with the protein chain A.

loop_

_atom_site.group_PDB

_atom_site.id

_atom_site.type_symbol

_atom_site.label_atom_id

_atom_site.label_alt_id

_atom_site.label_comp_id

_atom_site.label_asym_id

_atom_site.label_entity_id

_atom_site.label_seq_id

_atom_site.pdbx_PDB_ins_code

_atom_site.Cartn_x

_atom_site.Cartn_y

_atom_site.Cartn_z

_atom_site.occupancy

_atom_site.B_iso_or_equiv

_atom_site.pdbx_formal_charge

_atom_site.auth_seq_id

_atom_site.auth_comp_id

_atom_site.auth_asym_id

_atom_site.auth_atom_id

_atom_site.pdbx_PDB_model_num

...

HETATM 27655 C C1  . NAG Y  6 .    ? 191.103 162.375 206.665 1.00
47.28  ? 1301 NAG A C1  1

HETATM 27656 C C2  . NAG Y  6 .    ? 191.067 161.665 208.065 1.00 47.22 
? 1301 NAG A C2  1

HETATM 27657 C C3  . NAG Y  6 .    ? 190.138 160.434 207.960 1.00 47.42 
? 1301 NAG A C3  1

HETATM 27658 C C4  . NAG Y  6 .    ? 188.730 160.906 207.541 1.00 48.73 
? 1301 NAG A C4  1

HETATM 27659 C C5  . NAG Y  6 .    ? 188.838 161.622 206.176 1.00 48.66 
? 1301 NAG A C5  1

HETATM 27660 C C6  . NAG Y  6 .    ? 187.494 162.153 205.709 1.00 48.17 
? 1301 NAG A C6  1

HETATM 27661 C C7  . NAG Y  6 .    ? 193.233 161.885 209.217 1.00 47.40 
? 1301 NAG A C7  1

HETATM 27662 C C8  . NAG Y  6 .    ? 194.594 161.311 209.471 1.00 47.45 
? 1301 NAG A C8  1

HETATM 27663 N N2  . NAG Y  6 .    ? 192.418 161.218 208.414 1.00 47.36 
? 1301 NAG A N2  1

HETATM 27664 O O3  . NAG Y  6 .    ? 190.069 159.774 209.231 1.00 47.22 
? 1301 NAG A O3  1

HETATM 27665 O O4  . NAG Y  6 .    ? 187.867 159.778 207.435 1.00 48.89 
? 1301 NAG A O4  1

HETATM 27666 O O5  . NAG Y  6 .    ? 189.760 162.757 206.285 1.00 47.83 
? 1301 NAG A O5  1

HETATM 27667 O O6  . NAG Y  6 .    ? 186.953 163.102 206.622 1.00 49.06 
? 1301 NAG A O6  1

HETATM 27668 O O7  . NAG Y  6 .    ? 192.879 162.950 209.739 1.00 47.58 
? 1301 NAG A O7  1

...

Author-provided chain ID and residue numbering for oligosaccharides are
retained in the PDBx/mmCIF file (_pdbx_branch_scheme.auth_mon_id and
_pdbx_branch_scheme.auth_seq_num, respectively). Users can map how
carbohydrates are described in the corresponding primary citation to the
PDBx/mmCIF files using _pdbx_branch_scheme mapping category. wwPDB
strongly encourages depositors to use the wwPDB-assigned chain ID and
residue numbers in any publication material.

For example, PDB entry 6WPS

loop_

*_pdbx_branch_scheme.asym_id *

_pdbx_branch_scheme.entity_id

_pdbx_branch_scheme.mon_id

_pdbx_branch_scheme.num

*_pdbx_branch_scheme.pdb_asym_id *

_pdbx_branch_scheme.pdb_mon_id

_pdbx_branch_scheme.pdb_seq_num

*_pdbx_branch_scheme.auth_asym_id *

_pdbx_branch_scheme.auth_mon_id

_pdbx_branch_scheme.auth_seq_num

_pdbx_branch_scheme.hetero

*J *4 NAG 1 I NAG 1 A NAG 1310 n

J 4 NAG 2 I NAG 2 A NAG 1311 n

K 4 NAG 1 J NAG 1 A NAG 1312 n

K 4 NAG 2 J NAG 2 A NAG 1313 n

L 4 NAG 1 K NAG 1 A NAG 1315 n

L 4 NAG 2 K NAG 2 A NAG 1316 n

M 4 NAG 1 M NAG 1 A NAG 1317 n

M 4 NAG 2 M NAG 2 A NAG 1318 n

N 5 NAG 1 N NAG 1 A NAG 1321 n

N 5 NAG 2 N NAG 2 A NAG 1322 n

N 5 BMA 3 N BMA 3 A BMA 1323 n

N 5 MAN 4 N MAN 4 A MAN 1325 n

N 5 MAN 5 N MAN 5 A MAN 1324 n

N 5 FUC 6 N FUC 6 A FUC 1320 n

O 4 NAG 1 O NAG 1 B NAG 1310 n

O 4 NAG 2 O NAG 2 B NAG 1311 n

P 4 NAG 1 P NAG 1 B NAG 1312 n

P 4 NAG 2 P NAG 2 B NAG 1313 n

Q 4 NAG 1 Q NAG 1 B NAG 1315 n

Q 4 NAG 2 Q NAG 2 B NAG 1316 n

R 4 NAG 1 R NAG 1 B NAG 1317 n

R 4 NAG 2 R NAG 2 B NAG 1318 n

S 5 NAG 1 S NAG 1 B NAG 1321 n

S 5 NAG 2 S NAG 2 B NAG 1322 n

S 5 BMA 3 S BMA 3 B BMA 1323 n

S 5 MAN 4 S MAN 4 B MAN 1325 n

S 5 MAN 5 S MAN 5 B MAN 1324 n

S 5 FUC 6 S FUC 6 B FUC 1320 n

...

As some users pointed out, single NAG could be just a part of the glycan
that the author chose to build, as most natural N-glycans must have stem
of a common core of 5 monosaccharides or its fucosylated version, such
as those modeled in the PDB ID 6WPS. However, the PDB is a 3D-atomic
coordinate archive in which the model coordinates are built based on
supporting experimental data. Therefore, carbohydrates are described
as-is in the modeled structures without reference to missing components
of the presumed oligosaccharide sequence. If the author only builds a
monosaccharide, then this monosaccharide is described as a non-polymer
ligand.

Glycosylation annotation has been provided to facilitate searches of all
glycosylation sites. A total of 45,000 glycosylation sites have been
annotated in _struct_conn.pdbx_role in over 7500 PDB structures to
identify all glycosylation sites and the monosaccharides bound at such
sites. The annotation specifies the glycosylation sites, the
monosaccharide identity and chain IDs in either PDB format or mmCIF
format. In PDB ID 6WPS, a user can search N-Glycosylation in
‘_struct_conn.pdbx_role’ and find 16 glycosylation sites between ASN and
NAG at chain A alone.

The wwPDB encourages the community to use PDB/mmCIF format files rather
than the frozen legacy PDB file format. The legacy format cannot support
large structures. Currently, PDB format-files are not available for
large structures that have either more than 62 chains or 99,999 atoms. 
In addition, the legacy format cannot support ligand ID codes beyond
3-characters, which will be needed in the coming years.

We thank you again for your feedback. The wwPDB is committed to
improving data representation in the PDB archive. Please do not hesitate
to contact us at info@wwpdb.org mailto:info@wwpdb.org.

Regards,

Jasmine

---==========================
Jasmine Young, Ph.D.
Biocuration Team Lead
RCSB Protein Data Bank
Research Professor
Institute for Quantitative Biomedicine
Rutgers, The State University of New Jersey
174 Frelinghuysen Rd
Piscataway, NJ 08854-8087

Email:jasmine@rcsb.rutgers.edu
Phone: (848)445-0103 ext 4920
Fax: (732)445-4320

---==========================

On 12/4/20 3:15 PM, Marcin Wojdyr wrote:

On Fri, 4 Dec 2020 at 19:16, Dale TronruddetBB@daletronrud.com  wrote:

  Creating meaning in the chain names "A, B, C, Ag1, Ag2, Ag3" is

exactly the problem.

It's not about "creating meaning" but about consistent naming. For humans.

"chain names" ( or "entity identifiers" if I
recall the mmCIF terminology correctly) are simply database "indexes".

No, entity is a somewhat different thing (multiple chains can point to
the same entity). entity_id is specified in addition to label_asym_id
and auth_asym_id.
asym = "structural element in the asymmetric unit" (so-called chain).

The values of indices are meaningless in themselves, they are just
unique values that can be used to unambiguously identify a record. In
principle, you could just assign random ISO characters (I don't think
mmCIF allows unicode) and the mmCIF would be considered identical.

And then you'd use this random string also in a publication when
referring to the chain, and in the user interface?

  You are trying to force meaning to the characters with an index, and

that puts multiple types of information in a single field. As Robbie
said already exists, if you want to encode connectivity into the data
base you have to add records that define that connectivity.  That places
the connectivity information explicitly in the data models and allows
standard data base tools to track and validate.

No one was proposing to replace connectivity with names.
It was about naming that will be easier to work with for people.

learn the sequence you have to go to the mmCIF records that define the
connectivity between residues.  It is entirely possible that "3" comes
before "1" because these indexes don't contain any information, other
than being unique within the chain.

In mmCIF you have label_seq_id that must be both unique and
sequential. So 3 is always the third residue wrt to the full sequence.

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://urldefense.com/v3/https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__kWZQAN4$

This message was issued to members ofhttps://urldefense.com/v3/http://www.jiscmail.ac.uk/CCP4BB;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__Q8ETkVo$ , a mailing list hosted byhttps://urldefense.com/v3/http://www.jiscmail.ac.uk;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__UW_bl0c$ , terms & conditions are available athttps://www.jiscmail.ac.uk/policyandsecurity/

Dear PDB Data Users: Thank you for providing feedback on the results of an archival-level carbohydrate remediation project that led to the re-release of over 14,000 PDB structures in July 2020. This update includes diverse oligosaccharides: glycosylation; metabolites such as maltose, sucrose, cellulose fragments; glycosaminoglycans, such as fragments of heparin and heparan sulfate; epitope patterns such as A/B blood group antigens and the H-type or Lewis-type stems; and many artificial carbohydrates mimicking or counting natural products (https://www.wwpdb.org/documentation/carbohydrate-remediation). Starting in 2017, this PDB remediation aimed to standardize the biochemical nomenclature of the carbohydrate components following the IUPAC-IUBMB recommendations established by the carbohydrate community (https://urldefense.com/v3/__https://media.iupac.org/publications/pac/1996/pdf/6810x1919.pdf__;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__IigBLWQ$ <https://urldefense.com/v3/__https://media.iupac.org/publications/pac/1996/pdf/6810x1919.pdf__;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__IigBLWQ$ >), and to provide uniform representation of oligosaccharides to improve the identification and searchability of oligosaccharides modeled in the PDB structures.  During the remediation planning, wwPDB consulted community users and the PDBx/mmCIF Working Group and made data files available on GitHub in early 2020 for community feedback. wwPDB has collaborated with Robert Woods at University of Georgia in US, researchers at The Noguchi Institute and Soka University in Japan, and Thomas Lutteke in Germany to generate uniform linear descriptors for the oligosaccharide sequences. To achieve these community goals, each oligosaccharide is represented as a branched entity with complete biochemical description and each glycosidic linkage specified. The full representation of carbohydrates is provided in the mmCIF format file, but this is not possible in legacy PDB format files (as the format has been frozen since 2012 (https://www.wwpdb.org/documentation/file-formats-and-the-pdb <https://www.wwpdb.org/documentation/file-formats-and-the-pdb>). Proper indexing is necessary for branched entity representation and for generation of linear descriptors, hence the ordering (numbering) starts at the reducing end (#1), where the glycosylation occurs, to the non-reducing end in ascending order. Unique chain IDs are assigned to branched entities (oligosaccharides) to avoid residue numbering overlapped with protein residues and to enable consistent numbering for every oligosaccharide. For example, in PDB ID 6WPS, there are 5 oligosaccharides associated with the same protein chain A, the consistent ordering and numbering can only be retained with unique chain ID for each oligosaccharide in both PDBx/mmCIF and PDB format files For archival consistency, a single-monosaccharide is defined as a non-polymer and treated consistently with other non-polymer ligands in the PDB. A single-monosaccharide occurring at a glycosylation site has a unique chain ID in the PDBx/mmCIF file (_atom_site.label_asym_id) but not in the PDB format file. Using PDB ID 6WPS as an example, the PDBx/mmCIF data item _atom_site.label_asym_id corresponds to the column #7 in the atom_site coordinates section has an asym ID ‘Y’ for the 1st instance of single-monosaccharide, NAG bound to ASN 61 of protein chain ‘A’. The ‘Y’ value is unique for this monosaccharide. The additional chain ID (_atom_site.auth_asym_id) in the PDBx/mmCIF file that mapped to the PDB format file for this NAG is chain ‘A’, which is consistently represented as any other non-polymer ligands associated with the protein chain A. # loop_ _atom_site.group_PDB _atom_site.id _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_alt_id _atom_site.label_comp_id *_atom_site.label_asym_id* _atom_site.label_entity_id _atom_site.label_seq_id _atom_site.pdbx_PDB_ins_code _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.pdbx_formal_charge _atom_site.auth_seq_id _atom_site.auth_comp_id *_atom_site.auth_asym_id* _atom_site.auth_atom_id _atom_site.pdbx_PDB_model_num ... HETATM 27655 C C1  . NAG *Y*  6 .    ? 191.103 162.375 206.665 1.00 47.28  ? 1301 NAG *A* C1  1 HETATM 27656 C C2  . NAG Y  6 .    ? 191.067 161.665 208.065 1.00 47.22  ? 1301 NAG A C2  1 HETATM 27657 C C3  . NAG Y  6 .    ? 190.138 160.434 207.960 1.00 47.42  ? 1301 NAG A C3  1 HETATM 27658 C C4  . NAG Y  6 .    ? 188.730 160.906 207.541 1.00 48.73  ? 1301 NAG A C4  1 HETATM 27659 C C5  . NAG Y  6 .    ? 188.838 161.622 206.176 1.00 48.66  ? 1301 NAG A C5  1 HETATM 27660 C C6  . NAG Y  6 .    ? 187.494 162.153 205.709 1.00 48.17  ? 1301 NAG A C6  1 HETATM 27661 C C7  . NAG Y  6 .    ? 193.233 161.885 209.217 1.00 47.40  ? 1301 NAG A C7  1 HETATM 27662 C C8  . NAG Y  6 .    ? 194.594 161.311 209.471 1.00 47.45  ? 1301 NAG A C8  1 HETATM 27663 N N2  . NAG Y  6 .    ? 192.418 161.218 208.414 1.00 47.36  ? 1301 NAG A N2  1 HETATM 27664 O O3  . NAG Y  6 .    ? 190.069 159.774 209.231 1.00 47.22  ? 1301 NAG A O3  1 HETATM 27665 O O4  . NAG Y  6 .    ? 187.867 159.778 207.435 1.00 48.89  ? 1301 NAG A O4  1 HETATM 27666 O O5  . NAG Y  6 .    ? 189.760 162.757 206.285 1.00 47.83  ? 1301 NAG A O5  1 HETATM 27667 O O6  . NAG Y  6 .    ? 186.953 163.102 206.622 1.00 49.06  ? 1301 NAG A O6  1 HETATM 27668 O O7  . NAG Y  6 .    ? 192.879 162.950 209.739 1.00 47.58  ? 1301 NAG A O7  1 ... # Author-provided chain ID and residue numbering for oligosaccharides are retained in the PDBx/mmCIF file (_pdbx_branch_scheme.auth_mon_id and _pdbx_branch_scheme.auth_seq_num, respectively). Users can map how carbohydrates are described in the corresponding primary citation to the PDBx/mmCIF files using _pdbx_branch_scheme mapping category. wwPDB strongly encourages depositors to use the wwPDB-assigned chain ID and residue numbers in any publication material. For example, PDB entry 6WPS # loop_ *_pdbx_branch_scheme.asym_id * _pdbx_branch_scheme.entity_id _pdbx_branch_scheme.mon_id _pdbx_branch_scheme.num *_pdbx_branch_scheme.pdb_asym_id * _pdbx_branch_scheme.pdb_mon_id _pdbx_branch_scheme.pdb_seq_num *_pdbx_branch_scheme.auth_asym_id * _pdbx_branch_scheme.auth_mon_id *_pdbx_branch_scheme.auth_seq_num* _pdbx_branch_scheme.hetero *J *4 NAG 1 *I* NAG 1 *A* NAG *1310* n J 4 NAG 2 I NAG 2 A NAG 1311 n K 4 NAG 1 J NAG 1 A NAG 1312 n K 4 NAG 2 J NAG 2 A NAG 1313 n L 4 NAG 1 K NAG 1 A NAG 1315 n L 4 NAG 2 K NAG 2 A NAG 1316 n M 4 NAG 1 M NAG 1 A NAG 1317 n M 4 NAG 2 M NAG 2 A NAG 1318 n N 5 NAG 1 N NAG 1 A NAG 1321 n N 5 NAG 2 N NAG 2 A NAG 1322 n N 5 BMA 3 N BMA 3 A BMA 1323 n N 5 MAN 4 N MAN 4 A MAN 1325 n N 5 MAN 5 N MAN 5 A MAN 1324 n N 5 FUC 6 N FUC 6 A FUC 1320 n O 4 NAG 1 O NAG 1 B NAG 1310 n O 4 NAG 2 O NAG 2 B NAG 1311 n P 4 NAG 1 P NAG 1 B NAG 1312 n P 4 NAG 2 P NAG 2 B NAG 1313 n Q 4 NAG 1 Q NAG 1 B NAG 1315 n Q 4 NAG 2 Q NAG 2 B NAG 1316 n R 4 NAG 1 R NAG 1 B NAG 1317 n R 4 NAG 2 R NAG 2 B NAG 1318 n S 5 NAG 1 S NAG 1 B NAG 1321 n S 5 NAG 2 S NAG 2 B NAG 1322 n S 5 BMA 3 S BMA 3 B BMA 1323 n S 5 MAN 4 S MAN 4 B MAN 1325 n S 5 MAN 5 S MAN 5 B MAN 1324 n S 5 FUC 6 S FUC 6 B FUC 1320 n ... # As some users pointed out, single NAG could be just a part of the glycan that the author chose to build, as most natural N-glycans must have stem of a common core of 5 monosaccharides or its fucosylated version, such as those modeled in the PDB ID 6WPS. However, the PDB is a 3D-atomic coordinate archive in which the model coordinates are built based on supporting experimental data. Therefore, carbohydrates are described as-is in the modeled structures without reference to missing components of the presumed oligosaccharide sequence. If the author only builds a monosaccharide, then this monosaccharide is described as a non-polymer ligand. Glycosylation annotation has been provided to facilitate searches of all glycosylation sites. A total of 45,000 glycosylation sites have been annotated in _struct_conn.pdbx_role in over 7500 PDB structures to identify all glycosylation sites and the monosaccharides bound at such sites. The annotation specifies the glycosylation sites, the monosaccharide identity and chain IDs in either PDB format or mmCIF format. In PDB ID 6WPS, a user can search N-Glycosylation in ‘_struct_conn.pdbx_role’ and find 16 glycosylation sites between ASN and NAG at chain A alone. The wwPDB encourages the community to use PDB/mmCIF format files rather than the frozen legacy PDB file format. The legacy format cannot support large structures. Currently, PDB format-files are not available for large structures that have either more than 62 chains or 99,999 atoms.  In addition, the legacy format cannot support ligand ID codes beyond 3-characters, which will be needed in the coming years. We thank you again for your feedback. The wwPDB is committed to improving data representation in the PDB archive. Please do not hesitate to contact us at info@wwpdb.org <mailto:info@wwpdb.org>. Regards, Jasmine =========================================================== Jasmine Young, Ph.D. Biocuration Team Lead RCSB Protein Data Bank Research Professor Institute for Quantitative Biomedicine Rutgers, The State University of New Jersey 174 Frelinghuysen Rd Piscataway, NJ 08854-8087 Email:jasmine@rcsb.rutgers.edu Phone: (848)445-0103 ext 4920 Fax: (732)445-4320 =========================================================== On 12/4/20 3:15 PM, Marcin Wojdyr wrote: > On Fri, 4 Dec 2020 at 19:16, Dale Tronrud<detBB@daletronrud.com> wrote: >> Creating meaning in the chain names "A, B, C, Ag1, Ag2, Ag3" is >> exactly the problem. > It's not about "creating meaning" but about consistent naming. For humans. > >> "chain names" ( or "entity identifiers" if I >> recall the mmCIF terminology correctly) are simply database "indexes". > No, entity is a somewhat different thing (multiple chains can point to > the same entity). entity_id is specified in addition to label_asym_id > and auth_asym_id. > asym = "structural element in the asymmetric unit" (so-called chain). > >> The values of indices are meaningless in themselves, they are just >> unique values that can be used to unambiguously identify a record. In >> principle, you could just assign random ISO characters (I don't think >> mmCIF allows unicode) and the mmCIF would be considered identical. > And then you'd use this random string also in a publication when > referring to the chain, and in the user interface? > >> You are trying to force meaning to the characters with an index, and >> that puts multiple types of information in a single field. As Robbie >> said already exists, if you want to encode connectivity into the data >> base you have to add records that define that connectivity. That places >> the connectivity information explicitly in the data models and allows >> standard data base tools to track and validate. > No one was proposing to replace connectivity with names. > It was about naming that will be easier to work with for people. > >> learn the sequence you have to go to the mmCIF records that define the >> connectivity between residues. It is entirely possible that "3" comes >> before "1" because these indexes don't contain any information, other >> than being unique within the chain. > In mmCIF you have label_seq_id that must be both unique and > sequential. So 3 is always the third residue wrt to the full sequence. > > ######################################################################## > > To unsubscribe from the CCP4BB list, click the following link: > https://urldefense.com/v3/__https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1__;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__kWZQAN4$ > > This message was issued to members ofhttps://urldefense.com/v3/__http://www.jiscmail.ac.uk/CCP4BB__;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__Q8ETkVo$ , a mailing list hosted byhttps://urldefense.com/v3/__http://www.jiscmail.ac.uk__;!!Mih3wA!XYp2A8os5iYWd5LUx1qpOcCkkATWHC_L552xgwjL3SUngh0KYSHwBH__UW_bl0c$ , terms & conditions are available athttps://www.jiscmail.ac.uk/policyandsecurity/