1Vaiman D, Pietrokovski S, Cohen B, Benech P, Chebath J
Synergism of type I and type II interferons in stimulating the activity of the same DNA enhancer
FEBS Lett 265:12-16, 1990 (Medline ID: 90306300)
Abstract
Type I and type II interferons (IFNs) can act synergistically to activate the transcription of the 2-5A synthetase gene. We used in vivo functional assays of sequences from the gene promoter region to determine which DNA segment mediates the gene induction by IFN gamma and the synergistic effect. We found that the type I IFN-inducible enhancer (or IRS) of the 2-5A synthetase gene also confers inducibility by type II IFN to a reporter CAT gene, though the time course and dose response of the induction by the two IFNs are quite different. A clear synergism of the two IFN in stimulating the IRS is observed at low doses of the two IFNs.

2Pietrokovski S, Hirshon J, Trifonov EN
Linguistic measure of taxonomic and functional relatedness of nucleotide sequences
J Biomol Struct Dyn 7:1251-1268, 1990 (Medline ID: 2363847)
Abstract
The frequencies of "words", oligonucleotides within nucleotide sequences, reflect the genetic information contained in the sequence "texts". Nucleotide sequences are characteristically represented by their contrast word vocabularies. Comparison of the sequences by correlating their contrast vocabularies is shown to reflect well the relatedness (unrelatedness) between the sequences. A single value, the linguistic similarity between the sequences, is suggested as a measure of sequence relatedness. Sequences as short as 1000 bases can be characterized and quantitatively related to other sequences by this technique. The linguistic sequence similarity value is used for analysis of taxonomically and functionally diverse nucleotide sequences. The similarity value is shown to be very sensitive to the relatedness of the source species, thus providing a convenient tool for taxonomic classification of species by their sequence vocabularies. Functionally diverse sequences appear distinct by their linguistic similarity values. This can be a basis for a quick screening technique for functional characterization of the sequences and for mapping functionally distinct regions in long sequences.

3 Pietrokovski S, Trifonov EN
Imported sequences in the mitochondrial yeast genome identified by nucleotide linguistics
Gene 122:129-137, 1992 (Medline ID: 1452019)
Abstract
In addition to universally appearing mitochondrial (mt) genes, origins of replication and transcription start regions typical of all mt genome variants of the yeast Saccharomyces cerevisiae, the mt genomes of some of the strains contain variable sequences. These sequences are apparently largely dispensable. They are mainly composed of group-I and -II introns and intergenic open reading frames (ORFs). Many of the introns contain ORFs, some of which were shown by genetic and biochemical means to be involved in splicing and transposition of the mt introns. Some of the optional sequences are hypothesized to be mobile genetic elements. Nucleotide (nt) sequences of the mt genome of S. cerevisiae were examined by analyzing occurrences of oligodeoxyribonucleotide (oligo) 'words'. This linguistic technique had been found to be sensitive to both function and origin of the sequence [Pietrokovski et al., J. Biomol. Struct. Dyn. 7 (1990) 1251-1268]. A clear difference is found between the oligo vocabularies of the optional and basic yeast mt sequences. The difference is mainly located in protein coding segments of the optional sequences which contain conserved amino acid motifs, characteristic of intronic and intergenic ORFs. The use of nt linguistics to detect the sequence dissimilarity and its causes in yeast mitochondria provides fast and straightforward results, identifying the intronic and intergenic ORFs as DNA sequences of foreign, non-mt origin.

4 Pietrokovski S
Comparing nucleotide and protein sequences by linguistic methods
J Biotechnol 35:257-272, 1994 (Medline ID: 7765062)
Abstract
Nucleotide and amino acid sequences can be analyzed and compared by their oligomer compositions. Such methods are fundamentally different from comparison methods based on sequence alignment. They are analogous to the linguistic analysis of human texts. The methods have a wide range of sensitivity and can identify homologous as well as functionally and taxonomically related sequences. Significant sequence dissimilarity can also be identified enabling detection of foreign DNA sequences in genomes, genetic libraries and databases. The simplicity and speed of linguistic methods make them very suitable for database searching and maintenance and as a preliminary step to more specific and time-consuming analysis methods.

5 Pietrokovski S
Conserved sequence features of inteins (protein introns) and their use in identifying new inteins and related proteins
Protein Science 3:2340-2350, 1994 (Medline ID: 7756989)
Abstract
Inteins (protein introns) are internal portions of protein sequences that are posttranslationally excised while the flanking regions are spliced together, making an additional protein product. Inteins have been found in a number of homologous genes in yeast, mycobacteria, and extreme thermophile archaebacteria. The inteins are probably multifunctional, autocatalyzing their own splicing, and some were also shown to be DNA endonucleases. The splice junction regions and two regions similar to homing endonucleases were thought to be the only common sequence features of inteins. This work analyzed all published intein sequences with recently developed methods for detecting weak, conserved sequence features. The methods complemented each other in the identification and assessment of several patterns characterizing the intein sequences. New intein conserved features are discovered and the known ones are quantitatively described and localized. The general sequence description of all the known inteins is derived from the motifs and their relative positions. The intein sequence description is used to search the sequence databases for intein-like proteins. A sequence region in a mycobacterial open reading frame possessing all of the intein motifs and absent from sequences homologous to both of its flanking sequences is identified as an intein. A newly discovered putative intein in red algae chloroplasts is found not to contain the endonuclease motifs present in all other inteins. The yeast HO endonuclease is found to have an overall intein-like structure and a few viral polyprotein cleavage sites are found to be significantly similar to the inteins amino-end splice junction motif. The intein features described may serve for detection of intein sequences.

6Henikoff S, Henikoff JG, Alford WJ, Pietrokovski S
Automated construction and graphical presentation of protein blocks from unaligned sequences
Gene 163:GC17-GC26, 1995 (Medline ID: 7590261)
Abstract
Protein blocks consist of multiply aligned sequence segments that correspond to the most highly conserved regions of protein families. Typically, a set of related proteins has more than one region in common and their relationship can be represented as a series of ungapped blocks separated by unaligned regions. Blockmaker is an automated system available by electronic mail (blockmaker@howard.fhcrc.org) and the World Wide Web (http://www.blocks.fhcrc.org) that finds blocks in a group of related protein sequences submitted by the user. It adapts and extends existing algorithms to make them useful to biologists looking for conserved regions in a group of related proteins sequences. Two sets of blocks are returned, one in which candidate blocks are detected using the MOTIF algorithm and the other using a Gibbs sampler algorithm that has been adapted for full automation. This use of two block-finding methods based on completely different principles provides a 'reality check,' whereby a block detected by both methods is considered to be correct. Resulting blocks can be displayed using the information-based 'sequence logo' method, adapted to incorporate sequence weights, which provides an intuitive visual description of both the residue and the conservation information at each position. Blocks generated by this system are useful in diverse applications, such as searching databases and designing degenerate PCR primers. As an example, blocks made from amino acid sequences related to Caenorhabditis elegans Tc1 transposase were used to search GenBank, revealing that several fish and amphibian genomic sequences harbor previously unreported Tc1 homologs.

7 Pietrokovski S, Henikoff JG, Henikoff S
The Blocks database--a system for protein classification
Nucl Acids Res 24:197-200, 1996 (Medline ID: 8594578)
Abstract
The Blocks Database contains multiple alignments of conserved regions in protein families. The database can be searched by e-mail and World Wide Web(WWW) servers (http://blocks.fhcrc.org/help) to classify protein and nucleotide sequences.

8 Pietrokovski S
A new intein in Cyanobacteria and its significance for the spread of inteins
Trends Genet 12:287-288, 1996 (Medline ID: 8783935)
paper in html format
Summary
A new intein (protein intron) is identified inside a dnaB gene of a thermophilic cyanobacterium. It is the first intein reported in cyanobacteria and the first eubacterial intein outside mycobacteria. The intein is integrated in exactly the same position as an intein from a chloroplast of a red alga. The last common ancestor of cyanobacteria and red algae existed about 1.25 to 2.1 billion years ago. This leads to observations and questions about the inteins origin, mode of genetic spread and possible selective advantage to their hosts.

9 Pietrokovski S
Searching Databases of Conserved Sequence Regions by Aligning Protein Multiple-Alignments
Nucl Acids Res 24:3836-3845, 1996 (Medline ID: 8871566)
Abstract
A general searching method for comparing multiple sequence alignments was developed to detect sequence relationships between conserved protein regions. Multiple alignments are treated as sequences of amino acid distributions and aligned by comparing pairs of such distributions. Four different comparison measures were tested and the Pearson correlation coefficient chosen. The method is sensitive, detecting weak sequence relationships between protein families. Relationships are detected beyond the range of conventional sequence database searches, illustrating the potential usefulness of the method. The previously undetected relation between flavoprotein subunits of two oxidoreductase families points to the potential active site in one of the families. The similarity between the bacterial RecA, DnaA and Rad51 protein families reveals a region in DnaA and Rad51 proteins likely to bind and unstack single-stranded DNA. Helix-turn-helix DNA binding domains from diverse proteins are readily detected and shown to be similar to each other. Glycosylasparaginase and gamma-glutamyltransferase enzymes are found to be similar in their proteolytic cleavage sites. The method has been fully implemented on the World Wide Web at URL: http://blocks.fhcrc.org/blocks-bin/LAMA_search

10 Henikoff JG, Pietrokovski S, Henikoff S
Recent enhancements to the Blocks database servers
Nucl Acids Res 25:222-226, 1997 (Medline ID: 9016540)
Abstract
The Blocks Database contains multiple alignments of conserved regions in protein families and can be searched by e-mail (blocks@howard.fhcrc.org) and World Wide Web (http://blocks.fhcrc.org) servers to classify protein and nucleotide sequences. Recent enhancements to the servers include: Improved calculation of position-specific scoring matrices from blocks; availability of the Prints protein fingerprint database for searching in Blocks Database format; a representative sequence for a protein family biassed towards the Blocks of the family which can be used to search sequence databases; a tree constructed from the Blocks for a protein family; links to related World Wide Web pages for a family; and the new Local Alignment of Multiple Alignments (LAMA) system to search a Block against a database of Blocks.

11Pietrokovski S, Henikoff S
A helix-turn-helix DNA-binding motif predicted for transposases of DNA transposons
Mol Gen Genet 254:689-695, 1997 (Medline ID: 9202385)
Abstract
A helix-turn-helix (HTH) DNA-binding motif is identified in Tc1, mariner and pogo DNA transposons. The findings are supported by results of various sequence analysis methods. Tc1 transposases are also predicted to contain another DNA-binding region. These findings are in accord with experimental evidence obtained from Tc1A, Tc3A and pogo transposases. The pogo family transposases, but not the pogo type transcription factors, contain the HTH motif, suggesting that HTH structures are essential for Tc1/mariner/pogo transposition. Analysis of multiple sequence alignments enabled the identification of the HTH motif in distantly related protein sequences.

12Pietrokovski S
Modular organization of inteins and C-terminal autocatalytic domains
Protein Science, 7:64-71, 1998 (Medline ID: 9514260)
Abstract
Analysis of the conserved sequence features of inteins (protein "introns") reveals that they are composed of three distinct modular domains. The N-terminal (N) and C-terminal (C) domains are predicted to perform different parts of the autocatalytic protein splicing reaction. An optional endonuclease domain (EN) is shown to correspond to different types of homing endonucleases in different inteins. The N-terminal domain (N) contains motifs predicted to catalyze the first steps of protein splicing, leading to the cleavage of the intein N-terminus from its protein host. Intein N-domain motifs are also found in C-terminal autocatalytic domains (CADs) present in hedgehog and other protein families. Specific residues in the N-domain of intein and CADs are proposed to form a charge relay system involved in cleaving their N-termini. The intein C-terminal domain (C) is apparently unique to inteins and contains motifs that catalyze the final protein splicing steps: the ligation of the intein flanks and the cleavage of its C-terminus to release the free intein and spliced host protein. All intein EN-domains known thus far have dodecapeptide (DOD, LAGLI-DADG) type homing endonuclease motifs. This work identifies an EN-domain with an HNH homing-endonuclease motif and two new small inteins with no EN-domains. One of these small inteins might be inactive or a "pseudo intein". The presented results suggest a modular architecture for inteins, clarify their origin and relationship to other protein families, and extend recent experimental findings on the functional roles of intein N-, C- and EN- motifs.

13 Smith DR, Doucette-Stamm LA, Deloughery C, Lee H, Dubois J, Aldredge T, Bashirzadeh R, Blakely D, Cook R, Gilbert K, Harrison D, Hoang L, Keagle P, Lumm W, Pothier B, Qiu D, Spadafora R, Vicaire R, Wang Y, Wierzbowski J, Gibson R, Jiwani N, Caruso A, Bush D, Safer H, Patwell D, Prabhakar S, McDougall S, Shimer G, Goyal A, Pietrokovski S, Church GM, Daniels CJ, Mao J, Rice P, Nolling J, Reeve JN
Complete genome sequence of Methanobacterium thermoautotrophicum strain deltaH: Functional analysis and comparative genomics
J Bacteriology 179:7135-7155, 1997 (Medline ID: 9371463)
Abstract
The complete 1,751,377 bp sequence of the genome of the thermophilic archaeon Methanobacterium thermoautotrophicum strain DH has been determined by a whole genome shotgun sequencing approach. 1,855 open reading frames (ORFs) have been identified that appear likely to encode polypeptides, 807 (44%) of which have been assigned putative functions based on their similarities to database sequences with assigned functions. 547 (29%) of the ORF encoded amino acid sequences are related to database sequences with unknown functions, and 501 (27%) have little or no homology to database sequences. Comparisons with eucaryal, bacterial and archaeal specific databases reveal that 1,013 of the putative ORF-encoded gene product (54% of the total) have sequences most similar to polypeptide sequences described previously in other Archaea, and 210 (11%) have sequences with significant similarity only to archaeal polypeptides. Comparisons with the Methanococcus jannaschii genome data underline the extensive divergence that has occurred between the two methanogens. Only 352 (19%) of M. thermoautotrophicum ORFs encode sequences that are >50% identical to M. jannaschii ORF-encoded sequences, and only 14 (<1%) polypeptides are predicted to have sequences that are >70% identical in the two methanogens. There is also little conservation in the relative locations of orthologous genes within the two methanogen genomes. When the M. thermoautotrophicum ORF-encoded sequences are evaluated in terms of their similarity to bacterial versus eucaryal polypeptide sequences, 786 (42%) are more similar to bacterial sequences and 241 (13%) are more similar to eucaryal sequences. The majority of gene products predicted to be proteins involved in cofactor and small molecule biosyntheses, intermediary metabolism, transport, nitrogen fixation, regulatory functions and interactions with the environment have sequences more similar to bacterial than eucaryal sequences, whereas many proteins predicted to be involved in DNA metabolism, transcription, and translation have sequences more similar to eucaryal than bacterial sequences. Most M. thermoautotrophicum ORFs appear to be preceded by ribosome binding sites and ORFs predicted to encode functionally related gene products are frequently clustered in what appear to be multigene transcriptional units. These include ORFs that encode polypeptides with sequences related proteins found in Eucarya but not in Bacteria. The M. thermoautotrophicum genome is predicted to encode 24 polypetides that could form two-component sensor kinase-response regulator systems, homologs of the bacterial Hsp70-response proteins DnaK and DnaJ, homologs of eucaryal DNA replication initiation Cdc6 proteins, an X-family repair-type DNA polymerase and an unusual archaeal B-type DNA polymerase formed by two separate polypeptides encoded by genes that are ~0.65 Mb apart. These are all molecular features notably absent in M. jannaschii. DNA replication and genome organization in M. thermoautotrophicum appear to have eucaryal features, based on the predicted presence of two Cdc6 homologs and three histones, whereas the presence of an ftsZ gene indicates a bacterial type of cell division initiation. DNA- dependent RNA polymerase (RNAP) subunits A', A'', B', B'' and H are encoded in a typical archaeal RNAP operon, and a second A' subunit encoding gene, that contains frameshifts, is present at a remote location. There are two rRNA operons, separated by only ~110 kb, and both contain a tRNAala (UGC) gene between the 16S and 23S rRNA genes. Immediately upstream of one rRNA operon is the 7S RNA gene and a tRNAser (GCU) gene. There are 39 tRNA genes, ten in two 5-gene clusters and 16 and 2-gene clusters. The remainder, apart from the rRNA operon associated tRNA genes, are dispersed apparently as single gene transcriptional units around the genome. Introns are present between positions 37 and 38 of the mature anticodon loop in the elongation tRNAmet, tRNAtrp and tRNApro (GGG) genes, and the tRNApro (GGG) gene contains a second intron, at an unprecedented location, between nucleotides 32 and 33 of the mature anticodon loop. There is no selenocysteinyl-tRNA gene, nor evidence for classically organized IS elements, prophages or plasmids. The M. thermoautotrophicum genome contains one intein, located in the alpha chain of ribonucleoside-diphosphate reductase, and 2 extended repeats (3.6 kb and 8.6 kb in length) that are members of a repeat family that has18 representatives in the M. jannaschii genome.

14Henikoff S, Greene EA, Pietrokovski S, Bork P, Attwood TK, Hood L
Gene families: the taxonomy of protein paralogs and chimeras
Science 278:609-614, 1997 (Medline ID: 9381171)
Abstract
Ancient duplications and rearrangements of protein-coding segments have resulted in complex gene family relationships. Duplications can be tandem or dispersed and can involve entire coding regions or modules that correspond to folded protein domains. As a result, gene products may acquire new specificities, altered recognition properties or modified functions. Extreme proliferation of some families within an organism, perhaps at the expense of other families, may correspond to functional innovations during evolution. The underlying processes are still at work, and the large fraction of human and other genomes consisting of transposable elements may be a manifestation of the evolutionary benefits of genomic flexibility.

15Greene EA, Pietrokovski S, Henikoff S, Bork P, Attwood TK, Hood L, Bairoch A
GENOME MAPS 8: Building gene families (wall chart)
Science 278:615, 1997 (Medline ID: 9381172)
Abstract
Genome sequencing projects and other large-scale efforts are generating hundreds of thousands of sequences of new proteins from diverse organisms. The task of discovering the structure and function of an unknown protein is aided by the fact that most new genes are related to other genes, and these relationships can often be detected via sequence similarity. Perhaps half of all known genes encode members of some 3000 major families. Family members share sequence and structural similarities, suggesting divergence from a common ancestor. Unlike proteins that are direct counterparts in different organisms, there can be many members of a gene family within one organism that carry out distinct, yet similar, functions. For the organism itself, the existence of gene families provides a way of generating diversity in function and specificity from a limited number of building blocks, which is essential for the evolutionary success of a genome. Within large eukaryotic genomes, gene family size varies tremendously, ranging from a unique member to thousands of members. Even smaller genomes harbor families that comprise several percent of their genome.

16Henikoff S, Pietrokovski S, Henikoff JG
Superior performance in protein homology detection with the Blocks database servers
Nucleic Acids Research, 26:309-312 (1998) (Medline ID: 9399861)
Abstract
The Blocks Database World Wide Web (http://www.blocks.fhcrc.org) and e-mail (blocks@blocks.fhcrc.org) servers provide tools for the detection and analysis of protein homology based on alignment blocks representing conserved regions of proteins. During the past year, searching has been augmented by supplementation of the Blocks Database with blocks from the Prints Database for a total of 4754 blocks from 1163 families. Blocks from both the Blocks and Prints Databases and blocks that are constructed from sequences submitted to Block Maker can be used for blocks-versus-blocks searching of these databases with LAMA, and for viewing logos and bootstrap trees. Sensitive searches of up-to-date protein sequence databanks are carried out via direct links to the MAST server using position-specific scoring matrices and to the BLAST and PSI-BLAST servers using consensus-embedded sequence queries. Utilizing the trypsin family to evaluate performance, we illustrate the superiority of blocks-based tools over expert pairwise searching or Hidden Markov Models.

17Stoddard BL, Pietrokovski S
Breaking up is hard to do
Nature Structural Biology, 5:3-6 (1998) (Medline ID: 9437416)
A Research News and Views review and commentary
Abstract
The structure of the gyrA intein provides a view of its active site prior to the initial step of the protein splicing pathway. An intact leaving group at the N-terminal splice junction displays a highly strained cis geometry that is potentially critical for driving the isomerization of the scissile peptide bond, inducing subsquent cleavage.

18 Pietrokovski S, Henikoff JG, Henikoff S
Exploring protein homology with the Blocks database
Trends In Genetics, 14:162-163 (1998) (Medline ID: 9594665)

19Rose TM, Schultz ER, Henikoff JG, Pietrokovski S, McCallum CM, Henikoff S
Consensus-degenerate hybrid oligonucleotide primers for amplification of distantly-related sequences
Nucleic Acids Research, 26:1628-1635 (1998) (Medline ID: 9512532)
Abstract
We describe a new primer design strategy for PCR amplification of unknown targets that are related to multiply-aligned protein sequences. Each primer consists of a short 3' degenerate core region and a longer 5' consensus clamp region. Only 3-4 highly conserved amino acid residues are necessary for design of the core, which is stabilized by the clamp during annealing to template molecules. During later rounds of amplification, the non-degenerate clamp permits stable annealing to product molecules. We demonstrate the practical utility of this hybrid primer method by detection of diverse reverse transcriptase-like genes in a human genome, and by detection of C5 DNA methyltransferase homologs in various plant DNAs. In each case, amplified products were sufficiently pure to be cloned without gel fractionation. This COnsensus-DEgenerate Hybrid Oligonucleotide Primer (CODEHOP) strategy has been implemented as a computer program that is accessible over the World-Wide Web ( http://blocks.fhcrc.org/codehop.html) and is directly linked from the BlockMaker multiple sequence alignment site for hybrid primer prediction begining with a set of related protein sequences.

20 Pietrokovski S
Identification of a virus intein and a possible variation in the protein-splicing reaction
Current Biology 8:R634-R635, 1998 (Medline ID: 9740808)
Summary
The first identified animal virus intein is reported. On the basis of the sequence relation of the viral intein and its protein host to other proteins, viruses are proposed as vehicles of intein dispersion. A new type of residue in the intein's carboxy-terminal end suggests a variation to the protein-splicing mechanism.

21Henikoff JG, Henikoff S, Pietrokovski S
New features of the Blocks Database servers
Nucleic Acids Research 27:226-228, 1999 (Medline ID: 9847186)
Abstract
Blocks are ungapped multiple sequence alignments representing conserved protein regions, and the Blocks Database consists of blocks from documented protein families. World Wide Web (http://www.blocks.fhcrc.org) and e-mail (blocks@blocks.fhcrc.org) servers provide tools for homology searching and for analyzing protein family relationships. New enhancements include a multiple alignment processor that extends the use of these tools to imported multiple alignments of families not present in the database and a PCR primer designer that implements a new strategy for gene isolation.

22Kowalski JC, Belfort M, Stapleton SA, Holpert M, Dansereau JT, Pietrokovski S, Baxter S, Derbyshire V
Configuration of the catalytic GIY-YIG domain of intron endonuclease I-TevI: coincidence of computational and molecular findings
Nucleic Acids Research 27:2115-2125, 1999 (Medline ID: 10219084)
Abstract
I-TevI is a member of the GIY-YIG family of homing endonucleases. It is folded into two structural and functional domains, an N-terminal catalytic domain and a C-terminal DNA-binding domain, separated by a flexible linker. In this study we have used genetic analyses, computational sequence analysis and NMR spectroscopy to define the configuration of the N-terminal domain and its relationship to the flexible linker. The catalytic domain is an alpha/beta structure contained within the first 92 amino acids of the 245-amino acid protein followed by an unstructured linker. Remarkably, this structured domain corresponds precisely to the GIY-YIG module defined by sequence comparisons of 57 proteins including more than 30 newly reported members of the family. Although much of the unstructured linker is not essential for activity, residues 93-116 are required, raising the possibility that this region may adopt an alternate conformation upon DNA binding. Two invariant residues of the GIY-YIG module, Arg27 and Glu75, located in alpha-helices, have properties of catalytic residues. Furthermore, the GIY-YIG sequence elements for which the module is named form part of a three-stranded antiparallel beta-sheet that is important for I-TevI structure and function.

23Henikoff S, Henikoff JG, Pietrokovski S
Blocks+: A non-redundant database of protein alignment blocks derived from multiple compilations
Bioinformatics 15:471-479, 1999 (Medline ID: 10383472)
Abstract
Motivation: As databanks grow, sequence classification and prediction of function by searching protein family databases becomes increasingly valuable. The original Blocks Database, which contains ungapped multiple alignments for families documented in Prosite, can be searched to classify new sequences. However, Prosite is incomplete, and families from other databases are now available to expand coverage of the Blocks Database.
Results: To take advantage of protein family information present in several existing compilations, we have used five databases to construct Blocks+, a unified database that is built on the PROTOMAT/BLOSUM scoring model and that can be searched using a single algorithm for consistent sequence classification. The LAMA blocks-versus-blocks searching program identifies overlapping protein families, making possible a non-redundant hierarchical compilation. Blocks+ consists of all blocks derived from PROSITE, blocks from Prints not present in PROSITE, blocks from Pfam-A not present in PROSITE or Prints, and so on for ProDom and Domo, for a total of 1995 protein families represented by 8909 blocks, doubling the coverage of the original Blocks Database. A challenge for any procedure aimed at non-redundancy is to retain related but distinct families while discarding those that are duplicates. We illustrate how using multiple compilations can minimize this potential problem by examining the SNF2 family of ATPases, which is detectably similar to distinct families of helicases and ATPases.
Availability: http://blocks.fhcrc.org/
Contact: steveh@fhcrc.org.

24Kelman Z, Pietrokovski S, Hurwitz J
Isolation and characterization of a split B-type DNA polymerase from the archaeon Methanobacterium thermoautotrophicum DH
J Biological Chemistry 274:28751-28761, 1999 (Medline ID: 10497247)
Abstract
We describe here the isolation and characterization of a B-type DNA polymerase (PolB) from the archaeon Methanobacterium thermoautotrophicum DH (Mth). A unique feature of PolB, not yet found in other polymerases, is being encoded on two different genes. The two genes were cloned and the proteins overexpressed and purified individually and as a complex. Similar to other polymerases from family-B, PolB passes both polymerases and 3"-5" exonuclease activities. We demonstrate that both polypeptides are needed to form an active polymerase. We found that a homologue of replication protein A from Mth (mthRPA) binds to PolB and inhibits its pol activity. The inhibition of DNA synthesis by mthRPA can be relive in the presence of Mth homologues of replication factor C (mthRFC) and proliferating cell nuclear antigen (mthPCNA). The possible roles of PolB in Mth replication is discussed.

25 Amitai G, Pietrokovski S
Fine-tuning an engineered intein
Nature Biotechnology 17:854-855, 1999 (Medline ID: 10471922)
A Research News and Views review and commentary

26Henikoff JG, Greene EA, Pietrokovski S, Henikoff S
Increased coverage of protein families with the Blocks Database servers
Nucleic Acids Research 28:228-230, 2000 (Medline ID: 10592233)
Abstract
The Blocks Database WWW (http://blocks.fhcrc.org) and Email (blocks@blocks.fhcrc.org) servers provide tools to search DNA and protein queries against the Blocks+ Database of multiple alignments, which represent conserved protein regions. Blocks+ nearly doubles the number of protein families included in the database by adding families from the Pfam-A, ProDom and Domo databases to those from PROSITE and PRINTS. Other new features include improved Block Searcher statistics, searching with NCBI's IMPALA program and 3D display of blocks on PDB structures.

27Sapir T, Horesh D, Caspi M, Atlas R, Burgess HA, Grayer Wolf S, Francis F, Chelly J, Elbaum M, Pietrokovski S, Reiner O
Doublecortin mutations cluster in evolutionarily conserved functional domains
Human Molecular Genetics 9:703-712, 2000 (Medline ID: 10749977)
Abstract
Mutations in the X-linked gene doublecortin (DCX) result in lissencephaly in males or subcortical laminar heterotopia (`double cortex') in females. Various types of mutation were identified and the sequence differences included nonsense, splice site and missense mutations throughout the gene. Recently, we and others have demonstrated that DCX interacts and stabilizes microtubules. Here, we performed a detailed sequence analysis of DCX and DCX-like proteins from various organisms and defined an evolutionarily conserved Doublecortin (DC) domain. The domain typically appears in the N-terminus of proteins and consists of two tandemly repeated 80 amino acid regions. In the large majority of patients, missense mutations in DCX fall within the conserved regions. We hypothesized that these repeats may be important for microtubule binding. We expressed DCX or DCLK (KIAA0369) repeats in vitro and in vivo. Our results suggest that the first repeat binds tubulin but not microtubules and enhances microtubule polymerization. To study the functional consequences of DCX mutations, we overexpressed seven of the reported mutations in COS7 cells and examined their effect on the microtubule cytoskeleton. The results demonstrate that some of the mutations disrupt microtubules. The most severe effect was observed with a tyrosine to histidine mutation at amino acid 125 (Y125H). Produced as a recombinant protein, this mutation disrupts microtubules in vitro at high molar concentration. The positions of the different mutations are discussed according to the evolutionarily defined DC-repeat motif. The results from this study emphasize the importance of DCX-microtubule interaction during normal and abnormal brain development.

28Henikoff, JG, Pietrokovski S, McCallum CM, Henikoff S
Blocks-based methods for detecting protein homology
Electrophoresis 21:1700-1706, 2000 (Medline ID: 10870957)
Abstract
The most highly conserved regions of proteins can be represented as blocks of aligned sequence segments, typically with multiple blocks for a given protein family. The Blocks Database World Wide Web (http://blocks.fhcrc.org) and e-mail (blocks@blocks. fhcrc.org) servers provide tools to search DNA and protein queries against the Blocks+ Database of multiple alignments. We describe features for detection of distant relationships using blocks. Blocks+ includes protein families from the PROSITE, Prints, Pfam-A, ProDom and Domo databases. Other features include searching Blocks+ with the BLIMPS and NCBI's IMPALA programs, sequence logos, phylogenetic trees, three-dimensional display of blocks on PDB structures, and a polymerase chain reaction (PCR) primer design strategy based on blocks.

29Pietrokovski S, Shilo B-Z
Identification of new signaling components in the Drosophila genome sequence
Functional & Integrative Genomics 1:250-255, 2001 (published online: 14 September 2000) (Pubmed ID: 11793244)
Abstract
The availability of the complete sequence of the Drosophila genome, and the assignment of putative reading frames, provides an opportunity to search for new members in families of proteins generating signaling cascades. The six major pathways that dictate patterning were examined: receptor tyrosine kinases, TGFβ, Wnt, Toll, Hedgehog and Notch. Several new components were identified for the first four pathways, including ligands, receptors, cytoplasmic components and transcription factors. Most notable is the identification of a vascular endothelial growth factor (VEGF) receptor tyrosine kinase, two insulin/IGF I receptors without cytoplasmic protein kinase domains, and a family of proteins similar to Rhomboid - (a protein involved in cleavage of TGFα-like ligands). A new TGFβ family ligand, two new Wnts and a Frizzled receptor were also identified. Finally, for the Toll pathway, two new potential Spatzle-like ligands and two new receptors were identified. The number of new components is limited, and in the case of the Hedgehog and Notch pathways no new members were identified. This indicates that for the signaling pathways which determine pattern formation, the exhaustive genetic screens have identified most of the components. Thus, functional redundancy between signaling components belonging to the same family is limited, as mutations in each member usually give rise to a detectable phenotype.

30Kunin V, Chan B, Sitbon E, Lithwick G, Pietrokovski S
Consistency analysis of similarity between multiple alignments - prediction of protein function and fold structure from analysis of local sequence motifs
J Molecular Biology 307:939-949, 2001 (Pubmed ID: 11273712)
Abstract
A new method to analyze the similarity between multiply-aligned protein motifs (blocks) was developed. It identifies sets of consistently aligned blocks. These are found to be protein regions of similar function and structure that appear in different contexts. For example, the Rossmann fold ligand-binding region is found similar to TIM barrel and methylase regions, various protein families are predicted to have a TIM-barrel fold and the structural relation between the ClpP protease and crotonase folds is identified from their sequence. Besides identifying local structure features, sequence similarity across short sequence-regions (less than twenty amino acids) also predicts structure similarity of whole domains (folds) a few hundred amino acids long. Most of these relations could not be identified by other advanced sequence-to-sequence and sequence-to-multiple alignments comparisons. We describe the method (termed cyrca), present examples of our findings and discuss their implications.

31Pietrokovski S
Intein spread and extinction in evolution
Trends In Genetics 17:465-472, 2001 (Pubmed ID: 11485819)
Abstract
Inteins are selfish DNA elements found within coding regions. They are translated with their host protein, but then catalyze their own excision and the formation of a peptide bond between their flanking protein regions. Understanding what drives and selects inteins is relevant for assessing whether they have unidentified biological functions and whether they can invade and become established in new genes and organisms. Inteins are suggested to have been present and more common in the progenitors of eukaryotes and prokaryotes. In these cells inteins had some beneficial function or had evolved from an unknown beneficial protein. Since then this putative benefit has been lost and inteins are gradually becoming extinct. The proteins in which inteins are currently found are proposed to be proteins vital for the survival of the organism, where intein removal is most difficult.

32Adato A, Vreugde S, Joensuu T, Avidan N, Hamalainen R, Belenkiy O, Olender T, Bonne-Tamir B, Ben-Asher E, Espinos C, Mill Lehesjoki A, Flannery JG, Avraham KB, Pietrokovski S, Sankila E, Beckmann JS, Lancet D
USH3A transcripts encode clarin-1, a four-transmembrane-domain protein with a possible role in sensory synapses
European Journal of Human Genetics 10:339-350, 2002 (Pubmed ID: 12080385)
Abstract
Usher syndrome type 3 (USH3) is an autosomal recessive disorder characterized by the association of post-lingual progressive hearing loss, progressive visual loss due to retinitis pigmentosa and variable presence of vestibular dysfunction. Because the previously defined transcripts do not account for all USH3 cases, we performed further analysis and revealed the presence of additional exons embeded in longer human and mouse USH3A transcripts and three novel USH3A mutations. Expression of Ush3a transcripts was localized by whole mount in situ hybridization to cochlear hair cells and spiral ganglion cells. The full length USH3A transcript encodes clarin-1, a four-transmembrane-domain protein, which defines a novel vertebrate-specific family of three paralogues. Limited sequence homology to stargazin, a cerebellar synapse four-transmembrane-domain protein, suggests a role for clarin-1 in hair cell and photoreceptor cell synapses, as well as a common pathophysiological pathway for different Usher syndromes.

33Henikoff JG, Greene EA, Taylor N, Pietrokovski S, Henikoff S
Using the Blocks Database to Recognize Functional Domains
Current Protocols in Bioinformatics UNIT 2.2, 2002
Abstract
Blocks are ungapped multiple alignments of of related protein sequence segments that correspond to the most conserved regions of the proteins. The Blocks Database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins. Protocols in this unit describe the analysis of proteins and families using Blocks-based tools, including searching, exploring relationships with trees, making new blocks, and designing PCR primers from blocks for isolating homologous sequences.

34Amitai G, Belenkiy O, Dassa B, Shainskaya A, Pietrokovski S
Distribution and function of new bacterial intein-like protein domains
Molecular Microbiology 47:61-73 2003. (Pubmed ID: 12492854)
Abstract
Hint protein domains appear in inteins and in the C-terminal region of Hedgehog and Hedgehog-like animal developmental proteins. Intein Hint domains are responsible and sufficient for protein-splicing of their host-protein flanks. In Hedgehog proteins the Hint domain autocatalyses its cleavage from the N-terminal domain of the Hedgehog protein by attaching a cholesterol molecule to it. We identified two new types of Hint domains. Both types have active site sequence features of Hint domains but also possess distinguishing sequence features. The new domains appear in more than 50 different proteins from diverse bacteria, including pathogenic species of humans and plants, such as Neisseria meningitidis and Pseudomonas syringae. These new domains are termed bacterial intein-like (BIL) domains. Bacterial intein-like domains are present in variable protein regions and are typically flanked by domains that also appear in secreted proteins such as filamentous haemagglutinin and calcium binding RTX repeats. Phylogenetic and genomic analysis of BIL sequences suggests that they were positively selected for in different lineages. We cloned two BIL domains of different types and showed them to be active. One of the domains efficiently cleaved itself from its C-terminal flank and could also protein-splice its two flanks, in E. coli and in a cell free system. We discuss several possible biological roles for BIL domains including microevolution and post translational modification for generating protein variability.

35Sitbon E, Pietrokovski S
New types of conserved sequence domains in DNA-binding regions of homing endonucleases
Trends in Biochemical Sciences 28:473-477 2003 (Pubmed ID: 13678957)
Abstract
We have identified four new types of short conserved sequence domains in homing endonucleases and related proteins. These domains are modular, appearing in various combinations. One domain includes a motif known by structure as a novel sequence-specific DNA-binding helix. Sequence similarity shows two other domains to be new types of helix-turn-helix DNA-binding domains. We term the new domains nuclease-associated modular DNA-binding domains (NUMODs).

36Caspi J, Amitai G, Belenkiy O, Pietrokovski S
Distribution of split DnaE inteins in cyanobacteria
Molecular Microbiology 50:1569-1577, 2003 (Pubmed ID: 14651639)
Abstract
Inteins are genetic elements found inside the coding regions of different host proteins and are translated in frame with them. The intein encoded protein region is removed by an autocatalytic protein-splicing reaction that ligates the host protein flanks with a peptide bond. This reaction can also occur in trans with the intein and host protein split in two. Following translation of the two genes the two intein parts ligate their flanking protein parts to each other, producing the mature protein. Naturally split inteins are only known in the DNA polymerase III alpha subunit (polC or dnaE gene) of a few cyanobacteria. Analyzing the phylogenetic distribution and probable genetic propagation mode of these split inteins we conclude they are genetically fixed in several large cyanobacterial lineages. To test our hypothesis we sequenced parts of the dnaE genes from five diverse cyanobacteria and found all species to have the same type of split intein. Our results suggest the occurrence of a genetic rearrangement in the ancestor of a large division of cyanobacteria. This event fixed the dnaE gene in a unique two-genes one-protein configuration in the progenitor of many cyanobacteria. Our hypothesis, findings, and cloning procedure we established allow the identification and acquisition of many naturally split inteins. Having a large and diverse repertoire of these unique inteins will enable studies of their distinct activity and enhance their use in biotechnology.

37Amitai G, Dassa B, Pietrokovski S
Protein-splicing of inteins with atypical glutamine and aspartate C-terminal residues
J Biological Chemistry 279 3121-3131 2004 (published online: October 30, 2003) (Pubmed ID: 14593103)
Abstract
Inteins are protein-splicing domains present in many proteins. They self catalyze their excision from the host protein, ligating their former flanks by a peptide bond. The C-terminal residue of inteins is typically an asparagine (Asn). Cyclization of this residue to succinimide causes the final detachment of inteins from their hosts. We studied protein-splicing activity of two inteins with atypical C-terminal residues. One having a C-terminal glutamine (Gln), isolated from Chilo-Iridescent virus (CIV), and another unique intein, first reported here, with a C-terminal aspartate, isolated from Carboxydothermus hydrogenoformans (Chy). Protein-splicing activity was examined in the wild-type inteins and in several mutants with N- and C-terminal amino acid substitutions. We demonstrate that both wild-type inteins can protein-splice, probably by new variations of the typical protein-splicing mechanism. Substituting the atypical C-terminal residue to the typical Asn retained protein-splicing only in the CIV intein. All diverse C-terminal substitutions in the Chy intein (Asp345 to Asn, Gln, Glu, and Ala) abolished protein-splicing and generated N- and C-terminal cleavage. The observed C-terminal cleavage in the Chy intein ending with Ala cannot be explained by cyclization of this residue. We present and discuss several new models for reactions in the protein-splicing pathway.

38Dassa B, Haviv H, Amitai G, Pietrokovski S
Protein splicing and auto-cleavage of bacterial intein-like domains lacking a C'-flanking nucleophilic residue
J Biological Chemistry 279 32001-32007 2004 (published online: May 18, 2004) (Pubmed ID: 15150275)
Abstract
Bacterial intein-like (BIL) domains are newly identified homologs of intein protein-splicing domains. The two known types of BIL domains together with inteins and hedgehog (Hog) auto-processing domains form the HINT super- family. BIL domains are distinct from inteins and Hogs in sequence, phylogenetic distribution and host protein type, but little is known about their biochemical activity. Here we experimentally study the auto- processing activity of four BIL domains. An A-type BIL domain from Clostridium thermocellum showed both protein-splicing and auto-cleavage activities. The splicing is notable since this domain has a native Ala C'- flanking residue, rather than a nucleophilic residue, which is absolutely necessary for intein protein-splicing. B-type BIL domains from Rhodobacter sphaeroides and Rhodobacter capsulatus cleaved their N' or C' ends. We propose an alternative protein-splicing mechanism for A-type BIL domains. After an initial N-S acyl shift, creating a thioester bond at the domain N' end, the domain's C' end is cleaved by Asn cyclization. Next, the resulting amino end of the C' flank attacks the thioester bond at the domain's N' end. This aminolysis step splices the two flanks of the domain. B-type BIL domains cleavage activity is explained in context of the canonical intein protein-splicing mechanism. Our results suggest that the different HINT domains have related biochemical activities of proteolytic cleavages, ligation and splicing. Yet the predominant reactions diverged in each HINT type, according to their specific biological roles. We suggest that BIL domains cleavage and splicing reactions are mechanisms for post- translationally generating protein variability, particularly in extra cellular bacterial proteins.

39Dassa B, Yanai I, Pietrokovski S
New type of poly ubiquitin-like genes with intein-like autoprocessing domains
Trends In Genetics 20 538-542, 2004 (published online: September 15, 2004) (Pubmed ID: 15475112)
Abstract
Genome analysis of ciliates identified a new type of poly ubiquitin-like genes. These contain tandem repeats of ubiquitin-like domains interspersed with auto-catalytic intein-like domains. Inteins and related protein domains post-translationally process their own precursor proteins by protein-splicing, cleavage and ligation reactions. The structure of these poly ubiquitin-like genes suggests their precursor products undergo maturation and conjugation in cis. This novel gene structure also illustrates the genetic modularity of ubiquitin-like and intein-like domains. Our suggested auto-processing of ubiquitin-like polyproteins is a new potential general way for controlling protein functions.

40Amitai G, Shemesh A, Sitbon E, Shklyar M, Netanely D, Venger I, Pietrokovski S
Network analysis of protein structures identifies functional residues
J Molecular Biology 344:1136-1145 2004 (published online: November 6, 2004) (Pubmed ID: 15544817)
Abstract
Identifying active site residues strictly from protein three-dimensional structure is a difficult task, especially for proteins that have few or no homologues. We transformed protein structures into residue interaction graphs (RIGs), where amino acid residues are graph nodes and their interactions with each other are the graph edges. We found that active site, ligand-binding and evolutionary conserved residues, typically have high closeness values. Residues with high closeness values interact directly or by a few intermediates with all other residues of the protein. Combining closeness and surface accessibility identified active site residues in 70% of 178 representative structures. Detailed structural analysis of specific enzymes also located other types of functional residues. These include the substrate binding sites of acetylcholinesterases and subtilisin, and the regions whose structural changes activate MAP kinase and glycogen phosphorylase. Our approach uses single protein structures, and does not rely on sequence conservation, comparison to other similar structures or any prior knowledge. Residue closeness is distinct from various sequence and structure measures and can thus complement them in identifying key protein residues. Closeness integrates the effect of the entire protein on single residues. Such natural structural design may be evolutionary maintained to preserve interaction redundancy and contribute to optimal setting of functional sites.

41 Dassa B, Pietrokovski S
Origin and Evolution of Inteins and other Hint domains
In: Homing Endonucleases and Inteins, Series:Nucleic Acids and Molecular Biology, Vol. 16, Edited by Belfort M, Derbyshire V, Stoddard B, Wood D. Springer-Verlag ISBN: 3-540-25106-5 2005
Abstract
Intein protein-splicing domains are part of the Hint superfamily. This superfamily includes three other characterized families: Hog-Hint and two types of Bacterial intein-like (BIL) domains. Hint domains share the same structure fold and common sequence features, and have similar biochemical activities. They post-translationally auto-process the proteins in which they are present by protein-splicing, self-cleavage or ligation activities. Yet, each Hint family apparently has its own distinct biological role. We discuss the evolution of the different Hint families, the origin of primordial Hint domains themselves, and their possible activities and biological functions.

42Grinberg M, Schwarz M, Zaltsman Y, Eini T, Pietrokovski S, Gross A
Mitochondrial carrier homolog 2 is a target of tBID in cells signaled to die by TNFα
Molecular Cell Biology 25:4579-4590 2005 (Pubmed ID: 15899861)
Abstract
BID, a proapoptotic BCL-2 faily member, plays an essential role in the tumor necrosis factor alpha (TNFα)/Fas death receptor pathway in vivo. Activation of the TNF-R1 receptor results in the cleavage of BID into truncated BID (tBID), which translocates to the mitochondria and induces the activation of BAX or BAK. In TNFα -activated FL5.12 cells, tBID becomes part of a 45-kDa cross-linkable mitochondrial complex. Here we describe the biochemical purification of this complex and the identification of mitochondrial carrier homolog 2 (Mtch2) as part of this complex. Mtch2 is a conserved protein that is similar to members of the mitochondrial carrier protein family. Our studies with mouse liver mitochondria indicate that Mtch2 is an integral membrane protein exposed on the surface of mitochondria. Using blue-native gel electrophoresis we revealed that in viable FL5.12 cells Mtch2 resides in a protein complex of ca. 18 kDa and that the addition of TNFα to these cells leads to the recruitment of tBID and BAX to this complex. Importantly, this recruitment was partially inhibited in FL5.12 cells stably expressing BCL-XL. These results implicate Mtch2 as a mitochondrial target of tBID and raise the possibility that the Mtch2-resident complex participates in the mitochondrial apoptotic program.

43Frenkel-Morgenstern M, Voet H,, Pietrokovski S
Enhanced statistics for local alignment of multiple alignments improves prediction of protein function and structure
Bioinformatics 21:2950-2956 2005 (published online: May 3, 2005) (Pubmed ID: 15870168)
Abstract
Motivation: Improved comparisons of multiple sequence alignments (profiles) with other profiles can identify subtle relationships between protein families and motifs significantly beyond the resolution of sequence-based comparisons.
Results: The local alignment of multiple alignments (LAMA) method was modified to estimate alignment score significance by applying a new measure based on Fisher's combining method. To verify the new procedure, we used known protein structures, sequence annotations and cyclical relations consistency analysis (CYRCA) sets of consistently aligned blocks. Using the new significance measure improved the sensitivity of LAMA without altering its selectivity. The program performed better than other profile-to-profile methods (COMPASS and Prof_sim) and a sequence-to-profile method (PSI-BLAST). The testing was large scale and used several parameters, including pseudo-counts profile calculations and local ungapped blocks or more extended gapped profiles. This comparison provides guidelines to the relative advantages of each method for different cases. We demonstrate and discuss the unique advantages of using block multiple alignments of protein motifs.
Availability: http://bioinformatics.weizmann.ac.il/blocks/LAMA

44Nagasaki K, Shirai Y, Tomaru, Y Nishida K, Pietrokovski S
Algal viruses with distinct intraspecies host specificities include identical intein elements
Applied and Environmental Microbiology 71:3599-3607 2005 (Pubmed ID: 16000767)
Abstract
HaV is a large double-stranded DNA virus infecting the single-cell bloom-forming raphidophyte (golden brown alga) Heterosigma akashiwo. Molecular phylogenetic sequence analysis of HaV DNA polymerase showed that it forms a sister group with Phycodnaviridae algal viruses. All ten examined HaV strains, with distinct intraspecies host specificities, included in their DNA polymerase genes an intein (protein intron). The 232 amino acids inteins differed from each other by no more than a single nucleotide change. All inteins were present in the same conserved position, coding for an active-site motif, which also includes inteins in Mimivirus (a very large double-stranded DNA virus of amoebae), and several archaeal DNA polymerases. The HaV intein is closely related to the Mimivirus intein and both are apparently monophyletic to the archaeal inteins. These observations suggest horizontal transfers of inteins between viruses of different families and between archaea and viruses, and that viruses might be reservoirs and intermediates in intein horizontal transmissions. The homing endonuclease domain of the HaV intein alleles is mostly deleted. The mechanism keeping their sequences basically identical in HaV strains specific for different hosts is yet unknown. One possibility is that rapid and local changes in the HaV genome change its host specificity. This is the first report of inteins found in viruses infecting eukaryotic algae.

45Frenkel-Morgenstern M, Singer A, Bronfeld H, Pietrokovski S
One-Block CYRCA: automated procedure for identifying multiple-block alignments from single block queries
Nucleic Acids Research 33(Web Server issue):W281-W283 2005 (Pubmed ID: 15980470)
Abstract
One-Block CYRCA is an automated procedure for identifying multiple-block alignments from single block queries (http://bioinfo.weizmann.ac.il/blocks/OneCYRCA). It is based on the LAMA and CYRCA block-to-block alignment methods. The procedure identifies whether the query blocks can form new multiple-block alignments (block sets) with blocks from a database, or join pre-existing database block sets. Using pre-computed LAMA block alignments and CYRCA sets from the Blocks database reduces the computation time. LAMA and CYRCA are highly sensitive and selective methods that can augment many other sequence analysis approaches.

46Slavikova S, Shy G, Yao Y, Glozman R, Levanony H, Pietrokovski S, Elazar Z, Galili G
The autophagy-associated Atg8 gene family operates both under favourable growth conditions and under starvation stresses in Arabidopsis plants
J Experimental Botany 56:2839-2849 2005 (Pubmed ID: 16157655)
Abstract
Arabidopsis plants possess a family of nine AtAtg8 gene homologues of the yeast autophagy-associated Apg8/Aut7 gene. To gain insight into how these genes function in plants, first, the expression patterns of five AtAtg8 homologues were analysed in young Arabidopsis plants grown under favourable growth conditions or following exposure to prolonged darkness or sugar starvation. Promoters, plus the entire coding regions (exons and introns) of the AtAtg8 genes, were fused to the β-glucuronidase reporter gene and transformed into Arabidopsis plants. In all plants, grown under favourable growth conditions, β-glucuronidase staining was much more significant in roots than in shoots. Different genes showed distinct spatial and temporal expression patterns in roots. In some transgenic plants, β-glucuronidase staining in leaves was induced by prolonged darkness or sugar starvation. Next, Arabidopsis plants were transformed with chimeric gene-encoding Atg8f protein fused to N-terminal green fluorescent protein and C-terminal haemagglutinin epitope tags. Analysis of these plants showed that, under favourable growth conditions, the Atg8f protein is efficiently processed and is localized to autophagosome-resembling structures, both in the cytosol and in the central vacuole, in a similar manner to its processing and localization under starvation stresses. Moreover, treatment with a cocktail of proteasome inhibitors did not prevent the turnover of this protein, implying that its turnover takes place in the vacuoles, as occurs in yeasts. The results suggest that, in plants, the cellular processes involving the Atg8 genes function efficiently in young, non-senescing tissues, both under favourable growth conditions and under starvation stresses.

47Bakhrat A, Baranes K, Krichevsky O, Rom I, Schlenstedt G, Pietrokovski S,, Raveh D
Nuclear import of Ho endonuclease utilizes two NLS signals and four importins of the ribosomal import system
J Biological Chemistry 281:12218-12226 2006 (published online: February 28, 2006) (Pubmed ID: 16507575)
Abstract
Activity of Ho, the yeast mating switch endonuclease, is restricted to a narrow time window of the cell cycle. Ho is unstable and despite being a nuclear protein is exported to the cytoplasm for proteasomal degradation. We report here the molecular basis for the highly efficient nuclear import of Ho and the relation between its short half-life and passage through the nucleus. The Ho nuclear import machinery is functionally redundant, being based on two bipartite nuclear localization signals (NLSs), recognized by four importins of the ribosomal import system. Ho degradation is regulated by the DNA damage response and Ho retained in the cytoplasm is stabilized, implying that Ho acquires its crucial degradation signals in the nucleus. Ho arose by domestication of a fungal VMA1 intein. A comparison of the primary sequences of Ho and fungal VMA1 inteins shows that the Ho NLSs are highly conserved in all Ho proteins, but are absent from VMA1 inteins. Thus adoption of a highly efficient import strategy occurred very early in the evolution of Ho. This may have been a crucial factor in establishment of homothallism in yeast, and a key event in the rise of the Saccharomyces sensu stricto.

48 Citri A, Harari D, Shochat G, Ramakrishnan P, Gan J, Eisenstein M, Kimchi A, Wallach D, Pietrokovski S,, Yarden Y
Hsp90 recognizes a common surface on client kinases
J Biological Chemistry 281:14361-14369 2006 (published online: March 21, 2006) (Pubmed ID: 16551624)
Abstract
Hsp90 is a highly abundant chaperone, whose clientele includes hundreds of cellular proteins, many of which are central players in key signal transduction pathways, and the majority of which are protein kinases. In light of the variety of Hsp90 clientele, the mechanism of selectivity of the chaperone towards its client proteins is a major open question. Focusing on human kinases, we demonstrate that the chaperone recognizes a common surface in the amino-terminal lobe of kinases from diverse families, including two newly identified clients, NIK and DAPK, and the oncoprotein HER2/ErbB-2. Surface electrostatics determine the interaction with the Hsp90 chaperone complex, such that introduction of a negative charge within this region disrupts recognition. Compiling information on the Hsp90 dependence of 105 protein kinases, including 16 kinases whose relationship to Hsp90 is first examined in this study, reveals that surface features, rather than a contiguous amino-acid sequence, define the capacity of the Hsp90 chaperone machine to recognize client kinases. Analyzing Hsp90 regulation of two major signaling cascades, the MAP-kinase and PI-3 kinase, leads us to propose that the selectivity of the chaperone to specific kinases is functional, namely: Hsp90 controls kinases that function as hubs, integrating multiple inputs. These lessons bear significance to pharmacological attempts to target the chaperone in human pathologies, such as cancer.

49 Eyal E, Frenkel-Morgenstern M, Sobolev V,, Pietrokovski S
A pair-to-pair amino acids substitution matrix and its applications for protein structure prediction
Proteins 67:142-153 2007 (accepted August 2006, published online: January 22, 2007) (Pubmed ID: 17243158)
Abstract
We present a new structurally derived pair-to-pair substitution matrix (P2PMAT). This matrix is constructed from a very large amount of integrated high quality multiple sequence alignments (Blocks) and protein structures. It evaluates the likelihoods of all 160,000 pair-to-pair substitutions. P2PMAT matrix implicitly accounts for evolutionary conservation, correlated mutations, and residue-residue contact potentials. The usefulness of the matrix for structural predictions is shown in this article. Predicting protein residue-residue contacts from sequence information alone, by our method (P2PConPred) is particularly accurate in the protein cores, where it performs better than other basic contact prediction methods (increasing accuracy by 25-60%). The method mean accuracy for protein cores is 24% for 59 diverse families and 34% for a subset of proteins shorter than 100 residues. This is above the level that was recently shown to be sufficient to significantly improve ab initio protein structure prediction. We also demonstrate the ability of our approach to identify native structures within large sets of (300-2000) protein decoys. On the basis of evolutionary information alone our method ranks the native structure in the top 0.3% of the decoys in 4/10 of the sets, and in 8/10 of sets the native structure is ranked in the top 10% of the decoys. The method can, thus, be used to assist filtering wrong models, complimenting traditional scoring functions.

50 Dassa B, Amitai G, Caspi J, Schueler-Furman O,, Pietrokovski S
Trans protein splicing of cyanobacterial split inteins in endogenous and exogenous combinations
Biochemistry 46:322-330 2007 (published online: January 2, 2007) (Pubmed ID: 17198403)
Abstract
Inteins are autocatalytic protein domains that post-translationally excise from protein precursors and ligate their flanking regions with a peptide bond, in a process called protein splicing. Intein-containing DNA polymerases of cyanobacteria and nanoarchaea are naturally split into two separate genes at their intein domain. Such naturally occurring split inteins rapidly self-associate and reconstitute protein-splicing activity in trans. Here, we analyze the in vitro protein-splicing activity of three naturally split inteins from diverse cyanobacteria: Oscillatoria limnetica, Thermosynechococcus vulcanus, and Nostoc sp. PCC7120. N- and C-terminal halves of these split inteins were mixed in nine combinations, resulting in three endogenous (wild-type) and six exogenous combinations. Protein splicing was detected in all split-intein combinations, despite a 30-50% sequence variation between the homologous proteins. Splicing activity proceeded under a variety of conditions, including the presence of denaturants and reductants and high temperature, ionic strength, and viscosity. Still, in a high concentration of salt (2 M) or urea (6 M), specific combinations spliced significantly better than others. Additionally, copper ions were found to inhibit trans splicing in a reversible double-lock reaction. Our comparative analysis of naturally split inteins in endogenous and exogenous combinations demonstrates the modularity of trans protein-splicing elements and their robust activity. It suggests tight interactions between split-intein halves and conditions for modifying the specificity of intein parts. These results promote the biotechnological use of split inteins for controlled assembly of protein fragments either in vivo or in vitro and under moderate or extreme conditions.

51 Sitbon E, Pietrokovski S
Occurrence of protein structure elements in conserved sequence regions
BMC Structural Biology 7:3 2007 (published online: January 9, 2007) (Pubmed ID: 17210087)
Abstract
BACKGROUND: Conserved protein sequence regions are extremely useful for identifying and studying functionally and structurally important regions. By means of an integrated analysis of large-scale protein structure and sequence data, structural features of conserved protein sequence regions were identified. RESULTS: Helices and turns were found to be underrepresented in conserved regions, while strands were found to be overrepresented. Similar numbers of loops were found in conserved and random regions. CONCLUSION: These results can be understood in light of the structural constraints on different secondary structure elements, and their role in protein structural stabilization and topology. Strands can tolerate fewer sequence changes and nonetheless keep their specific shape and function. They thus tend to be more conserved than helices, which can keep their shape and function with more changes. Loop behavior can be explained by the presence of both constrained and freely changing loops in proteins. Our detailed statistical analysis of diverse proteins links protein evolution to the biophysics of protein thermodynamic stability and folding. The basic structural features of conserved sequence regions are also important determinants of protein structure motifs and their function.

52 Frenkel-Morgenstern M, Magid R, Eyal E, Pietrokovski S
Refining intra-protein contact prediction by graph analysis
BMC Bioinformatics 8:S6 2007 (published online: May 24, 2007) (Pubmed ID: 17570865)
Abstract
BACKGROUND: Accurate prediction of intra-protein residue contacts from sequence information will allow the prediction of protein structures. Basic predictions of such specific contacts can be further refined by jointly analyzing predicted contacts, and by adding information on the relative positions of contacts in the protein primary sequence. RESULTS: We introduce a method for graph analysis refinement of intra-protein contacts, termed GARP. Our previously presented intra-contact prediction method by means of pair-to-pair substitution matrix (P2PConPred) was used to test the GARP method. In our approach, the top contact predictions obtained by a basic prediction method were used as edges to create a weighted graph. The edges were scored by a mutual clustering coefficient that identifies highly connected graph regions, and by the density of edges between the sequence regions of the edge nodes. A test set of 57 proteins with known structures was used to determine contacts. GARP improves the accuracy of the P2PConPred basic prediction method in whole proteins from 12% to 18%. CONCLUSION: Using a simple approach we increased the contact prediction accuracy of a basic method by 1.5 times. Our graph approach is simple to implement, can be used with various basic prediction methods, and can provide input for further downstream analyses.

53 Eyal E, Pietrokovski S, Bahar I
Rapid assessment of correlated amino acids from Pair-to-Pair (P2P) substitution matrices
Bioinformatics 23:1837-1839 2007 (published online: May 12, 2007) (Pubmed ID: 17496318)
Abstract
Identification of correlated amino acids in proteins has been a topic of broad interest in view of its functional implications and importance in protein design. A new set of pair-to-pair (P2P) substitution matrices for amino acids was recently introduced as a useful tool for inferring information on such correlated sites. We present a web site developed for automated application of these matrices to analysis of query sequences. The site offers options for graphical analysis of correlations, as well as visualization of correlated amino acids on representative, structurally characterized, members of the examined family of sequences.

Availability: http://www.ccbb.pitt.edu/p2p


54 Shoval Y, Pietrokovski S, Kimchi A
ZIPk: a unique case of murine-specific divergence of a conserved vertebrate gene
PLoS Genetics 3:e180 doi:10.1371/journal.pgen.0030180 2007 (published online: September 7, 2007) (Pubmed ID: 17953487)
Abstract
Zipper interacting protein kinase (ZIPK, also known as death-associated protein kinase 3 [DAPK3]) is a Ser/Thr kinase that functions in programmed cell death. Since its identification eight years ago, contradictory findings regarding its intracellular localization and molecular mode of action have been reported, which may be attributed to unpredicted differences among the human and rodent orthologs. By aligning the sequences of all available ZIPK orthologs, from fish to human, we discovered that rat and mouse sequences are more diverged from the human ortholog relative to other, more distant, vertebrates. To test experimentally the outcome of this sequence divergence, we compared rat ZIPK to human ZIPK in the same cellular settings. We found that while ectopically expressed human ZIPK localized to the cytoplasm and induced membrane blebbing, rat ZIPK localized exclusively within nuclei, mainly to promyelocytic leukemia oncogenic bodies, and induced significantly lower levels of membrane blebbing. Among the unique murine (rat and mouse) sequence features, we found that a highly conserved phosphorylation site, previously shown to have an effect on the cellular localization of human ZIPK, is absent in murines but not in earlier diverging organisms. Recreating this phosphorylation site in rat ZIPK led to a significant reduction in its promyelocytic leukemia oncogenic body localization, yet did not confer full cytoplasmic localization. Additionally, we found that while rat ZIPK interacts with PAR-4 (also known as PAWR) very efficiently, human ZIPK fails to do so. This interaction has clear functional implications, as coexpression of PAR-4 with rat ZIPK caused nuclear to cytoplasm translocation and induced strong membrane blebbing, thus providing the murine protein a possible adaptive mechanism to compensate for its sequence divergence. We have also cloned zebrafish ZIPK and found that, like the human and unlike the murine orthologs, it localizes to the cytoplasm, and fails to bind the highly conserved PAR-4 protein. This further supports the hypothesis that murine ZIPK underwent specific divergence from a conserved consensus. In conclusion, we present a case of species-specific divergence occurring in a specific branch of the evolutionary tree, accompanied by the acquisition of a unique protein-protein interaction that enables conservation of cellular function.

55 Ilouz R, Pietrokovski S, Eisenstein M, Eldar-Finkelman H
New insights into the autoinhibition mechanism of glycogen synthase kinase-3β
J Molecular Biology 383:999-1007 (2008) (published online: 9 September, 2008) (Pubmed ID: 18793648)
Abstract
It has been suggested that phosphorylation at serine 9 near the N-terminus of glycogen synthase kinase-3β (GSK-3β) mimics the prephosphorylation of its substrate and, therefore, the N-terminus functions as a pseudosubstrate. The molecular basis for the pseudosubstrate's binding to the catalytic core and autoinhibition has not been fully defined. Here, we combined biochemical and computational analyses to identify the potential residues within the N-terminus and the catalytic core engaged in autoinhibition of GSK-3β. Bioinformatic analysis found Arg4, Arg6, and Ser9 in the pseudosubstrate sequence to be extremely conserved through evolution. Mutations at Arg4 and Arg6 to alanine enhanced GSK-3β kinase activity and impaired its ability to autophosphorylate at Ser9. In addition, and unlike wild-type GSK-3β, these mutants were unable to undergo autoinhibition by phosphorylated Ser9. We further show that Gln89 and Asn95, located within the catalytic core, interact with the pseudosubstrate. Mutation at these sites prevented inhibition by phosphorylated Ser9. Furthermore, the respective mutants were not inhibited by a phosphorylated pseudosubstrate peptide inhibitor. Finally, computational docking of the pseudosubstrate into the catalytic active site of the kinase suggested specific interactions between Arg6 and Asn95 and of Arg4 to Asp181 (apart from the interaction of phosphorylated serine 9 with the "phosphate binding pocket"). Altogether, our study supports a model of GSK-3-pseudosubstrate autoregulation that involves phosphorylated Ser9, Arg4, and Arg6 within the N-terminus and identified the specific contact sites within the catalytic core.

56 Dori-Bachash M, Dassa B, Pietrokovski S, Jurkevitch E
Proteome-based comparative analyses of growth stages reveal new cell-cycle dependent functions in the predatory bacterium Bdellovibrio bacteriovorus
Applied and Environmental Microbiology 74:7152-7162 (2008) (published online: 3 October, 2008) (Pubmed ID: 18836011)
Abstract
Bdellovibrio and like organisms are obligate predators of bacteria, that are ubiquitously found in the environment. Most exhibit a peculiar dimorphic life-cycle during which free swimming attack phase (AP) cells search and invade bacterial prey cells. The invader develops in the prey as a filamentous polynucleoid-containing cell that finally splits into progeny cells. Therapeutic and biocontrol applications of Bdellovibrio in human and animal, and plant health, respectively, have been proposed but more knowledge on this peculiar cell cycle is needed to develop such applications. A proteomic approach was applied to study cell cycle dependent expression of the Bdellovibrio bacteriovorus' proteome in synchronous cultures of a facultative host-independent (HI) strain able to grow in the absence of prey. Two-dimensional gel electrophoresis, mass spectrometry and temporal expression of selected genes in predicted operons were analyzed. In total, about 21% of the in-silico predicted proteome was covered. One hundred and ninety six proteins were identified, including 63 hitherto unknown proteins and 140 life stage-dependent spots. Of those, 47 were differentially expressed, including chemotaxis, attachment, growth and replication-related, cell wall and regulatory proteins. Novel cell cycle-dependent adhesion, gliding, mechanosensing, signaling and hydrolytic functions were assigned. The HI model was further studied by comparing HI and wild-type AP-cells, revealing that proteins involved in DNA replication and signaling were deregulated in the former. A complementary analysis of the secreted proteome identified 59 polypeptides, including cell contact proteins and hydrolytic enzymes specific to predatory bacteria.

57 Dori-Bachash M, Dassa B, Peleg O, Pineiro SA, Jurkevitch E, Pietrokovski S
Bacterial intein-like domains of predatory bacteria: a new domain type characterized in Bdellovibrio bacteriovorus
Functional & Integrative Genomics 9:153-166 (2009) (published online: 20 January, 2009) (Pubmed ID: 19153786)
Abstract
We report a new family of bacterial intein-like domains (BILs) identified in ten proteins of four diverse predatory bacteria. BILs belong to the HINT (Hedgehog/Intein) superfamily of domains that post-translationally self-process their protein molecules by protein splicing and self-cleavage. The new, C-type, BILs appear with other domains, including putative predator-specific domain 1 (PPS-1), a new domain typically appearing immediately upstream of C-type BILs. The Bd2400 protein of the obligate predator Bdellovibrio bacteriovorus includes a C-type BIL and a PPS-1 domains at its C-terminal part, and a signal peptide and two polycystic kidney disease domains at its N-terminal part. We demonstrate the in vivo transcription, translation, secretion, and processing of the B. bacteriovorus protein, and the in vitro autocatalytic N-terminal cleavage activity of its C-type BIL. Interestingly, whereas the Bd2400 gene is constitutively expressed, its protein product is differentially processed throughout the dimorphic life cycle of the B. bacteriovorus predator. The modular structure of the protein, its localization, and complex processing suggest that it may be involved in the interaction between the predator and its prey.

58 Dassa B, London N, Stoddard BL, Schueler-Furman O, Pietrokovski S
Fractured genes: a novel genomic arrangement involving new split inteins and a new homing endonuclease family
Nucleic Acids Research 37:2560-2573 (2009) (published online: 5 March, 2009) (Pubmed ID: 19264795)
Abstract
Inteins are genetic elements, inserted in-frame into protein-coding genes, whose products catalyze their removal from the protein precursor via a protein-splicing reaction. Intein domains can be split into two fragments and still ligate their flanks by a trans-protein-splicing reaction. A bioinformatic analysis of environmental metagenomic data revealed 26 different loci with a novel genomic arrangement. In each locus, a conserved enzyme coding region is broken in two by a split intein, with a free-standing endonuclease gene inserted in between. Eight types of DNA synthesis and repair enzymes have this 'fractured' organization. The new types of naturally split-inteins were analyzed in comparison to known split-inteins. Some loci include apparent gene control elements brought in with the endonuclease gene. A newly predicted homing endonuclease family, related to very-short patch repair (Vsr) endonucleases, was found in half of the loci. These putative homing endonucleases also appear in group-I introns, and as stand-alone inserts in the absence of surrounding intervening sequences. The new fractured genes organization appears to be present mainly in phage, shows how endonucleases can integrate into inteins, and may represent a missing link in the evolution of gene breaking in general, and in the creation of split-inteins in particular.

59 Eldar-Finkelman H, Licht-Murava A, Pietrokovski S, Eisenstein M
Substrate Competitive GSK-3 Inhibitors strategy and Implications
Biochim Biophys Acta 1804:598-603 (2010) (published online: 18 September, 2009) (Pubmed ID: 19770076)
Abstract
Glycogen synthase kinase-3 (GSK-3) is a highly conserved protein serine/threonine kinase ubiquitously distributed in eukaryotes as a constitutively active enzyme. Abnormally high GSK-3 activity has been implicated in several pathological disorders, including diabetes and neuron degenerative and affective disorders. This led to the hypothesis that inhibition of GSK-3 may have therapeutic benefit. Most GSK-3 inhibitors developed so far compete with ATP and often show limited specificity. Our goal is to develop inhibitors that compete with GSK-3 substrates, as this type of inhibitor is more specific and may be useful for clinical applications. We have employed computational, biochemical, and molecular analyses to gain in-depth understanding of GSK-3's substrate recognition. Here we argue that GSK-3 is a promising drug discovery target and describe the strategy and practice for developing specific substrate-competitive inhibitors of GSK-3.

60 Salzberg Y, Eldar T, Karminsky O, Bar-Sheshet Itach S, Pietrokovski S, Don J
Meig1 Deficiency Causes a Severe Defect in Mouse Spermatogenesis

Developmental Biology 338:158-167 (2010) (published online 22 Nov. 2009) (Pubmed ID: 20004656)
Abstract
Meig1 is a mouse gene, abundantly expressed in the testis. It encodes two alternative transcripts that are expressed differentially in the somatic and germinal compartments of the testis. These transcripts share the same coding region but differ in their 5’ un-translated regions, due to alternative promoters. Here we show that MEIG1 is a highly conserved short metazoan protein with a conserved core of 81 residues. It is present from chordates to radial symmetry animals, with an intriguing absence in insects and nematodes. It is also present in two earlier diverging protist lineages. To elucidate the role of MEIG1 during gamete production we established a knockout mouse line by eliminating the common coding region. Our results identified Meig1 as a critical spermatogenic gene, whose absence results in complete male infertility. Differentiation of spermatocytes to haploid spermatids seemed complete, although with significantly increased apoptosis of germ cells, but further differentiation into later stages was generally blocked. The caudal epididymis was apparently missing spermatozoa, and the very few that were obtained were immotile and exhibited a wide range of severe morphological abnormalities. Meig1 is, therefore, a highly conserved gene which is indispensable for sperm production and hence for male fertility in mice.

61 Tori K, Dassa B, Johnson MA, Southworth MW, Brace LE, Ishino Y, Pietrokovski S, Perler FB
Splicing of the Mycobacteriophage Bethlehem DnaB intein: identification of a new mechanistic class of inteins that contain an obligate Block F nucleophile
J Biological Chemistry 285:2515-2526 (2010) (published online 23 Nov. 2009) (Pubmed ID: 19940146)
Abstract
Inteins are single turnover enzymes that splice out of protein precursors during maturation of the host protein (extein). The Cys or Ser at the N-terminus of most inteins initiates a four-step protein splicing reaction by forming a (thio)ester bond at the N-terminal splice junction. Several recently identified inteins cannot perform this acyl rearrangement because they do not begin with Cys, Thr or Ser. This study analyzes one of these, the Mycobacteriophage Bethlehem DnaB intein, which we describe here as the prototype for a new class of inteins based on sequence comparisons, reactivity and mechanism. These Class 3 inteins are characterized by a non-nucleophilic N-terminal residue that co-varies with a non-contiguous Trp, Cys, Thr triplet (WCT) and a Thr or Ser as the first C-extein residue. Several mechanistic differences were observed compared to standard inteins or previously studied atypical KlbA Ala1 inteins: (a) cleavage at the N-terminal splice junction in the absence of all standard N- and C-terminal splice junction nucleophiles, (b) activation of the N-terminal splice junction by a variant Block B motif that includes the WCT triplet Trp, (c) decay of the branched intermediate by thiols or Cys despite an ester linkage at the C-extein branch point, and (d) an absolute requirement for the WCT triplet Block F Cys. Based on biochemical data and confirmed by molecular modeling, we propose roles for these newly identified conserved residues, a novel protein splicing mechanism that includes a second branched intermediate, and an intein classification with 3 mechanistic categories.

62 Harel A, Dalah I, Pietrokovski S, Safran M, Lancet D
Omics Data Management and annotation
In: Bioinformatics for Omics Data, Series:Methods in Molecular Biology Volume 719, Edited by Mayer, B.Humana Press, New York, NY (2011) ISBN: 978-1-61779-026-3 (Pubmed ID: 21370079)
Abstract
Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledge-gleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g. genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance procedures must validate data and feature integrity, as well as system performance expectations. These data management principles are illustrated with examples from the life cycle of the GeneCards Omics project (www.genecards.org) a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter.

63 Wurtzel O, Dori-Bachash M, Pietrokovski S, Jurkevitch E, Sorek R
Mutation detection with next-generation resequencing through a mediator genome
PLoS One 5:e15628 (2010) (published online 31 Dec. 2010) (Pubmed ID: 21209874)
Abstract
The affordability of next generation sequencing (NGS) is transforming the field of mutation analysis in bacteria. The genetic basis for phenotype alteration can be identified directly by sequencing the entire genome of the mutant and comparing it to the wild-type (WT) genome, thus identifying acquired mutations. A major limitation for this approach is the need for an a-priori sequenced reference genome for the WT organism, as the short reads of most current NGS approaches usually prohibit de-novo genome assembly. To overcome this limitation we propose a general framework that utilizes the genome of relative organisms as mediators for comparing WT and mutant bacteria. Under this framework, both mutant and WT genomes are sequenced with NGS, and the short sequencing reads are mapped to the mediator genome. Variations between the mutant and the mediator that recur in the WT are ignored, thus pinpointing the differences between the mutant and the WT. To validate this approach we sequenced the genome of Bdellovibrio bacteriovorus 109J, an obligatory bacterial predator, and its prey-independent mutant, and compared both to the mediator species Bdellovibrio bacteriovorus HD100. Although the mutant and the mediator sequences differed in more than 28,000 nucleotide positions, our approach enabled pinpointing the single causative mutation. Experimental validation in 53 additional mutants further established the implicated gene. Our approach extends the applicability of NGS-based mutant analyses beyond the domain of available reference genomes.

64 Azoulay-Alfaguter I, Yaffe Y, Licht-Murava A, Urbanska M, Jaworski J, Pietrokovski S, Hirschberg K, Eldar-Finkelman H
Distinct molecular regulation of GSK-3α isozyme controlled by its N-terminal region. Functional role in calcium/calpain signaling
J. Biological Chemistry 286:13470-13480 (2011) (Published online January 25, 2011) (Pubmed ID: 21266584)
Abstract
Glycogen synthase kinase-3 is expressed as two isozymes alpha and beta. They share high similarity in their catalytic domains, but differ in their N- and C-terminal regions, with GSK-3α having an extended glycine-rich N-terminus. Here we undertook live cell imaging combined with molecular and bioinformatics studies to understand the distinct functions of the GSK-3 isozymes focusing on GSK-3α-N-terminal region. We found that unlike GSK-3β, which shuttles between the nucleus and cytoplasm, GSK-3α was excluded from the nucleus. Deletion of the N-terminal region of GSK-3α resulted in nuclear localization, and treatment with Leptomycin B led to accumulation of GSK-3α in the nucleus. GSK-3α rapidly accumulated in the nucleus in response to calcium or serum deprivation, and accumulation was strongly inhibited by the calpain inhibitor calpeptin. This nuclear accumulation was not mediated by cleavage of the N-terminal region or phosphorylation of GSK-3α. Rather, we show that calcium-induced GSK-3α nuclear accumulation was governed by GSK-3α-binding with as yet unknown calpain-sensitive protein or proteins; this binding was mediated by the N-terminal region. Bioinformatic and experimental analyses indicated that nuclear exclusion of GSK-3α was likely an exclusive characteristic of mammalian GSK-3α. Finally, we show that nuclear localization of GSK-3α reduced the nuclear pool of β-catenin and its target cyclin D1. Taken together, these data suggest that the N-terminal region of GSK-3α is responsible for its nuclear exclusion and that binding with a calcium/calpain sensitive product enables GSK-3α nuclear retention. We further uncovered a novel link between calcium and nuclear GSK-3α-mediated inhibition of the canonical Wnt/β-catenin pathway.

65 Shoval Y, Berissi H, Kimchi A, Pietrokovski S
New modularity of DAP-kinases: alternative splicing of the DRP-1 gene produces a ZIPk-like isoform
PLoS ONE 6:e17344 (2011) (published online 8 Mar. 2011) (Pubmed ID: 21408167)
Abstract
DRP-1 and ZIPk are two members of the Death Associated Protein Ser/Thr Kinase (DAP-kinase) family, which function in different settings of cell death including autophagy. DAP kinases are very similar in their catalytic domains but differ substantially in their extra-catalytic domains. This difference is crucial for the significantly different modes of regulation and function among DAP kinases. Here we report the identification of a novel alternatively spliced kinase isoform of the DRP-1 gene, termed DRP-1β. The alternative splicing event replaces the whole extra catalytic domain of DRP-1 with a single coding exon that is closely related to the sequence of the extra catalytic domain of ZIPk. As a consequence, DRP-1β lacks the calmodulin regulatory domain of DRP-1, and instead contains a leucine-zipper-like motif similar to the protein binding region of ZIPk. Several functional assays proved that this new isoform retained the biochemical and cellular properties which are common to DRP-1 and ZIPk, including myosin light chain phosphorylation, and activation of membrane blebbing and autophagy. In addition, DRP-1β also acquired binding to the ATF4 transcription factor, a feature characteristic of ZIPk but not DRP-1. Thus, a splicing event of the DRP-1 produces a ZIPk like isoform. DRP-1β is highly conserved in evolution, present in all known vertebrate DRP-1 loci. We detected the corresponding mRNA and protein in embryonic mouse brains and in human embryonic stem cells thus confirming the in vivo utilization of this isoform. The discovery of module conservation within the DAPk family members illustrates a parsimonious way to increase the functional complexity within protein families. It also provides crucial data for modeling the expansion and evolution of DAP kinase proteins within vertebrates, suggesting that DRP-1 and ZIPk most likely evolved from their ancient ancestor gene DAPk by two gene duplication events that occurred close to the emergence of vertebrates.

66 Bialik S, Pietrokovski S, Kimchi A
Myosin drives autophagy in a pathway linking Atg1 to Atg9
EMBO J. 30 629-630 (2011) (Pubmed ID: 21326172)
Abstract
Autophagy is a cellular process in which specialized autodegradative vesicles, the autophagosomes, are formed. Much progress has been made in understanding the molecular mechanism controlling autophagy, particularly the role of the Atg genes. In this issue, Tang et al identify a signalling pathway linking two main regulators, the Atg1 kinase—essential for the induction of the autophagosome—and the transmembrane protein Atg9, whose shuttling between the Golgi and the forming autophagosme provides a source of membrane for the new vesicle. This study provides the missing piece of the puzzle: Atg1 phosphorylates and activates a myosin light chain kinase, which in turn activates myosin to drive transport of Atg9.

67 Tsaadon Alon L, Pietrokovski S, Barkan S, Avrahami L, Kaidanovich-Beilin O, Woodgett JR, Barnea A, Eldar-Finkelman H
Selective loss of glycogen synthase kinase-3α in birds reveals distinct roles for GSK-3 isozymes in tau phosphorylation
FEBS L. 585:1158-1162 (2011) (published online 16 Mar. 2011) (Pubmed ID: 21419127)
Abstract
Mammalian glycogen synthase kinase-3 (GSK-3) is a critical regulator of neuronal signaling, cognition and behavior. It exists as two isozymes GSK-3α and GSK-3β, but their distinct biological functions are not fully known. Here, we examined the evolutionary significance of each of these isozymes. Surprisingly, we found that unlike other vertebrates that harbor both GSK-3 genes, the GSK-3α gene is missing in birds. GSK-3-mediated tau phosphorylation was significantly lower in bird brains than in mouse brains, a phenomenon that was reproduced in GSK-3α knockout mice. In bird embryos tau was strongly phosphorylated, altogether, suggesting that GSK-3 isozymes play distinct roles in tau phosphorylation. Birds are GSK-3α knockout organisms and may serve as a novel model to study the distinct functions of GSK-3 isozymes.

68 Shpilka T, Weidberg H, Pietrokovski S, Elazar, Z
Atg8: an autophagy-related ubiquitin-like protein family
Genome Biology 12:226 (2011) (published online 27 July 2011) (Pubmed ID: 21867568)
Abstract
Autophagy-related (Atg) proteins are eukaryotic factors participating in various stages of the autophagic process. Thus far 34 Atgs have been identified in yeast, including the key autophagic protein Atg8. The Atg8 gene family encodes ubiquitin-like proteins that share a similar structure consisting of two amino-terminal α helices and a ubiquitin-like core. Atg8 family members are expressed in various tissues, where they participate in multiple cellular processes, such as intracellular membrane trafficking and autophagy. Their role in autophagy has been intensively studied. Atg8 proteins undergo a unique ubiquitin-like conjugation to phosphatidylethanolamine on the autophagic membrane, a process essential for autophagosome formation. Whereas yeast has a single Atg8 gene, many other eukaryotes contain multiple Atg8 orthologs. Atg8 genes of multicellular animals can be divided, by sequence similarities, into three subfamilies: microtubule-associated protein 1 light chain 3 (MAP1LC3 or LC3), γ-aminobutyric acid receptor-associated protein (GABARAP) and Golgi-associated ATPase enhancer of 16 kDa (GATE-16), which are present in sponges, cnidarians (such as sea anemones, corals and hydras) and bilateral animals. Although genes from all three subfamilies are found in vertebrates, some invertebrate lineages have lost the genes from one or two subfamilies. The amino terminus of Atg8 proteins varies between the subfamilies and has a regulatory role in their various functions. Here we discuss the evolution of Atg8 proteins and summarize the current view of their function in intracellular trafficking and autophagy from a structural perspective.

69 Taylor GK, Heiter DF, Pietrokovski S, Stoddard BL
Activity, specificity and structure of I-Bth0305I: a representative of a new homing endonuclease family
Nucleic Acids Research 39:9705-9719 (2011) (published online 2 September 2011) (Pubmed ID: 21890897)
Abstract
Novel family of putative homing endonuclease genes was recently discovered during analyses of metagenomic and genomic sequence data. One such protein is encoded within a group I intron that resides in the recA gene of the Bacillus thuringiensis 03058-36 bacteriophage. Named I-Bth0305I, the endonuclease cleaves a DNA target in the uninterrupted recA gene at a position immediately adjacent to the intron insertion site. The enzyme displays a multidomain, homodimeric architecture and footprints a DNA region of ~60 bp. Its highest specificity corresponds to a 14-bp pseudopalindromic sequence that is directly centered across the DNA cleavage site. Unlike many homing endonucleases, the specificity profile of the enzyme is evenly distributed across much of its target site, such that few single base pair substitutions cause a significant decrease in cleavage activity. A crystal structure of its C-terminal domain confirms a nuclease fold that is homologous to very short patch repair (Vsr) endonucleases. The domain architecture and DNA recognition profile displayed by I-Bth0305I, which is the prototype of a homing lineage that we term the 'EDxHD' family, are distinct from previously characterized homing endonucleases.

70 Samach A, Melamed-Bessudo C, Avivi-Ragolski N, Pietrokovski S, Levy AA,
Identification of plant RAD52 homologs and characterization of the Arabidopsis thaliana RAD52-Like genes
The Plant Cell doi/10.1105/tpc.111.091744 (2011) (published online December 2011) (Pubmed ID: 22202891)
Abstract
RAD52 mediates RAD51 loading onto single-stranded DNA ends, thereby initiating homologous recombination and catalyzing DNA annealing. RAD52 is highly conserved among eukaryotes, including animals and fungi. This article reports that RAD52 homologs are present in all plants whose genomes have undergone extensive sequencing. Computational analyses suggest a very early RAD52 gene duplication, followed by later lineage-specific duplications, during the evolution of higher plants. Plant RAD52 proteins have high sequence similarity to the oligomerization and DNA binding N-terminal domain of RAD52 proteins. Remarkably, the two identified Arabidopsis thaliana RAD52 genes encode four open reading frames (ORFs) through differential splicing, each of which specifically localized to the nucleus, mitochondria, or chloroplast. The A. thaliana RAD52-1A ORF provided partial complementation to the yeast rad52 mutant. A. thaliana mutants and RNA interference lines defective in the expression of RAD52-1 or RAD52-2 showed reduced fertility, sensitivity to mitomycin C, and decreased levels of intrachromosomal recombination compared with the wild type. In summary, computational and experimental analyses provide clear evidence for the presence of functional RAD52 DNA-repair homologs in plants.