The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research.
SkiptoMainContent
Advertisement
SearchMenu
Menu
NavbarSearchFilter
ThisissueAllBioinformatics
AllBioinformaticsJournalsAllJournals
MobileMicrositeSearchTerm
Search
SignIn
Issues
Advancearticles
Submit
AuthorGuidelines
SubmissionSite
OpenAccess
Purchase
Alerts
About
AboutBioinformatics
JournalsCareerNetwork
EditorialBoard
AdvertisingandCorporateServices
Self-ArchivingPolicy
DispatchDates
Issues
Advancearticles
Submit
AuthorGuidelines
SubmissionSite
OpenAccess
Purchase
Alerts
About
AboutBioinformatics
JournalsCareerNetwork
EditorialBoard
AdvertisingandCorporateServices
Self-ArchivingPolicy
DispatchDates
Close
searchfilter
Thisissue
AllBioinformatics
AllBioinformaticsJournals
AllJournals
searchinput
Search
AdvancedSearch
SearchMenu
ArticleNavigation
Closemobilesearchnavigation
ArticleNavigation
Volume33
Issue21
01November2017
ArticleContents
Abstract
1Introduction
2Materialsandmethods
3Results
4Discussion
5Conclusion
Acknowledgements
References
ArticleNavigation
ArticleNavigation
DeepLoc:predictionofproteinsubcellularlocalizationusingdeeplearning
JoséJuanAlmagroArmenteros,
JoséJuanAlmagroArmenteros
DepartmentofBioandHealthInformatics,TechnicalUniversityofDenmark,2800Kgs.Lyngby,DenmarkTheBioinformaticsCentre,DepartmentofBiology,UniversityofCopenhagen,CopenhagenN,Denmark
Towhomcorrespondenceshouldbeaddressed.Email:[email protected]
Searchforotherworksbythisauthoron:
OxfordAcademic
PubMed
GoogleScholar
CasperKaaeSønderby,
CasperKaaeSønderby
TheBioinformaticsCentre,DepartmentofBiology,UniversityofCopenhagen,CopenhagenN,Denmark
Searchforotherworksbythisauthoron:
OxfordAcademic
PubMed
GoogleScholar
SørenKaaeSønderby,
SørenKaaeSønderby
TheBioinformaticsCentre,DepartmentofBiology,UniversityofCopenhagen,CopenhagenN,Denmark
Searchforotherworksbythisauthoron:
OxfordAcademic
PubMed
GoogleScholar
HenrikNielsen,
HenrikNielsen
DepartmentofBioandHealthInformatics,TechnicalUniversityofDenmark,2800Kgs.Lyngby,Denmark
Searchforotherworksbythisauthoron:
OxfordAcademic
PubMed
GoogleScholar
OleWinther
OleWinther
TheBioinformaticsCentre,DepartmentofBiology,UniversityofCopenhagen,CopenhagenN,DenmarkDTUCompute,TechnicalUniversityofDenmark,2800Kgs.Lyngby,Denmark
Searchforotherworksbythisauthoron:
OxfordAcademic
PubMed
GoogleScholar
Bioinformatics,Volume33,Issue21,01November2017,Pages3387–3395,https://doi.org/10.1093/bioinformatics/btx431
Published:
07July2017
Articlehistory
Received:
16March2017
Revisionreceived:
06June2017
Accepted:
03July2017
Published:
07July2017
Acorrectionhasbeenpublished:
Bioinformatics,Volume33,Issue24,15December2017,Page4049,https://doi.org/10.1093/bioinformatics/btx548
PDF
SplitView
Views
Articlecontents
Figures&tables
Video
Audio
SupplementaryData
Cite
Cite
JoséJuanAlmagroArmenteros,CasperKaaeSønderby,SørenKaaeSønderby,HenrikNielsen,OleWinther,DeepLoc:predictionofproteinsubcellularlocalizationusingdeeplearning,Bioinformatics,Volume33,Issue21,01November2017,Pages3387–3395,https://doi.org/10.1093/bioinformatics/btx431
SelectFormat
Selectformat
.ris(Mendeley,Papers,Zotero)
.enw(EndNote)
.bibtex(BibTex)
.txt(Medlars,RefWorks)
Downloadcitation
Close
PermissionsIcon
Permissions
Share
Email
Twitter
Facebook
More
NavbarSearchFilter
ThisissueAllBioinformatics
AllBioinformaticsJournalsAllJournals
MobileMicrositeSearchTerm
Search
SignIn
Close
searchfilter
Thisissue
AllBioinformatics
AllBioinformaticsJournals
AllJournals
searchinput
Search
AdvancedSearch
SearchMenu
Abstract
MotivationThepredictionofeukaryoticproteinsubcellularlocalizationisawell-studiedtopicinbioinformaticsduetoitsrelevanceinproteomicsresearch.Manymachinelearningmethodshavebeensuccessfullyappliedinthistask,butinmostofthem,predictionsrelyonannotationofhomologuesfromknowledgedatabases.Fornovelproteinswherenoannotatedhomologuesexist,andforpredictingtheeffectsofsequencevariants,itisdesirabletohavemethodsforpredictingproteinpropertiesfromsequenceinformationonly.ResultsHere,wepresentapredictionalgorithmusingdeepneuralnetworkstopredictproteinsubcellularlocalizationrelyingonlyonsequenceinformation.Atitscore,thepredictionmodelusesarecurrentneuralnetworkthatprocessestheentireproteinsequenceandanattentionmechanismidentifyingproteinregionsimportantforthesubcellularlocalization.ThemodelwastrainedandtestedonaproteindatasetextractedfromoneofthelatestUniProtreleases,inwhichexperimentallyannotatedproteinsfollowmorestringentcriteriathanpreviously.Wedemonstratethatourmodelachievesagoodaccuracy(78%for10categories;92%formembrane-boundorsoluble),outperformingcurrentstate-of-the-artalgorithms,includingthoserelyingonhomologyinformation.AvailabilityandimplementationThemethodisavailableasawebserverathttp://www.cbs.dtu.dk/services/DeepLoc.Examplecodeisavailableathttps://github.com/JJAlmagro/subcellular_localization.Thedatasetisavailableathttp://www.cbs.dtu.dk/services/DeepLoc/data.php.
1Introduction
Proteinsfulfilawidediversityoffunctionsinsidethevariouscompartmentsofeukaryoticcells.Thefunctionofaproteindependsonthecompartmentororganellewhereitislocated,asitprovidesaphysiologicalcontextforitsfunction.However,aberrantproteinsubcellularlocalizationcanaffectthefunctionthataproteinexhibitsandcontributestothepathogenesisofmanyhumandiseases;suchasmetabolic,cardiovascularandneurodegenerativediseases,aswellascancer(HungandLink,2011).Therefore,predictingthesubcellularlocalizationoftheproteinsisanessentialtaskwhichhasbeenextensivelystudiedinbioinformatics(Emanuelssonetal.,2007;ImaiandNakai,2010;WanandMak,2015).Mostofthecurrentmachinelearningmethodsforsubcellularlocalizationpredictionextractafixednumberoffeaturesfromtheproteinsequencesandusethisfixedlengthrepresentationasinputtoanon-linearclassifiersuchasasupportvectormachine(SVM).However,sequence-basedmodels,whichprocessonepositionatatime,aremorenaturalforthistaskastheycanlearnandmakeinferencesfrominputofvaryinglength.Unfortunately,thesemodelshavenotbeencompetitivewithnon-linearclassifiersupuntilrecently.Inthispaperwetakeadvantageofprogressindeeplearning,specificallyrecurrentneuralnetworks(RNNs)withlongshort-termmemory(LSTM)cells,attentionmodelsandconvolutionalneuralnetworks(CNNs),toproposeanend-to-endsequence-basedmodel.LSTMscontainmemorycellsthatcanholdinformationfrompastinputstothenetworkforinprincipleanarbitrarynumberofpositions(HochreiterandSchmidhuber,1997).Attention(Bahdanauetal.,2014)makesitpossibletodetectsortingsignalsinproteinsregardlessoftheirpositioninthesequence.Inaddition,CNNsareabletotrainfiltersthatdetectshortmotifsintheinputsequenceirrespectivelyofwheretheyoccur,andhaveshownpromisingperformanceforproteinsubcellularlocalizationwhencombinedwithLSTMs(Sønderbyetal.,2015).Wealsoproposeahierarchicaltreelikelihoodmimickingthebiologyofthesortingpathwayandatransferlearningapproachtojointlypredictsubcellularlocalizationandwhethertheproteinismembrane-boundorsoluble.Inthefollowingwediscusssomeofthecaveatswiththedatasetsusedinprevioussubcellularlocalizationtools.First,manymethodsusehomologyinformationforprediction,eitherbydirectlyusingannotatedsubcellularlocationannotationsofretrievedhitsinadatabasesearch,asinLocTree3(withanaccuracyof80%for18locations)(Goldbergetal.,2014),orbytakinghintsfromothertypesofannotationsuchasGO-terms,asiniLoc-EukandYLoc(Briesemeisteretal.,2010;Chouetal.,2011),orPubMedabstractslinkedtotheprotein’sSwiss-Protentry,asinSherLoc(Briesemeisteretal.,2009).Thesemethodsareappropriateforannotatedproteinsorproteinswithannotatedclosehomologues.Nonetheless,itshouldbetakenintoaccountthattheperformancewillbemuchlowerforsequenceswithoutwell-annotatedhomologues—preciselythesequencesforwhichitwouldbemostrelevanttohaveworkingpredictionmethods.Inaddition,anyhomology-basedmethodwillhaveverylimitedchanceofbeingabletopredicttheconsequencesofmutationsaffectingsortingsignalsbecausethewild-typeandthevariantprobablywouldpickupthesamehomologuesinadatabasesearch.Second,theperformancesofmachinelearningalgorithmsarecruciallydependentonthedatasetsusedtotrainandtestthem.Forproteinsubcellularlocalizationakeyaspectisthatproteinsshouldhaveexperimentalevidencefortheirsubcellularlocation,sothatpredictionsarenotbasedonpredictionsinacircularfashion.However,currentmethodsusedatafromUniProt(TheUniProtConsortium,2017)priortorelease2014_09,whereamajorchangeintheannotationstandardstookplace.Beforethechange,anannotationwasregardedasexperimentalifitlackedqualifierssuchas‘Potential’,‘Probable’or‘Bysimilarity’;afterthechange,onlyannotationswithaspecificliteraturereferencewereannotatedasbeingexperimental(evidencecodeECO:0000269).Thisresultedinaconsiderabledecreaseinthenumberofproteinswithsubcellularlocationregardedasexperimentallyconfirmed,thusraisingtheissuethatcurrentmethodsmayinpartbetrainedandtestedonquestionableexamples.Anotheraspectofthedatasetissueisthattheamountofhomologybetweentrainingdataandtestdatashouldbekeptataminimum(Hobohmetal.,1992).Themeasuredtestperformanceshouldbeatruemeasureofthepredictiveperformanceonnewproteinsandnotjustameasureofhowgoodthemethodisatfindinghomologueswiththesamesubcellularlocation.Unfortunately,theHöglunddataset(Höglundetal.,2006)whichhasbeenusedinthetrainingandtestofseveralmethods(Blumetal.,2009;Briesemeisteretal.,2009,2010;Shatkayetal.,2007;Sønderbyetal.,2015)isonlyhomologyreducedto80%identity.Thismeansthatratherclosehomologuestothetrainingdatawilloccurinthetestset,whichresultsinoverlyoptimisticperformancesthatdonotreflectthetruegeneralizationtonewunseenproteins.Anexampleofastate-of-the-artmethodthatusesthisdatasetsetisSherloc2,whichreportsanaccuracyof93%for11locations.Thispaperhasfourmajorcontributions:WeconstructanewdatasetfromarecentversionofUniProtwhereproteinshaveexperimentalevidencefortheirsubcellularlocationsaccordingtothenewstricterdefinition.Weperformstringenthomologypartitioningtoavoidoverfitting,providingrealisticaccuracymeasuresonnewproteins.WeshowthatmodelstrainedontheHöglunddatasethavepoorgeneralizationperformanceonournewdataset.Thisreflectsthehighlevelofhomologyandpossiblyerroneousannotationsintheolddataset.Wedevelopdeeprecurrentneuralnetworksfortheproteinsubcellularlocalizationtaskwithanumberofnovelstate-of-the-artmodelfeatures.Thisincludesconvolutionalmotifdetectors,selectiveattentiononsequenceregionsimportantforsubcellularlocalizationpredictionandanovelhierarchicalsortinglikelihood.Thesefeaturesareusedforinterpretationofthemodelandpredictions.Ournetworksshowimprovedpredictionaccuracywithoutusinghomologyinformation.Weimplementtheresultingmodelasauser-friendlyweb-servercalledDeepLoc(Concurrentlywithourwork,Krausetal.(2017)hasintroducedamethodforproteinsubcellularlocationfromcellimagedataalsocalledDeepLoc).2Materialsandmethods
2.1Neuralnetworkmodels
Thedeeplearningneuralnetworkmodelusedisdescribedindetailbelow.Figure1andthefollowingdescriptiongivesasummaryofthearchitectureused:Theinputissequencelength(=1000) × sizeofaminoacidvocabulary(=20).TheCNNextractsmotifinformationusing120filtersofdifferentsizes(20foreachofthesizes1,3,5,9,15and21).Thisgivesa1000 × 120featuremap.Anotherconvolutionallayerof128filtersofsize3 × 120isappliedtothisfeaturemap.Thisgivesa1000 × 128featuremapwhichisusedasinputtotherecurrentlayer.Therecurrentneuralnetworkscansthesequenceusing256LSTMunitsinbothdirectionsgivingintotala1000 × 512dimensionaloutput.TheattentiondecodinglayerusesanLSTMwith512unitsthrough10decodingstepsandtheattentionmechanismfeedforwardneuralnetwork(FFN)has256units.Thefinalfullyconnecteddenselayeriscomposedby512andthetwooutputlayershaveoneunit(membrane-bound)and10units(subcellularlocalization).
Fig.1.OpeninnewtabDownloadslide(A)Theconvolutionalneuralnetwork(CNN)extractsmotifinformationusingdifferentmotifsizes.(B)Therecurrentneuralnetworkscansthesequenceinbothdirections,extractingthespatialdependenciesbetweenaminoacids.(C)Theattentionmechanismassignshigherimportancetoaminoacidsthatarerelevantfortheprediction.Ateachdecodingstep,theattentionweightsαaregeneratedbasedonthehiddenstatesfromtheRNNandthehiddenstatesfromthepreviousdecodingstep.Theweightedaverageoftheseweightsatthelastdecodingstepisusedasinputtoafullyconnecteddenselayer.(D)AlltheinformationgatheredfromtheproteinsequenceispassedtoasoftmaxfunctionandahierarchicaltreeofsortingpathwaystocalculatethefinalpredictionFig.1.OpeninnewtabDownloadslide(A)Theconvolutionalneuralnetwork(CNN)extractsmotifinformationusingdifferentmotifsizes.(B)Therecurrentneuralnetworkscansthesequenceinbothdirections,extractingthespatialdependenciesbetweenaminoacids.(C)Theattentionmechanismassignshigherimportancetoaminoacidsthatarerelevantfortheprediction.Ateachdecodingstep,theattentionweightsαaregeneratedbasedonthehiddenstatesfromtheRNNandthehiddenstatesfromthepreviousdecodingstep.Theweightedaverageoftheseweightsatthelastdecodingstepisusedasinputtoafullyconnecteddenselayer.(D)AlltheinformationgatheredfromtheproteinsequenceispassedtoasoftmaxfunctionandahierarchicaltreeofsortingpathwaystocalculatethefinalpredictionWelearnasubcellularlocalizationmodelwhichpredictsthesubcellularlocalizationusingtheaminoacidssequenceasinput:y=fθ(X) ,(1)whereyisthepredictedlocalization,fisthepredictionmodelparametrizedbyparametersθandXistheinputdatasequenceofsizeL × NwhereListheproteinlengthandNisthenumberofinputfeaturespersequenceposition.Theparametersθareoptimizedusingstochasticgradientdescentwithcrossentropylossbetweenthetrueandpredictedlocalizationdistribution.Inpractice,thelengthofproteinsequencescanvaryfromtenstothousandsofaminoacidsposingachallengeformanypredictionalgorithmsrequiringafixedsizeinputrepresentation.Instead,recurrentneuralnetworks(RNN)thatnaturallyhandlevaryinginputsequencelengthswereused.Thenetworksappliesarecurrentcalculationateachsequencepositiontht=fE(xt,ht−1), t=1…L(2)wherefEisanRNNdenotedtheencoder,xtistheinputfeaturesofXatpositiontandh=[h1,…,hL]isthehiddenstatesoftheRNNwherehtisavectorofsamelengthasthenumberofhiddenunitsintheRNN.Theencodercanbeviewedasatrainablefeatureextractorencodingtheaminoacidsequenceintoafeaturespacesuitableforsubcellularlocalizationprediction.Naively,thefinalsubcellullarlocationycouldbepredictedbyapplyingaclassifierfytothefinalhiddenstateoftheencoderhLy=fy(hL) .(3)However,thisapproachisnotidealforseveralreasons.FirstlytheRNNhastorememberallusefulinformationacrosstheentire,oftenverylong,inputsequence.Insubcellularlocalizationthisisespeciallyproblematicsincemostoftheinformationisknowntoresideinthebeginning(N-terminus)andend(C-terminus)ofthesequence.Secondlyallinformationabouttheproteinhastobecompressedintothesamesizevectorregardlessofthelengthoftheprotein.Twodifferentsolutionswereusedtoalleviatetheseproblems,BidirectionalRNNsandAttentionRNNs.InbidirectionalRNNs,theproteinsequenceisprocessedbothforwardsandbackwardsbytwoseparateRNNsandtheinputtothefinalclassifieristhentheconcatenatedoutputsofthelasthiddenstateofbothRNNs.TheforwardsandbackwardsRNNswillthenbebetteratrememberingmotifsintheC-terminusandN-terminusrespectively.Nevertheless,forlongsequencesthesealgorithmsstillhavetorememberinformationacrossmanysteps.Tosolvethisproblem,aswellasidentifyproteinregionsimportantforclassification,weaugmentedthebidirectionalRNNencoderwithanattentivedecoder(Bahdanauetal.,2014).UsingthelasthiddenstateoftheencoderhLasinputtheattentivedecoderfDisrunforDdecodingsteps.NotethatDdoesnotdependontheinputsequencelengthL.Ateachstep,thehiddenstateoftheattentivedecoderdrisusedbyanattentionfunctionfAtoassignanormalizedimportanceweighttoeachsequencepositionoftheencoderhiddenstatesh=[h1,…,hL]asdr=fD(hL,dr−1,cr−1), r=1…D(4)et,r=fA(ht,dr)=tanh(htWe+dr−1Wd)vT(5)αt,r= exp (et,r)∑t′=1L exp (et′,r) ,(6)wheredristhehiddenstateofthedecoderatstepr,matricesWdandWeandcolumnvectorvarethetrainableparametersoftheattentionfunction.drisvectorofsamesizeasthenumberofhiddenunitsinthedecoderLSTM,whichcanbedifferentfromthedimensionalityoftheencoderht.αt,risthenormalizedimportanceweightsandcrisaweightedaverageoftheencoderRNNhiddenstatescalculatedascr=∑t=1Lαt,rht.(7)Theinitialvalueofcr,i.e.c0,isalearnedparametervectorthatistrainedaspartoftheneuralnetworkmodel.ThesubcellularlocalizationisthenpredictedusingtheweightedaverageoftheencoderRNNhiddenstatesatthelaststepofthedecodery=fy(cD) .(8)Thisallowsthemodeltoselectivelyassignsweighttosequencepositionsimportantforclassification,whichreducestheneedforrememberingallinformationacrosstheentirelengthofthesequence.BothfEandfDareimplementedasaspecialtypeofRNNunitscalledLong-ShortTermMemory(LSTM)cells(HochreiterandSchmidhuber,1997).LSTMssharethesamechainstructureasRNNs,buttherecurrentcalculationisaugmentedwithaninternalmemorycellcapturinglongrangedependencies.Furthermore,convolutionalfilterswereusedtodetectproteinmotifs.Hereafilter,akintopositionspecificscoringmatrices,isslidacrossthesequence.Itwillthendetectamotifregardlessofitspositioninthesequence.Theweightsofeachfiltercanbeadjustedtofindthemotifsthathelptobetterpredicteachclass.ThesenewfeaturescreatedwithaCNNcanrepresenttheinputsinamoreabstractway,which,incombinationwithLSTMs,hasbeenshowntobebeneficialforproteinclassification(Sønderbyetal.,2015).2.2Hierarchicaltreelikelihood
Toincludeinformationfromproteinsortingpathwaysintoourmodel,ahierarchicaltreewithmultiplenodeswasdeveloped.Eachnoderepresentsabinarydecisionattemptingtoassigntheproteintotherightpathwayfromhigh-leveltodetailedclassification.Asanexample,thefirstbinarydecisioninthetreeclassifiesproteinsinthesecretoryornon-secretorypathway,whereasthelastnodesseparaterelatedcompartmentssuchasmitochondriaandchloroplasts,seeFigure1panelD.Theleafnodescorrespondtothefinalsubcellularlocalizations,andthelikelihoodiscalculatedasthejointprobabilityofdecisionsinthetree.Soforexample,ifwehavedecisionsA,B,ythenaccordingtothetreedecompositiontheprobabilityofygiveninputsequenceXisgivenbyP(y|X)=P(y|B,X)P(B|A,X)p(A|X) .(9)AnexamplepathisA=Non-SecretoryPathway,B=N-terminalSequenceandy=Mitochondria.Eachoftheninebinaryclassifiersisimplementedbyalogisticoutputconnectedtothefullyconnecteddenselayer.Byconstruction,thetreeprobabilitiesarenormalized∑yp(y|X)=1.2.3Datasets
2.3.1DeepLocdataset
TheproteindatausedtotrainDeepLocwereextractedfromtheUniProtdatabase,release2016_04(TheUniProtConsortium,2017).Theproteindatasetwasfilteredusingthefollowingcriteria:eukaryotic,notfragments(theycouldhavetheN-terminalorC-terminalmissing),encodedinthenucleus,longerthan40aminoacidsandexperimentallyannotated(ECO:0000269).Similarlocationsorsubclassesofthesamelocationweremappedto10mainlocationsinordertoincreasethenumberofproteinspercompartment.Furthermore,proteinswereclassifiedasmembraneorsolubleiftheywerefoundoneitherthemembraneorthelumenoftheorganelle;ifnoinformationwasprovided,theyweretaggedasunknown.Finally,proteinswithmorethanonesubcellularlocalizationwerefilteredout.Atotalof13 858proteinswereobtainedafterthefilteringprocess.ThemappedsublocationsandthenumberofproteinsineachmainlocalizationaresummarizedinTable1.Table1.NumberofproteinsineachlocationandsublocationsthatweregroupedtogetherunderthesamemainlocationLocation
. No.ofproteins
. Sublocations
. Nucleus 4043 Envelope,innerandoutermembrane,matrix,lamina,chromosome,nucleusspeckle Cytoplasm 2542 Cytoplasm(cytosolandcytoskeleton) Extracellular 1973 Extracellular Mitochondrion 1510 Envelope,innerandoutermembrane,matrix,intermembranespace Cellmembrane 1340 Apical,apicolateral,basal,basolateral,lateral,cellmembrane,cellprojection Endoplasmicreticulum(ER) 862 ERmembraneandlumen,microsome,roughER,smoothER,Sarcoplasmicreticulum Plastid 757 Plastidmembrane,stromaandthylakoid Golgiapparatus 356 Golgiapparatusmembraneandlumen Lysosome/Vacuole 321 Contractile,lyticandproteinstoragevacuole,vacuolelumenandmembrane,lysosomelumenandmembrane Peroxisome 154 Peroxisomematrixandmembrane Location
. No.ofproteins
. Sublocations
. Nucleus 4043 Envelope,innerandoutermembrane,matrix,lamina,chromosome,nucleusspeckle Cytoplasm 2542 Cytoplasm(cytosolandcytoskeleton) Extracellular 1973 Extracellular Mitochondrion 1510 Envelope,innerandoutermembrane,matrix,intermembranespace Cellmembrane 1340 Apical,apicolateral,basal,basolateral,lateral,cellmembrane,cellprojection Endoplasmicreticulum(ER) 862 ERmembraneandlumen,microsome,roughER,smoothER,Sarcoplasmicreticulum Plastid 757 Plastidmembrane,stromaandthylakoid Golgiapparatus 356 Golgiapparatusmembraneandlumen Lysosome/Vacuole 321 Contractile,lyticandproteinstoragevacuole,vacuolelumenandmembrane,lysosomelumenandmembrane Peroxisome 154 Peroxisomematrixandmembrane
Openinnewtab
Table1.NumberofproteinsineachlocationandsublocationsthatweregroupedtogetherunderthesamemainlocationLocation
. No.ofproteins
. Sublocations
. Nucleus 4043 Envelope,innerandoutermembrane,matrix,lamina,chromosome,nucleusspeckle Cytoplasm 2542 Cytoplasm(cytosolandcytoskeleton) Extracellular 1973 Extracellular Mitochondrion 1510 Envelope,innerandoutermembrane,matrix,intermembranespace Cellmembrane 1340 Apical,apicolateral,basal,basolateral,lateral,cellmembrane,cellprojection Endoplasmicreticulum(ER) 862 ERmembraneandlumen,microsome,roughER,smoothER,Sarcoplasmicreticulum Plastid 757 Plastidmembrane,stromaandthylakoid Golgiapparatus 356 Golgiapparatusmembraneandlumen Lysosome/Vacuole 321 Contractile,lyticandproteinstoragevacuole,vacuolelumenandmembrane,lysosomelumenandmembrane Peroxisome 154 Peroxisomematrixandmembrane Location
. No.ofproteins
. Sublocations
. Nucleus 4043 Envelope,innerandoutermembrane,matrix,lamina,chromosome,nucleusspeckle Cytoplasm 2542 Cytoplasm(cytosolandcytoskeleton) Extracellular 1973 Extracellular Mitochondrion 1510 Envelope,innerandoutermembrane,matrix,intermembranespace Cellmembrane 1340 Apical,apicolateral,basal,basolateral,lateral,cellmembrane,cellprojection Endoplasmicreticulum(ER) 862 ERmembraneandlumen,microsome,roughER,smoothER,Sarcoplasmicreticulum Plastid 757 Plastidmembrane,stromaandthylakoid Golgiapparatus 356 Golgiapparatusmembraneandlumen Lysosome/Vacuole 321 Contractile,lyticandproteinstoragevacuole,vacuolelumenandmembrane,lysosomelumenandmembrane Peroxisome 154 Peroxisomematrixandmembrane
Openinnewtab
Toensurethatthemodelgeneralizestonewdataastringenthomologypartitioningwasperformed.Homologousproteinsthatfulfilacertainthresholdofsimilaritywereclusteredasdetailedbelow.Then,eachclusterofhomologousproteinswasassignedtooneofthefivefolds,ensuringthatsimilarproteinswerenotmixedbetweenthedifferentfolds.PSI-CD-HIT(LiandGodzik,2006)wasusedtoclusterproteinswith30%ofidentityor10−6E-valuecutoffandthealignmentmustcover80%ofshorter(redundant)sequences,whichproduced8410clustersforthewholedataset.Thefivefoldsgeneratedhadapproximatelythesamenumberofproteinsineachlocation.Fourwereusedforthetrainingandvalidationandoneheldoutsetfortesting.2.3.2Höglunddataset
TheHöglunddataset(Höglundetal.,2006)havebeenusedtotrainboththeMultiLocandRNNpredictionmethodsinHöglundetal.(2006)andSønderbyetal.(2015).Thisdatasetconsistof5959proteinswith11possiblelocations(cytoplasm,nucleus,extracellular,mitochondria,plasmamembrane,ER,chloroplast,Golgiapparatus,lysosome,vacuoleandperoxisome)andishomologyreducedto80%identity.Apartfromgroupingtogetherlysosomalandvacuolarproteinsnomodificationsweremadetothedataset.2.4Comparisontocurrentpredictionalgorithms
Theperformanceofourmodelswerecomparedwithanumberofcurrentpredictionalgorithmsusingthefollowingapproaches:LocTree2(Goldbergetal.,2012),MultiLoc2(Blumetal.,2009)andSherLoc2(Briesemeisteretal.,2009)wererunwithlocalcommand-lineversionsinstalledonourownserver,whileCELLO(Yuetal.,2006),iLoc-Euk(Chouetal.,2011)andWoLFPSORT(Hortonetal.,2007)wererunontheirwebservers.YLoc(Briesemeisteretal.,2010)wasrunofflinebythemaintainerofthewebservice.ResultsforYLocaregivenwiththeoptiontoincludeGOtermsturnedon.ForMultiLoc2andSherLoc2,anewerversionofInterProScan(5.21-60)wasusedinsteadoftherecommendedone(4.4)duetocompatibilityproblemswiththeolderversion.AsareferencetheperformanceofHöglundtestsetwasmeasuredonourlocalinstallationobtaininganaccuracyof0.8300forMultiloc2and0.9179forSherLoc2.Inthecaseswherecurrentmethodspredictmorethantenlocations,thepredictedlocationsweremappedontoourtenlocations.Twoofthemethods,iLoc-EukandWoLFPSORT,insomecasespredictduallocations(suchascytoplasm/nucleus).Sinceproteinswithduallocationswerefilteredoutintheconstructionofthedataset,thosepredictionswerecountedaserroneous,unlessboththepredictedlocationsmappedtothesamelocationinourclassification.2.5Experiments
Twodifferentsetofexperimentswerecarriedout.Thefirstexperimentswereusedformodelselectioncomparingtherelativeperformancesofthefollowingmodelarchitectures:Feedforwardneuralnetwork(FFN)BidirectionalLSTMneuralnetwork(BLSTM)BLSTMneuralnetworkwithattentionmechanism(A-BLSTM)ConvolutionalBLSTMneuralnetworkwithattentionmechanism(ConvA-BLSTM)UsingthebestmodelarchitecturesthesecondsetofexperimentsisdesignedtotestthegeneralizationperformanceofmodelstrainedoneitherournewDeepLocdatasetortheHöglunddataset.Hyperparameterswereoptimizedonthreeoffoursplitsofthetrainingdataandtheperformancewasevaluatedonthelastvalidationsplit.Thehyperparameterselectionwasdoneusinguni-dimensionalsearchwhereonehyperparameterwaschangedandtherestwerekeptfixed.Ifahyperparameterhadnotyetbeentested,themedianvalueintherangeofthathyperparameterwaschosen.Eachhyperparametersettingwasrunfor150epochs(epoch = fullpassoverthetrainingset)andtheperformancewasmeasuredasthehighestseenperformanceonthevalidationset.Thisstrategywasusedforcomputationalreasonssinceafullgridsearchoverallparameterswasnotcomputationallyfeasible.Afterthebesthyperparameterswereidentified,afinalrunofexperimentswereusedtoidentifythebestcombinationofaminoacidencodingsamongBLOSUM62(HenikoffandHenikoff,1992),sparse,proteinprofilesorHSDMencoding(Prlićetal.,2000).Wefurtherfoundthatproteinprofilesgavethehighestperformanceandincludedtheseasinputfeaturesforthefinalmodels.TheprofilesweregeneratedusingthesamemethodastheTOPCONSwebserver(Tsirigosetal.,2015).Thetestperformancewasmeasuredbytrainingfourmodelsonthetrainingsetusingthefourdifferentcombinationsoftrainingandvalidationset.Thereportedtestperformanceistheaverageofthefourmodelsevaluatedontheheld-outtestset.Westressthatweneveroptimizedanyparametersonthetestsetleavingthereportedperformancesunbiased.Todecreasethetrainingtime,themaximumproteinlengthwas1000.Ifaproteinexceededthislength,aminoacidsfromthemiddleofthesequencewereremovedinordertonottoloseinformationabouttheN-terminalandC-terminalsortingsignals.9.98%oftheproteinsweretruncatedusingthisrule.TheperformancemeasurementsusedtoassesstheperformanceofourmodelswereaccuracyandtheGorodkinmeasure(Gorodkin,2004).Forthebinaryprediction,theaccuracyandtheMatthew’sCorrelationCoefficient(Matthews,1975)(MCC)wereused.TheGorodkinmeasurecanbeseenasageneralizationofMCCthatappliestoK-categories,whichismoreinformativethantheaccuracywhenthereisanimbalanceofclasses.ForK=2,theGorodkinmeasuresquaredisthe‘generalizedsquaredcorrelation’(GC2)ofBaldietal.(2000).AllmodelswereimplementedinPython2.7.11usingtheneuralnetworklibraryLasagne0.2(Dielemanetal.,2015)andTheano0.9.0(TheanoDevelopmentTeam,2016)forefficientGPUimplementation.3Results
Wedesignedexperimentstoaddressthefollowingquestions:Whataretherelativeperformancesoftheproposedneuralnetworkmodelarchitectures?→Section3.1HowdoesthegeneralizationperformancesofmodelstrainedoneithertheDeepLocorHöglunddatasetscompare?→Section3.2HowdoesthefinalDeepLocmodelcomparetocurrentstate-of-the-artproteinsubcellularpredictionmodels?→Section3.33.1Modelselection
InTable2wecomparetheperformancesofdifferentmodelarchitecturestrainedontheDeepLocdataset.Notethatweareinterestedintherelativeperformanceofthemodels.Duetothis,weonlyusedBLOSUM62encodingsasinputfeatures,whichresultedinaslightlydegradedperformancecomparedtothefinalperformancesdescribedinthefollowingsections.Table2.ComparisonofperformancesfordifferentmodelarchitecturesusingBLOSUM62inputfeaturesModel
. Subcellularlocation
.
. Membrane
.
.
.
.
. Accuracy
. Gorodkin
. Accuracy
. MCC
. FFN 0.5234 0.4229 0.7301 0.4509 BLSTM 0.6925 0.6278 0.9004 0.8023 A-BLSTM 0.7290 0.6729 0.9163 0.8345 CONVA-BLSTM 0.7289 0.6780 0.9111 0.8218 Model
. Subcellularlocation
.
. Membrane
.
.
.
.
. Accuracy
. Gorodkin
. Accuracy
. MCC
. FFN 0.5234 0.4229 0.7301 0.4509 BLSTM 0.6925 0.6278 0.9004 0.8023 A-BLSTM 0.7290 0.6729 0.9163 0.8345 CONVA-BLSTM 0.7289 0.6780 0.9111 0.8218
Openinnewtab
Table2.ComparisonofperformancesfordifferentmodelarchitecturesusingBLOSUM62inputfeaturesModel
. Subcellularlocation
.
. Membrane
.
.
.
.
. Accuracy
. Gorodkin
. Accuracy
. MCC
. FFN 0.5234 0.4229 0.7301 0.4509 BLSTM 0.6925 0.6278 0.9004 0.8023 A-BLSTM 0.7290 0.6729 0.9163 0.8345 CONVA-BLSTM 0.7289 0.6780 0.9111 0.8218 Model
. Subcellularlocation
.
. Membrane
.
.
.
.
. Accuracy
. Gorodkin
. Accuracy
. MCC
. FFN 0.5234 0.4229 0.7301 0.4509 BLSTM 0.6925 0.6278 0.9004 0.8023 A-BLSTM 0.7290 0.6729 0.9163 0.8345 CONVA-BLSTM 0.7289 0.6780 0.9111 0.8218
Openinnewtab
TheA-BLSTMandtheCONVA-BLSTMmodelsachievedthehighestperformancepredictingthesubcellularlocalizationwithaccuraciesof0.7290and0.7289,respectively.ComparingtheseresultswiththeperformanceoftheBLSTMwithoutattention(accuracy0.6925),weseethatattentionimprovesperformance.Theseresultsconfirmthebenefitofselective,contextdependent,attentionforproteinclassification.AlloftheA-BLSTMmodelsperformedsignificantlybetterthanthebaselineFFNmodelwhichachievedanaccuracyof0.5234.ThisisexpectedsinceFFNmodelsdonottakeintoaccounttheorderoftheaminoacids,whereastheLSTMmodelsnaturallyconsidertherelationshipsbetweenaminoacids.Furthermore,weobservedthatincluding10decodingstepsintheattentionmechanismincreasedtheaccuracy(adifferenceof1%)incomparisonwithasingledecodingstep.Increasingthedecodingstepsbeyond10resultedinareductionintheaccuracy.Lastly,theA-BLSTMmodelspredictedwhethertheproteinsweremembrane-boundorsolublewithaccuraciesof0.9163and0.9111respectively.Fromtheaminoacidencodingcomparison,wefoundthattheCONVA-BLSTMmodelusingproteinprofilesencodinghadthehighestaccuracy,withadifferenceof2%comparedtotheA-BLSTMmodel.Therefore,wedecidedtousethisencodingandthismodelfortherestoftheexperiments.3.2Datasetcomparison
TocomparethegeneralizationperformanceofmodelstrainedoneithertheDeepLocortheHöglunddatasets,wetrainedaCONVA-BLSTMmodeloneachdatasetandevaluatedtheperformancesonthetestsetsfrombothdatasets.Table3showsthat(i)theHöglundtrainingsetachievesagoodtestperformanceonlyontheHöglundtestsetand(ii)theDeepLoctrainingsetachievesagoodtestperformanceontestsetswithstringentindependencebetweentrainingandtestsets.Table3.ComparisonofgeneralizationperformancesusingtheCONVA-BLSTMmodelbetweentheDeepLocdatasetandtheHöglunddatasetTrainingset
. Testset
. Accuracy
. Gorodkin
. DeepLoc DeepLoc 0.7511 0.6988 Höglund DeepLoc 0.6426 0.5756 DeepLoc Höglund 0.8301 0.8010 Höglund Höglund 0.9138 0.8979 Trainingset
. Testset
. Accuracy
. Gorodkin
. DeepLoc DeepLoc 0.7511 0.6988 Höglund DeepLoc 0.6426 0.5756 DeepLoc Höglund 0.8301 0.8010 Höglund Höglund 0.9138 0.8979 Note:Sequenceprofileswereusedasinputfeatures.
Openinnewtab
Table3.ComparisonofgeneralizationperformancesusingtheCONVA-BLSTMmodelbetweentheDeepLocdatasetandtheHöglunddatasetTrainingset
. Testset
. Accuracy
. Gorodkin
. DeepLoc DeepLoc 0.7511 0.6988 Höglund DeepLoc 0.6426 0.5756 DeepLoc Höglund 0.8301 0.8010 Höglund Höglund 0.9138 0.8979 Trainingset
. Testset
. Accuracy
. Gorodkin
. DeepLoc DeepLoc 0.7511 0.6988 Höglund DeepLoc 0.6426 0.5756 DeepLoc Höglund 0.8301 0.8010 Höglund Höglund 0.9138 0.8979 Note:Sequenceprofileswereusedasinputfeatures.
Openinnewtab
TheseresultsshowthatmodelstrainedontheHöglunddatasetgeneralizepoorlycomparedtomodelstrainedontheDeepLocdataset.AsaqualitativecomparisonofthetwodatasetswevisualizedthecontextvectorscrforCONVA-BLSTMmodelstrainedonbothdatasetsasseeninFigure2.ThecompartmentsarenotablymoreseparatedforthemodeltrainedontheHöglunddatasetcomparedtothemodeltrainedontheDeepLocdataset
Fig.2.OpeninnewtabDownloadslidet-SNErepresentationofthecontextvectorcrforaConvA-BLSTMtrainedontheDeepLocandHöglunddatasetandvisualizedfortherespectivetestsetsFig.2.OpeninnewtabDownloadslidet-SNErepresentationofthecontextvectorcrforaConvA-BLSTMtrainedontheDeepLocandHöglunddatasetandvisualizedfortherespectivetestsets3.3DeepLocmodel
FromthemodelcomparisonsweidentifiedtheCONVA-BLSTMasthebestperformingmodelarchitecture.Tofurtherimprovepredictionaccuracywetrainedanensembleof16modelsusingnestedcrossvalidation.Eightofthemodelsweretrainedusingasoftmaxoutputdistribution(classprobabilityfromsoftmaxfunction)andeightofthemodelsusingthehierarchicaltreedistribution(jointprobabilityofmultiplelogisticfunctions).Furtherwemitigatetheeffectoftheclassimbalancesbyusingacostmatrix(ZhouandLiu,2006)torecalculatetheclassprobabilitiesbasedonthenumberofsamplesinthetrainingset.Thefullensembleachievedanaccuracyof0.7797andGorodkinof0.7347onthesubcellularlocalizationandanaccuracyof0.9234andaMCCof0.8435onthemembrane-boundorsolubleprediction.Wefoundthatthesoftmaxmodelshadaslightlyhigheraccuracythanthehierarchicaltreemodelwiththe8-ensemblesachievinganaccuracyof0.7717and0.7695,respectively.WeshowinTable4theaccuracyandtheMCCforeachbinarydecisioninthehierarchicaltreemodel.Weexperimentedwithincreasingtheensemblesizebutfoundnoimprovementinperformance.Table4.AccuracyandMCCofeachnodeinthehierarchicaltreeNode
. Accuracy
. MCC
. Secretory/Non-secretorypathway 0.9502 0.8902 Intracellular/Extracellular 0.9507 0.8979 N-terminalsequences 0.9544 0.8784 Intermediatecompartment 0.7982 0.5824 PTS 0.9784 0.4085 Mitochondrion/Chloroplastsignals 0.9537 0.8955 Cellmembrane/Lysosome 0.8575 0.5002 ER/Golgi 0.8559 0.6376 NLS 0.8138 0.6031 Node
. Accuracy
. MCC
. Secretory/Non-secretorypathway 0.9502 0.8902 Intracellular/Extracellular 0.9507 0.8979 N-terminalsequences 0.9544 0.8784 Intermediatecompartment 0.7982 0.5824 PTS 0.9784 0.4085 Mitochondrion/Chloroplastsignals 0.9537 0.8955 Cellmembrane/Lysosome 0.8575 0.5002 ER/Golgi 0.8559 0.6376 NLS 0.8138 0.6031
Openinnewtab
Table4.AccuracyandMCCofeachnodeinthehierarchicaltreeNode
. Accuracy
. MCC
. Secretory/Non-secretorypathway 0.9502 0.8902 Intracellular/Extracellular 0.9507 0.8979 N-terminalsequences 0.9544 0.8784 Intermediatecompartment 0.7982 0.5824 PTS 0.9784 0.4085 Mitochondrion/Chloroplastsignals 0.9537 0.8955 Cellmembrane/Lysosome 0.8575 0.5002 ER/Golgi 0.8559 0.6376 NLS 0.8138 0.6031 Node
. Accuracy
. MCC
. Secretory/Non-secretorypathway 0.9502 0.8902 Intracellular/Extracellular 0.9507 0.8979 N-terminalsequences 0.9544 0.8784 Intermediatecompartment 0.7982 0.5824 PTS 0.9784 0.4085 Mitochondrion/Chloroplastsignals 0.9537 0.8955 Cellmembrane/Lysosome 0.8575 0.5002 ER/Golgi 0.8559 0.6376 NLS 0.8138 0.6031
Openinnewtab
Thetrainingtimeforthefullensemblewas80 hours,approximatelyfivehourspermodel.Whentesting,theensembletakesthreesecondsperproteinonaveragetoperformaprediction.Nonetheless,thisensembleusedproteinprofiles,whichwerealreadygeneratedforthisdataset.Thisprofilegenerationisthemosttime-consumingstepusuallytakingapproximately30 secondsperprotein.IfahitwiththePFAMdatabaseisnotfoundtheprofilegenerationusesUniref90instead.Thiscantakeevenlongerandthereforecanbeproblematicforlargeproteindatasets.TosolvethiswetrainedthesameensembleusingBLOSUM62encoding.Thismodelhasanaccuracyof0.7360andGorodkinof0.6832onthesubcellularlocalizationandanaccuracyof0.9130andaMCCof0.8237onthemembrane-boundorsolubleprediction.Byomittingtheprofilegeneration,weachievedafasterpredictionatthecostofdecreaseinaccuracy.Tables5and6showtheconfusionmatricesofthefullensembledescribedaboveforsubcellularlocalizationandmembrane-boundpredictionrespectively.Theprimarysourcesoferrorareconfusionofthenucleusandcytoplasm,lysosome/vacuolemisclassifiedascell-membraneandGolgimisclassifiedascytoplasm.InFigure3,weshowtheattentionvectorα,i.e.howimportantdifferentregionsofthesequencearefortheclassification.Ingeneral,theDeepLocmodelassignslargeimportancetotheN-terminalforsecretedproteinswherease.g.membraneproteinshaveregionsofimportanceinterspersedacrosstheproteinlength.Table5.ConfusionmatrixofthetestsetonthefinalDeepLocmodelusingprofilesencodingLocation
. Numberofpredictedproteins
. Sens.
. MCC
. Nucleus 680 103 4 5 2 8 1 2 2 1 0.842 0.784 Cytoplasm 94 361 7 18 5 4 3 8 1 7 0.711 0.608 Extracellular 3 5 365 5 5 4 2 0 4 0 0.929 0.907 Mitochondrion 9 21 0 247 0 5 14 2 1 3 0.818 0.812 Cellmembrane 5 15 6 1 203 20 1 4 18 0 0.744 0.732 Endoplasmicreticulum 3 6 6 3 18 120 1 7 8 1 0.694 0.654 Plastid 1 2 0 8 0 0 140 0 1 0 0.921 0.883 Golgiapparatus 4 17 1 0 9 8 1 26 4 0 0.371 0.414 Lysosome/Vacuole 0 7 11 1 20 9 0 4 12 0 0.188 0.194 Peroxisome 0 13 0 4 1 4 0 0 0 8 0.267 0.321 Location
. Numberofpredictedproteins
. Sens.
. MCC
. Nucleus 680 103 4 5 2 8 1 2 2 1 0.842 0.784 Cytoplasm 94 361 7 18 5 4 3 8 1 7 0.711 0.608 Extracellular 3 5 365 5 5 4 2 0 4 0 0.929 0.907 Mitochondrion 9 21 0 247 0 5 14 2 1 3 0.818 0.812 Cellmembrane 5 15 6 1 203 20 1 4 18 0 0.744 0.732 Endoplasmicreticulum 3 6 6 3 18 120 1 7 8 1 0.694 0.654 Plastid 1 2 0 8 0 0 140 0 1 0 0.921 0.883 Golgiapparatus 4 17 1 0 9 8 1 26 4 0 0.371 0.414 Lysosome/Vacuole 0 7 11 1 20 9 0 4 12 0 0.188 0.194 Peroxisome 0 13 0 4 1 4 0 0 0 8 0.267 0.321 Note:Sens.,sensitivity.
Openinnewtab
Table5.ConfusionmatrixofthetestsetonthefinalDeepLocmodelusingprofilesencodingLocation
. Numberofpredictedproteins
. Sens.
. MCC
. Nucleus 680 103 4 5 2 8 1 2 2 1 0.842 0.784 Cytoplasm 94 361 7 18 5 4 3 8 1 7 0.711 0.608 Extracellular 3 5 365 5 5 4 2 0 4 0 0.929 0.907 Mitochondrion 9 21 0 247 0 5 14 2 1 3 0.818 0.812 Cellmembrane 5 15 6 1 203 20 1 4 18 0 0.744 0.732 Endoplasmicreticulum 3 6 6 3 18 120 1 7 8 1 0.694 0.654 Plastid 1 2 0 8 0 0 140 0 1 0 0.921 0.883 Golgiapparatus 4 17 1 0 9 8 1 26 4 0 0.371 0.414 Lysosome/Vacuole 0 7 11 1 20 9 0 4 12 0 0.188 0.194 Peroxisome 0 13 0 4 1 4 0 0 0 8 0.267 0.321 Location
. Numberofpredictedproteins
. Sens.
. MCC
. Nucleus 680 103 4 5 2 8 1 2 2 1 0.842 0.784 Cytoplasm 94 361 7 18 5 4 3 8 1 7 0.711 0.608 Extracellular 3 5 365 5 5 4 2 0 4 0 0.929 0.907 Mitochondrion 9 21 0 247 0 5 14 2 1 3 0.818 0.812 Cellmembrane 5 15 6 1 203 20 1 4 18 0 0.744 0.732 Endoplasmicreticulum 3 6 6 3 18 120 1 7 8 1 0.694 0.654 Plastid 1 2 0 8 0 0 140 0 1 0 0.921 0.883 Golgiapparatus 4 17 1 0 9 8 1 26 4 0 0.371 0.414 Lysosome/Vacuole 0 7 11 1 20 9 0 4 12 0 0.188 0.194 Peroxisome 0 13 0 4 1 4 0 0 0 8 0.267 0.321 Note:Sens.,sensitivity.
Openinnewtab
Table6.Confusionmatrixforthemembrane-boundpredictorType
. Numberofpredictedproteins
.
. Soluble 968 38 Membrane-bound 96 647 Type
. Numberofpredictedproteins
.
. Soluble 968 38 Membrane-bound 96 647
Openinnewtab
Table6.Confusionmatrixforthemembrane-boundpredictorType
. Numberofpredictedproteins
.
. Soluble 968 38 Membrane-bound 96 647 Type
. Numberofpredictedproteins
.
. Soluble 968 38 Membrane-bound 96 647
Openinnewtab
Fig.3.OpeninnewtabDownloadslideSequenceimportanceacrosstheproteinsequenceofDeepLoctestsetwhenmakingtheprediction.Thex-axisisthesequencepositionandalomgthey-axiswehavetheproteinsinthetestsetsortedaccordingtoproteinlocalization.Forvisualizationproteinsshorterthan1000aminoacidsarepaddedfromthemiddle,sotheN-terminusandC-terminusalign.Proteinslongerthan1000aminoacidshavethemiddlepartremovedFig.3.OpeninnewtabDownloadslideSequenceimportanceacrosstheproteinsequenceofDeepLoctestsetwhenmakingtheprediction.Thex-axisisthesequencepositionandalomgthey-axiswehavetheproteinsinthetestsetsortedaccordingtoproteinlocalization.Forvisualizationproteinsshorterthan1000aminoacidsarepaddedfromthemiddle,sotheN-terminusandC-terminusalign.Proteinslongerthan1000aminoacidshavethemiddlepartremovedTocomparetheperformanceofthefinalDeepLocmodeltootherapproacheswebenchmarkedanumberofcurrentpredictionalgorithmsontheDeepLoctestsetasseeninTable7.TheaccuracyofthefinalDeepLocmodel(0.7797)issignificantlybetterthanallothermethodswithiLoc-Eukachievingthesecondbestaccuracyof0.6820.Table7.AccuracyandGorodkinmeasureachievedbycurrentpredictorsandthefinalDeepLocmodelontheDeepLoctestsetMethod
. Accuracy
. Gorodkin
. LocTree2 0.6120 0.5250 MultiLoc2 0.5592 0.4869 SherLoc2 0.5815 0.5112 YLoc 0.6122 0.5330 CELLO 0.5521 0.4543 iLoc-Euk 0.6820 0.6412 WoLFPSORT 0.5671 0.4785 DeepLoc 0.7797 0.7347 Method
. Accuracy
. Gorodkin
. LocTree2 0.6120 0.5250 MultiLoc2 0.5592 0.4869 SherLoc2 0.5815 0.5112 YLoc 0.6122 0.5330 CELLO 0.5521 0.4543 iLoc-Euk 0.6820 0.6412 WoLFPSORT 0.5671 0.4785 DeepLoc 0.7797 0.7347 Thehighestscoresareshowninbold.
Openinnewtab
Table7.AccuracyandGorodkinmeasureachievedbycurrentpredictorsandthefinalDeepLocmodelontheDeepLoctestsetMethod
. Accuracy
. Gorodkin
. LocTree2 0.6120 0.5250 MultiLoc2 0.5592 0.4869 SherLoc2 0.5815 0.5112 YLoc 0.6122 0.5330 CELLO 0.5521 0.4543 iLoc-Euk 0.6820 0.6412 WoLFPSORT 0.5671 0.4785 DeepLoc 0.7797 0.7347 Method
. Accuracy
. Gorodkin
. LocTree2 0.6120 0.5250 MultiLoc2 0.5592 0.4869 SherLoc2 0.5815 0.5112 YLoc 0.6122 0.5330 CELLO 0.5521 0.4543 iLoc-Euk 0.6820 0.6412 WoLFPSORT 0.5671 0.4785 DeepLoc 0.7797 0.7347 Thehighestscoresareshowninbold.
Openinnewtab
4Discussion
InthispaperwehaveintroducedtheDeepLocdataset:awellassembledproteincollectionwithreliablesubcellularlocalizationinformation.Secondlywehaveprovidedadeepneuralnetworkbasedpredictionalgorithmachievingstate-of-the-artperformanceonthisnewdataset.Thecontext-dependentannotationvectorgeneratedbytheattentionmechanismisabletorepresentaproteinbasedonitssubcellularlocalization.Inaddition,theattentionbasedpredictionmethodallowsvisualizationofthebiologicallyplausibleregionsusedtopredictthesubcellularlocalizationoftheproteinswhichwebelievewillproviderelevantinformation.ThecomparisonofthegeneralizationperformancesformodelstrainedonournewDeepLocdatasetandtheHöglunddatasetshowedthatDeepLoctrainedmodelsgeneralizedmuchbetterthantheHöglundtrainedmodel.Herewediscussanumberofexplanationsforthesefindings.Firstly,withtheUniprotdatabasechangetheHöglunddatasetcouldcontainmanywronglyannotatedproteins,whichgeneratesamodelthatlearnstopredictthewronglabels.Secondly,thehomologyreductionthreshold80%usedforconstructingtheHöglunddatasetmightnotbestringentenough,sinceitproducessimilartrainingandtestexamples.InFigure2,wecomparedtheattentioncontextvectorformodelstrainedoneithertheDeepLocorHöglunddatasets.FortheHöglundtrainedmodel,alllocationsarealmostperfectlyseparatedimplyingthatthereislittlevariationwithintheHöglunddatasetclassesandthatthetrainingandtestsetsarerelativelysimilar.ThissupportsthefindingofpoorgeneralizationperformanceformodelstrainedontheHöglunddataset.Hence,webelievethatthehighperformancereportedforalgorithmstrainedonthisdatasetisactuallyresultsfromoverfitting.ThetruevariationwithineachproteinclassislargerasindicatedbythebettergeneralizationperformanceformodelstrainedontheDeepLocdataset.ThisisfurthercorroboratedbythepoorerseparationofclassesfortheDeepLoctrainedmodelsinthesamefigure.WecomparedtheperformanceofthefinalDeepLocmodelwithothercurrentpredictionalgorithmsinTable7.WefoundthattheDeepLocmodelperformssignificantlybetterthantheotherapproaches.HerewenotethattheDeepLocperformanceisatruetestsetperformance,whereastheperformancesoftheothermethodsmaybeoverestimatedsincesomesequencesinourtestsetmayhavebeenincludedintheirtrainingsets.FurtherweemphasizethattheDeepLocmethodisapurelysequence-basedmethodanddoesnotrelyonannotationinformationfromhomologousproteins.Duetothestringenthomologypartitioningappliedinthedatasetconstruction,themodelshouldgeneralizetonewproteinswithoutknownclosehomologues.WenotethatwealsocomparedtheperformanceagainsttheLocTree3predictionmethod(Goldbergetal.,2014),whichisacombinationofLocTree2andaBLASTsearchofadatabaseofproteinswithknownsubcellularlocation.However,as75%oftheproteinsintheDeepLoctestsetarealsointheLocTree3BLASTdatabase,themeasuredaccuracywasartificiallyhighat91%,sinceLocTree3simplyretrievesthesamesubcellularlocationusedforlabellingourtestset.ThecompartmentspecificpredictionperformanceofthefinalDeepLocmodelisshowninTable5.ThemainsourceoferroristhelowperformanceontheGolgiapparatus,lysosome/vacuoleandperoxisome.Onepossiblecauseisthelownumberofsamplesusedtotraintheseclasses.However,thisfindingcouldalsobeassociatedwiththesimilaritybetweentheproteinsfromtheselocationsandothercompartments.Forexample,Table5showsthatthelysosome/vacuoleisusuallymisclassifiedascellmembraneandtheperoxisomeascytoplasm.Inadditiontothementionedunder-representedclasses,proteinsfromthecytoplasmandnucleusarealsodifficulttodifferentiate(Fig.2,Table4)becausetheybothlackN-terminalsortingsignals.Theonlydifferencebetweenthemisthenuclearlocalizationsignal(NLS),whichisahighlyvariantshortsequencethatcanbelocatedinmultipleregionsoftheproteinsequence,makingithardtorecognize.Figure3showsthepositionsinthesequencethattheattentionmechanismfocusesontogeneratetheattentioncontextvectorcr.Forthenucleicandcytoplasmicproteins,themodelfocusesonthebeginningofthesequence(checkingfortheabsenceofanN-terminalsortingsignal).Moreover,themodelalsogivesimportancetosmallregionsacrossthesequence.Themaindifferenceisthatthereisahigherdensityoftheseregionsinnucleusexamplesthanincytoplasm,whichcouldindicatethatthemodelisabletoidentifysomeofthemostrepresentedNLS.Figure3allowsustovisualizewhatregionsinthesequencearerelevantforeachsubcellularlocalizationtoperformtheprediction.Fortheextracellularproteins,themodelfocusesmainlyonthesignalpeptide,whichcanbeseenasasmallregionattheN-terminusofthesequence.Incontrast,theattentionisscatteredacrossthesequenceforplasmamembraneproteins,whichcouldindicatethatthealgorithmisdetectingthetransmembranehelices.FortheERproteinswecanseeattentionattheN-terminus,wherethesignalpeptideislocated,andalsosomeattentionattheC-terminus,whichcouldmeanthepresenceofKDELorKKXXsignals.GolgiproteinshavetheimportanceontheN-terminusslightlyshiftedtotheright,incomparisonwithotherproteinsfromthesecretorypathway,astheyaremostlytypeIItransmembraneproteinswithsignalanchors.MitochondrialandchloroplasticproteinshavelargeregionsattheN-terminus,whichclearlycorrelatestothemitochondrialandchloroplastictransitpeptides.Thelysosomal/vacuolarproteinsdonotseemtohaveaclearimportantregionacrosstheirsequences.Finally,forperoxisomalproteins,someregionsattheN-terminusandattheC-terminusareobserved,whichcouldmeanthatthemodelisdetectingPTS2andPTS1signals.5Conclusion
WehaveshownthatconvolutionalBLSTMneuralnetworkswithattentionmechanismareabletoaccuratelypredicttheproteinsubcellularlocalizationandifaproteinismembrane-boundorsolublejustusingthesequenceinformation.FurtherwehaveintroducedtheDeepLocdataset.TheDeepLocmodeltrainedonthisdatasetisabletogeneralizebetterthanusingpreviousdatasetsforsubcellularlocalization.Inaddition,DeepLocobtainedthehighestaccuracyusingtheindependenttestset,whencomparedwiththecurrentmethods.Thereareseveralperspectivesofthisprojectthatwewouldliketopursueinthefuture.Oneofthoseistomakebetteruseofexistingknowledgeaboutsortingsignals.DeepLoc1.0istrainedinarelatively‘naive’way,wherethenetworkshavebeenprovidedonlywithproteinprofilesandtheirlocationlabels.ItwouldbebeneficialtoexplicitlymodelknownsortingsignalssuchasN-terminalsignalpeptidesandtransitpeptides.Inaddition,itshouldbeinvestigatedwhetherperformancecanbeenhancedbytrainingseveralmodelswithanarrowertaxonomicalscopeinsteadoftreatingalleukaryotesbyonemodel.Obviously,animalsandfungidonothaveplastids,andsomefalsepredictionscouldbeavoidedbydisallowingplastidpredictionsforthesegroups,butmoresubtledifferencesbetweensortingsignalsarealsoknowntoexist.However,thereisatrade-offbetweentheprecisionofthetaxonomicalscopeandthesizesofthetrainingdatasets.Fortaxonomicgroupswithlimitednumbersofdatawithexperimentallyknownsubcellularlocation,itmaybenecessarytoemploysemisupervisedlearning,whereunlabelleddatafromgenomesequencesareusedalongwithlabelleddata.Acknowledgements
TheauthorswishtothankKonstantinosTsirigosandArneElofssonofStockholmUniversityforpermissiontousetheirfastprofileconstructionmethodinDeepLoc,eventhoughithasnotbeenpublishedyet.Inaddition,theywanttothankFabianAichelerofUniversityofTübingenforkindlyrunningtheDeepLoctestsetonYLoc.Funding
S.K.S.andO.W.weresupportedbyagrantfromtheNovoNordiskFoundationandtheNVIDIACorporationwiththedonationofTITANXGPUs.ConflictofInterest:nonedeclared.References
BahdanauDetal. (2014)Neuralmachinetranslationbyjointlylearningtoalignandtranslate.arXivpreprintarXiv:1409.0473BaldiPetal. (2000)
Assessingtheaccuracyofpredictionalgorithmsforclassification:anoverview.Bioinformatics,16,412–424.GoogleScholarCrossrefSearchADSPubMedWorldCat BlumTetal. (2009)
Multiloc2:integratingphylogenyandgeneontologytermsimprovessubcellularproteinlocalizationprediction.BMCBioinformatics,10,1.GoogleScholarCrossrefSearchADSPubMedWorldCat BriesemeisterSetal. (2009)
Sherloc2:ahigh-accuracyhybridmethodforpredictingsubcellularlocalizationofproteins.J.ProteomeRes.,8,5363–5366.GoogleScholarCrossrefSearchADSPubMedWorldCat BriesemeisterSetal. (2010)
YLoc–aninterpretablewebserverforpredictingsubcellularlocalization.NucleicAcidsRes.,38,W497–W502.GoogleScholarCrossrefSearchADSPubMedWorldCat ChouK.-Cetal. (2011)
iLoc-Euk:amulti-labelclassifierforpredictingthesubcellularlocalizationofsingleplexandmultiplexeukaryoticproteins.PLoSONE,6,e18258.GoogleScholarCrossrefSearchADSPubMedWorldCat DielemanSetal. (2015)Lasagne:FirstRelease.
Geneva,Switzerland,
Zenodo.GoogleScholarGooglePreviewOpenURLPlaceholderTextWorldCatCOPAC EmanuelssonOetal. (2007)
LocatingproteinsinthecellusingTargetP,SignalPandrelatedtools.NatureProtoc.,2,953–971.GoogleScholarCrossrefSearchADSWorldCat GoldbergTetal. (2012)
LocTree2predictslocalizationforalldomainsoflife.Bioinformatics,28,i458–i465.GoogleScholarCrossrefSearchADSPubMedWorldCat GoldbergTetal. (2014)
Loctree3predictionoflocalization.NucleicAcidsRes.,42,W350–W355.GoogleScholarCrossrefSearchADSPubMedWorldCat GorodkinJ(2004)
Comparingtwok-categoryassignmentsbyak-categorycorrelationcoefficient.Comput.Biol.Chem.,28,367–374.GoogleScholarCrossrefSearchADSPubMedWorldCat HenikoffS,HenikoffJ.G(1992)
Aminoacidsubstitutionmatricesfromproteinblocks.Proc.Natl.Acad.Sci.USA,89,10915–10919.GoogleScholarCrossrefSearchADSWorldCat HobohmUetal. (1992)
Selectionofrepresentativeproteindatasets.ProteinSci.,1,409–417.GoogleScholarCrossrefSearchADSPubMedWorldCat HochreiterS,SchmidhuberJ(1997)
Longshort-termmemory.NeuralComput.,9,1735–1780.GoogleScholarCrossrefSearchADSPubMedWorldCat HöglundAetal. (2006)
Multiloc:predictionofproteinsubcellularlocalizationusingN-terminaltargetingsequences,sequencemotifsandaminoacidcomposition.Bioinformatics,22,1158–1165.GoogleScholarCrossrefSearchADSPubMedWorldCat HortonPetal. (2007)
WoLFPSORT:proteinlocalizationpredictor.NucleicAcidsRes.,35,W585–W587.GoogleScholarCrossrefSearchADSPubMedWorldCat HungM.-C,LinkW(2011)
Proteinlocalizationindiseaseandtherapy.J.CellSci.,124,3381–3392.GoogleScholarCrossrefSearchADSPubMedWorldCat ImaiK,NakaiK(2010)
Predictionofsubcellularlocationsofproteins:wheretoproceed?Proteomics,10,3970–3983.GoogleScholarCrossrefSearchADSPubMedWorldCat KrausO.Zetal. (2017)
Automatedanalysisofhigh-contentmicroscopydatawithdeeplearning.Mol.Syst.Biol.,13,924.GoogleScholarCrossrefSearchADSPubMedWorldCat LiW,GodzikA(2006)
Cd-hit:afastprogramforclusteringandcomparinglargesetsofproteinornucleotidesequences.Bioinformatics,22,1658–1659.GoogleScholarCrossrefSearchADSPubMedWorldCat MatthewsB.W(1975)
ComparisonofthepredictedandobservedsecondarystructureofT4phagelysozyme.Biochim.Biophys.Acta(BBA)-ProteinStruct.,405,442–451.GoogleScholarCrossrefSearchADSWorldCat PrlićAetal. (2000)
Structure-derivedsubstitutionmatricesforalignmentofdistantlyrelatedsequences.ProteinEng.,13,545–550.GoogleScholarCrossrefSearchADSPubMedWorldCat ShatkayHetal. (2007)
Sherloc:high-accuracypredictionofproteinsubcellularlocalizationbyintegratingtextandproteinsequencedata.Bioinformatics,23,1410–1417.GoogleScholarCrossrefSearchADSPubMedWorldCat SønderbySKetal. (2015)ConvolutionalLSTMnetworksforsubcellularlocalizationofproteins.In:InternationalConferenceonAlgorithmsforComputationalBiology,volume9199ofLectureNotesinComputerScience,pp.68–80.SpringerTheUniProtConsortium(2017)
UniProt:theuniversalproteinknowledgebase.NucleicAcidsRes.,45,D158–D169.CrossrefSearchADSPubMedWorldCat TheanoDevelopmentTeam(2016)Theano:APythonframeworkforfastcomputationofmathematicalexpressions.arXive-prints,abs/1605.02688.TsirigosK.Detal. (2015)
TheTOPCONSwebserverforconsensuspredictionofmembraneproteintopologyandsignalpeptides.NucleicAcidsRes.,43,W401–W407.GoogleScholarCrossrefSearchADSPubMedWorldCat WanS,MakM.-W(2015)MachineLearningforProteinSubcellularLocalizationPrediction.
DeGruyter,Berlin,Germany.GoogleScholarCrossrefSearchADSGooglePreviewWorldCatCOPAC YuC.-Setal. (2006)
Predictionofproteinsubcellularlocalization.Proteins,64,643–651.GoogleScholarCrossrefSearchADSPubMedWorldCat ZhouZ.-H,LiuX.-Y(2006)
Trainingcost-sensitiveneuralnetworkswithmethodsaddressingtheclassimbalanceproblem.IEEETrans.KnowledgeDataEng.,18,63–77.GoogleScholarCrossrefSearchADSWorldCat
©TheAuthor2017.PublishedbyOxfordUniversityPress.Allrightsreserved.ForPermissions,pleasee-mail:journals.permissions@oup.comThisarticleispublishedanddistributedunderthetermsoftheOxfordUniversityPress,StandardJournalsPublicationModel(https://academic.oup.com/journals/pages/about_us/legal/notices)
IssueSection:
Sequenceanalysis
AssociateEditor:
JohnHancock
JohnHancock
AssociateEditor
Searchforotherworksbythisauthoron:
OxfordAcademic
PubMed
GoogleScholar
Downloadallslides
Advertisement
24,388
Views
381
Citations
ViewMetrics
×
Emailalerts
Articleactivityalert
Advancearticlealerts
Newissuealert
ReceiveexclusiveoffersandupdatesfromOxfordAcademic
Relatedarticlesin
WebofScience
GoogleScholar
Citingarticlesvia
WebofScience(381)
GoogleScholar
Crossref
Latest
MostRead
MostCited
BFFandcellhashR:AnalysisToolsforAccurateDemultiplexingofCellHashingData
findPC:AnRpackagetoautomaticallyselectthenumberofprincipalcomponentsinsingle-cellanalysis
scGraph:agraphneuralnetwork-basedapproachtoautomaticallyidentifycelltypes
TPpred-ATMV:Therapeuticpeptidespredictionbyadaptivemulti-viewtensorlearningmodel
Exploitingdeeptransferlearningforthepredictionoffunctionalnoncodingvariantsusinggenomicsequence
Tick-BorneSpecialistPhysicianAspirusTick-BorneIllnessCenter|LeadingTreatmentandResearchCenter|DesirableWisconsinLocation
Woodruff,Wisconsin
Director,TransplantandOncologyInfectiousDiseasesProgram
Baltimore,Maryland
ASSISTANTPROFESSORINFECTIOUSDISEASES
NewHaven,Connecticut
ProgramDirectorTrans-DivisionalResearchProgram
Bethesda,Maryland
Viewalljobs
Advertisement
Advertisement
AboutBioinformatics
EditorialBoard
AuthorGuidelines
Facebook
Twitter
Purchase
RecommendtoyourLibrary
AdvertisingandCorporateServices
JournalsCareerNetwork
OnlineISSN1460-2059PrintISSN1367-4803Copyright©2022OxfordUniversityPress
AboutUs
ContactUs
Careers
Help
Access&Purchase
Rights&Permissions
OpenAccess
PotentiallyOffensiveContent
Connect
JoinOurMailingList
OUPblog
Twitter
Facebook
YouTube
LinkedIn
Resources
Authors
Librarians
Societies
Sponsors&Advertisers
Press&Media
Agents
Explore
ShopOUPAcademic
OxfordDictionaries
Epigeum
OUPWorldwide
UniversityofOxford
OxfordUniversityPressisadepartmentoftheUniversityofOxford.ItfurtherstheUniversity'sobjectiveofexcellenceinresearch,scholarship,andeducationbypublishingworldwide
Copyright©2022OxfordUniversityPress
CookiePolicy
PrivacyPolicy
LegalNotice
SiteMap
Accessibility
Close
ThisFeatureIsAvailableToSubscribersOnly
SignInorCreateanAccount
Close
ThisPDFisavailabletoSubscribersOnly
ViewArticleAbstract&PurchaseOptions
Forfullaccesstothispdf,signintoanexistingaccount,orpurchaseanannualsubscription.
Close