DeepLoc: prediction of protein subcellular localization using ...

文章推薦指數: 80 %
投票人數:10人

The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. SkiptoMainContent Advertisement SearchMenu Menu NavbarSearchFilter ThisissueAllBioinformatics AllBioinformaticsJournalsAllJournals MobileMicrositeSearchTerm Search SignIn Issues Advancearticles Submit AuthorGuidelines SubmissionSite OpenAccess Purchase Alerts About AboutBioinformatics JournalsCareerNetwork EditorialBoard AdvertisingandCorporateServices Self-ArchivingPolicy DispatchDates Issues Advancearticles Submit AuthorGuidelines SubmissionSite OpenAccess Purchase Alerts About AboutBioinformatics JournalsCareerNetwork EditorialBoard AdvertisingandCorporateServices Self-ArchivingPolicy DispatchDates Close searchfilter Thisissue AllBioinformatics AllBioinformaticsJournals AllJournals searchinput Search AdvancedSearch SearchMenu ArticleNavigation Closemobilesearchnavigation ArticleNavigation Volume33 Issue21 01November2017 ArticleContents Abstract 1Introduction 2Materialsandmethods 3Results 4Discussion 5Conclusion Acknowledgements References ArticleNavigation ArticleNavigation DeepLoc:predictionofproteinsubcellularlocalizationusingdeeplearning JoséJuanAlmagroArmenteros, JoséJuanAlmagroArmenteros DepartmentofBioandHealthInformatics,TechnicalUniversityofDenmark,2800Kgs.Lyngby,DenmarkTheBioinformaticsCentre,DepartmentofBiology,UniversityofCopenhagen,CopenhagenN,Denmark Towhomcorrespondenceshouldbeaddressed.Email:[email protected] Searchforotherworksbythisauthoron: OxfordAcademic PubMed GoogleScholar CasperKaaeSønderby, CasperKaaeSønderby TheBioinformaticsCentre,DepartmentofBiology,UniversityofCopenhagen,CopenhagenN,Denmark Searchforotherworksbythisauthoron: OxfordAcademic PubMed GoogleScholar SørenKaaeSønderby, SørenKaaeSønderby TheBioinformaticsCentre,DepartmentofBiology,UniversityofCopenhagen,CopenhagenN,Denmark Searchforotherworksbythisauthoron: OxfordAcademic PubMed GoogleScholar HenrikNielsen, HenrikNielsen DepartmentofBioandHealthInformatics,TechnicalUniversityofDenmark,2800Kgs.Lyngby,Denmark Searchforotherworksbythisauthoron: OxfordAcademic PubMed GoogleScholar OleWinther OleWinther TheBioinformaticsCentre,DepartmentofBiology,UniversityofCopenhagen,CopenhagenN,DenmarkDTUCompute,TechnicalUniversityofDenmark,2800Kgs.Lyngby,Denmark Searchforotherworksbythisauthoron: OxfordAcademic PubMed GoogleScholar Bioinformatics,Volume33,Issue21,01November2017,Pages3387–3395,https://doi.org/10.1093/bioinformatics/btx431 Published: 07July2017 Articlehistory Received: 16March2017 Revisionreceived: 06June2017 Accepted: 03July2017 Published: 07July2017 Acorrectionhasbeenpublished: Bioinformatics,Volume33,Issue24,15December2017,Page4049,https://doi.org/10.1093/bioinformatics/btx548 PDF SplitView Views Articlecontents Figures&tables Video Audio SupplementaryData Cite Cite JoséJuanAlmagroArmenteros,CasperKaaeSønderby,SørenKaaeSønderby,HenrikNielsen,OleWinther,DeepLoc:predictionofproteinsubcellularlocalizationusingdeeplearning,Bioinformatics,Volume33,Issue21,01November2017,Pages3387–3395,https://doi.org/10.1093/bioinformatics/btx431 SelectFormat Selectformat .ris(Mendeley,Papers,Zotero) .enw(EndNote) .bibtex(BibTex) .txt(Medlars,RefWorks) Downloadcitation Close PermissionsIcon Permissions Share Email Twitter Facebook More NavbarSearchFilter ThisissueAllBioinformatics AllBioinformaticsJournalsAllJournals MobileMicrositeSearchTerm Search SignIn Close searchfilter Thisissue AllBioinformatics AllBioinformaticsJournals AllJournals searchinput Search AdvancedSearch SearchMenu Abstract MotivationThepredictionofeukaryoticproteinsubcellularlocalizationisawell-studiedtopicinbioinformaticsduetoitsrelevanceinproteomicsresearch.Manymachinelearningmethodshavebeensuccessfullyappliedinthistask,butinmostofthem,predictionsrelyonannotationofhomologuesfromknowledgedatabases.Fornovelproteinswherenoannotatedhomologuesexist,andforpredictingtheeffectsofsequencevariants,itisdesirabletohavemethodsforpredictingproteinpropertiesfromsequenceinformationonly.ResultsHere,wepresentapredictionalgorithmusingdeepneuralnetworkstopredictproteinsubcellularlocalizationrelyingonlyonsequenceinformation.Atitscore,thepredictionmodelusesarecurrentneuralnetworkthatprocessestheentireproteinsequenceandanattentionmechanismidentifyingproteinregionsimportantforthesubcellularlocalization.ThemodelwastrainedandtestedonaproteindatasetextractedfromoneofthelatestUniProtreleases,inwhichexperimentallyannotatedproteinsfollowmorestringentcriteriathanpreviously.Wedemonstratethatourmodelachievesagoodaccuracy(78%for10categories;92%formembrane-boundorsoluble),outperformingcurrentstate-of-the-artalgorithms,includingthoserelyingonhomologyinformation.AvailabilityandimplementationThemethodisavailableasawebserverathttp://www.cbs.dtu.dk/services/DeepLoc.Examplecodeisavailableathttps://github.com/JJAlmagro/subcellular_localization.Thedatasetisavailableathttp://www.cbs.dtu.dk/services/DeepLoc/data.php. 1Introduction Proteinsfulfilawidediversityoffunctionsinsidethevariouscompartmentsofeukaryoticcells.Thefunctionofaproteindependsonthecompartmentororganellewhereitislocated,asitprovidesaphysiologicalcontextforitsfunction.However,aberrantproteinsubcellularlocalizationcanaffectthefunctionthataproteinexhibitsandcontributestothepathogenesisofmanyhumandiseases;suchasmetabolic,cardiovascularandneurodegenerativediseases,aswellascancer(HungandLink,2011).Therefore,predictingthesubcellularlocalizationoftheproteinsisanessentialtaskwhichhasbeenextensivelystudiedinbioinformatics(Emanuelssonetal.,2007;ImaiandNakai,2010;WanandMak,2015).Mostofthecurrentmachinelearningmethodsforsubcellularlocalizationpredictionextractafixednumberoffeaturesfromtheproteinsequencesandusethisfixedlengthrepresentationasinputtoanon-linearclassifiersuchasasupportvectormachine(SVM).However,sequence-basedmodels,whichprocessonepositionatatime,aremorenaturalforthistaskastheycanlearnandmakeinferencesfrominputofvaryinglength.Unfortunately,thesemodelshavenotbeencompetitivewithnon-linearclassifiersupuntilrecently.Inthispaperwetakeadvantageofprogressindeeplearning,specificallyrecurrentneuralnetworks(RNNs)withlongshort-termmemory(LSTM)cells,attentionmodelsandconvolutionalneuralnetworks(CNNs),toproposeanend-to-endsequence-basedmodel.LSTMscontainmemorycellsthatcanholdinformationfrompastinputstothenetworkforinprincipleanarbitrarynumberofpositions(HochreiterandSchmidhuber,1997).Attention(Bahdanauetal.,2014)makesitpossibletodetectsortingsignalsinproteinsregardlessoftheirpositioninthesequence.Inaddition,CNNsareabletotrainfiltersthatdetectshortmotifsintheinputsequenceirrespectivelyofwheretheyoccur,andhaveshownpromisingperformanceforproteinsubcellularlocalizationwhencombinedwithLSTMs(Sønderbyetal.,2015).Wealsoproposeahierarchicaltreelikelihoodmimickingthebiologyofthesortingpathwayandatransferlearningapproachtojointlypredictsubcellularlocalizationandwhethertheproteinismembrane-boundorsoluble.Inthefollowingwediscusssomeofthecaveatswiththedatasetsusedinprevioussubcellularlocalizationtools.First,manymethodsusehomologyinformationforprediction,eitherbydirectlyusingannotatedsubcellularlocationannotationsofretrievedhitsinadatabasesearch,asinLocTree3(withanaccuracyof80%for18locations)(Goldbergetal.,2014),orbytakinghintsfromothertypesofannotationsuchasGO-terms,asiniLoc-EukandYLoc(Briesemeisteretal.,2010;Chouetal.,2011),orPubMedabstractslinkedtotheprotein’sSwiss-Protentry,asinSherLoc(Briesemeisteretal.,2009).Thesemethodsareappropriateforannotatedproteinsorproteinswithannotatedclosehomologues.Nonetheless,itshouldbetakenintoaccountthattheperformancewillbemuchlowerforsequenceswithoutwell-annotatedhomologues—preciselythesequencesforwhichitwouldbemostrelevanttohaveworkingpredictionmethods.Inaddition,anyhomology-basedmethodwillhaveverylimitedchanceofbeingabletopredicttheconsequencesofmutationsaffectingsortingsignalsbecausethewild-typeandthevariantprobablywouldpickupthesamehomologuesinadatabasesearch.Second,theperformancesofmachinelearningalgorithmsarecruciallydependentonthedatasetsusedtotrainandtestthem.Forproteinsubcellularlocalizationakeyaspectisthatproteinsshouldhaveexperimentalevidencefortheirsubcellularlocation,sothatpredictionsarenotbasedonpredictionsinacircularfashion.However,currentmethodsusedatafromUniProt(TheUniProtConsortium,2017)priortorelease2014_09,whereamajorchangeintheannotationstandardstookplace.Beforethechange,anannotationwasregardedasexperimentalifitlackedqualifierssuchas‘Potential’,‘Probable’or‘Bysimilarity’;afterthechange,onlyannotationswithaspecificliteraturereferencewereannotatedasbeingexperimental(evidencecodeECO:0000269).Thisresultedinaconsiderabledecreaseinthenumberofproteinswithsubcellularlocationregardedasexperimentallyconfirmed,thusraisingtheissuethatcurrentmethodsmayinpartbetrainedandtestedonquestionableexamples.Anotheraspectofthedatasetissueisthattheamountofhomologybetweentrainingdataandtestdatashouldbekeptataminimum(Hobohmetal.,1992).Themeasuredtestperformanceshouldbeatruemeasureofthepredictiveperformanceonnewproteinsandnotjustameasureofhowgoodthemethodisatfindinghomologueswiththesamesubcellularlocation.Unfortunately,theHöglunddataset(Höglundetal.,2006)whichhasbeenusedinthetrainingandtestofseveralmethods(Blumetal.,2009;Briesemeisteretal.,2009,2010;Shatkayetal.,2007;Sønderbyetal.,2015)isonlyhomologyreducedto80%identity.Thismeansthatratherclosehomologuestothetrainingdatawilloccurinthetestset,whichresultsinoverlyoptimisticperformancesthatdonotreflectthetruegeneralizationtonewunseenproteins.Anexampleofastate-of-the-artmethodthatusesthisdatasetsetisSherloc2,whichreportsanaccuracyof93%for11locations.Thispaperhasfourmajorcontributions:WeconstructanewdatasetfromarecentversionofUniProtwhereproteinshaveexperimentalevidencefortheirsubcellularlocationsaccordingtothenewstricterdefinition.Weperformstringenthomologypartitioningtoavoidoverfitting,providingrealisticaccuracymeasuresonnewproteins.WeshowthatmodelstrainedontheHöglunddatasethavepoorgeneralizationperformanceonournewdataset.Thisreflectsthehighlevelofhomologyandpossiblyerroneousannotationsintheolddataset.Wedevelopdeeprecurrentneuralnetworksfortheproteinsubcellularlocalizationtaskwithanumberofnovelstate-of-the-artmodelfeatures.Thisincludesconvolutionalmotifdetectors,selectiveattentiononsequenceregionsimportantforsubcellularlocalizationpredictionandanovelhierarchicalsortinglikelihood.Thesefeaturesareusedforinterpretationofthemodelandpredictions.Ournetworksshowimprovedpredictionaccuracywithoutusinghomologyinformation.Weimplementtheresultingmodelasauser-friendlyweb-servercalledDeepLoc(Concurrentlywithourwork,Krausetal.(2017)hasintroducedamethodforproteinsubcellularlocationfromcellimagedataalsocalledDeepLoc).2Materialsandmethods 2.1Neuralnetworkmodels Thedeeplearningneuralnetworkmodelusedisdescribedindetailbelow.Figure1andthefollowingdescriptiongivesasummaryofthearchitectureused:Theinputissequencelength(=1000) × sizeofaminoacidvocabulary(=20).TheCNNextractsmotifinformationusing120filtersofdifferentsizes(20foreachofthesizes1,3,5,9,15and21).Thisgivesa1000 × 120featuremap.Anotherconvolutionallayerof128filtersofsize3 × 120isappliedtothisfeaturemap.Thisgivesa1000 × 128featuremapwhichisusedasinputtotherecurrentlayer.Therecurrentneuralnetworkscansthesequenceusing256LSTMunitsinbothdirectionsgivingintotala1000 × 512dimensionaloutput.TheattentiondecodinglayerusesanLSTMwith512unitsthrough10decodingstepsandtheattentionmechanismfeedforwardneuralnetwork(FFN)has256units.Thefinalfullyconnecteddenselayeriscomposedby512andthetwooutputlayershaveoneunit(membrane-bound)and10units(subcellularlocalization). Fig.1.OpeninnewtabDownloadslide(A)Theconvolutionalneuralnetwork(CNN)extractsmotifinformationusingdifferentmotifsizes.(B)Therecurrentneuralnetworkscansthesequenceinbothdirections,extractingthespatialdependenciesbetweenaminoacids.(C)Theattentionmechanismassignshigherimportancetoaminoacidsthatarerelevantfortheprediction.Ateachdecodingstep,theattentionweightsαaregeneratedbasedonthehiddenstatesfromtheRNNandthehiddenstatesfromthepreviousdecodingstep.Theweightedaverageoftheseweightsatthelastdecodingstepisusedasinputtoafullyconnecteddenselayer.(D)AlltheinformationgatheredfromtheproteinsequenceispassedtoasoftmaxfunctionandahierarchicaltreeofsortingpathwaystocalculatethefinalpredictionFig.1.OpeninnewtabDownloadslide(A)Theconvolutionalneuralnetwork(CNN)extractsmotifinformationusingdifferentmotifsizes.(B)Therecurrentneuralnetworkscansthesequenceinbothdirections,extractingthespatialdependenciesbetweenaminoacids.(C)Theattentionmechanismassignshigherimportancetoaminoacidsthatarerelevantfortheprediction.Ateachdecodingstep,theattentionweightsαaregeneratedbasedonthehiddenstatesfromtheRNNandthehiddenstatesfromthepreviousdecodingstep.Theweightedaverageoftheseweightsatthelastdecodingstepisusedasinputtoafullyconnecteddenselayer.(D)AlltheinformationgatheredfromtheproteinsequenceispassedtoasoftmaxfunctionandahierarchicaltreeofsortingpathwaystocalculatethefinalpredictionWelearnasubcellularlocalizationmodelwhichpredictsthesubcellularlocalizationusingtheaminoacidssequenceasinput:y=fθ(X) ,(1)whereyisthepredictedlocalization,fisthepredictionmodelparametrizedbyparametersθandXistheinputdatasequenceofsizeL × NwhereListheproteinlengthandNisthenumberofinputfeaturespersequenceposition.Theparametersθareoptimizedusingstochasticgradientdescentwithcrossentropylossbetweenthetrueandpredictedlocalizationdistribution.Inpractice,thelengthofproteinsequencescanvaryfromtenstothousandsofaminoacidsposingachallengeformanypredictionalgorithmsrequiringafixedsizeinputrepresentation.Instead,recurrentneuralnetworks(RNN)thatnaturallyhandlevaryinginputsequencelengthswereused.Thenetworksappliesarecurrentcalculationateachsequencepositiontht=fE(xt,ht−1),   t=1…L(2)wherefEisanRNNdenotedtheencoder,xtistheinputfeaturesofXatpositiontandh=[h1,…,hL]isthehiddenstatesoftheRNNwherehtisavectorofsamelengthasthenumberofhiddenunitsintheRNN.Theencodercanbeviewedasatrainablefeatureextractorencodingtheaminoacidsequenceintoafeaturespacesuitableforsubcellularlocalizationprediction.Naively,thefinalsubcellullarlocationycouldbepredictedbyapplyingaclassifierfytothefinalhiddenstateoftheencoderhLy=fy(hL) .(3)However,thisapproachisnotidealforseveralreasons.FirstlytheRNNhastorememberallusefulinformationacrosstheentire,oftenverylong,inputsequence.Insubcellularlocalizationthisisespeciallyproblematicsincemostoftheinformationisknowntoresideinthebeginning(N-terminus)andend(C-terminus)ofthesequence.Secondlyallinformationabouttheproteinhastobecompressedintothesamesizevectorregardlessofthelengthoftheprotein.Twodifferentsolutionswereusedtoalleviatetheseproblems,BidirectionalRNNsandAttentionRNNs.InbidirectionalRNNs,theproteinsequenceisprocessedbothforwardsandbackwardsbytwoseparateRNNsandtheinputtothefinalclassifieristhentheconcatenatedoutputsofthelasthiddenstateofbothRNNs.TheforwardsandbackwardsRNNswillthenbebetteratrememberingmotifsintheC-terminusandN-terminusrespectively.Nevertheless,forlongsequencesthesealgorithmsstillhavetorememberinformationacrossmanysteps.Tosolvethisproblem,aswellasidentifyproteinregionsimportantforclassification,weaugmentedthebidirectionalRNNencoderwithanattentivedecoder(Bahdanauetal.,2014).UsingthelasthiddenstateoftheencoderhLasinputtheattentivedecoderfDisrunforDdecodingsteps.NotethatDdoesnotdependontheinputsequencelengthL.Ateachstep,thehiddenstateoftheattentivedecoderdrisusedbyanattentionfunctionfAtoassignanormalizedimportanceweighttoeachsequencepositionoftheencoderhiddenstatesh=[h1,…,hL]asdr=fD(hL,dr−1,cr−1),   r=1…D(4)et,r=fA(ht,dr)=tanh⁡(htWe+dr−1Wd)vT(5)αt,r= exp ⁡(et,r)∑t′=1L exp ⁡(et′,r) ,(6)wheredristhehiddenstateofthedecoderatstepr,matricesWdandWeandcolumnvectorvarethetrainableparametersoftheattentionfunction.drisvectorofsamesizeasthenumberofhiddenunitsinthedecoderLSTM,whichcanbedifferentfromthedimensionalityoftheencoderht⁠.αt,risthenormalizedimportanceweightsandcrisaweightedaverageoftheencoderRNNhiddenstatescalculatedascr=∑t=1Lαt,rht.(7)Theinitialvalueofcr⁠,i.e.c0⁠,isalearnedparametervectorthatistrainedaspartoftheneuralnetworkmodel.ThesubcellularlocalizationisthenpredictedusingtheweightedaverageoftheencoderRNNhiddenstatesatthelaststepofthedecodery=fy(cD) .(8)Thisallowsthemodeltoselectivelyassignsweighttosequencepositionsimportantforclassification,whichreducestheneedforrememberingallinformationacrosstheentirelengthofthesequence.BothfEandfDareimplementedasaspecialtypeofRNNunitscalledLong-ShortTermMemory(LSTM)cells(HochreiterandSchmidhuber,1997).LSTMssharethesamechainstructureasRNNs,buttherecurrentcalculationisaugmentedwithaninternalmemorycellcapturinglongrangedependencies.Furthermore,convolutionalfilterswereusedtodetectproteinmotifs.Hereafilter,akintopositionspecificscoringmatrices,isslidacrossthesequence.Itwillthendetectamotifregardlessofitspositioninthesequence.Theweightsofeachfiltercanbeadjustedtofindthemotifsthathelptobetterpredicteachclass.ThesenewfeaturescreatedwithaCNNcanrepresenttheinputsinamoreabstractway,which,incombinationwithLSTMs,hasbeenshowntobebeneficialforproteinclassification(Sønderbyetal.,2015).2.2Hierarchicaltreelikelihood Toincludeinformationfromproteinsortingpathwaysintoourmodel,ahierarchicaltreewithmultiplenodeswasdeveloped.Eachnoderepresentsabinarydecisionattemptingtoassigntheproteintotherightpathwayfromhigh-leveltodetailedclassification.Asanexample,thefirstbinarydecisioninthetreeclassifiesproteinsinthesecretoryornon-secretorypathway,whereasthelastnodesseparaterelatedcompartmentssuchasmitochondriaandchloroplasts,seeFigure1panelD.Theleafnodescorrespondtothefinalsubcellularlocalizations,andthelikelihoodiscalculatedasthejointprobabilityofdecisionsinthetree.Soforexample,ifwehavedecisionsA,B,ythenaccordingtothetreedecompositiontheprobabilityofygiveninputsequenceXisgivenbyP(y|X)=P(y|B,X)P(B|A,X)p(A|X) .(9)AnexamplepathisA=Non-SecretoryPathway,B=N-terminalSequenceandy=Mitochondria.Eachoftheninebinaryclassifiersisimplementedbyalogisticoutputconnectedtothefullyconnecteddenselayer.Byconstruction,thetreeprobabilitiesarenormalized∑yp(y|X)=1⁠.2.3Datasets 2.3.1DeepLocdataset TheproteindatausedtotrainDeepLocwereextractedfromtheUniProtdatabase,release2016_04(TheUniProtConsortium,2017).Theproteindatasetwasfilteredusingthefollowingcriteria:eukaryotic,notfragments(theycouldhavetheN-terminalorC-terminalmissing),encodedinthenucleus,longerthan40aminoacidsandexperimentallyannotated(ECO:0000269).Similarlocationsorsubclassesofthesamelocationweremappedto10mainlocationsinordertoincreasethenumberofproteinspercompartment.Furthermore,proteinswereclassifiedasmembraneorsolubleiftheywerefoundoneitherthemembraneorthelumenoftheorganelle;ifnoinformationwasprovided,theyweretaggedasunknown.Finally,proteinswithmorethanonesubcellularlocalizationwerefilteredout.Atotalof13 858proteinswereobtainedafterthefilteringprocess.ThemappedsublocationsandthenumberofproteinsineachmainlocalizationaresummarizedinTable1.Table1.NumberofproteinsineachlocationandsublocationsthatweregroupedtogetherunderthesamemainlocationLocation . No.ofproteins . Sublocations . Nucleus 4043 Envelope,innerandoutermembrane,matrix,lamina,chromosome,nucleusspeckle Cytoplasm 2542 Cytoplasm(cytosolandcytoskeleton) Extracellular 1973 Extracellular Mitochondrion 1510 Envelope,innerandoutermembrane,matrix,intermembranespace Cellmembrane 1340 Apical,apicolateral,basal,basolateral,lateral,cellmembrane,cellprojection Endoplasmicreticulum(ER) 862 ERmembraneandlumen,microsome,roughER,smoothER,Sarcoplasmicreticulum Plastid 757 Plastidmembrane,stromaandthylakoid Golgiapparatus 356 Golgiapparatusmembraneandlumen Lysosome/Vacuole 321 Contractile,lyticandproteinstoragevacuole,vacuolelumenandmembrane,lysosomelumenandmembrane Peroxisome 154 Peroxisomematrixandmembrane Location . No.ofproteins . Sublocations . Nucleus 4043 Envelope,innerandoutermembrane,matrix,lamina,chromosome,nucleusspeckle Cytoplasm 2542 Cytoplasm(cytosolandcytoskeleton) Extracellular 1973 Extracellular Mitochondrion 1510 Envelope,innerandoutermembrane,matrix,intermembranespace Cellmembrane 1340 Apical,apicolateral,basal,basolateral,lateral,cellmembrane,cellprojection Endoplasmicreticulum(ER) 862 ERmembraneandlumen,microsome,roughER,smoothER,Sarcoplasmicreticulum Plastid 757 Plastidmembrane,stromaandthylakoid Golgiapparatus 356 Golgiapparatusmembraneandlumen Lysosome/Vacuole 321 Contractile,lyticandproteinstoragevacuole,vacuolelumenandmembrane,lysosomelumenandmembrane Peroxisome 154 Peroxisomematrixandmembrane  Openinnewtab Table1.NumberofproteinsineachlocationandsublocationsthatweregroupedtogetherunderthesamemainlocationLocation . No.ofproteins . Sublocations . Nucleus 4043 Envelope,innerandoutermembrane,matrix,lamina,chromosome,nucleusspeckle Cytoplasm 2542 Cytoplasm(cytosolandcytoskeleton) Extracellular 1973 Extracellular Mitochondrion 1510 Envelope,innerandoutermembrane,matrix,intermembranespace Cellmembrane 1340 Apical,apicolateral,basal,basolateral,lateral,cellmembrane,cellprojection Endoplasmicreticulum(ER) 862 ERmembraneandlumen,microsome,roughER,smoothER,Sarcoplasmicreticulum Plastid 757 Plastidmembrane,stromaandthylakoid Golgiapparatus 356 Golgiapparatusmembraneandlumen Lysosome/Vacuole 321 Contractile,lyticandproteinstoragevacuole,vacuolelumenandmembrane,lysosomelumenandmembrane Peroxisome 154 Peroxisomematrixandmembrane Location . No.ofproteins . Sublocations . Nucleus 4043 Envelope,innerandoutermembrane,matrix,lamina,chromosome,nucleusspeckle Cytoplasm 2542 Cytoplasm(cytosolandcytoskeleton) Extracellular 1973 Extracellular Mitochondrion 1510 Envelope,innerandoutermembrane,matrix,intermembranespace Cellmembrane 1340 Apical,apicolateral,basal,basolateral,lateral,cellmembrane,cellprojection Endoplasmicreticulum(ER) 862 ERmembraneandlumen,microsome,roughER,smoothER,Sarcoplasmicreticulum Plastid 757 Plastidmembrane,stromaandthylakoid Golgiapparatus 356 Golgiapparatusmembraneandlumen Lysosome/Vacuole 321 Contractile,lyticandproteinstoragevacuole,vacuolelumenandmembrane,lysosomelumenandmembrane Peroxisome 154 Peroxisomematrixandmembrane  Openinnewtab Toensurethatthemodelgeneralizestonewdataastringenthomologypartitioningwasperformed.Homologousproteinsthatfulfilacertainthresholdofsimilaritywereclusteredasdetailedbelow.Then,eachclusterofhomologousproteinswasassignedtooneofthefivefolds,ensuringthatsimilarproteinswerenotmixedbetweenthedifferentfolds.PSI-CD-HIT(LiandGodzik,2006)wasusedtoclusterproteinswith30%ofidentityor10−6E-valuecutoffandthealignmentmustcover80%ofshorter(redundant)sequences,whichproduced8410clustersforthewholedataset.Thefivefoldsgeneratedhadapproximatelythesamenumberofproteinsineachlocation.Fourwereusedforthetrainingandvalidationandoneheldoutsetfortesting.2.3.2Höglunddataset TheHöglunddataset(Höglundetal.,2006)havebeenusedtotrainboththeMultiLocandRNNpredictionmethodsinHöglundetal.(2006)andSønderbyetal.(2015).Thisdatasetconsistof5959proteinswith11possiblelocations(cytoplasm,nucleus,extracellular,mitochondria,plasmamembrane,ER,chloroplast,Golgiapparatus,lysosome,vacuoleandperoxisome)andishomologyreducedto80%identity.Apartfromgroupingtogetherlysosomalandvacuolarproteinsnomodificationsweremadetothedataset.2.4Comparisontocurrentpredictionalgorithms Theperformanceofourmodelswerecomparedwithanumberofcurrentpredictionalgorithmsusingthefollowingapproaches:LocTree2(Goldbergetal.,2012),MultiLoc2(Blumetal.,2009)andSherLoc2(Briesemeisteretal.,2009)wererunwithlocalcommand-lineversionsinstalledonourownserver,whileCELLO(Yuetal.,2006),iLoc-Euk(Chouetal.,2011)andWoLFPSORT(Hortonetal.,2007)wererunontheirwebservers.YLoc(Briesemeisteretal.,2010)wasrunofflinebythemaintainerofthewebservice.ResultsforYLocaregivenwiththeoptiontoincludeGOtermsturnedon.ForMultiLoc2andSherLoc2,anewerversionofInterProScan(5.21-60)wasusedinsteadoftherecommendedone(4.4)duetocompatibilityproblemswiththeolderversion.AsareferencetheperformanceofHöglundtestsetwasmeasuredonourlocalinstallationobtaininganaccuracyof0.8300forMultiloc2and0.9179forSherLoc2.Inthecaseswherecurrentmethodspredictmorethantenlocations,thepredictedlocationsweremappedontoourtenlocations.Twoofthemethods,iLoc-EukandWoLFPSORT,insomecasespredictduallocations(suchascytoplasm/nucleus).Sinceproteinswithduallocationswerefilteredoutintheconstructionofthedataset,thosepredictionswerecountedaserroneous,unlessboththepredictedlocationsmappedtothesamelocationinourclassification.2.5Experiments Twodifferentsetofexperimentswerecarriedout.Thefirstexperimentswereusedformodelselectioncomparingtherelativeperformancesofthefollowingmodelarchitectures:Feedforwardneuralnetwork(FFN)BidirectionalLSTMneuralnetwork(BLSTM)BLSTMneuralnetworkwithattentionmechanism(A-BLSTM)ConvolutionalBLSTMneuralnetworkwithattentionmechanism(ConvA-BLSTM)UsingthebestmodelarchitecturesthesecondsetofexperimentsisdesignedtotestthegeneralizationperformanceofmodelstrainedoneitherournewDeepLocdatasetortheHöglunddataset.Hyperparameterswereoptimizedonthreeoffoursplitsofthetrainingdataandtheperformancewasevaluatedonthelastvalidationsplit.Thehyperparameterselectionwasdoneusinguni-dimensionalsearchwhereonehyperparameterwaschangedandtherestwerekeptfixed.Ifahyperparameterhadnotyetbeentested,themedianvalueintherangeofthathyperparameterwaschosen.Eachhyperparametersettingwasrunfor150epochs(epoch = fullpassoverthetrainingset)andtheperformancewasmeasuredasthehighestseenperformanceonthevalidationset.Thisstrategywasusedforcomputationalreasonssinceafullgridsearchoverallparameterswasnotcomputationallyfeasible.Afterthebesthyperparameterswereidentified,afinalrunofexperimentswereusedtoidentifythebestcombinationofaminoacidencodingsamongBLOSUM62(HenikoffandHenikoff,1992),sparse,proteinprofilesorHSDMencoding(Prlićetal.,2000).Wefurtherfoundthatproteinprofilesgavethehighestperformanceandincludedtheseasinputfeaturesforthefinalmodels.TheprofilesweregeneratedusingthesamemethodastheTOPCONSwebserver(Tsirigosetal.,2015).Thetestperformancewasmeasuredbytrainingfourmodelsonthetrainingsetusingthefourdifferentcombinationsoftrainingandvalidationset.Thereportedtestperformanceistheaverageofthefourmodelsevaluatedontheheld-outtestset.Westressthatweneveroptimizedanyparametersonthetestsetleavingthereportedperformancesunbiased.Todecreasethetrainingtime,themaximumproteinlengthwas1000.Ifaproteinexceededthislength,aminoacidsfromthemiddleofthesequencewereremovedinordertonottoloseinformationabouttheN-terminalandC-terminalsortingsignals.9.98%oftheproteinsweretruncatedusingthisrule.TheperformancemeasurementsusedtoassesstheperformanceofourmodelswereaccuracyandtheGorodkinmeasure(Gorodkin,2004).Forthebinaryprediction,theaccuracyandtheMatthew’sCorrelationCoefficient(Matthews,1975)(MCC)wereused.TheGorodkinmeasurecanbeseenasageneralizationofMCCthatappliestoK-categories,whichismoreinformativethantheaccuracywhenthereisanimbalanceofclasses.ForK=2,theGorodkinmeasuresquaredisthe‘generalizedsquaredcorrelation’(GC2)ofBaldietal.(2000).AllmodelswereimplementedinPython2.7.11usingtheneuralnetworklibraryLasagne0.2(Dielemanetal.,2015)andTheano0.9.0(TheanoDevelopmentTeam,2016)forefficientGPUimplementation.3Results Wedesignedexperimentstoaddressthefollowingquestions:Whataretherelativeperformancesoftheproposedneuralnetworkmodelarchitectures?→Section3.1HowdoesthegeneralizationperformancesofmodelstrainedoneithertheDeepLocorHöglunddatasetscompare?→Section3.2HowdoesthefinalDeepLocmodelcomparetocurrentstate-of-the-artproteinsubcellularpredictionmodels?→Section3.33.1Modelselection InTable2wecomparetheperformancesofdifferentmodelarchitecturestrainedontheDeepLocdataset.Notethatweareinterestedintherelativeperformanceofthemodels.Duetothis,weonlyusedBLOSUM62encodingsasinputfeatures,whichresultedinaslightlydegradedperformancecomparedtothefinalperformancesdescribedinthefollowingsections.Table2.ComparisonofperformancesfordifferentmodelarchitecturesusingBLOSUM62inputfeaturesModel . Subcellularlocation .  . Membrane .  .  .  .  . Accuracy . Gorodkin . Accuracy . MCC . FFN 0.5234 0.4229 0.7301 0.4509 BLSTM 0.6925 0.6278 0.9004 0.8023 A-BLSTM 0.7290 0.6729 0.9163 0.8345 CONVA-BLSTM 0.7289 0.6780 0.9111 0.8218 Model . Subcellularlocation .  . Membrane .  .  .  .  . Accuracy . Gorodkin . Accuracy . MCC . FFN 0.5234 0.4229 0.7301 0.4509 BLSTM 0.6925 0.6278 0.9004 0.8023 A-BLSTM 0.7290 0.6729 0.9163 0.8345 CONVA-BLSTM 0.7289 0.6780 0.9111 0.8218  Openinnewtab Table2.ComparisonofperformancesfordifferentmodelarchitecturesusingBLOSUM62inputfeaturesModel . Subcellularlocation .  . Membrane .  .  .  .  . Accuracy . Gorodkin . Accuracy . MCC . FFN 0.5234 0.4229 0.7301 0.4509 BLSTM 0.6925 0.6278 0.9004 0.8023 A-BLSTM 0.7290 0.6729 0.9163 0.8345 CONVA-BLSTM 0.7289 0.6780 0.9111 0.8218 Model . Subcellularlocation .  . Membrane .  .  .  .  . Accuracy . Gorodkin . Accuracy . MCC . FFN 0.5234 0.4229 0.7301 0.4509 BLSTM 0.6925 0.6278 0.9004 0.8023 A-BLSTM 0.7290 0.6729 0.9163 0.8345 CONVA-BLSTM 0.7289 0.6780 0.9111 0.8218  Openinnewtab TheA-BLSTMandtheCONVA-BLSTMmodelsachievedthehighestperformancepredictingthesubcellularlocalizationwithaccuraciesof0.7290and0.7289,respectively.ComparingtheseresultswiththeperformanceoftheBLSTMwithoutattention(accuracy0.6925),weseethatattentionimprovesperformance.Theseresultsconfirmthebenefitofselective,contextdependent,attentionforproteinclassification.AlloftheA-BLSTMmodelsperformedsignificantlybetterthanthebaselineFFNmodelwhichachievedanaccuracyof0.5234.ThisisexpectedsinceFFNmodelsdonottakeintoaccounttheorderoftheaminoacids,whereastheLSTMmodelsnaturallyconsidertherelationshipsbetweenaminoacids.Furthermore,weobservedthatincluding10decodingstepsintheattentionmechanismincreasedtheaccuracy(adifferenceof1%)incomparisonwithasingledecodingstep.Increasingthedecodingstepsbeyond10resultedinareductionintheaccuracy.Lastly,theA-BLSTMmodelspredictedwhethertheproteinsweremembrane-boundorsolublewithaccuraciesof0.9163and0.9111respectively.Fromtheaminoacidencodingcomparison,wefoundthattheCONVA-BLSTMmodelusingproteinprofilesencodinghadthehighestaccuracy,withadifferenceof2%comparedtotheA-BLSTMmodel.Therefore,wedecidedtousethisencodingandthismodelfortherestoftheexperiments.3.2Datasetcomparison TocomparethegeneralizationperformanceofmodelstrainedoneithertheDeepLocortheHöglunddatasets,wetrainedaCONVA-BLSTMmodeloneachdatasetandevaluatedtheperformancesonthetestsetsfrombothdatasets.Table3showsthat(i)theHöglundtrainingsetachievesagoodtestperformanceonlyontheHöglundtestsetand(ii)theDeepLoctrainingsetachievesagoodtestperformanceontestsetswithstringentindependencebetweentrainingandtestsets.Table3.ComparisonofgeneralizationperformancesusingtheCONVA-BLSTMmodelbetweentheDeepLocdatasetandtheHöglunddatasetTrainingset . Testset . Accuracy . Gorodkin . DeepLoc DeepLoc 0.7511 0.6988 Höglund DeepLoc 0.6426 0.5756 DeepLoc Höglund 0.8301 0.8010 Höglund Höglund 0.9138 0.8979 Trainingset . Testset . Accuracy . Gorodkin . DeepLoc DeepLoc 0.7511 0.6988 Höglund DeepLoc 0.6426 0.5756 DeepLoc Höglund 0.8301 0.8010 Höglund Höglund 0.9138 0.8979 Note:Sequenceprofileswereusedasinputfeatures. Openinnewtab Table3.ComparisonofgeneralizationperformancesusingtheCONVA-BLSTMmodelbetweentheDeepLocdatasetandtheHöglunddatasetTrainingset . Testset . Accuracy . Gorodkin . DeepLoc DeepLoc 0.7511 0.6988 Höglund DeepLoc 0.6426 0.5756 DeepLoc Höglund 0.8301 0.8010 Höglund Höglund 0.9138 0.8979 Trainingset . Testset . Accuracy . Gorodkin . DeepLoc DeepLoc 0.7511 0.6988 Höglund DeepLoc 0.6426 0.5756 DeepLoc Höglund 0.8301 0.8010 Höglund Höglund 0.9138 0.8979 Note:Sequenceprofileswereusedasinputfeatures. Openinnewtab TheseresultsshowthatmodelstrainedontheHöglunddatasetgeneralizepoorlycomparedtomodelstrainedontheDeepLocdataset.AsaqualitativecomparisonofthetwodatasetswevisualizedthecontextvectorscrforCONVA-BLSTMmodelstrainedonbothdatasetsasseeninFigure2.ThecompartmentsarenotablymoreseparatedforthemodeltrainedontheHöglunddatasetcomparedtothemodeltrainedontheDeepLocdataset Fig.2.OpeninnewtabDownloadslidet-SNErepresentationofthecontextvectorcrforaConvA-BLSTMtrainedontheDeepLocandHöglunddatasetandvisualizedfortherespectivetestsetsFig.2.OpeninnewtabDownloadslidet-SNErepresentationofthecontextvectorcrforaConvA-BLSTMtrainedontheDeepLocandHöglunddatasetandvisualizedfortherespectivetestsets3.3DeepLocmodel FromthemodelcomparisonsweidentifiedtheCONVA-BLSTMasthebestperformingmodelarchitecture.Tofurtherimprovepredictionaccuracywetrainedanensembleof16modelsusingnestedcrossvalidation.Eightofthemodelsweretrainedusingasoftmaxoutputdistribution(classprobabilityfromsoftmaxfunction)andeightofthemodelsusingthehierarchicaltreedistribution(jointprobabilityofmultiplelogisticfunctions).Furtherwemitigatetheeffectoftheclassimbalancesbyusingacostmatrix(ZhouandLiu,2006)torecalculatetheclassprobabilitiesbasedonthenumberofsamplesinthetrainingset.Thefullensembleachievedanaccuracyof0.7797andGorodkinof0.7347onthesubcellularlocalizationandanaccuracyof0.9234andaMCCof0.8435onthemembrane-boundorsolubleprediction.Wefoundthatthesoftmaxmodelshadaslightlyhigheraccuracythanthehierarchicaltreemodelwiththe8-ensemblesachievinganaccuracyof0.7717and0.7695,respectively.WeshowinTable4theaccuracyandtheMCCforeachbinarydecisioninthehierarchicaltreemodel.Weexperimentedwithincreasingtheensemblesizebutfoundnoimprovementinperformance.Table4.AccuracyandMCCofeachnodeinthehierarchicaltreeNode . Accuracy . MCC . Secretory/Non-secretorypathway 0.9502 0.8902 Intracellular/Extracellular 0.9507 0.8979 N-terminalsequences 0.9544 0.8784 Intermediatecompartment 0.7982 0.5824 PTS 0.9784 0.4085 Mitochondrion/Chloroplastsignals 0.9537 0.8955 Cellmembrane/Lysosome 0.8575 0.5002 ER/Golgi 0.8559 0.6376 NLS 0.8138 0.6031 Node . Accuracy . MCC . Secretory/Non-secretorypathway 0.9502 0.8902 Intracellular/Extracellular 0.9507 0.8979 N-terminalsequences 0.9544 0.8784 Intermediatecompartment 0.7982 0.5824 PTS 0.9784 0.4085 Mitochondrion/Chloroplastsignals 0.9537 0.8955 Cellmembrane/Lysosome 0.8575 0.5002 ER/Golgi 0.8559 0.6376 NLS 0.8138 0.6031  Openinnewtab Table4.AccuracyandMCCofeachnodeinthehierarchicaltreeNode . Accuracy . MCC . Secretory/Non-secretorypathway 0.9502 0.8902 Intracellular/Extracellular 0.9507 0.8979 N-terminalsequences 0.9544 0.8784 Intermediatecompartment 0.7982 0.5824 PTS 0.9784 0.4085 Mitochondrion/Chloroplastsignals 0.9537 0.8955 Cellmembrane/Lysosome 0.8575 0.5002 ER/Golgi 0.8559 0.6376 NLS 0.8138 0.6031 Node . Accuracy . MCC . Secretory/Non-secretorypathway 0.9502 0.8902 Intracellular/Extracellular 0.9507 0.8979 N-terminalsequences 0.9544 0.8784 Intermediatecompartment 0.7982 0.5824 PTS 0.9784 0.4085 Mitochondrion/Chloroplastsignals 0.9537 0.8955 Cellmembrane/Lysosome 0.8575 0.5002 ER/Golgi 0.8559 0.6376 NLS 0.8138 0.6031  Openinnewtab Thetrainingtimeforthefullensemblewas80 hours,approximatelyfivehourspermodel.Whentesting,theensembletakesthreesecondsperproteinonaveragetoperformaprediction.Nonetheless,thisensembleusedproteinprofiles,whichwerealreadygeneratedforthisdataset.Thisprofilegenerationisthemosttime-consumingstepusuallytakingapproximately30 secondsperprotein.IfahitwiththePFAMdatabaseisnotfoundtheprofilegenerationusesUniref90instead.Thiscantakeevenlongerandthereforecanbeproblematicforlargeproteindatasets.TosolvethiswetrainedthesameensembleusingBLOSUM62encoding.Thismodelhasanaccuracyof0.7360andGorodkinof0.6832onthesubcellularlocalizationandanaccuracyof0.9130andaMCCof0.8237onthemembrane-boundorsolubleprediction.Byomittingtheprofilegeneration,weachievedafasterpredictionatthecostofdecreaseinaccuracy.Tables5and6showtheconfusionmatricesofthefullensembledescribedaboveforsubcellularlocalizationandmembrane-boundpredictionrespectively.Theprimarysourcesoferrorareconfusionofthenucleusandcytoplasm,lysosome/vacuolemisclassifiedascell-membraneandGolgimisclassifiedascytoplasm.InFigure3,weshowtheattentionvectorα,i.e.howimportantdifferentregionsofthesequencearefortheclassification.Ingeneral,theDeepLocmodelassignslargeimportancetotheN-terminalforsecretedproteinswherease.g.membraneproteinshaveregionsofimportanceinterspersedacrosstheproteinlength.Table5.ConfusionmatrixofthetestsetonthefinalDeepLocmodelusingprofilesencodingLocation . Numberofpredictedproteins . Sens. . MCC . Nucleus 680 103 4 5 2 8 1 2 2 1 0.842 0.784 Cytoplasm 94 361 7 18 5 4 3 8 1 7 0.711 0.608 Extracellular 3 5 365 5 5 4 2 0 4 0 0.929 0.907 Mitochondrion 9 21 0 247 0 5 14 2 1 3 0.818 0.812 Cellmembrane 5 15 6 1 203 20 1 4 18 0 0.744 0.732 Endoplasmicreticulum 3 6 6 3 18 120 1 7 8 1 0.694 0.654 Plastid 1 2 0 8 0 0 140 0 1 0 0.921 0.883 Golgiapparatus 4 17 1 0 9 8 1 26 4 0 0.371 0.414 Lysosome/Vacuole 0 7 11 1 20 9 0 4 12 0 0.188 0.194 Peroxisome 0 13 0 4 1 4 0 0 0 8 0.267 0.321 Location . Numberofpredictedproteins . Sens. . MCC . Nucleus 680 103 4 5 2 8 1 2 2 1 0.842 0.784 Cytoplasm 94 361 7 18 5 4 3 8 1 7 0.711 0.608 Extracellular 3 5 365 5 5 4 2 0 4 0 0.929 0.907 Mitochondrion 9 21 0 247 0 5 14 2 1 3 0.818 0.812 Cellmembrane 5 15 6 1 203 20 1 4 18 0 0.744 0.732 Endoplasmicreticulum 3 6 6 3 18 120 1 7 8 1 0.694 0.654 Plastid 1 2 0 8 0 0 140 0 1 0 0.921 0.883 Golgiapparatus 4 17 1 0 9 8 1 26 4 0 0.371 0.414 Lysosome/Vacuole 0 7 11 1 20 9 0 4 12 0 0.188 0.194 Peroxisome 0 13 0 4 1 4 0 0 0 8 0.267 0.321 Note:Sens.,sensitivity. Openinnewtab Table5.ConfusionmatrixofthetestsetonthefinalDeepLocmodelusingprofilesencodingLocation . Numberofpredictedproteins . Sens. . MCC . Nucleus 680 103 4 5 2 8 1 2 2 1 0.842 0.784 Cytoplasm 94 361 7 18 5 4 3 8 1 7 0.711 0.608 Extracellular 3 5 365 5 5 4 2 0 4 0 0.929 0.907 Mitochondrion 9 21 0 247 0 5 14 2 1 3 0.818 0.812 Cellmembrane 5 15 6 1 203 20 1 4 18 0 0.744 0.732 Endoplasmicreticulum 3 6 6 3 18 120 1 7 8 1 0.694 0.654 Plastid 1 2 0 8 0 0 140 0 1 0 0.921 0.883 Golgiapparatus 4 17 1 0 9 8 1 26 4 0 0.371 0.414 Lysosome/Vacuole 0 7 11 1 20 9 0 4 12 0 0.188 0.194 Peroxisome 0 13 0 4 1 4 0 0 0 8 0.267 0.321 Location . Numberofpredictedproteins . Sens. . MCC . Nucleus 680 103 4 5 2 8 1 2 2 1 0.842 0.784 Cytoplasm 94 361 7 18 5 4 3 8 1 7 0.711 0.608 Extracellular 3 5 365 5 5 4 2 0 4 0 0.929 0.907 Mitochondrion 9 21 0 247 0 5 14 2 1 3 0.818 0.812 Cellmembrane 5 15 6 1 203 20 1 4 18 0 0.744 0.732 Endoplasmicreticulum 3 6 6 3 18 120 1 7 8 1 0.694 0.654 Plastid 1 2 0 8 0 0 140 0 1 0 0.921 0.883 Golgiapparatus 4 17 1 0 9 8 1 26 4 0 0.371 0.414 Lysosome/Vacuole 0 7 11 1 20 9 0 4 12 0 0.188 0.194 Peroxisome 0 13 0 4 1 4 0 0 0 8 0.267 0.321 Note:Sens.,sensitivity. Openinnewtab Table6.Confusionmatrixforthemembrane-boundpredictorType . Numberofpredictedproteins .  . Soluble 968 38 Membrane-bound 96 647 Type . Numberofpredictedproteins .  . Soluble 968 38 Membrane-bound 96 647  Openinnewtab Table6.Confusionmatrixforthemembrane-boundpredictorType . Numberofpredictedproteins .  . Soluble 968 38 Membrane-bound 96 647 Type . Numberofpredictedproteins .  . Soluble 968 38 Membrane-bound 96 647  Openinnewtab Fig.3.OpeninnewtabDownloadslideSequenceimportanceacrosstheproteinsequenceofDeepLoctestsetwhenmakingtheprediction.Thex-axisisthesequencepositionandalomgthey-axiswehavetheproteinsinthetestsetsortedaccordingtoproteinlocalization.Forvisualizationproteinsshorterthan1000aminoacidsarepaddedfromthemiddle,sotheN-terminusandC-terminusalign.Proteinslongerthan1000aminoacidshavethemiddlepartremovedFig.3.OpeninnewtabDownloadslideSequenceimportanceacrosstheproteinsequenceofDeepLoctestsetwhenmakingtheprediction.Thex-axisisthesequencepositionandalomgthey-axiswehavetheproteinsinthetestsetsortedaccordingtoproteinlocalization.Forvisualizationproteinsshorterthan1000aminoacidsarepaddedfromthemiddle,sotheN-terminusandC-terminusalign.Proteinslongerthan1000aminoacidshavethemiddlepartremovedTocomparetheperformanceofthefinalDeepLocmodeltootherapproacheswebenchmarkedanumberofcurrentpredictionalgorithmsontheDeepLoctestsetasseeninTable7.TheaccuracyofthefinalDeepLocmodel(0.7797)issignificantlybetterthanallothermethodswithiLoc-Eukachievingthesecondbestaccuracyof0.6820.Table7.AccuracyandGorodkinmeasureachievedbycurrentpredictorsandthefinalDeepLocmodelontheDeepLoctestsetMethod . Accuracy . Gorodkin . LocTree2 0.6120 0.5250 MultiLoc2 0.5592 0.4869 SherLoc2 0.5815 0.5112 YLoc 0.6122 0.5330 CELLO 0.5521 0.4543 iLoc-Euk 0.6820 0.6412 WoLFPSORT 0.5671 0.4785 DeepLoc 0.7797 0.7347 Method . Accuracy . Gorodkin . LocTree2 0.6120 0.5250 MultiLoc2 0.5592 0.4869 SherLoc2 0.5815 0.5112 YLoc 0.6122 0.5330 CELLO 0.5521 0.4543 iLoc-Euk 0.6820 0.6412 WoLFPSORT 0.5671 0.4785 DeepLoc 0.7797 0.7347 Thehighestscoresareshowninbold. Openinnewtab Table7.AccuracyandGorodkinmeasureachievedbycurrentpredictorsandthefinalDeepLocmodelontheDeepLoctestsetMethod . Accuracy . Gorodkin . LocTree2 0.6120 0.5250 MultiLoc2 0.5592 0.4869 SherLoc2 0.5815 0.5112 YLoc 0.6122 0.5330 CELLO 0.5521 0.4543 iLoc-Euk 0.6820 0.6412 WoLFPSORT 0.5671 0.4785 DeepLoc 0.7797 0.7347 Method . Accuracy . Gorodkin . LocTree2 0.6120 0.5250 MultiLoc2 0.5592 0.4869 SherLoc2 0.5815 0.5112 YLoc 0.6122 0.5330 CELLO 0.5521 0.4543 iLoc-Euk 0.6820 0.6412 WoLFPSORT 0.5671 0.4785 DeepLoc 0.7797 0.7347 Thehighestscoresareshowninbold. Openinnewtab 4Discussion InthispaperwehaveintroducedtheDeepLocdataset:awellassembledproteincollectionwithreliablesubcellularlocalizationinformation.Secondlywehaveprovidedadeepneuralnetworkbasedpredictionalgorithmachievingstate-of-the-artperformanceonthisnewdataset.Thecontext-dependentannotationvectorgeneratedbytheattentionmechanismisabletorepresentaproteinbasedonitssubcellularlocalization.Inaddition,theattentionbasedpredictionmethodallowsvisualizationofthebiologicallyplausibleregionsusedtopredictthesubcellularlocalizationoftheproteinswhichwebelievewillproviderelevantinformation.ThecomparisonofthegeneralizationperformancesformodelstrainedonournewDeepLocdatasetandtheHöglunddatasetshowedthatDeepLoctrainedmodelsgeneralizedmuchbetterthantheHöglundtrainedmodel.Herewediscussanumberofexplanationsforthesefindings.Firstly,withtheUniprotdatabasechangetheHöglunddatasetcouldcontainmanywronglyannotatedproteins,whichgeneratesamodelthatlearnstopredictthewronglabels.Secondly,thehomologyreductionthreshold80%usedforconstructingtheHöglunddatasetmightnotbestringentenough,sinceitproducessimilartrainingandtestexamples.InFigure2,wecomparedtheattentioncontextvectorformodelstrainedoneithertheDeepLocorHöglunddatasets.FortheHöglundtrainedmodel,alllocationsarealmostperfectlyseparatedimplyingthatthereislittlevariationwithintheHöglunddatasetclassesandthatthetrainingandtestsetsarerelativelysimilar.ThissupportsthefindingofpoorgeneralizationperformanceformodelstrainedontheHöglunddataset.Hence,webelievethatthehighperformancereportedforalgorithmstrainedonthisdatasetisactuallyresultsfromoverfitting.ThetruevariationwithineachproteinclassislargerasindicatedbythebettergeneralizationperformanceformodelstrainedontheDeepLocdataset.ThisisfurthercorroboratedbythepoorerseparationofclassesfortheDeepLoctrainedmodelsinthesamefigure.WecomparedtheperformanceofthefinalDeepLocmodelwithothercurrentpredictionalgorithmsinTable7.WefoundthattheDeepLocmodelperformssignificantlybetterthantheotherapproaches.HerewenotethattheDeepLocperformanceisatruetestsetperformance,whereastheperformancesoftheothermethodsmaybeoverestimatedsincesomesequencesinourtestsetmayhavebeenincludedintheirtrainingsets.FurtherweemphasizethattheDeepLocmethodisapurelysequence-basedmethodanddoesnotrelyonannotationinformationfromhomologousproteins.Duetothestringenthomologypartitioningappliedinthedatasetconstruction,themodelshouldgeneralizetonewproteinswithoutknownclosehomologues.WenotethatwealsocomparedtheperformanceagainsttheLocTree3predictionmethod(Goldbergetal.,2014),whichisacombinationofLocTree2andaBLASTsearchofadatabaseofproteinswithknownsubcellularlocation.However,as75%oftheproteinsintheDeepLoctestsetarealsointheLocTree3BLASTdatabase,themeasuredaccuracywasartificiallyhighat91%,sinceLocTree3simplyretrievesthesamesubcellularlocationusedforlabellingourtestset.ThecompartmentspecificpredictionperformanceofthefinalDeepLocmodelisshowninTable5.ThemainsourceoferroristhelowperformanceontheGolgiapparatus,lysosome/vacuoleandperoxisome.Onepossiblecauseisthelownumberofsamplesusedtotraintheseclasses.However,thisfindingcouldalsobeassociatedwiththesimilaritybetweentheproteinsfromtheselocationsandothercompartments.Forexample,Table5showsthatthelysosome/vacuoleisusuallymisclassifiedascellmembraneandtheperoxisomeascytoplasm.Inadditiontothementionedunder-representedclasses,proteinsfromthecytoplasmandnucleusarealsodifficulttodifferentiate(Fig.2,Table4)becausetheybothlackN-terminalsortingsignals.Theonlydifferencebetweenthemisthenuclearlocalizationsignal(NLS),whichisahighlyvariantshortsequencethatcanbelocatedinmultipleregionsoftheproteinsequence,makingithardtorecognize.Figure3showsthepositionsinthesequencethattheattentionmechanismfocusesontogeneratetheattentioncontextvectorcr.Forthenucleicandcytoplasmicproteins,themodelfocusesonthebeginningofthesequence(checkingfortheabsenceofanN-terminalsortingsignal).Moreover,themodelalsogivesimportancetosmallregionsacrossthesequence.Themaindifferenceisthatthereisahigherdensityoftheseregionsinnucleusexamplesthanincytoplasm,whichcouldindicatethatthemodelisabletoidentifysomeofthemostrepresentedNLS.Figure3allowsustovisualizewhatregionsinthesequencearerelevantforeachsubcellularlocalizationtoperformtheprediction.Fortheextracellularproteins,themodelfocusesmainlyonthesignalpeptide,whichcanbeseenasasmallregionattheN-terminusofthesequence.Incontrast,theattentionisscatteredacrossthesequenceforplasmamembraneproteins,whichcouldindicatethatthealgorithmisdetectingthetransmembranehelices.FortheERproteinswecanseeattentionattheN-terminus,wherethesignalpeptideislocated,andalsosomeattentionattheC-terminus,whichcouldmeanthepresenceofKDELorKKXXsignals.GolgiproteinshavetheimportanceontheN-terminusslightlyshiftedtotheright,incomparisonwithotherproteinsfromthesecretorypathway,astheyaremostlytypeIItransmembraneproteinswithsignalanchors.MitochondrialandchloroplasticproteinshavelargeregionsattheN-terminus,whichclearlycorrelatestothemitochondrialandchloroplastictransitpeptides.Thelysosomal/vacuolarproteinsdonotseemtohaveaclearimportantregionacrosstheirsequences.Finally,forperoxisomalproteins,someregionsattheN-terminusandattheC-terminusareobserved,whichcouldmeanthatthemodelisdetectingPTS2andPTS1signals.5Conclusion WehaveshownthatconvolutionalBLSTMneuralnetworkswithattentionmechanismareabletoaccuratelypredicttheproteinsubcellularlocalizationandifaproteinismembrane-boundorsolublejustusingthesequenceinformation.FurtherwehaveintroducedtheDeepLocdataset.TheDeepLocmodeltrainedonthisdatasetisabletogeneralizebetterthanusingpreviousdatasetsforsubcellularlocalization.Inaddition,DeepLocobtainedthehighestaccuracyusingtheindependenttestset,whencomparedwiththecurrentmethods.Thereareseveralperspectivesofthisprojectthatwewouldliketopursueinthefuture.Oneofthoseistomakebetteruseofexistingknowledgeaboutsortingsignals.DeepLoc1.0istrainedinarelatively‘naive’way,wherethenetworkshavebeenprovidedonlywithproteinprofilesandtheirlocationlabels.ItwouldbebeneficialtoexplicitlymodelknownsortingsignalssuchasN-terminalsignalpeptidesandtransitpeptides.Inaddition,itshouldbeinvestigatedwhetherperformancecanbeenhancedbytrainingseveralmodelswithanarrowertaxonomicalscopeinsteadoftreatingalleukaryotesbyonemodel.Obviously,animalsandfungidonothaveplastids,andsomefalsepredictionscouldbeavoidedbydisallowingplastidpredictionsforthesegroups,butmoresubtledifferencesbetweensortingsignalsarealsoknowntoexist.However,thereisatrade-offbetweentheprecisionofthetaxonomicalscopeandthesizesofthetrainingdatasets.Fortaxonomicgroupswithlimitednumbersofdatawithexperimentallyknownsubcellularlocation,itmaybenecessarytoemploysemisupervisedlearning,whereunlabelleddatafromgenomesequencesareusedalongwithlabelleddata.Acknowledgements TheauthorswishtothankKonstantinosTsirigosandArneElofssonofStockholmUniversityforpermissiontousetheirfastprofileconstructionmethodinDeepLoc,eventhoughithasnotbeenpublishedyet.Inaddition,theywanttothankFabianAichelerofUniversityofTübingenforkindlyrunningtheDeepLoctestsetonYLoc.Funding S.K.S.andO.W.weresupportedbyagrantfromtheNovoNordiskFoundationandtheNVIDIACorporationwiththedonationofTITANXGPUs.ConflictofInterest:nonedeclared.References BahdanauDetal. (2014)Neuralmachinetranslationbyjointlylearningtoalignandtranslate.arXivpreprintarXiv:1409.0473BaldiPetal. (2000) Assessingtheaccuracyofpredictionalgorithmsforclassification:anoverview.Bioinformatics,16,412–424.GoogleScholarCrossrefSearchADSPubMedWorldCat BlumTetal. (2009) Multiloc2:integratingphylogenyandgeneontologytermsimprovessubcellularproteinlocalizationprediction.BMCBioinformatics,10,1.GoogleScholarCrossrefSearchADSPubMedWorldCat BriesemeisterSetal. (2009) Sherloc2:ahigh-accuracyhybridmethodforpredictingsubcellularlocalizationofproteins.J.ProteomeRes.,8,5363–5366.GoogleScholarCrossrefSearchADSPubMedWorldCat BriesemeisterSetal. (2010) YLoc–aninterpretablewebserverforpredictingsubcellularlocalization.NucleicAcidsRes.,38,W497–W502.GoogleScholarCrossrefSearchADSPubMedWorldCat ChouK.-Cetal. (2011) iLoc-Euk:amulti-labelclassifierforpredictingthesubcellularlocalizationofsingleplexandmultiplexeukaryoticproteins.PLoSONE,6,e18258.GoogleScholarCrossrefSearchADSPubMedWorldCat DielemanSetal. (2015)Lasagne:FirstRelease. Geneva,Switzerland, Zenodo.GoogleScholarGooglePreviewOpenURLPlaceholderTextWorldCatCOPAC EmanuelssonOetal. (2007) LocatingproteinsinthecellusingTargetP,SignalPandrelatedtools.NatureProtoc.,2,953–971.GoogleScholarCrossrefSearchADSWorldCat GoldbergTetal. (2012) LocTree2predictslocalizationforalldomainsoflife.Bioinformatics,28,i458–i465.GoogleScholarCrossrefSearchADSPubMedWorldCat GoldbergTetal. (2014) Loctree3predictionoflocalization.NucleicAcidsRes.,42,W350–W355.GoogleScholarCrossrefSearchADSPubMedWorldCat GorodkinJ(2004) Comparingtwok-categoryassignmentsbyak-categorycorrelationcoefficient.Comput.Biol.Chem.,28,367–374.GoogleScholarCrossrefSearchADSPubMedWorldCat HenikoffS,HenikoffJ.G(1992) Aminoacidsubstitutionmatricesfromproteinblocks.Proc.Natl.Acad.Sci.USA,89,10915–10919.GoogleScholarCrossrefSearchADSWorldCat HobohmUetal. (1992) Selectionofrepresentativeproteindatasets.ProteinSci.,1,409–417.GoogleScholarCrossrefSearchADSPubMedWorldCat HochreiterS,SchmidhuberJ(1997) Longshort-termmemory.NeuralComput.,9,1735–1780.GoogleScholarCrossrefSearchADSPubMedWorldCat HöglundAetal. (2006) Multiloc:predictionofproteinsubcellularlocalizationusingN-terminaltargetingsequences,sequencemotifsandaminoacidcomposition.Bioinformatics,22,1158–1165.GoogleScholarCrossrefSearchADSPubMedWorldCat HortonPetal. (2007) WoLFPSORT:proteinlocalizationpredictor.NucleicAcidsRes.,35,W585–W587.GoogleScholarCrossrefSearchADSPubMedWorldCat HungM.-C,LinkW(2011) Proteinlocalizationindiseaseandtherapy.J.CellSci.,124,3381–3392.GoogleScholarCrossrefSearchADSPubMedWorldCat ImaiK,NakaiK(2010) Predictionofsubcellularlocationsofproteins:wheretoproceed?Proteomics,10,3970–3983.GoogleScholarCrossrefSearchADSPubMedWorldCat KrausO.Zetal. (2017) Automatedanalysisofhigh-contentmicroscopydatawithdeeplearning.Mol.Syst.Biol.,13,924.GoogleScholarCrossrefSearchADSPubMedWorldCat LiW,GodzikA(2006) Cd-hit:afastprogramforclusteringandcomparinglargesetsofproteinornucleotidesequences.Bioinformatics,22,1658–1659.GoogleScholarCrossrefSearchADSPubMedWorldCat MatthewsB.W(1975) ComparisonofthepredictedandobservedsecondarystructureofT4phagelysozyme.Biochim.Biophys.Acta(BBA)-ProteinStruct.,405,442–451.GoogleScholarCrossrefSearchADSWorldCat PrlićAetal. (2000) Structure-derivedsubstitutionmatricesforalignmentofdistantlyrelatedsequences.ProteinEng.,13,545–550.GoogleScholarCrossrefSearchADSPubMedWorldCat ShatkayHetal. (2007) Sherloc:high-accuracypredictionofproteinsubcellularlocalizationbyintegratingtextandproteinsequencedata.Bioinformatics,23,1410–1417.GoogleScholarCrossrefSearchADSPubMedWorldCat SønderbySKetal. (2015)ConvolutionalLSTMnetworksforsubcellularlocalizationofproteins.In:InternationalConferenceonAlgorithmsforComputationalBiology,volume9199ofLectureNotesinComputerScience,pp.68–80.SpringerTheUniProtConsortium(2017) UniProt:theuniversalproteinknowledgebase.NucleicAcidsRes.,45,D158–D169.CrossrefSearchADSPubMedWorldCat TheanoDevelopmentTeam(2016)Theano:APythonframeworkforfastcomputationofmathematicalexpressions.arXive-prints,abs/1605.02688.TsirigosK.Detal. (2015) TheTOPCONSwebserverforconsensuspredictionofmembraneproteintopologyandsignalpeptides.NucleicAcidsRes.,43,W401–W407.GoogleScholarCrossrefSearchADSPubMedWorldCat WanS,MakM.-W(2015)MachineLearningforProteinSubcellularLocalizationPrediction. DeGruyter,Berlin,Germany.GoogleScholarCrossrefSearchADSGooglePreviewWorldCatCOPAC YuC.-Setal. (2006) Predictionofproteinsubcellularlocalization.Proteins,64,643–651.GoogleScholarCrossrefSearchADSPubMedWorldCat ZhouZ.-H,LiuX.-Y(2006) Trainingcost-sensitiveneuralnetworkswithmethodsaddressingtheclassimbalanceproblem.IEEETrans.KnowledgeDataEng.,18,63–77.GoogleScholarCrossrefSearchADSWorldCat  ©TheAuthor2017.PublishedbyOxfordUniversityPress.Allrightsreserved.ForPermissions,pleasee-mail:journals.permissions@oup.comThisarticleispublishedanddistributedunderthetermsoftheOxfordUniversityPress,StandardJournalsPublicationModel(https://academic.oup.com/journals/pages/about_us/legal/notices) IssueSection: Sequenceanalysis AssociateEditor: JohnHancock JohnHancock AssociateEditor Searchforotherworksbythisauthoron: OxfordAcademic PubMed GoogleScholar Downloadallslides Advertisement 24,388 Views 381 Citations ViewMetrics × Emailalerts Articleactivityalert Advancearticlealerts Newissuealert ReceiveexclusiveoffersandupdatesfromOxfordAcademic Relatedarticlesin WebofScience GoogleScholar Citingarticlesvia WebofScience(381) GoogleScholar Crossref Latest MostRead MostCited BFFandcellhashR:AnalysisToolsforAccurateDemultiplexingofCellHashingData findPC:AnRpackagetoautomaticallyselectthenumberofprincipalcomponentsinsingle-cellanalysis scGraph:agraphneuralnetwork-basedapproachtoautomaticallyidentifycelltypes TPpred-ATMV:Therapeuticpeptidespredictionbyadaptivemulti-viewtensorlearningmodel Exploitingdeeptransferlearningforthepredictionoffunctionalnoncodingvariantsusinggenomicsequence Tick-BorneSpecialistPhysicianAspirusTick-BorneIllnessCenter|LeadingTreatmentandResearchCenter|DesirableWisconsinLocation Woodruff,Wisconsin Director,TransplantandOncologyInfectiousDiseasesProgram Baltimore,Maryland ASSISTANTPROFESSORINFECTIOUSDISEASES NewHaven,Connecticut ProgramDirectorTrans-DivisionalResearchProgram Bethesda,Maryland Viewalljobs Advertisement Advertisement AboutBioinformatics EditorialBoard AuthorGuidelines Facebook Twitter Purchase RecommendtoyourLibrary AdvertisingandCorporateServices JournalsCareerNetwork OnlineISSN1460-2059PrintISSN1367-4803Copyright©2022OxfordUniversityPress AboutUs ContactUs Careers Help Access&Purchase Rights&Permissions OpenAccess PotentiallyOffensiveContent Connect JoinOurMailingList OUPblog Twitter Facebook YouTube LinkedIn Resources Authors Librarians Societies Sponsors&Advertisers Press&Media Agents Explore ShopOUPAcademic OxfordDictionaries Epigeum OUPWorldwide UniversityofOxford OxfordUniversityPressisadepartmentoftheUniversityofOxford.ItfurtherstheUniversity'sobjectiveofexcellenceinresearch,scholarship,andeducationbypublishingworldwide Copyright©2022OxfordUniversityPress CookiePolicy PrivacyPolicy LegalNotice SiteMap Accessibility Close ThisFeatureIsAvailableToSubscribersOnly SignInorCreateanAccount Close ThisPDFisavailabletoSubscribersOnly ViewArticleAbstract&PurchaseOptions Forfullaccesstothispdf,signintoanexistingaccount,orpurchaseanannualsubscription. Close



請為這篇文章評分?