影片標題產生與問答__臺灣博碩士論文知識加值系統

文章推薦指數: 80 %
投票人數:10人

我們接著利用一個自動的問答生成器來生成多個問答配對來訓練並從Amazon Mechanical Turk上收集人為產生的問答配對。

在VTW上,我們的方法能持續的提高標題預測精度,並實現 ... 資料載入處理中... 跳到主要內容 臺灣博碩士論文加值系統 ::: 網站導覽| 首頁| 關於本站| 聯絡我們| 國圖首頁| 常見問題| 操作說明 English |FB專頁 |Mobile 免費會員 登入| 註冊 功能切換導覽列 (178.128.63.162)您好!臺灣時間:2022/10/1123:29 字體大小:       ::: 詳目顯示 recordfocus 第1筆/ 共1筆  /1頁 論文基本資料 摘要 外文摘要 目次 參考文獻 電子全文 紙本論文 論文連結 QRCode 本論文永久網址: 複製永久網址Twitter研究生:曾國豪研究生(外文):Zeng,Kuo-Hao論文名稱:影片標題產生與問答論文名稱(外文):VideotitlingandQuestion-Answering指導教授:孫民指導教授(外文):Sun,Min口試委員:陳冠文、林嘉文、陳縕儂口試委員(外文):Chen,Kuan-Wen、Lin,Chia-Wen、Chen,Yun-Nung口試日期:2017-07-06學位類別:碩士校院名稱:國立清華大學系所名稱:電機工程學系所學門:工程學門學類:電資工程學類論文種類:學術論文論文出版年:2017畢業學年度:105語文別:英文論文頁數:51中文關鍵詞:電腦視覺、深度學習、遞迴式神經網路、影片、標題、問答外文關鍵詞:CV、DL、RNN、Video、Title、Question-Answering相關次數: 被引用:0點閱:285評分:下載:31書目收藏:0 影片標題和問答是高階視覺數據理解的兩個重要任務。

為了解決這兩個任務,我們提出了一個大規模的數據集,並在這個工作中展示了對於這個數據集的幾個模型。

一個好的影片標題緊密地描述了最突出的事件,並捕獲觀眾的注意力。

相反的,影片字幕產生傾向於產生描述整個影片的句子。

雖然自動產生影片標題是非常有用的任務,但它相對於影片字幕處理的較少。

我們首次提出用兩種方法將最優秀的影片標題產生器擴展到這項新任務來解決影片標題生成的問題。

首先,我們利用精彩片段偵測器讓影片標題產生器敏感於精彩片段,我們的方法能夠訓練一個模型讓它能夠允許同時處理影片標題產生以及影片精彩片段的時間。

第二,我們引入高多樣性的句子在影片標題產生器中,使得所產生的標題也是多樣化和引人入勝的。

這意味著我們需要大量的句子來學習標題的句子結構。

因此,我們提出一種新穎的句子增加方法來訓練標題產生器,利用的是只有句子而沒有相應的影片例子。

另一方面,對於影片問答任務,我們提出一個深的模型來回答對於影片上下文的自由形式自然語言問題,我們自動的從網路上收集大量的免費影片以及其描述,因此,大量的問答配對候選就自動的產生而不需要人工標註。

接著,我們使用這些問答配對候選來訓練多個由MN、VQA、SA以及SS延伸的影片為主的問答方法,為了要處理非完美的問答配對候選,我們提出了一個自主學習的學習程序迭代地識別它們並減輕其對培訓的影響,為了展示我們的想法,我們收集了18100部的野外大型影片字幕(VTW)數據集,自動抓取用戶生成的影片和標題。

我們接著利用一個自動的問答生成器來生成多個問答配對來訓練並從AmazonMechanicalTurk上收集人為產生的問答配對。

在VTW上,我們的方法能持續的提高標題預測精度,並實現了自動化的最佳性能和人類評價,我們的句子增加方法也勝過M-VAD數據集的基準。

最後,結果顯示我們的自學習程序是有效的,而擴展SS模型也優於各種基準模型。

Videotitlingandquestionansweringaretwoimportanttaskstowardhigh-levelvisualdataunderstanding.Toaddressthosetwotasks,weproposealarge-scaledatasetanddemonstrateseveralmodelsonsuchdatasetinthiswork.Agreatvideotitledescribesthemostsalienteventcompactlyandcapturestheviewer'sattention.Incontrast,videocaptioningtendstogeneratesentencesthatdescribethevideoasawhole.Althoughgeneratingavideotitleautomaticallyisaveryusefultask,itismuchlessaddressedthanvideocaptioning.Weaddressvideotitlegenerationforthefirsttimebyproposingtwomethodsthatextendstate-of-the-artvideocaptionerstothisnewtask.First,wemakevideocaptionershighlightsensitivebyprimingthemwithahighlightdetector.Ourframeworkallowsforjointlytrainingamodelfortitlegenerationandvideohighlightlocalization.Second,weinducehighsentencediversityinvideocaptioners,sothatthegeneratedtitlesarealsodiverseandcatchy.Thismeansthatalargenumberofsentencesmightberequiredtolearnthesentencestructureoftitles.Hence,weproposeanovelsentenceaugmentationmethodtotrainacaptionerwithadditionalsentence-onlyexamplesthatcomewithoutcorrespondingvideos.Ontheotherhand,forvideoquestion-answeringtask:weproposetolearnadeepmodeltoanswerafree-formnaturallanguagequestionaboutthecontentsofavideo.Wemakeaprogramautomaticallyharvestsalargenumberofvideosanddescriptionsfreelyavailableonline.Then,alargenumberofcandidateQApairsareautomaticallygeneratedfromdescriptionsratherthanmanuallyannotated.Next,weusethesecandidateQApairstotrainanumberofvideo-basedQAmethodsextendedfromMN,VQA,SA,andSS.Inordertohandlenon-perfectcandidateQApairs,weproposeaself-pacedlearningproceduretoiterativelyidentifythemandmitigatetheireffectsintraining.Todemonstrateouridea,wecollectedalarge-scaleVideoTitlesintheWild(VTW)datasetof$18100$automaticallycrawleduser-generatedvideosandtitles.WethenutilizeanautomaticQAgeneratortogeneratealargenumberofQApairsfortrainingandcollectthemanuallygeneratedQApairsfromAmazonMechanicalTurk.OnVTW,ourmethodsconsistentlyimprovetitlepredictionaccuracy,andachievethebestperformanceinbothautomaticandhumanevaluation.Next,oursentenceaugmentationmethodalsooutperformsthebaselinesontheM-VADdataset.Finally,theresultsofvideoquestionansweringshowthatourself-pacedlearningprocedureiseffective,andtheextendedSSmodeloutperformsvariousbaselines. 摘要iiAbstractiv1Introduction11.1Motivation.................................11.2ProblemDescription...........................31.3MainContribution............................42RelatedWork72.1VideoCaptioning..............................72.2VideoHighlightDetection.........................82.3VideoCaptioningDatasets.........................92.4Image-QA.................................102.5Questiongeneration............................102.6Video-QA..................................113DatasetCollection123.1VideoTitlingDataset...........................133.1.1CollectionofCuratedUGVs...................133.1.2DatasetComparison.......................143.2VideoQuestionAnsweringDataset....................163.2.1QuestionsGeneration(QG)...................173.2.2QuestionsandAnswersAnalysis.................184Method214.1FromCaptiontoTitle...........................214.2VideoCaptioning.............................214.3HighlightSensitiveCaptioning......................234.4SentenceAugmentation..........................244.5MitigatingtheEffectofNon-perfectQAsPairs.............264.6ExtenedMethods.............................275Experiment295.1VideoTitling...............................295.1.1ImplementationofHighlightDetector..............325.1.2BaselineMethods.........................335.1.3ImplementationofS2VTandSA................345.1.4Results..............................355.2VideoQuestionAnswering........................395.2.1ImplementationDetails......................395.2.2Trainingdetails..........................415.2.3EvaluationMetrics........................415.2.4Results..............................426ConclusionandFutureWork456.1Conclusion................................456.2FutureWork................................45References47 [1]D.L.ChenandW.B.Dolan,“Collectinghighlyparalleldataforparaphraseevaluation,”inProceedingsofthe49thAnnualMeetingoftheAssociationforComputationalLinguistics,2011.[2]A.Rohrbach,M.Rohrbach,N.Tandon,andB.Schiele,“Adatasetformoviedescription,”inCVPR,2015.[3]A.Torabi,C.J.Pal,H.Larochelle,andA.C.Courville,“Usingdescriptivevideoservicestocreatealargedatasourceforvideoannotationresearch,”arXiv:1503.01070,2015.[4]S.Venugopalan,M.Rohrbach,J.Donahue,R.Mooney,T.Darrell,andK.Saenko,“Sequencetosequence-videototext,”inICCV,2015.[5]L.Yao,A.Torabi,K.Cho,N.Ballas,C.Pal,H.Larochelle,andA.Courville.,“Describingvideosbyexploitingtemporalstructure,”inICCV,2015.[6]T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P.Perona,D.Ramanan,P.Dollár,andC.L.Zitnick,“Microsoftcoco:Commonobjectsincontext,”inECCV,2014.[7]R.Vedantam,C.LawrenceZitnick,andD.Parikh,“Cider:Consensus-basedimagedescriptionevaluation,”inCVPR,2015.[8]A.Krizhevsky,I.Sutskever,andG.E.Hinton,“Imagenetclassificationwithdeepconvolutionalneuralnetworks,”inNIPS,2012.[9]S.Venugopalan,H.Xu,J.Donahue,M.Rohrbach,R.Mooney,andK.Saenko,“Translatingvideostonaturallanguageusingdeeprecurrentneuralnetworks,”inNAACL,2015.[10]O.Vinyals,A.Toshev,S.Bengio,andD.Erhan,“Showandtell:Aneuralimagecaptiongenerator,”inCVPR,2015.[11]S.Antol,A.Agrawal,J.Lu,M.Mitchell,D.Batra,C.L.Zitnick,andD.Parikh,“VQA:Visualquestionanswering,”inICCV,2015.[12]M.MalinowskiandM.Fritz,“Amulti-worldapproachtoquestionansweringaboutreal-worldscenesbasedonuncertaininput,”inNIPS,2014.[13]M.Tapaswi,Y.Zhu,R.Stiefelhagen,A.Torralba,R.Urtasun,andS.Fidler,“MovieQA:Understandingstoriesinmoviesthroughquestion-answering,”inCVPR,2016.[14]M.HeilmanandN.A.Smith,“Goodquestion!statisticalrankingforquestiongeneration,”inHLT,2010.[15]M.Rohrbach,W.Qiu,I.Titov,S.Thater,M.Pinkal,andB.Schiele,“Translatingvideocontenttonaturallanguagedescriptions,”inICCV,2013.[16]S.Sukhbaatar,a.szlam,J.Weston,andR.Fergus,“End-to-endmemorynetworks,”inNIPS,2015.[17]P.Das,C.Xu,R.Doell,andJ.Corso,“Athousandframesinjustafewwords:Lingualdescriptionofvideosthroughlatenttopicsandsparseobjectstitching,”inCVPR,2013.[18]S.Guadarrama,N.Krishnamoorthy,G.Malkarnenkar,S.Venugopalan,R.Mooney,T.Darrell,andK.Saenko,“Youtube2text:Recognizinganddescribingarbitraryactivitiesusingsemantichierarchiesandzero-shotrecognition,”inICCV,2013.[19]N.Krishnamoorthy,G.Malkarnenkar,R.J.Mooney,K.Saenko,andS.Guadarrama,“Generatingnatural-languagevideodescriptionsusingtext-minedknowledge,”inAAAI,2013.[20]J.Thomason,S.Venugopalan,S.Guadarrama,K.Saenko,andR.Mooney,“Integratinglanguageandvisiontogeneratenaturallanguagedescriptionsofvideosinthewild,”inCOLING,2014.[21]A.Barbu,E.Bridge,Z.Burchill,D.Coroian,S.Dickinson,S.Fidler,A.Michaux,S.Mussman,S.Narayanaswamy,D.Salvi,L.Schmidt,J.Shangguan,J.M.Siskind,J.Waggoner,S.Wang,J.Wei,Y.Yin,andZ.Zhang,“Videoinsentencesout,”inUAI,2012.[22]A.Kojima,T.Tamura,andK.Fukunaga,“Naturallanguageescriptionofhumanactivitiesfromvideoimagesbasedonconcepthierarchyofactions,”IJCV,vol.50,pp.171–184,nov2002.[23]J.Donahue,L.A.Hendricks,S.Guadarrama,M.Rohrbach,S.Venugopalan,K.Saenko,andT.Darrell,“Long-termrecurrentconvolutionalnetworksforvisualrecognitionanddescription,”inCVPR,2015.[24]K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.S.Zemel,andY.Bengio,“Show,attendandtell:Neuralimagecaptiongenerationwithvisualattention,”ICML,2015.[25]J.Mao,X.Wei,Y.Yang,J.Wang,Z.Huang,andA.L.Yuille,“Learninglikeachild:Fastnovelvisualconceptlearningfromsentencedescriptionsofimages,”inProceedingsoftheIEEEInternationalConferenceonComputerVision,pp.2533–2541,2015.[26]J.Mao,J.Huang,A.Toshev,O.Camburu,A.Yuille,andK.Murphy,“Generationandcomprehensionofunambiguousobjectdescriptions,”CVPR,2016.[27]J.Johnson,A.Karpathy,andL.Fei-Fei,“Densecap:Fullyconvolutionallocalizationnetworksfordensecaptioning,”CVPR,2016.[28]A.Rohrbach,M.Rohrbach,andB.Schiele,“Thelong-shortstoryofmoviedescription,”inGCPR,2015.[29]L.A.Hendricks,S.Venugopalan,M.Rohrbach,R.Mooney,K.Saenko,andT.Darrell,“Deepcompositionalcaptioning:Describingnovelobjectcategorieswithoutpairedtrainingdata,”CVPR,2016.[30]H.WangandC.Schmid,“ActionRecognitionwithImprovedTrajectories,”inICCV,2013.[31]R.Xu,C.Xiong,W.Chen,andJ.J.Corso,“Jointlymodelingdeepvideoandcompositionaltexttobridgevisionandlanguageinaunifiedframework.,”inAAAI,pp.2346–2352,2015.[32]Y.Pan,T.Mei,T.Yao,H.Li,andY.Rui,“Jointlymodelingembeddingandtranslationtobridgevideoandlanguage,”CVPR,2016.[33]P.Pan,Z.Xu,Y.Yang,F.Wu,andY.Zhuang,“Hierarchicalrecurrentneuralencoderforvideorepresentationwithapplicationtocaptioning,”CVPR,2016.[34]H.Yu,J.Wang,Z.Huang,Y.Yang,andW.Xu,“Videoparagraphcaptioningusinghierarchicalrecurrentneuralnetworks,”CVPR,2016.[35]D.Yow,B.Yeo,M.Yeung,andB.Liu,“Analysisandpresentationofsoccerhighlightsfromdigitalvideo,”inACCV,1995.[36]Y.Rui,A.Gupta,andA.Acero,“Automaticallyextractinghighlightsfortvbaseballprograms,”inACMMultimedia,2000.[37]S.Nepal,U.Srinivasan,andG.Reynolds,“Automaticdetectionofgoalsegmentsinbasketballvideos,”inACMMultimedia,2001.[38]E.C.J.WangandC.XuandQ.Tian,“Sportshighlightdetectionfromkeywordsequencesusinghmm,”inICME,2004.[39]Z.Xiong,R.Radhakrishnan,A.Divakaran,andT.Huang,“Highlightsextractionfromsportsvideobasedonanaudio-visualmarkerdetectionframework,”inICME,2005.[40]M.KolekarandS.Sengupta,“Event-importancebasedcustomizedandautomaticcrickethighlightgeneration,”inICME,2006.[41]A.Hanjalic,“Adaptiveextractionofhighlightsfromasportvideobasedonexcitementmodeling,”EEETransactionsonMultimedia,2005.[42]H.Tang,V.Kwatra,M.Sargin,andU.Gargi,“Detectinghighlightsinsportsvideos:Cricketasatestcase,”inICME,2011.[43]M.Sun,A.Farhadi,andS.Seitz,“Rankingdomain-specifichighlightsbyanalyzingeditedvideos,”inECCV,2014.[44]Y.Song,J.Vallmitjana,A.Stent,andA.Jaimes,“Tvsum:Summarizingwebvideosusingtitles,”inCVPR,2015.[45]B.ZhaoandE.P.Xing,“Quasireal-timesummarizationforconsumervideos,”inCVPR,2014.[46]H.Yang,B.Wang,S.Lin,D.Wipf,M.Guo,andB.Guo,“Unsupervisedextractionofvideohighlightsviarobustrecurrentauto-encoders,”inICCV,2015.[47]A.Rohrbach,M.Rohrbach,W.Qiu,A.Friedrich,M.Pinkal,andB.Schiele,“Coherentmulti-sentencevideodescriptionwithvariablelevelofdetail,”inGCPR,2014.[48]J.Xu,T.Mei,T.Yao,andY.Rui,“Msr-vtt:Alargevideodescriptiondatasetforbridgingvideoandlanguage,”inConferenceonComputerVisionandPatternRecognition(CVPR),2016.[49]J.P.Bigham,C.Jayant,H.Ji,G.Little,A.Miller,R.C.Miller,R.Miller,A.Tatarowicz,B.White,S.White,andT.Yeh,“Vizwiz:Nearlyreal-timeanswerstovisualquestions,”inUIST,2010.[50]D.Geman,S.Geman,N.Hallonquist,andL.Younes,“Visualturingtestforcomputervisionsystems,”PNAS,vol.112,no.12,pp.3618–3623,2014.[51]M.Malinowski,M.Rohrbach,andM.Fritz,“Askyourneurons:Aneural-basedapproachtoansweringquestionsaboutimages,”inICCV,2015.[52]H.Gao,J.Mao,J.Zhou,Z.Huang,L.Wang,andW.Xu,“Areyoutalkingtoamachine?datasetandmethodsformultilingualimagequestionanswering,”inNIPS,2015.[53]H.Noh,P.H.Seo,andB.Han,“Imagequestionansweringusingconvolutionalneuralnetworkwithdynamicparameterprediction,”CVPR,2016.[54]J.Andreas,M.Rohrbach,T.Darrell,andD.Klein,“Deepcompositionalquestionansweringwithneuralmodulenetworks,”inCVPR,2016.[55]L.Ma,Z.Lu,andH.Li,“Learningtoanswerquestionsfromimageusingconvolutionalneuralnetwork,”inAAAI,2016.[56]V.RusandJ.Lester,“Workshoponquestiongeneration,”inWorkshoponQuestionGeneration,2009.[57]V.RusandGraessar,“Questiongenerationsharedtaskandevaluationchallengevstatusreport,”inTheQuestionGenerationSharedTaskandEvaluationChallenge,2009.[58]D.M.Gates,“Generatingreadingcomprehensionlook-backstrategyquestionsfromexpositorytexts,”Master’sthesis,CarnegieMellonUniversity,2008.[59]M.Ren,R.Kiros,andR.Zemel,“Exploringmodelsanddataforimagequestionanswering,”inNIPS,2015.[60]K.Tu,M.Meng,M.W.Lee,T.E.Choe,andS.C.Zhu,“Jointvideoandtextparsingforunderstandingeventsandansweringqueries.,”inIEEEMultiMedia,2014.[61]L.Zhu,Z.Xu,Y.Yang,andA.G.Hauptmann,“Uncoveringtemporalcontextforvideoquestionandanswering,”arXivpreprintarXiv:1511.04670,2015.[62]S.HochreiterandJ.Schmidhuber,“Longshort-termmemory.,”NeuralComputation,pp.1735–1780,1997.[63]K.SimonyanandA.Zisserman,“Verydeepconvolutionalnetworksforlarge-scaleimagerecognition,”inICLR,2015.[64]D.Tran,L.Bourdev,R.Fergus,L.Torresani,andM.Paluri,“Learningspatiotemporalfeatureswith3dconvolutionalnetworks,”inICCV,2015.[65]T.Mikolov,K.Chen,G.Corrado,andJ.Dean,“Efficientestimationofwordrepresentationsinvectorspace,”arXiv:1301.3781,2013.[66]D.P.KingmaandJ.Ba,“Adam:Amethodforstochasticoptimization,”ICLR,2015.[67]M.Abadi,A.Agarwal,P.Barham,E.Brevdo,Z.Chen,C.Citro,G.S.Corrado,A.Davis,J.Dean,M.Devin,S.Ghemawat,I.Goodfellow,A.Harp,G.Irving,M.Isard,Y.Jia,R.Jozefowicz,L.Kaiser,M.Kudlur,J.Levenberg,D.Mané,R.Monga,S.Moore,D.Murray,C.Olah,M.Schuster,J.Shlens,B.Steiner,I.Sutskever,K.Talwar,P.Tucker,V.Vanhoucke,V.Vasudevan,F.Viégas,O.Vinyals,P.Warden,M.Wattenberg,M.Wicke,Y.Yu,andX.Zheng,“Tensor-Flow:Large-scalemachinelearningonheterogeneoussystems,”2015.Softwareavailablefromtensorflow.org.[68]R.Kiros,Y.Zhu,R.R.Salakhutdinov,R.Zemel,R.Urtasun,A.Torralba,andS.Fidler,“Skip-thoughtvectors,”inNIPS,2015.  電子全文  國圖紙本論文 連結至畢業學校之論文網頁點我開啟連結註:此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝! 推文 網路書籤 推薦 評分 引用網址 轉寄                                                                                                                                                                                                                    top 相關論文 相關期刊 熱門點閱論文 1. 用於物體交互感知的非局部注意區域 2. 跟隨機器人-基於人臉識別模塊 3. 視覺問題生成搭配文字輔助的方法與比較 4. 可動態調整影片語義分割網路 5. 應用於域自適應學習之生成對抗引導網絡 6. 基於自動駕駛之影像物件偵測技術於自定義資料集之實現與分析 7. 一個基於區域的多過濾器卷積類神經網路高效物體偵測方法 8. 基於電腦視覺及卷積神經網路進行行人航位未對準改正以改善手機行人導航 9. 訓練深度學習網路之探討 10. 基於對抗式訓練生成跨域影像描述 11. 基於深度學習預測交通意外及事故 12. 在不同頭部姿勢下的視線估測 13. 基於深度學習且具有多物件偵測與辨識能力之智慧型自動結帳系統 14. 卷積神經網路應用於中文字手寫風格辨識 15. 運用卷積神經網路於道路坑洞與裂縫之自動影像分類系統   無相關期刊   1. 基於對抗式訓練生成跨域影像描述 2. 基於深度學習預測交通意外及事故 3. 基於深度學習從手部視角辨識影像 4. 基於行車紀錄器提取駕駛行為-行車全球座標定位 5. 基於對抗式訓練的跨城市街景分割 6. 高效率不確定性預測應用於影像語意分割 7. 影像視覺的定位研究 8. 以不一致性損失函數結合抽取式和生成式摘要的融合摘要模型 9. 基於多模態資料訓練人類意圖預測及液體傾倒監測 10. 立方填補於360影片之非監督式學習 11. 基於影片中移動資訊改善跨域人體目標分割 12. 基於多模態系統預測人類意圖與提示遺忘事件之持續學習 13. 室內機器人輔助導向之基於深度學習之俯瞰視角物品偵測 14. 利用動作資訊改善跨域人體目標分割 15. 基於電腦斷層影像之食道癌放射治療計劃劑量區域擷取系統     簡易查詢 | 進階查詢 | 熱門排行 | 我的研究室



請為這篇文章評分?