Extraction of information from born-digital PDF documents for reproducible research - Publication - Bridge of Knowledge

Search

Extraction of information from born-digital PDF documents for reproducible research

Abstract

Born-digital PDF electronic documents might reasonably be expected to preserve useful data units of their source originals that suffice to produce executable papers for reproducible research. Unfortunately, developers of authoring tools may adopt arbitrary PDF generation strategies, producing a plethora of internal data representations. Such common information units as text paragraphs, tables, function graphs and flow diagrams, may require numerous heuristics to handle properly each vendor specific PDF file content. We propose a generic Reverse MVC interpretation pattern that enables to cope with that arbitrariness in a systematic way. It constitutes a component of a larger framework we have been developing for making executable papers out of PDF documents without injecting in the PDF file any extra data or code

Citations

  • 0

    CrossRef

  • 0

    Web of Science

  • 0

    Scopus

Cite as

Full text

download paper
downloaded 17 times
Publication version
Accepted or Published Version
License
Creative Commons: CC-BY-NC-ND open in new tab

Keywords

Details

Category:
Articles
Type:
publikacja w in. zagranicznym czasopiśmie naukowym (tylko język obcy)
Published in:
Journal of Advanced Management no. 4, pages 238 - 244,
ISSN: 2168-0787
Title of issue:
ICIME 2014 : 2014 6th International Conference on Information Management and Engineering strony 238 - 244
Language:
English
Publication year:
2016
Bibliographic description:
Wiszniewski B., Siciarek J.. Extraction of information from born-digital PDF documents for reproducible research. Journal of Advanced Management, 2016, Vol. 4, iss. 3, s.238-244
DOI:
Digital Object Identifier (open in new tab) 10.12720/joams.4.3.238-244
Bibliography: test
  1. Elsevier. (2011). The Executable Paper Grand Challenge. [Online]. Available: http://www.executablepapers.com. open in new tab
  2. J. Quirk. (2012). Executable Papers-The Day the Universe Changed. [Online]. open in new tab
  3. F. Leisch, "Sweave: dynamic generation of statistical reports using literate data analysis," in Proc. Comp. Statistics COMPSTAT'02, Berlin, Germany, 2002, pp. 575-580. open in new tab
  4. M. Kohlhase, "The planetary project: Towards eMath3.0," in Proc. 11th Int. Conf. on Intelligent Computer Mathematics CICM'12, Bremen, Germany, 2012, pp. 448-452. open in new tab
  5. N. Limare, L. Oudre, and P. Getreuer, "IPOL: Reviewed publication and public testing of research software," in Proc. IEEE 8th Int. Conf. on E-Science (e-Science), Chicago, IL, USA, 2012, pp. 1-8. open in new tab
  6. E. Ciepiela, D. Haręźlak, M. Kasztelnik, J. Meizner, G. Dyk, P. Nowakowski, and M. Bubak, "The collage authoring Environment: from proof-of-concept prototype to pilot service," Procedia Computer Science, vol. 18, pp. 769-778, 2013. open in new tab
  7. J. Siciarek and B. Wiszmiewski, "IODA-An interactive open document architecture," Procedia Computer Science, vol. 4, pp. 668-677, 2011. open in new tab
  8. A. Anjewierden, "AIDAS: Incremental logical structure discovery in PDF documents," in Proc. 6th Int. Conf. on Document Analysis and Recognition, Seattle, WA, USA, 2001, pp. 374-378. open in new tab
  9. R. P. Futrelle, M. Shao, Ch. Cieslik, and A. E. Grimes, "Extraction, layout analysis and classification of diagrams in PDF documents," in Proc. 7th Int. Conf. on Document Analysis and Recognition, ICDAR'03, Edinburgh, Scotland, UK, 2003, pp. 1007-1014. open in new tab
  10. T. Hassan and R. Baumgartner, "Table recognition and understanding from PDF files," in Proc. 9th Int. Conf. on Document Analysis and Recognition, Parana, 2007, pp. 1143-1147. open in new tab
  11. E. Oro and M. Ruffolo, "PDF-TREX: An approach for recognizing and extracting tables from PDF documents," in Proc. open in new tab
  12. Int. Conf. on Document Analysis and Recognition, ICDAR'09, Barcelona, Spain, 2009, pp. 906-910. open in new tab
  13. A. Gabdulkhakova and T. Hassan, "Document understanding of graphical content in natively digital PDF documents," in Proc. the 2012 ACM Symp. on Document Engineering (DocEng '12), Paris, France, 2012, pp. 137-140. open in new tab
  14. W. Szwoch and M. Mucha, "Recognition of hand drawn flowcharts," in Image Processing and Communications Challenges 4, Advances in Intelligent Systems and Computing, vol. 184, R. S. Choras, Ed, Springer, 2013, pp. 65-72. open in new tab
  15. J. Siciarek, "Semantics driven table understanding in born-digital documents," in Image Processing and Communications Challenges 5, Advances in Intelligent Systems and Computing, vol. 233, R. S. Choras, Ed, Springer, 2014, pp. 153-160. open in new tab
  16. P. Pieniążek, "Automatic extraction of business logic from digital documents," in Proc. 6th Int. Conf. in Image Processing & Communications, Sept. 10-12, 2014, Bydgoszcz, Poland (in press). open in new tab
  17. I. Vignoli, "LibreOffice: State of the project," presented at the LibreOffice Conference, Milano, Italy, Sept. 25-27, 2013.
  18. T. Sato and B. V. Smith. (2013). Xfig user manual. [Online]. Available: http://xfig.org/userman/authors.html
  19. T. Bah, Inkscape: Guide to a Vector Drawing Program, 4th ed. Prentice Hall, 2011.
  20. K. Höppner, "Strategies for including graphics in LATEX documents," The PracTEX Journal, No. 03, Rev. 2005-07-15, pp. 1-11, 2005. open in new tab
Verified by:
Gdańsk University of Technology

seen 74 times

Recommended for you

Meta Tags