The magic behind rebib is a parser which parses bibliographic data and assorts them according to the matching regular expressions.
This stage is a minor step where it reads the Embedded Bibliography from the LaTeX document. This step also includes Filtering out the commented code to avoid un-intended entries read.
Lastly, the data is broken down based on the LaTeX macro
\\bibitem
as a marker for a new entry and this assorted
data is exported to a variable.
file_name <- rebib:::get_texfile_name(your_article_path)
bib_items <- rebib:::extract_embeded_bib_items(your_article_path,file_name)
bib_items[[1]]
#> [1] "\\bibitem[Ihaka, Ross and Gentleman, Robert]{ihaka:1996}"
#> [2] "Ihaka, Ross and Gentleman, Robert"
#> [3] "\\newblock \\emph{R: A Language for Data Analysis and Graphics.}"
#> [4] "\\newblock \\emph{Journal of Computational and Graphical Statistics}, 3:\\penalty0"
#> [5] "299--314, 1996."
#> [6] "\\newblock URL : \\url{https://doi.org/10.1080/10618600.1996.10474713}"
bib_items[[2]]
#> [1] "\\bibitem[R Core Team]{R}"
#> [2] "R Core Team"
#> [3] "\\newblock R: A Language and Environment for Statistical Computing"
#> [4] "\\newblock \\emph{R Foundation for Statistical Computing}, Vienna, Austria \\penalty0 2016."
#> [5] "\\newblock URL : \\url{https://www.R-project.org/}, ISBN 3-900051-07-0"
Now, with the chunks of bibliographic entries, each is passed to a
parser which will break it down based on regular expressions. The logic
is to use the LaTeX macro \\newblock
as a placeholder to
identify the position of text elements relative to it.
The first value to be parsed is the unique_id
also
called the citation reference which is used to cite elements inside the
article. Usually, this is in the first or second line of the whole
entry. The position of the unique_id
will determine the
position of the author names.
After reading the unique_id
, the parser will attempt to
read the author name(s) up to two lines long (Usually this is
the case in most articles).
Next, the title is extracted based on the position of the new blocks or the end of the bib chunk.
This way the crucial elements of the bibliographic entry (unique_id, author names and title ) are parsed out.
The remaining data is stored as journal
internally and
publisher
when writing to a new BibTeX file.
bib_items[[1]][4:6]
#> [1] "\\newblock \\emph{Journal of Computational and Graphical Statistics}, 3:\\penalty0"
#> [2] "299--314, 1996."
#> [3] "\\newblock URL : \\url{https://doi.org/10.1080/10618600.1996.10474713}"
A series of filters for ISBN, URL, pages and year fields are applied to search for relevant data from the remaining data. If the data is not available then it is set as NULL and will be skipped while writing the BibTeX file. There is a lot of filtering of common LaTeX elements and after that, the data remaining is stored in a structured format to be written to a file.
bib_entry <- rebib:::bib_handler(bib_items)
bib_entry
#> $book
#> $book[[1]]
#> $book[[1]]$unique_id
#> [1] "ihaka:1996"
#>
#> $book[[1]]$author
#> [1] "Ihaka, Ross and Gentleman, Robert"
#>
#> $book[[1]]$title
#> [1] "R: A Language for Data Analysis and Graphics"
#>
#> $book[[1]]$journal
#> [1] "Journal of Computational and Graphical Statistics 3: :"
#>
#> $book[[1]]$year
#> [1] "1996"
#>
#> $book[[1]]$URL
#> [1] "https://doi.org/10.1080/10618600.1996.10474713"
#>
#> $book[[1]]$isbn
#> NULL
#>
#> $book[[1]]$pages
#> [1] "299--314"
#>
#>
#> $book[[2]]
#> $book[[2]]$unique_id
#> [1] "R"
#>
#> $book[[2]]$author
#> [1] "R Core Team"
#>
#> $book[[2]]$title
#> [1] "R: A Language and Environment for Statistical Computing"
#>
#> $book[[2]]$journal
#> [1] "R Foundation for Statistical Computing Vienna Austria :"
#>
#> $book[[2]]$year
#> [1] "2016"
#>
#> $book[[2]]$URL
#> [1] "https://www.R-project.org/"
#>
#> $book[[2]]$isbn
#> [1] "3-900051-07-0"
#>
#> $book[[2]]$pages
#> NULL
#>
#>
#> $book[[3]]
#> $book[[3]]$unique_id
#> [1] "Tremblay:2012"
#>
#> $book[[3]]$author
#> [1] "A.~Tremblay"
#>
#> $book[[3]]$title
#> [1] "LMERConvenienceFunctions: A suite of functions to back-fit fixed effects and forward-fit random effects, as well as other miscellaneous functions., "
#>
#> $book[[3]]$journal
#> [1] "R package version 1.6.8.2"
#>
#> $book[[3]]$year
#> [1] "2012"
#>
#> $book[[3]]$URL
#> [1] "http://CRAN.R-project.org/package=LMERConvenienceFunctions"
#>
#> $book[[3]]$isbn
#> NULL
#>
#> $book[[3]]$pages
#> NULL
After reading the bibliographic entries and splitting out meaningful values from them, we can finally write a structured file in the BibTeX format.
The writer will read the bib chunks one at a time based on the metadata extracted and will write the corresponding data fields. The default entry type is a book, which should not have any problems with the web articles.
#> Warning in file.remove(bib_path): cannot remove file
#> '/tmp/RtmpsrzBoI/exampledir/article/example.bib', reason 'No such file or
#> directory'
rebib:::bibtex_writer(bib_entry,file_path)
cat(readLines(paste(your_article_path,"example.bib",sep="/")),sep = "\n")
#> @book{ihaka:1996,
#> author = {{Ihaka, Ross and Gentleman, Robert}},
#> title = {{R: A Language for Data Analysis and Graphics}},
#> publisher = {Journal of Computational and Graphical Statistics 3: :},
#> pages = {299--314},
#> year = {1996},
#> url = {https://doi.org/10.1080/10618600.1996.10474713}
#> }
#> @book{R,
#> author = {R {Core Team}},
#> title = {{R: A Language and Environment for Statistical Computing}},
#> publisher = {R Foundation for Statistical Computing Vienna Austria :},
#> year = {2016},
#> url = {https://www.R-project.org/},
#> isbn = {3-900051-07-0}
#> }
#> @book{Tremblay:2012,
#> author = {A.~{Tremblay}},
#> title = {{LMERConvenienceFunctions: A suite of functions to back-fit fixed effects and forward-fit random effects, as well as other miscellaneous functions., }},
#> publisher = {R package version 1.6.8.2},
#> year = {2012},
#> url = {http://CRAN.R-project.org/package=LMERConvenienceFunctions}
#> }
I expect the authors who are converting the document to take a look and check if there are any errors or missing values.