Armin Goralczyk | 19 Dec 11:01 2007
Picon

Failure message in R on Mac with xmlTreeParse

Hello

In the following thread (R-help) the possibilities of analyzing
publications from pubmed via XML were discussed:

http://www.nabble.com/Analyzing-Publications-from-Pubmed-via-XML-to14328779.html#a14343090

Using xmlTreeParse in a function results in a failure message on my
Mac which is not reproduced in R for Windows:

> esearch <- function (term){
+ 	srch.stem <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"
+ 	srch.mode <- "db=pubmed&retmax=10000&retmode=xml&term="
+ 	doc <-xmlTreeParse(paste(srch.stem,srch.mode,term,sep=""),isURL = TRUE,
+ 		useInternalNodes = TRUE)
+ 	sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue)
+ 	}
>
> term <- 'meyer'
> pmid <- esearch(term) # works fine
>
> term <- 'meyer[au]'
> pmid <- esearch(term)
Fehler in .Call("RS_XML_ParseTree", as.character(file), handlers,
as.logical(ignoreBlanks),  :
  error in creating parser for
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=10000&retmode=xml&term=meyer[au]
>
I/O warning : failed to load external entity
"http%3A//eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi%3Fdb=pubmed&retmax=10000&retmode=xml&term=meyer%5Bau%5D"
(Continue reading)

Duncan Temple Lang | 19 Dec 22:23 2007
Picon

Re: Failure message in R on Mac with xmlTreeParse


The [au] portion seems to be causing the problem.
So escape the [ and ] by mapping them to %5B and %5D respectively
_before_ handing the URL string to xmlTreeParse().  (The error message
indicates that the internals have already performed the conversion, but
if you do it yourself, things should work as I can reproduce your error
message and can get the desired result by escaping the [ and ] first.)

There is more information about what needs to be escaped at
http://publib.boulder.ibm.com/infocenter/discover/v8r4/index.jsp?topic=/com.ibm.discovery.ds.ref.doc/t_RG_Escape_Sequences.htm

The HTTP/FTP code built into the xmlTreeParse(), htmlTreeParse() and
xmlEventParse() functions (specifically from libxml2) is minimalistic.
For better or worse, it is the code that is also in R to implement
url() connections.  It does not handle aspects of HTTP other than simple
request.  So when I run into problems with xmlTreeParse() and a URL,
I first fetch the content of the document using the RCurl package.

And
library(RCurl)
getURL("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=10000&retmode=xml&term=meyer[au]")

does fetch the document and the result can be passed directly to
xmlTreeParse().

RCurl is an interface to libcurl which is a very solid, stable
and feature rich library for performing HTTP, HTTPS, FTP, ... client
queries which allows us to do, in R, pretty much anything a Web browser
can do but programmatically.

(Continue reading)


Gmane