1 Jun 2011 17:49
Re: question about storage of corpora
Damir Cavar <dcavar <at> indiana.edu>
2011-06-01 15:49:58 GMT
2011-06-01 15:49:58 GMT
Sorry Tine, and the others, here's just some comment on some of the recent arguments related to your posting and some replies: Space and XML as a storage: Well, I just bought a fast SATA 2 TB disk for around 100 Euro for my private purposes, to extend my existing 1.5 and 1 TB disks, and backup them. A DB takes also space, and it is not true that the space is reduced to almost nothing, just maybe one Xth (is it 1/3 or 1/4 in your cases?). 240 GB or 40 GB, I don't see the need in putting time and effort in mapping the XML to a RelDB just to spare some space. What I mentioned about the CLC, it is raw TEI P5 XML, and I do not need to store it in a DB to get good performance, as the online interface shows http://riznica.ihjj.hr/ choose any of the subcorpora, the "complete" one should be over 100 mil. The rendering of the results is most of the time including a XSLT call on the raw XML data to create the HTML view, the documents are raw XML TEI P5 files on the server, the rendering to HTML is done with every request, without our server contaminating Zagreb with smoke. You'll probably wait for the connection, not for the server to do the job (except for the collocation analyses in the extended menu). And, there is no relational DB that I needed to maintain and set up, just a storage folder for the XML and a binary generated index. Speed: Observing a decrease of speed in any DB for any type of data storage based on the size of the data is usually a sign of poor engineering and/or poor hardware. XML DBs and other DBs do not differ there, so, if you index any field (XML attributes, full text, tags), the search is passing for example a hash function and should be as fast as your hashing function is, and this is true for any DB, relational, XML-based etc. I cannot imagine that access to binary DB tables in a RelDB should be significantly faster than direct access through the Operating Systems File IO to XML files on a disk somewhere (putting my full faith in the current OSes and their good handling of File IO and Cache management, even true for Linux nowadays). Evaluation: If you want to test large sizes and speed, just put BaseX on your desktop, no complicated installation(Continue reading)
RSS Feed