Development Speed Advantage? XML or SQLite?

Discussion in 'Software' started by OneSeventeen, 22 Apr 2006.

  1. OneSeventeen

    OneSeventeen Oooh Shiny!

    Joined:
    3 Apr 2002
    Posts:
    3,454
    Likes Received:
    2
    As I posted in another thread, I'm working on making some bible software, but for the time being I'm just focusing on writing a PHP script to do something similar to www.biblegateway.com 's scripture searching.

    On this page:
    http://bible.woventhorns.com/
    I'm using one 5MB XML file to find a book from the bible and display it.

    On this page:
    http://bible.woventhorns.com/lookup.php
    I'm using a SQLite database to search for a specific verse and display it. (only one verse for now)

    It feels like the SQLite database is slower than the XML... Would it be smarter to just break the XML file up by book and use the PHP simplexml functions?

    I guess I'll still need to see what it's like once I figure out the scripting to look up sets of scripture, such as Gen 1:34-3:34. Maybe that's where a database could come in handy, but I still think XML feels faster.


    Any tips on the XML vs. SQLite speed/performance battle? Am I missing some secret to making SQLite fast? Or am I just imagining things thinking that it takes longer to query the database rather than parse the XML file?
     
  2. Ramble

    Ramble Ginger Nut

    Joined:
    5 Dec 2005
    Posts:
    5,596
    Likes Received:
    42
    I imagine that xml files have lower overhead than a database.
     
  3. FredsFriend

    FredsFriend What's a Dremel?

    Joined:
    20 Jul 2005
    Posts:
    486
    Likes Received:
    0
    For small ammounts of data an XML file would be quicker than a database. When large ammounts of data are involved then a database will be far quicker.
    As it sounds like you are only using it for a small application a simple XML file should be adequate.
     
  4. John Cena

    John Cena What's a Dremel?

    Joined:
    1 Jun 2004
    Posts:
    818
    Likes Received:
    0
    Depends on the size of the data and how many concurrent reads are done.

    The more data you have,the slower xml format tends to be especially if you split it and there are many concurrent reads. Don't forget that the file has to be parsed aswell, taking more cpu power.

    I always use a database even for simple stuff because of maintainbility in the future + I would think it'd be faster.

    I never used SQLLite but in SQL 2005, you can build views and can specially state that it stays cached.
     
    Last edited: 23 Apr 2006
  5. RTT

    RTT #parp

    Joined:
    12 Mar 2001
    Posts:
    14,120
    Likes Received:
    74
    Almost always for small amounts of data, file system access beats database overheads.

    Have you tried the crude way of measuring the execution time of a php script using microtime() ?
     
  6. John Cena

    John Cena What's a Dremel?

    Joined:
    1 Jun 2004
    Posts:
    818
    Likes Received:
    0
    So RTT,you mean to tell me that 1000 concurrent users requesting XML vs SQL data, the XML data will be retrieved faster?

    I done plenty of tests regarding XML vs SQL (ASP and .NET). SQL wins by far.
     
  7. RTT

    RTT #parp

    Joined:
    12 Mar 2001
    Posts:
    14,120
    Likes Received:
    74
    yes actually, emphasis on small amounts of data as the thread poster and myself have posted :)

    do show us your test results
     
  8. John Cena

    John Cena What's a Dremel?

    Joined:
    1 Jun 2004
    Posts:
    818
    Likes Received:
    0
    Oh ok. Aslong as it's a small environment and with like 20 users...sure.

    I garantee you that after 8 months of data collection in XML format, maintainbility will be ZERO. Scalability and Performance will drop exponentially as the users go up.
     
  9. OneSeventeen

    OneSeventeen Oooh Shiny!

    Joined:
    3 Apr 2002
    Posts:
    3,454
    Likes Received:
    2
    I'd like to remind everyone, I am writing bible software, with each bible having its own XML file or SQLite database.

    Due to the nature of this being scripture lookup software, the database/XML file will be read-only, so data collection and maintenance are not neccessary.

    The more I look at it, the more I think it may be irrelevant. For single-queries, SQLite fluxuates between 4.340164899826 and 0.13911104202271 seconds, and simpleXML fluxuates between 4.0169820785522 and 0.4659538269043.

    That makes SQLite look a tiny bit slower in the end, but since it wavers back and forth... I'm not sure.

    On top of that, I think querying a database will be easier when trying to pull large sections out.

    In the future if this gets used a lot, I may break each version of the bible up into 66 separate databases, so maybe it will have less overhead when pulling the data.

    Look up a verse with SQLite
    Look up a verse with simpleXML
     
  10. John Cena

    John Cena What's a Dremel?

    Joined:
    1 Jun 2004
    Posts:
    818
    Likes Received:
    0
    Your design mechanics are very wrong


    why do you need to have 66 databases? OFCOURSE thats going to be slow because of the number of DB connections + connection pooling.

    I would have one database and normalize the bible scripture table.

    BUT if your going to use 66 databases, please go use XML..youll thank me later.
     
  11. Hepath

    Hepath Minimodder

    Joined:
    20 Oct 2003
    Posts:
    730
    Likes Received:
    0
    *grinding rusty wheels of memory start here... *

    You've not said what kind of reader you are using here. As its static fast forward only readers should be used. (E.g SAX parsers, which do not read the whole document into memory)....

    Have to agree with john though - in the end a database will be much more flexible. Trying to create complex XPath queries is not fun
     
  12. simon w

    simon w What's a Dremel?

    Joined:
    3 Nov 2003
    Posts:
    1,302
    Likes Received:
    0
    As John mentioned, think more about concurrent requests/scalability.

    Do you open a new file handle for *every* request or throw the querries at a DB server* that has a query cache

    * that assumes MySQL is available in your production environment
     
    Last edited: 26 Apr 2006
  13. OneSeventeen

    OneSeventeen Oooh Shiny!

    Joined:
    3 Apr 2002
    Posts:
    3,454
    Likes Received:
    2
    Okay, just a quick breakdown of things influencing my search for a quick and simple way to do this:
    • I currently have a limit on queries per hour in MySQL, and expect this to exceed those limits, so MySQL is out. (this limit will be lifted within the year, but I want to work with what I've got for now, so I can figure out the searching mechanism)
    • I will only pull one book at a time
    • I will eventually want to pull multiple verses (possibly across multiple chapters, but not multiple books)
    • I will eventually have multiple languages of multiple translations of the bible
    Therefore, a query cache is kind of out of the question for now.
    Having 66 databases shouldn't be a performance hit since I will only need to pull one of those databases per page visit.

    I have a separate database that will maintain a list of book names and the number of book it belongs to (which is the number of the database file to use), which will have all the various languages, abbreviations, and common mispellings.

    So, with this in mind, and keeping in mind a rdbm is not on the table, are there really any reasons not to use 66 different databases? (One for each book of the bible)
     
  14. John Cena

    John Cena What's a Dremel?

    Joined:
    1 Jun 2004
    Posts:
    818
    Likes Received:
    0
    A connection is VERY EXPENSIVE in terms of performance. Connection pooling fixes this BUT is also limited. 66 databases + 20 users = server hell. So you will be opening and closing DB connections out the wazoo.

    In .NET it would result in exceptions thrown everywhere because the default pool of connections is 15. To Modify that...depends on your host. Modifying also increases the requirement for RAM.
     
  15. eek

    eek CAMRA ***.

    Joined:
    23 Jan 2002
    Posts:
    1,600
    Likes Received:
    14
    Do you really have a need for 66 databases?!?! That's insane. I'd hate to have to be the person to have to manage them, let along normalise them all!!!

    Seeing as SQLite uses the local filesystem and so does XML, the speed is really just going to come down to the algorithms used to access and process the data. I'm guessing that in real terms, the performance difference will be negligable.

    Personally I would go with SQLite as you will be able to retrieve data in one big query rather than having to open and process lots of XML files and manually recursing and sorting through the data. If you optimize all your queries, I'm fairly sure that for complicated things such as JOINS and what not, SQL will be the way to go - especially with a properly normalised database so you don't keep pulling out redundant data!
     
  16. OneSeventeen

    OneSeventeen Oooh Shiny!

    Joined:
    3 Apr 2002
    Posts:
    3,454
    Likes Received:
    2
    I guess I'm too newbiesh when it comes to data normalization and whatnot. What is normalization, and how does it apply to read-only data?

    Also, 66 databases meaning one for each book of the bible, so there would be 2 connections per page visit. First, the database that ties the book name with the book ID, and second, the database that I want to pull data from.

    TBH, I may even store the searches, and establish a list of the 100 most popular queries, then put those in a separate flat, text files, and just display the text file if it exists, otherwise pull the data from the database.

    So I guess my only real question, is, if I already have all 66 read-only databases (that will never be modified), and only one of those 66 will be used during each search, why would that cause more of a load, than using one massive database.

    I would assume accessing a 50K database and a 300K database one time each would be quicker than accesing a single 9MB database. (especially considering this is a SQLite database, not a RDBM database)

    Another note, is I will not be searching text in the bible, just scripture reference. Such as Proverbs 6:6-11.

    Using One Database:
    1. Open 9MB bible database file in SQLite
    2. Look up which book Proverbs is (20) in the book table
    3. Look up chapter 6 verses 6-11 in the bible table
    4. Close Connection
    5. Display data


    Using 67 Databases:
    1. Open 50KB database of books and their ID number
    2. Look up which book Proverbs is (20)
    3. Close book reference database
    4. Open 119KB database of Proverbs
    5. Look up chapter 6 verses 6-11 in the bible table
    6. Close Connection
    7. Display data

    The largest database is Jeremiah's, which is 545KB, Other than the 10 largest databases, all others are under 300KB.

    If one 50KB plus one 200KB SQLite connection per visit really does cost more than one 9MB SQLite connection, please let me know why that is, so I can learn more about data management and how to make that data load quicker.
     
  17. John Cena

    John Cena What's a Dremel?

    Joined:
    1 Jun 2004
    Posts:
    818
    Likes Received:
    0
    Letme sum it up for you

    For EACH database you manipulate or even TOUCH...you need a DB connection. 2 searches = 2 DB connections. 3 searches = 3 DB connections.

    A connection is VERY expensive to open and therefore you have something called connection pooling where when a connection is closed..it isnt immediately destroyed, instead it goes back to a special pool where it waits and serves a new connection. This can improve perferformance but increases RAM and cpu usage as the connections in the pool have to be maintained. Also be warned that most hosts have limited the amount of connections in a pool so you might get network exceptions on a busy site.

    Normalization (third, and the most common) is when you seperate data into different tables that does not depend on the primary key and instead depends on another key. What this does is prevent redundant data and can increase performance significantly if done properly. It also opens up lots of flexibility options and better maintainbility in the future.

    For an example

    t_student has a column for degree 1, degree 2, degree 3..but that wouldnt make sense right? That's less flexible because it only allows 3 degrees. So you seperate the degrees into a different table based on student id

    So when normalized to 3N, it would look like this:

    t_student
    StudentID, student name, student city

    t_student_degrees
    StudentDegreeID, Student ID Foreign, Degree ID Foreign

    t_degrees
    DegreeID, Degree Name


    t_student_degrees contain all of the students degrees for a student and is matched to t_student by the student id foreign key.

    Now you would join all 3 tables together using a SQL View...this view will be cached = better performance.

    So for 1 DB, You open 1 Connection and then close it
    For 66 DB, you open 1 connection and then close it 66 times. Thats like 100000% negative performance lol.

    Sorry I suck at explaining lol.
     
  18. eek

    eek CAMRA ***.

    Joined:
    23 Jan 2002
    Posts:
    1,600
    Likes Received:
    14
    Normalisation is simply a method for design database relations in a manner which reduces redundency and hence makes it all more efficient. There are different levels of normal form with the most common being Boyce-Codd normal form (BCNF). One of the simplest things to do is look at every relation and if there is any possibility of a value appearing in a tuple more than once, then it should probably be moved out to it's own relation (not always true depending on what the data is, but can be a good indicator).

    If you are only accessing 1 of the databases per page then there is no problem with having 66 databases in terms of speed etc. It may be a headache trying to manage them all but if they are read-only then that is of little concern!

    I can't imagine a smaller database taking any more or less time to connect to than a larger one. Theoretically the smaller one should probably be faster to query, but again that is all dependent on the stucture of the database :) I wouldn't worry about it all too much to be honest!!

    *edit* beaten to it lol!
    I don't think I have ever seen anyone using connection pooling with PHP, I guess to implement it you would need to have a PHP script running in the background 100% of the time maintaining all the connections?

    Anyway, yeah if you were to use connection pooling then it is much better to pool connections for just one database rather than having to have a pool of ten or so connections for each of the 66 databases!!

    If you aren't pooling the connections then as I said above, I don't think that having 66 databases will be much of an issue if you only need to connect to one at a time, beyond maintenence of them all of course!!!
     
    Last edited: 27 Apr 2006
  19. John Cena

    John Cena What's a Dremel?

    Joined:
    1 Jun 2004
    Posts:
    818
    Likes Received:
    0
    The poster said that he will be searching the "scriptures" which are stored in different DB's.

    That means, he will look in all 66 DB's = 66 times a connection is open/closed.

    2 users searching at the same time = over 120 times a connection is open/closed.

    Even for a small application...66 DB does matter.
     
  20. eek

    eek CAMRA ***.

    Joined:
    23 Jan 2002
    Posts:
    1,600
    Likes Received:
    14
    Maybe I got the wrong idea, but I thought that the database that needed accessing was known, hence not all 66 databases had to be queried everytime, just the one requested?
     

Share This Page