Wednesday, September 21, 2011

Options for Slow Performance Parsing Large XML - Part 3

The final piece to this chapter. In my previous post Options for Slow Performance Parsing Large XML - Part 2, I showed how registering a schema and having that process create a table could be used to increase the parsing speed of XML compared to parsing that XML via PL/SQL. Now I'm going to look at whether using a table without an associated schema will make any difference on the parsing/inserting speeds.

I again used the following block of code for running my tests
declare 
   l_result_xml   XMLTYPE;
   l_ns_xml       XMLTYPE;
   l_start        NUMBER;
   l_end          NUMBER;

begin
   -- Build up the XML
   -- Has 2100+ row nodes
   ...snipped...

   l_start := dbms_utility.get_time;
   INSERT INTO LOAD_TEMP_RESULT VALUES l_result_xml;
   l_end   := dbms_utility.get_time;
   dbms_output.put_line('INSERT INTO TEMP TIME in sec: ' || to_char((l_end - l_start) / 100));

   l_start := dbms_utility.get_time;
   INSERT INTO import_queue
   (...20 columns...) 
   SELECT CBIQ_ID_SEQ, CBRH_ID_SEQ,
          ...18 columns ...
     FROM LOAD_TEMP_RESULT ltr,
          XMLTABLE('/resultset/row'
                   PASSING ltr.object_value  -- column alias name
                   COLUMNS
                   CBIQ_ID_SEQ          NUMBER(11) PATH 'CBIQ_ID_SEQ', 
                   CBRH_ID_SEQ          NUMBER(11) PATH 'CBRH_ID_SEQ',
                   ... 18 more columns ...);
   l_end   := dbms_utility.get_time;
   dbms_output.put_line('INSERT INTO TIME in sec: ' || to_char((l_end - l_start) / 100));
   rollback;
end;

The only difference I changed was the CREATE TABLE statement for LOAD_TEMP_RESULT. The following table summarizes the various CREATE TABLE statements I used and the resulting times, where TEMP TIME is the output for "INSERT INTO TEMP TIME in sec:" and TIME is the output for "INSERT INTO TIME in sec:"
DDL TEMP TIME (sec) TIME (sec)
CREATE TABLE load_temp_result OF XMLTYPE; 0.05 16
CREATE TABLE load_temp_result OF XMLTYPE XMLTYPE STORE AS BASICFILE CLOB; 0.09 15.82
CREATE TABLE load_temp_result OF XMLTYPE XMLTYPE STORE AS SECUREFILE CLOB; 2.36 15.46
CREATE TABLE load_temp_result OF XMLTYPE XMLTYPE STORE AS BASICFILE BINARY XML; 0.27 0.63
CREATE TABLE load_temp_result OF XMLTYPE XMLTYPE STORE AS SECUREFILE BINARY XML; 0.24 0.63

As I am using 11.1.0.6, the default behavior for this version of Oracle is to store XMLTypes as a CLOB. This is confirmed by the fact that the first row (default storage) matches most closely with the second row (explicit storage). Based on the results, it is easy to see that even creating a XMLType table (or a XMLType column) where the XMLType is stored as BINARY XML is well worth looking at in terms of performance when validation is not needed.

I'm not going to go into a discussion on the differences between BASICFILE and SECUREFILE, but the highlights of what SECUREFILE offers can be found at LOB Features Introduced in Oracle Database 11g Release 1. Oracle recommends that SECUREFILE be used for as I have picked up from the XML DB forum on OTN.

To recap the results from the previous two threads on this topic,
  • Using XMLTable in a SQL statement to parse a PL/SQL XMLType variable took 1032 seconds (16+ minutes).
  • Using PL/SQL to loop through each node in the XML and parse the information out and then INSERT that into a table took 4 seconds.
  • Using a registered schema and associated table took 0.5 seconds.
For the given test case I was using (approx 2100 row nodes in the XML) on the system I was using (11.1.0.6 on a RedHat OS), doing a schema registration that creates an associated table just edged out using a Binary XML XMLType based table at 0.87 seconds. This should be due to the additional information that Oracle knows about the XML based off the schema. As this was also the first version that BINARY XML existed in, it could be partially due to that fact as well. Maybe if I get access to a newer version of Oracle, I could test that theory.

Friday, September 16, 2011

Options for Slow Performance Parsing Large XML - Part 2

As I covered previously in Options for Slow Performance Parsing Large XML, I was looking for a PL/SQL only solution given the business requirements to parse an XML with just over 2100 repeating nodes and insert information from each node into a table in the DB for additional processing. The first attempt using only XMLTable to parse the PL/SQL XMLType variable took over 1000 seconds (16+ minutes). The second attempt parsed the XML by looping through it in PL/SQL and using a FORALL to insert the data into the table.  This took just over 4 seconds.

That was fine for my situation, but what if you need something faster? As I previously said
One option is to register a schema in the database, create a table based off that schema, store the XML in there and parse it that way (if still needed).

To verify the answer, I took the long way to get to the above just to see what happens.  I started by creating a simple schema for the XML, doing a basic schema registration, converting the XML to a schema based XML and parsing that PL/SQL XMLType.  The basics of that approach were

DECLARE
   l_schema  CLOB;
BEGIN
   l_schema := '<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdb="http://xmlns.oracle.com/xdb">
 <xs:element name="resultset" type="resultSetType" xdb:defaultTable="LOAD_TEMP_RESULT"/>
 <xs:complexType name="resultSetType">
  <xs:sequence>
   <xs:element name="row" type="rowType" maxOccurs="unbounded"/>
  </xs:sequence>
 </xs:complexType>
 <xs:complexType name="rowType">
  <xs:sequence>
   ...16 elements of xs:int, xs:string, xs:short, xs:date ...
  </xs:sequence>
 </xs:complexType>
</xs:schema>';

   dbms_xmlSchema.registerSchema('blog_pref.xsd', l_schema, gentypes => false, gentables => false);
END;/

declare 
   l_result_xml   XMLTYPE;
   l_ns_xml       XMLTYPE;
   l_start        NUMBER;
   l_end          NUMBER;

begin
   -- Build up the XML
   -- Has 2100+ row nodes
   ...snipped...

   l_start := dbms_utility.get_time;
   l_ns_xml := l_result_xml.createSchemaBasedXML('blog_pref.xsd');
   
   INSERT INTO import_queue
   (...20 columns...) 
   SELECT CBIQ_ID_SEQ, CBRH_ID_SEQ,
          ...18 columns ...
     FROM XMLTABLE('/resultset/row'
                   PASSING l_ns_xml
                   COLUMNS
                   CBIQ_ID_SEQ          NUMBER(11) PATH 'CBIQ_ID_SEQ', 
                   CBRH_ID_SEQ          NUMBER(11) PATH 'CBRH_ID_SEQ',
                   ... 18 more columns ...);
   l_end   := dbms_utility.get_time;
   dbms_output.put_line('INSERT INTO TIME in sec: ' || to_char((l_end - l_start) / 100));
   rollback;
end;
The time for a run was
INSERT INTO TIME in sec: 1024.09

This was still 16+ minutes so no improvement at all. I wasn't expecting any improvement in overall time because
A) The schema had been registered without creating any types or tables
B) The code is still parsing a PL/SQL XMLType variable.

The only difference now is that the XML is associated to a schema but the object itself is still simply a PL/SQL XMLType variable. The next approach was to use the default schema registration. The only change to the above was to use
dbms_xmlSchema.registerSchema('blog_pref.xsd', l_schema);
so that any needed tables/types would be generated. Running the above code to put the data into the import_queue table resulted in a run of
INSERT INTO TIME in sec: 1057.13

This was still 16+ minutes so no improvement at all. I wasn't expecting any improvement in overall time because I had only removed condition A) from above above and condition B) was still in effect.
To remove condition B), I rewrote the code to insert into the table created during the registerSchema process and then read from that table using XMLTable to parse the XML. I used xdb:defaultTable="LOAD_TEMP_RESULT" in the schema to name the table that Oracle created. The code ended up looking like
declare 
   l_result_xml   XMLTYPE;
   l_ns_xml       XMLTYPE;
   l_start        NUMBER;
   l_end          NUMBER;

begin
   -- Build up the XML
   -- Has 2100+ row nodes
   ...snipped...

   l_start := dbms_utility.get_time;
   INSERT INTO LOAD_TEMP_RESULT VALUES l_result_xml;
   l_end   := dbms_utility.get_time;
   dbms_output.put_line('INSERT INTO TEMP TIME in sec: ' || to_char((l_end - l_start) / 100));

   l_start := dbms_utility.get_time;
   INSERT INTO import_queue
   (...20 columns...) 
   SELECT CBIQ_ID_SEQ, CBRH_ID_SEQ,
          ...18 columns ...
     FROM LOAD_TEMP_RESULT ltr,
          XMLTABLE('/resultset/row'
                   PASSING ltr.object_value  -- column alias name
                   COLUMNS
                   CBIQ_ID_SEQ          NUMBER(11) PATH 'CBIQ_ID_SEQ', 
                   CBRH_ID_SEQ          NUMBER(11) PATH 'CBRH_ID_SEQ',
                   ... 18 more columns ...);
   l_end   := dbms_utility.get_time;
   dbms_output.put_line('INSERT INTO TIME in sec: ' || to_char((l_end - l_start) / 100));
   rollback;
end;
The time for a run broke down as
INSERT INTO TEMP TIME in sec: .31
INSERT INTO TIME in sec: .17


So the time Oracle spent parsing the XML into the temporary table and then retrieving the data from that and inserting it into the import_queue table was less than 0.5 seconds for 2100 rows. That was a very good improvement over 4 seconds, though it does require the registration of a schema and a table assocaited to that schema.
Still to be covered was the other option I mentioned of
store the XML as binary XML (available in 11g). Hopefully I get to that soon.

Thursday, June 30, 2011

XMLTable and outer join methods

While assisting with a question on the OTN forums (Using XML clob in loop)

I had to pause briefly to remember how to do outer joins when passing optional (aka child may not exist) information from one XMLTable to another XMLTable.

The syntax I first used was the proprietary Oracle method of using (+). So the FROM clause looked something like
XMLTABLE(...COLUMNS ... xmlfrag XMLTYPE PATH ... ) x, 
XMLTABLE(... PASSING x.xmlfrag ...) (+) y
So now the query returns data from "x" even in there was no data in "y". Then I got to wondering what the ANSI standard for writing that XMLTable join would be. Turns out it is just a simple
XMLTABLE(...COLUMNS ... xmlfrag XMLTYPE PATH ... ) x
LEFT OUTER JOIN
XMLTABLE(... PASSING x.xmlfrag ...) y
ON 1=1
The ON clause exists because some type of join condition is required for a LEFT/RIGHT OUTER JOIN. Normally a 1=1 would cause a Cartesian join between the two tables but in this case it doesn't because the join conditions are already defined as the second XMLTable is joined to the first XMLTable via the PASSING clause.

Pretty simple.

* Caution *
Be wary when using the ANSI JOIN syntax on older versions of Oracle. Here is the reason why Problem with XMLTABLE and LEFT OUTER JOIN . I know I've seen talk/blogs regarding bugs with Oracle's implementation of the ANSI JOIN syntax in certain situations. As with any SQL statement, you always need to verify your output.

Wednesday, May 18, 2011

Options for Slow Performance Parsing Large XML

First let me state that this is a situational only solution and not a general solution to all performance problems when parsing large XML and inserting the data into the DB. I was brought onto a project to code the PL/SQL portion of a business process that would be kicked off a few times a year via a screen. This process would call a remote web service to see if a set of data had been updated. If that data had been updated since the last time, the code would call the web service again to retrieve all the information in XML format. This information would be parsed and stored into a table in the DB and a few automatic rules would be applied to update the data before the process finished and the user would need to perform manual steps to finish the larger process.

The XML that was being returned from the web service was in a simple format of
<resultset>
 <row>
   ...up to 16 nodes...
 </row>
 <row>
   ...up to 16 nodes...
 </row>
</resultset>

There were some mandatory nodes and many optional nodes. At the time of testing, there were 2,183 Row nodes being returned. The coding/testing was being done on 11.1.0.6.

Being lazy, I went the simple XMLTable method for parsing the XML and storing it into the table. This was done via the basic
INSERT INTO ... 
SELECT ... 
  FROM XMLTABLE(... PASSING l_result_xml ...);
where l_result_xml was defined as an XMLType.

During my testing, the total run time of this was taking 1459 seconds. While this was a seldom run process, it was still too slow to give to the client. Given that this PL/SQL code included a web service call, a MERGE and an UPDATE statement, I wasn't sure where the slowdown was so I used
dbms_utility.get_time;

to get some basic timing information. The time for a run broke down as

HTTP TIME in sec: 30.88
INSERT INTO TIME in sec: 1031.99
MERGE TIME in sec: 0.03
UPDATE TIME in sec: 0.15


The time spent dealing with the web service seemed reasonable and the MERGE/UPDATE time were great given the amount of SQL involved in them. The above INSERT INTO was killing performance though. Having a really good guess, I did an explain plan on the SQL statement and saw the dreaded
COLLECTION ITERATOR PICKLER FETCH

in the explain plan. All those memory operations on the XML were killing performance as expected.

For several reasons, registering a schema was not high on the list of options for better performance. My next approach was then to manually parse the XML in PL/SQL into memory and then use a FORALL to store that into the database. For that approach I setup a structure like (simplified for posting)
TYPE import_q_rec IS RECORD (statute          dbms_sql.Varchar2_Table,
                             end_dt           dbms_sql.Date_Table);
l_import_q_tab       import_q_rec;
l_result_xml         XMLTYPE;

WHILE l_result_xml.Existsnode('/ResultSet/Row[' || To_Char(l_row_index) || ']') > 0
LOOP
  l_temp_nd := l_result_xml.Extract('/ResultSet/Row[' || To_Char(l_row_index) || ']');
  l_import_q_tab.statute(l_row_index) := f_extractxmltypestrval(l_temp_nd, '/Row/STATUTE/text()', NULL);
  l_row_index := l_row_index + 1;
END LOOP;

FORALL i IN 1..l_row_index-1
  INSERT INTO import_queue
  VALUES
  (l_import_q_tab.statute(i), l_import_q_tab.descr(i));

This structure loops through each Row node in the XML and parses out the children nodes that exist. As some nodes were optional, the hidden logic within f_extractxmltypestrval is used to not throw errors if the desired node did not exist.

Once all this information was parsed and loaded into memory, the FORALL would store it into the table.

The first clean run through the code showed it was noticeably faster as the total run time was 37 seconds. The time for a run broke down as

HTTP TIME in sec: 32.43
PARSE XML TIME in sec: 3.59
FORALL TIME in sec: 0.47
MERGE TIME in sec: 0.06
UPDATE TIME in sec: 0.27


So the time spent parsing the XML and storing it into the table for the first run was nearly 1032 seconds and the above way took approximately 4 seconds. That was a good enough in terms of performance in regards to run frequency to call this method good.

So what did this show? When Oracle has to parse large amounts of XML that is stored as a CLOB or a PL/SQL XMLType variable, performance will not be great (at best). This is confirmed by Mark Drake via this XMLDB thread. If you are looking for performance increases, then you have to go other options. The above is one option that worked good for the situation at hand. One option is to register a schema in the database, create a table based off that schema, store the XML in there and parse it that way (if still needed). A second option would be to store the XML as binary XML (available in 11g). How those compare to the pure PL/SQL approach is a good question and hopefully I get around to looking into those soon.

Edit: June 9, 2011
As stated above, my testing was done on 11.1.0.6. In this OTN thread, see the answer from Mark (mdrake) regarding some internal performance changes made in 11.1.0.7 that would alter my original results. It should make the results faster so I'd be interested to see how much faster.

Thursday, January 13, 2011

Memory Leaks, part 2

While keeping up on the OTN forums, I saw one post that was working with DOMDocuments and used .freeNode. Having a system in production that doesn't use .freeNode, I wondered whether this was part of the known small memory leak that an OS session has over time. So I went back to my previous post of Memory Leaks and revisited and expanded it to see whether not using .freeNode was causing memory leaks.

Systems:
The following was tested on two different machines. Both were running the same version of Redhat Linux. The older machine was 10.2.0.4 and 4 Gigs of memory. The newer machine has 11.1.0.6 and 8 Gigs of memory. The processor is better of course too.

Test procedure:
I ran the following updated script three times on both systems. On the older system, I ran it 300,000 times. On the newer system, I ran it 600,000 times (which still ran faster than the older system). You can see by the comments what two lines of code I added in run #2 and the one line of code I added in run #3. After doing each run, I would close that window/session and start a new one. I'm using PL/SQL Developer and have it setup so each window runs as a separate database session.

DECLARE 
l_dom_doc dbms_xmldom.DOMDocument;
l_node dbms_xmldom.domnode; -- added run 2

PROCEDURE p_local_open_close IS
BEGIN
l_dom_doc := dbms_xmldom.newDOMDocument;
l_node := dbms_xmldom.makenode(dbms_xmldom.createelement(l_domdoc, 'root')); -- added run 2
dbms_xmldom.freeNode(l_node); -- added run 3
dbms_xmldom.FreeDocument(l_dom_doc);
END p_local_open_close;

BEGIN
DBMS_APPLICATION_INFO.SET_MODULE(module_name => 'user_domDocument_mem_leak_test',
action_name => 'SLEEPING');
dbms_lock.sleep(20);
/* Sleeps so have time to find OS PID via
select s.sid, s.process, s.serial#, s.module, s.action, p.spid os_pid
from gv$session s,
gv$process p
where s.module = 'user_domDocument_mem_leak_test'
and s.paddr = p.addr;
*/
dbms_application_info.set_action('RUNNING');
FOR i IN 1..300000 -- adjust as needed for your system
LOOP
p_local_open_close;
END LOOP;
dbms_application_info.set_action('COMPLETE');

END;


To determine memory usage, I was using top to see what the VIRT (total virtual memory) for the OS session was. The SQL in the script shows how to get the OS session based on the Oracle session that was running. If your version of top doesn't support top -p , another way as my coworker turned up is ps -opid,vsz -p . This shows the PID and VSZ (Virtual Size) columns. I'm sure there are other ways too.

Run results:
Runs #2 and #3 were identical in memory usage to run #1. For the 10.2.0.4 system, VIRT would start out at/around 438m and increase to 470m after the run. For the 11.1.0.6 system, memory would start at 1181m and stay there.

Conclusion:
It would appear that using dbms_xmldom.freeNode is completely optional as Oracle handles the memory from this correctly. This is determined by the fact that memory usage from the first run, which had no dbms_xmldom.domnode in it, was exactly the same as the second and third runs.

Additional observation:
I was expecting that memory usage on the 10.2.0.4 system to remain a constant like it did on the 11.1.0.6 system. As my previous post said, the fix was applied to a 10.2.0.3 system and has since been upgraded to 10.2.0.4. As 10.2.0.4 was released sometime in early 2008 (or so it appears) it is highly possible the patch applied to 10.2.0.3 was not part of the 10.2.0.4 release. Looks like I will have to verify this on the client's system and see about getting this fixed if it still occurs on their system as well.