Deduplicating with group_by, func.min, and having

published by whitemice on Tue, 01/24/2017 - 14:45

You have a text file with four million records and you want to load this data into a table in an SQLite database. But some of these records are duplicates (based on certain fields) and the file is not ordered. Due to the size of the data loading the entire file into memory doesn't work very well. And due to the number of records doing a check-at-insert when loading the data is also prohibitively slow. But what does work pretty well is just to load all the data and then deduplicate it. Having an auto-increment record id is what makes this possible.

class VendorSKU(scratch_base):
    __tablename__ = 'sku'
    id      = Column(Integer, primary_key=True, autoincrement=True)
...

Once all the data gets loaded into the table the deduplication is straight-forward using minimum and group by.

query = scratch.query(
    func.min( VendorCross.id ),
    VendorCross.sku,
    VendorCross.oem,
    VendorCross.part ).filter(VendorCross.source == source).group_by(
        VendorCross.sku,
        VendorCross.oem,
        VendorCross.part ).having(
            func.count(VendorCross.id) > 1 )
counter = 0
for (id, sku, oem, part, ) in query.all( ):
    counter += 1
    scratch.query(VendorCross).filter(
        and_(
            VendorCross.source == source, 
            VendorCross.sku == sku,
            VendorCross.oem == oem,
            VendorCross.part == part,
            VendorCross.id != id ) ).delete( )
    if not (counter % 1000):
        # Commit every 1,000 records, SQLite does not like big transactions
        scratch.commit()
scratch.commit()

This incantation removes all the records from each group except for the one with the lowest id. The trick for good performance is to batch many deletes into each transaction - only commit every so many [in this case 1,000] groups processed; just also remember to commit at the end to catch the deletes from the last iteration.

Tags:

python

sqlalchemy

sqlite

Converting MBOX to ZIP whitemice
0x0000011b whitemice
Upgrading A Cisco AP To Autonomous whitemice
Linting JSON On The Command Line whitemice
Postfix IPv4 Only whitemice
Virtual Box Start Error (VNC) whitemice
Dropping An Element In An Iterative Parse whitemice
Installing The Zoom Client On openSUSE 15.1 whitemice
Multiple Interfaces On Windows whitemice
Announcing OpenGroupware COILS whitemice

whitemiceconsulting.com

You are here

Deduplicating with group_by, func.min, and having

Recent content

Adam Tauno Williams

whitemiceconsulting.com

You are here

Deduplicating with group_by, func.min, and having

Recent content

User login

Adam Tauno Williams