You are here

Deduplicating with group_by, func.min, and having

You have a text file with four million records and you want to load this data into a table in an SQLite database. But some of these records are duplicates (based on certain fields) and the file is not ordered. Due to the size of the data loading the entire file into memory doesn't work very well. And due to the number of records doing a check-at-insert when loading the data is also prohibitively slow. But what does work pretty well is just to load all the data and then deduplicate it. Having an auto-increment record id is what makes this possible.

class VendorSKU(scratch_base):
    __tablename__ = 'sku'
    id      = Column(Integer, primary_key=True, autoincrement=True)
...

Once all the data gets loaded into the table the deduplication is straight-forward using minimum and group by.

query = scratch.query(
    func.min( VendorCross.id ),
    VendorCross.sku,
    VendorCross.oem,
    VendorCross.part ).filter(VendorCross.source == source).group_by(
        VendorCross.sku,
        VendorCross.oem,
        VendorCross.part ).having(
            func.count(VendorCross.id) > 1 )
counter = 0
for (id, sku, oem, part, ) in query.all( ):
    counter += 1
    scratch.query(VendorCross).filter(
        and_(
            VendorCross.source == source, 
            VendorCross.sku == sku,
            VendorCross.oem == oem,
            VendorCross.part == part,
            VendorCross.id != id ) ).delete( )
    if not (counter % 1000):
        # Commit every 1,000 records, SQLite does not like big transactions
        scratch.commit()
scratch.commit()

This incantation removes all the records from each group except for the one with the lowest id. The trick for good performance is to batch many deletes into each transaction - only commit every so many [in this case 1,000] groups processed; just also remember to commit at the end to catch the deletes from the last iteration.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer