Storing periodically retrieved data

shadow_ray · November 12, 2020

Hi,

I'm working with a limited web API. It returns with a full database table as CSV. I want to write a script that can download it periodically and store it efficiently for later use. I'm interested in the data changes that's why i have to download it periodically. The efficient storage is my issue. Any ideas?

DB tables are usually around a few hundred KB. I planning on using python and docker.

Eigenvektor · November 12, 2020

Depends on what you mean by efficient. I would say storing it in a database is the most efficient, especially if you want to retrieve and compare it later. Otherwise you could just store the CSV, or maybe convert it to JSON to be more programmer friendly.

~edit: Since you mention Docker: Simply create a second Docker container with a database (e.g. MariaDB). Data should always be stored outside the container. Containers should be seen as ephemeral, meaning you can destroy them and recreate them and your data should still be intact.

shadow_ray · November 12, 2020

So the composite key in my DB should contain the original key and a timestamp/version. I have to compare the new records to the old ones and if there is a difference i can create a new record with a new timestamp. Am i correct?

But how can i deal with deletions?

Eigenvektor · November 12, 2020

1 hour ago, shadow_ray said:

So the composite key in my DB should contain the original key and a timestamp/version. I have to compare the new records to the old ones and if there is a difference i can create a new record with a new timestamp. Am i correct?

But how can i deal with deletions?

Yeah, sounds about what I'd use as well. I'd probably use "key + version" as the composite key but add a timestamp anyway (for debugging if nothing else). For deletions you could maybe add a "deleted" column that is set to "1" when a row is no longer returned.

After you fetch values from the remote service you can then do:

When the database is empty, simply add all records with version 1, current timestamp and deleted = 0

Otherwise, fetch all records from the database with the highest version (revision) for each record that is not deleted = 1

SELECT <columns>
    FROM table t1
    WHERE t1.rev = (SELECT max(rev) FROM table t2 WHERE t2.id = t1.id)
    AND t1.deleted = 0

When a record is unmodified, do nothing
When a record is new, add it into the database with version 1, current timestamp, deleted = 0
When a record is modified, add a new record with version+1, current timestamp and deleted = 0
When a record is missing, add a new record with version+1, current timestamp and deleted = 1

Alternatively you could also mark the last version as deleted, but then you'd lose the information when it was deleted. Instead of the whole record you could also only add the primary key, version, timestamp and deleted = 1.

mariushm · November 14, 2020

if it's just megabytes you could probably use sqlite instead of a big maria db instance

yeah add some columns to keep track of what got deleted something like last_seen, last_modified, first_seen could be integer for simplicity as in write 14 nov 2020 02:14 (time as i write this) as 20201114021400 ... should be good enough for once a second updates.