Jump to content

Storing periodically retrieved data

shadow_ray

Hi,

 

I'm working with a limited web API. It returns with a full database table as CSV. I want to write a script that can download it  periodically and store it efficiently for later use. I'm interested in the data changes that's why i have to download it periodically. The efficient storage is my issue. Any ideas?


DB tables are usually around a few hundred KB. I planning on using python and docker.

 

 

ಠ_ಠ

Link to comment
Share on other sites

Link to post
Share on other sites

Depends on what you mean by efficient. I would say storing it in a database is the most efficient, especially if you want to retrieve and compare it later. Otherwise you could just store the CSV, or maybe convert it to JSON to be more programmer friendly.

 

~edit: Since you mention Docker: Simply create a second Docker container with a database (e.g. MariaDB). Data should always be stored outside the container. Containers should be seen as ephemeral, meaning you can destroy them and recreate them and your data should still be intact.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

So the composite key in my DB should contain the original key and a timestamp/version. I have to compare the new records to the old ones and if there is a difference i can create a new record with a new timestamp. Am i correct?

But how can i deal with deletions?

ಠ_ಠ

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, shadow_ray said:

So the composite key in my DB should contain the original key and a timestamp/version. I have to compare the new records to the old ones and if there is a difference i can create a new record with a new timestamp. Am i correct?

But how can i deal with deletions?

Yeah, sounds about what I'd use as well. I'd probably use "key + version" as the composite key but add a timestamp anyway (for debugging if nothing else). For deletions you could maybe add a "deleted" column that is set to "1" when a row is no longer returned.

 

After you fetch values from the remote service you can then do:

  • When the database is empty, simply add all records with version 1, current timestamp and deleted = 0

Otherwise, fetch all records from the database with the highest version (revision) for each record that is not deleted = 1

SELECT <columns>
    FROM table t1
    WHERE t1.rev = (SELECT max(rev) FROM table t2 WHERE t2.id = t1.id)
    AND t1.deleted = 0
  • When a record is unmodified, do nothing
  • When a record is new, add it into the database with version 1, current timestamp, deleted = 0
  • When a record is modified, add a new record with version+1, current timestamp and deleted = 0
  • When a record is missing, add a new record with version+1, current timestamp and deleted = 1

Alternatively you could also mark the last version as deleted, but then you'd lose the information when it was deleted. Instead of the whole record you could also only add the primary key, version, timestamp and deleted = 1.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

if it's just megabytes you could probably use sqlite instead of a big maria db instance

 

yeah add some columns to keep track of what got deleted something like  last_seen, last_modified, first_seen  could be integer for simplicity as in write 14 nov 2020 02:14 (time as i write this)  as 20201114021400  ... should be good enough for once a second updates.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×