Jump to content

How to store a trie

Wictorian
 Share

26 minutes ago, Wictorian said:

a trie

A what? 

Main System (Byarlant): Ryzen 7 3800XT | Asus B350-F Strix | EK 240mm Basic AIO | 16GB G.Skill DDR4 3200MT/s CAS-14 | XFX RX 5600 XT THICC II | Samsung 960 PRO 512GB / Samsung 970 EVO 500GB / Crucial MX500 2TB / Crucial MX500 500GB / WD White 7200RPM 8TB | Corsair RM750X | Mellanox ConnectX-3 10G NIC | Hyte Y60 Case | Dell U3415W Monitor | Microsoft Modern Keyboard

 

Laptop (Narrative): Lenovo Flex 5 81X20005US | Ryzen 5 4500U | 16GB RAM (soldered) | Vega 6 Graphics | SKHynix P31 1TB NVMe SSD | Intel AX200 Wifi (all-around awesome machine)

 

TrueNAS Server (Veda): Xeon E3-1241v3 | Supermicro X10SLL-F | Corsair H60 | 32GB Micron DDR3L ECC 1600MHz | 4x 10TB WD Whites / 4x 14TB Seagate Exos / 2x 1TB HGST 2.5" / 1x Samsung PM961 128GB SSD / 1x Kingston 16GB SSD | Seasonic Prime Fanless 500W | Mellanox ConnectX-3 10G NIC | LSI 9207-8i LBA | Fractal Design Node 804 Case (side panels swapped to show off drives)


Media Center/Video Capture (Jesta Cannon): Ryzen 5 1600X | ASRock B450M Pro4 R2.0 | Noctua NH-L12S | 16GB Crucial DDR4 3200MT/s CAS-22 | EVGA GTX750Ti SC | UMIS NVMe SSD 256GB / Seagate 1.5TB HDD | Corsair CX450M | Hauppauge ImpactVCB-PCIe | Mellanox ConnectX-2 10G NIC | LG UH12NS30 BD-ROM | Silverstone Sugo SG-11 Case

 

Camera: Sony ɑ7II (w/ Meike Grip) | Sony SEL24240 | Samyang 35mm ƒ/2.8 | Sony SEL50F18F | Sony SEL2870 (kit lens) | PNY Elite Perfomance SDXC cards

 

Network: Google Fiber Webpass (500/500) > Unifi Security Gateway (Office) > Unifi Switch 8 60W (Office) > Unifi Switch Lite (Closet) > Unifi Switch Lite (Bedroom) + Unifi Switch 8 (Media Center)


Tablet (---): Samsung Galaxy Tab A 8"
 

Spoiler

Laptop (Rozen-ZuluSony VAIO VPCF13WFX | Core i7-740QM | 8GB Patriot DDR3 | GT 425M | Kingston 120GB SSD | Blu-ray Drive | Intel 7260 Wifi (lived a good life, retired with honor)


Tablet (ReGZ): Asus T102HA (BIOS clock doesn't tick, loses time when sleep/off) (I kill tablets with disturbing regularity)

Tablet (Unicorn): Surface Pro 2 (battery will reset total capacity to current charge, leading Windows to think it's always 100% charged until it dies)

Tablet (Loto): Dell Venue 8 Pro (screen discoloration issues, wouldn't update to Windows 10)

Tablet: iPad 2 16GB (WiFi died, basically useless after that)

Testbed/Old Desktop (Kshatriya): Xeon X5470 @ 4.0GHz | ZALMAN CNPS9500 | Gigabyte EP45-UD3L | 8GB Nanya DDR2 400MHz | XFX HD6870 DD | OCZ Vertex 3 Max-IOPS 120GB | Corsair CX430M (?) | HooToo USB 3.0 PCIe Card | NZXT H230 Case (mostly intact, but some parts have been scavenged)

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, AbydosOne said:

A what? 

I assume this: https://en.m.wikipedia.org/wiki/Trie

When you're working with such recursive data structures, you might want to also learn about Common Table Expressions (CTE)

 

Since this is basically a tree, a table with a key, some value(s) and a reference to a key should suffice.

~edit:

trie {
  id: int,
  parentId: int,
  …
  <values>
}

 

image.png.523933cada16ccc09bebb7b4e2d0a10f.png

 

For example this trie could be stored as:

 Id | ParentId | Value
  0 |     null | 
  1 |        0 | t
  2 |        0 | i
  3 |        0 | A
  4 |        1 | to
  5 |        1 | te
  6 |        2 | in
  7 |        5 | tea
  8 |        5 | ted
  9 |        5 | ten
 10 |        6 | inn

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

36 minutes ago, Eigenvektor said:

Oh, there's a name for that concept... surprised I haven't heard it before...

 

36 minutes ago, Eigenvektor said:

For example this trie could be stored as:

Couldn't you iterate up the parent IDs and store only one character at each node? Or maybe I'm missing the point of a trie (I just like iterating up pointer trails for some reason, so that's how I end up imagining algorithms).

 Id | ParentId | Value
  0 |     null | 
  1 |        0 | t
  2 |        0 | i
  3 |        0 | A
  4 |        1 | o
  5 |        1 | e
  6 |        2 | n
  7 |        5 | a
  8 |        5 | d
  9 |        5 | n
 10 |        6 | n

Main System (Byarlant): Ryzen 7 3800XT | Asus B350-F Strix | EK 240mm Basic AIO | 16GB G.Skill DDR4 3200MT/s CAS-14 | XFX RX 5600 XT THICC II | Samsung 960 PRO 512GB / Samsung 970 EVO 500GB / Crucial MX500 2TB / Crucial MX500 500GB / WD White 7200RPM 8TB | Corsair RM750X | Mellanox ConnectX-3 10G NIC | Hyte Y60 Case | Dell U3415W Monitor | Microsoft Modern Keyboard

 

Laptop (Narrative): Lenovo Flex 5 81X20005US | Ryzen 5 4500U | 16GB RAM (soldered) | Vega 6 Graphics | SKHynix P31 1TB NVMe SSD | Intel AX200 Wifi (all-around awesome machine)

 

TrueNAS Server (Veda): Xeon E3-1241v3 | Supermicro X10SLL-F | Corsair H60 | 32GB Micron DDR3L ECC 1600MHz | 4x 10TB WD Whites / 4x 14TB Seagate Exos / 2x 1TB HGST 2.5" / 1x Samsung PM961 128GB SSD / 1x Kingston 16GB SSD | Seasonic Prime Fanless 500W | Mellanox ConnectX-3 10G NIC | LSI 9207-8i LBA | Fractal Design Node 804 Case (side panels swapped to show off drives)


Media Center/Video Capture (Jesta Cannon): Ryzen 5 1600X | ASRock B450M Pro4 R2.0 | Noctua NH-L12S | 16GB Crucial DDR4 3200MT/s CAS-22 | EVGA GTX750Ti SC | UMIS NVMe SSD 256GB / Seagate 1.5TB HDD | Corsair CX450M | Hauppauge ImpactVCB-PCIe | Mellanox ConnectX-2 10G NIC | LG UH12NS30 BD-ROM | Silverstone Sugo SG-11 Case

 

Camera: Sony ɑ7II (w/ Meike Grip) | Sony SEL24240 | Samyang 35mm ƒ/2.8 | Sony SEL50F18F | Sony SEL2870 (kit lens) | PNY Elite Perfomance SDXC cards

 

Network: Google Fiber Webpass (500/500) > Unifi Security Gateway (Office) > Unifi Switch 8 60W (Office) > Unifi Switch Lite (Closet) > Unifi Switch Lite (Bedroom) + Unifi Switch 8 (Media Center)


Tablet (---): Samsung Galaxy Tab A 8"
 

Spoiler

Laptop (Rozen-ZuluSony VAIO VPCF13WFX | Core i7-740QM | 8GB Patriot DDR3 | GT 425M | Kingston 120GB SSD | Blu-ray Drive | Intel 7260 Wifi (lived a good life, retired with honor)


Tablet (ReGZ): Asus T102HA (BIOS clock doesn't tick, loses time when sleep/off) (I kill tablets with disturbing regularity)

Tablet (Unicorn): Surface Pro 2 (battery will reset total capacity to current charge, leading Windows to think it's always 100% charged until it dies)

Tablet (Loto): Dell Venue 8 Pro (screen discoloration issues, wouldn't update to Windows 10)

Tablet: iPad 2 16GB (WiFi died, basically useless after that)

Testbed/Old Desktop (Kshatriya): Xeon X5470 @ 4.0GHz | ZALMAN CNPS9500 | Gigabyte EP45-UD3L | 8GB Nanya DDR2 400MHz | XFX HD6870 DD | OCZ Vertex 3 Max-IOPS 120GB | Corsair CX430M (?) | HooToo USB 3.0 PCIe Card | NZXT H230 Case (mostly intact, but some parts have been scavenged)

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, AbydosOne said:

Oh, there's a name for that concept... surprised I haven't heard it before...

 

Couldn't you iterate up the parent IDs and store only one character at each node?

Sure, you could do that. But depending on how you're using that thing, it could be much faster to have what you need right away rather than recursing through the tree. It's essentially a space vs speed tradeoff.

 

~edit: Let's say you need to check if an entry for "tea" already exists, in my example that could be done with a simple select, in your example you'd need to recurse through the entries, starting with "t", to see if you can form that combination.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Wictorian said:

How should one store a trie in a database?

Is there any context into why you are trying to insert it into a database?

 

I mean there could be a few ways to store it.  If you are ultimately trying to store it as a dictionary, might as well just store it the way a database would work (in a table)...and then rebuild the trie on demand if you need it.

 

e.g. having a table with keys dictionary_id, word
So then you can do a select word from dictionary where dictionary_id = 123;  To get the entire dictionary.  Then iterate through creating the trie.

 

I mean you could I guess use a binary datatype, but I mean ultimately it comes back to needing to still re-import it most likely.

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Eigenvektor said:

Sure, you could do that. But depending on how you're using that thing, it could be much faster to have what you need right away rather than recursing through the tree. It's essentially a space vs speed tradeoff.

This is a really good point, creating a recursive will only optimize the storage space, not the query performance. Recursive query is extremely costly on database. If you have a thousand sub level that could be many tens of thousand of query while some speed optimized structure can do it in a single query.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, wanderingfool2 said:

Is there any context into why you are trying to insert it into a database?

 

I mean there could be a few ways to store it.  If you are ultimately trying to store it as a dictionary, might as well just store it the way a database would work (in a table)...and then rebuild the trie on demand if you need it.

 

e.g. having a table with keys dictionary_id, word
So then you can do a select word from dictionary where dictionary_id = 123;  To get the entire dictionary.  Then iterate through creating the trie.

 

I mean you could I guess use a binary datatype, but I mean ultimately it comes back to needing to still re-import it most likely.

what I mean by database is some dort of file on the hard drive so I dont have to reconstruct it every time I run a script.

Link to comment
Share on other sites

Link to post
Share on other sites

36 minutes ago, Wictorian said:

what I mean by database is some dort of file on the hard drive so I dont have to reconstruct it every time I run a script.

Well I mean insertion into a Trie is pretty lightweight for inserts...it's an O(n) efficiency.

 

I mean you could always get tricky by loading all the data in some set structure so that you can just read in the file and have it in the layout you need [would vary by language].

 

Ultimately if your thought was storing it as a flat file/similar, I would just reconstruct it each time.  Unless there is a bunch of words/phrases that is really slowing it down (or calling it to load a bunch of times) then you might want to consider different methods.

 

You could do things like loading in to memory (easier in things like c/C++ kind of thing)...or other ways that I could think of would really be a lot more effort to implement than just adding in again...ie the performance gains from doing such a thing, is likely not worth the time it would take to get it optimized as such.

 

 

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, Wictorian said:

what I mean by database is some dort of file on the hard drive so I dont have to reconstruct it every time I run a script.

While at the end of the day a database is technically "just a (bunch of) file(s)", it really isn't. You should normally distinguish between the two.

 

A database is optimized for fast retrieval (e.g. indexes) of individual rows and has its own query language. This can allow you to quickly load some subset of the data without having to keep everything in memory. Depending on how you're using the data, a flat file may "good enough", but as I pointed out above, databases support CTEs which can make retrieval of recursive data much more efficient.

 

In any case, you can still use the same overall structure for a flat file and e.g. use a simple CSV. You can use its columns and rows pretty much the same way as you would a single database table.

 

As @wanderingfool2 said, consider reconstructing it instead. If you're loading everything into memory on startup, loading it from disk is only really worth it if that is faster than reconstructing it on the fly. If you only need a subset and there's a good way to query what you need, then an actual database can start to make sense.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, Wictorian said:

what I mean by database is some dort of file on the hard drive so I dont have to reconstruct it every time I run a script.

If you need to load or modify portion on that data a database is what you need. If you will always and only use that tree when fully loaded you want a simple serialization.

Link to comment
Share on other sites

Link to post
Share on other sites

If you want to store it in a database you can use: Neo4J which is a graph database letting you store "nodes" and "links".

 

Alternativly and probably simpler, you can use graphviz which I use the python library (pypi) to parse graphs / construct and parse etc. These can be saved to ".dot" files and uses the "DOT Language".

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share


×