Jump to content

Hello, Im looking for IT wordlists, dont worry its not for password cracking.

 

I have large database of plaintext documents and Id like to select the ones that are about some IT service, job etc. So i thought wordlists would be ideal for this.

The documents are in Czech language so Czech IT wordlist would be ideal but even English IT wordlists would be very helpful.

 

If you have any other suggestion how to select relevant IT documents, please share it too.

Link to comment
https://linustechtips.com/topic/777774-it-wordlists/
Share on other sites

Link to post
Share on other sites

How many documents are you dealing with?  What kind of documents?  And what are your needs in terms of precision and recall of your queries?  Depending on what you need, a word list might be a very poor solution.  E.g., if you're looking at customer feedback or reviews or really free-form text.  Depending what you need I might recommend you look into topic modeling, which is a set of algorithms (e.g.: Latent Dirichlet Allocation and Author-Topic Modeling)  , which is a toolset, whic specifically designed to discover the "topics" that a set of documents are about, and to classify documents based on what "topics" appear in them.  It gets around the need for a word list, but it does bring some other challenges that may or may not be something you want to/can afford to deal with.

Link to comment
https://linustechtips.com/topic/777774-it-wordlists/#findComment-9813966
Share on other sites

Link to post
Share on other sites

Theres more than 500 000 documents, they are plain text contracts. The contracts have various topics like construction work or even laundry services. Im looking for any IT related contracts.

 

For topic modelling I would still need some kind of keywords to determine the topic?

Link to comment
https://linustechtips.com/topic/777774-it-wordlists/#findComment-9815122
Share on other sites

Link to post
Share on other sites

Okay.  That still doesn't quite answer the other question, which is what your needs are in terms of precision and recall.  E.g., if you need to get every contract related to IT and no false positives, then topic modeling may not be the best tool (for that, you'd need metadata on the documents--probably hand-annotated).  But if you just need to get a pretty good number of topics related to IT and some false positives are okay, then it might be a good option.

 

Topic modeling does not require word lists.  TM algorithms are designed to learn what the topics are, specifically to avoid the need for things like a wordlist.  You generally feed them your documents (though you need to do some preprocessing on them--a standard pipeline in NLP is tokenization, stemming/lemmatization, generating a document-term matrix, applying a TF-IDF transform/weighting, then normalizing your feature values) and some of the model parameters, then they does some number-crunching.  What you get out is 1) a list of what the "topics" are, what words are associated with those topics, and how much so (you have to look over these lists manually and figure out what the topics are about in order to make them meaningful to you); 2) a list of how much each topic is showing up in your documents.

 

There's a non-trivial learning curve if you're not already familiar with NLP tools and ideas, so if you're on a deadline then this might not be worth it.  Plus, using LDA, the most commonly used algorithm for topic modeling at the moment, you have to manually tell it how many topics there are, and it'll take rather a long time to run on 500,000+ documents (it's taken several hours on my pretty well-powered workstation to run a solid LDA analysis on ~75,000 news articles), so doing the data mining can be very time-consuming if you need to tweak the model's parameters after a run.

 

Though, a quick thought on where some word lists might be found: Wikipedia and Wiktionary might have some Category: pages for computer-related terminology.

Link to comment
https://linustechtips.com/topic/777774-it-wordlists/#findComment-9817830
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×