Jump to content

Azgoth 2

Member
  • Posts

    317
  • Joined

  • Last visited

Everything posted by Azgoth 2

  1. Python has a portable version (look for the "embedded ZIP" download), with versions for every OS. The 64-bit Windows version comes in at 12.7MB after unzipping. Pros: Code is very easy to write and maintain. Very robust built-in tools for file and string manipulation; more powerful tools in the standard library's os, sys, and re modules (for general OS interfaces, miscellaneous system/interpreter functionality, and regular expressions, respectively). Excellent documentation. The portable version doesn't seem to be generating any compiled bytecode files (the normal install does)--but if it is doing this, there's a command line flag (python -B your_file.py) to make it not. The compiled bytecode just reduces startup time--there's no runtime benefit. Python doesn't quite have Java's level of "write once, run anywhere," but it's close. In 2+ years of writing code across Windows and Linux machines I've only ever run into a small handful of things that needed to be changed (and they were very small, quick fixes). Cons: Runtime is a bit on the slow end, but recent versions of the language keep adding speedups. May or may not be an issue depending on your use cases--if you have very high performance needs, it might be worth benchmarking this with some test scripts to check. I've never noticed the file/directory manipulation to be a slowdown in my own work, and the string manipulation is plenty fast (I do a bunch of natural language processing work, so strings are part of my bread and butter). The portable version does not seem to come with pip, the CLI package manager for Python libraries, and I'm not sure if you can copy-paste source code directories from PyPI to add them in. But thankfully the standard library should be more than enough for file/directory/string manipulations, as long as that's what you stick to.
  2. I'm a big fan of Geany as an editor. It's a pretty light IDE, but it lacks some of the features that bigger programs like Visual Studio or JetBrains' stuff does (granted, they're all features I instantly turn off, so I love it). I've run it on my original Raspberry Pi B+ without issue. It's basically a fancy text editor--syntax highlighting, automatic indentation, code folding, project directory navigation--but it's designed to work with any language you can dream of and lets you specify shell commands to compile and run your program. That last point is a bit of a quirk at first, but it's nice to be able to do pretty much all my programming in a single program, no matter the language, which is a bit more friendly than Vim/Emacs. It should detect Java by default and have the right build/run commands configured out of the box, I'd imagine.
  3. Short answer, no, you don't need one. Longer answer, you are unlikely to see much benefit from one unless 1) you've always had a numpad, and thus just never learned to type on the number row, 2) you are constantly entering numeric values. E.g., hard-coding some arrays, manually specifying numeric coefficients in mathematical equations, or doing data-entry type activities. For that, a numpad tends to speed me up a lot, but only when I'm entering many numbers in succession without needing to jump back to the main keyboard area. So if you want a TKL, get a TKL. You can always get a dedicated, perfectly usable USB numpad for pretty cheap if you find yourself really needing one later. Or, I think there are some keyboard that have a detachable numpad section.
  4. How does OpenMV deal with images? Does it use something like a 2d array of pixel values? If so you could use some of the tools from probably the Numpy or Scipy libraries to clip your pixel values at a certain level (you'd probably need to play with the level a bit to find a good one). Or, if you could use the PIL library: from PIL import Image, ImageEnhance im = Image.Image("/path/to/saved/image") contrast = ImageEnhance.Contrast(im) contrast = contrast.enhance(FACTOR) # set FACTOR > 1 to enhance contrast, < 1 to decrease # either save the image... contrast.save("/path/to/new/location") # or cast to a numpy array import numpy as np im_array = np.array(contrast.getdata())
  5. Your first command might be looking for the wrong package. On Debian-based distros, at least, the package name for pip is python3-pip--not sure if it's the same in the CentOS repos, but look for something like that. Also make sure you don't have a bash alias that points "pip3" at the Python 2.6 pip--it seems like that might be happening since when you type "pip3 install requests" it says it's looking in the Python 2.6 directories. Also, after installing, run "pip install -U pip" to upgrade pip to the newest version (per the error message)--just to make sure you're using the newest version. You can double-check what version of Python your pip install is configured for with "pip -V" (my output is pip 9.0.1 from /home/localuser/.local/lib/python3.5/site-packages (python 3.5), which tells me it's Python 3.5). You can also try running pip explicitly through Python with "python3 -m pip [pip commands/options]". That'll use your PATH variable to find the Python 3 executable, which should know where its corresponding version of pip is stored if you've installed it.
  6. Libraries for AI: For general data and numeric work: the Scipy stack (numpy, scipy, matplotlib, pandas in particular)--necessary for doing really any work with data (and AI is all about data). Numpy gives you native multidimensional arrays and lots of very fast, efficient operations on those arrays (e.g., dot products, matrix norms, convolutions). Scipy has a lot of general scientific functions (e.g. Fourier analysis, Voronoi tesselations, and function optimization routines), and also sparse matrix formats (for storing large data sets that have a lot of zeros in a memory-efficient way). Matplotlib is a very large, but extremely powerful library for doing visualizations of data in pretty much whatever way you can imagine--scatter plots, line plots, 3d surfaces, shared axes, etc. It's a big, big library that can be a bit confusing at times, but it's very good to know. (Alternately, look into related libraries like Seaborn or Pygal--there are others, but Matplotlib is the one I use most so I don't know the others that well). Pandas gives you access to dataframe objects that make dealing with tabular data, particularly non-numeric data, pretty easy. (though Numpy is better for strictly numeric data). For non-neural machine learning: scikit-learn. An excellent library that's got some of the best documentation I've ever found for anything, and a lot of pretty good implementations of classic ML algorithms (e.g., SVM, decision tree, Naive Bayes, TSNE, K-Means, and way more). For neural networks: Tensorflow or Theano if you want to build networks completely from scratch. Both are general-purpose GPU accelerated math libraries, but with a focus on deep learning/neural networks. Theano is older and a bit more general-purpose; Tensorflow is newer, made by Google, and more geared at deep learning/neural networks. There are pros and cons to both, but they'll both be pretty comparable at the end of the day. Tensorflow is becoming far more common in industry (it's the de facto platform for a huge amount of neural network work these days), so that might make it a more attractive options. Keras--a library that acts as a very nice frontend to either Theano or Tensorflow (you can pick--I think it uses Tensorflow by default, but it's very easy to change). The Keras people have already done a lot of the legwork with building different neural network architectures for you, so you don't have to go build an LSTM layer or GRU layer by hand--just use the one that comes in Keras. For general speedups of non-Keras/Theano/Tensorflow code: Cython. It adds static typing to Python and gives you a Python-to-C transpiler that does what I've been told are some very aggressive optimizations. You'll spend more time actually writing you code than if you used pure Python, but if you need something to run super duper fast, this is an incredible tool. And there are a bunch of other libraries for dealing with specific data sources, e.g. NLTK/spaCy/gensim for doing natural language processing. Just google around a bit if you need something more specific. For general Python resources, the official Python documentation is actually not a bad place to start--they have an introduction/tutorial in the language tha'ts pretty basic, but enough to get you up and running reasonably fast. O'Reilly has a lot of very good Python books, but I don't think any of them are free. No Starch Press also has a lot of Python books, some (but not all) of which might be free. There are also lots of websites around that have problem sets designed to be solved using programming (they're usually not written with any specific language in mind): Project Euler for math-heavy stuff, Rosalind for biology/genetics, and Kaggle for machine learning are all ones that come to mind. Also look around on EdX, Coursera, and MIT's open courseware--there should be some free classes there you can use. For AI resources, all the above still applies (minus the main Python documentation), in addition to the PyData conferences. The PyData YouTube channel posts a lot of videos, a few of which are more tutorial-oriented. (PyData is a group of conferences for data scientists, with a focus on Python as the primary tool). The mayb PyCon (the general Python language conference) might also have some occasional AI-ish tutorials or talks.
  7. The single best way, in my experience, to learn how to use any programming language is a project-based approach. Pick or find a thing you want to do--maybe program a game, or a Twitter bot, a generative art maker, or some quality-of-life programs you expect to use often--and then google the hell out of how to do the different parts of it. The fact that you're enrolled in a course right now might give you some natural projects/things to do, but there are a bunch of websites out there that collect programming-oriented problems (Project Euler for computational mathematics; Rosalind for computational bioinformatics/genetics; Kaggle for machine learning; etc etc) if you want something different (and something that a lot of people have already done, so there should be plenty of resources available). The downside to a do-it-yourself project-based approach, though, is that you'll tend to learn a few parts of the language very deeply, but you may never pick up on the breadth of what can be done with the language. A more formal class will tend to, in my experience, give you the inverse, where you get a good sampling of everything but don't necessarily get to dive super deep into any one thing during the course itself. Usually, if you can find some sort of official documentation for your language (e.g., Python's documentation, which is excellent), that's a good reference to always keep on hand. From time to time you should skim through parts of the documentation you're not familiar with--you may not use all the features of a language, but you might learn about something you wouldn't come across otherwise.
  8. You can use most distros as a server, but you'll commonly see Debian and Slackware used, though they don't have distinct server variants--but they lend themselves very nicely to being configured as servers. As for distros that are specifically designed for server use, Ubuntu Server, Red Hat Enterprise Linux, and SUSE Enterprise Linux all have a corprorate support infrastructure behind them, which makes them appealing to a lot of businesses. CentOS is a community driven fork of Red Hat that's very popular, too.
  9. Since you're interested in GPU-accelerated math and neural nets, as mentioned, you won't be able to get anything serious done on a Pi due to its very low specs (at least in terms of building large or complex models), but you can get started with the basics. That said: for straight GPU-accelerated math, look into Tensorflow and Theano. They're both great libraries for GPU math, each with pros and cons that you'll want to read up on a bit. In short, though: Theano is older and more mature, but Tensorflow is developed by Google and is rapidly taking over other GPU-accelerated math libraries/frameworks. Tensorflow is also incompatible with Python 3.6 (unless you install it from source and manually compile it), which is frustrating. There's also PyTorch, which is still in the early release stages I believe. Despite the name, I don't think it has any relation to Lua's venerable Torch framework for deep learning--it seems to be developed from the ground up for Python. As a word of warning, I don't think any of these libraries have very good support for OpenCL--they're all heavily CUDA_oriented, so you'll need an Nvidia GPU to get any real use out of the GPU acceleration. (But, I believe they can all run in CPU mode, though that'll be horrendously slow and basically make non-trivial recurrent neural nets impossible to use). For a neural-net specific library, look into Keras. It's basically a nice frontend for Theano/Tensorflow that obviates the need for you to actually write the various layers of your network by hand. You'll also want to look into the Scipy stack of libraries: numpy, scipy, matplotlib, and pandas, in particular. They're absolutely indispensible for doiing any data work in Python. Just make sure you're either installing them in a *nix environment or you grab the numpy and scipy .whl files from Chris Gohlke's site--they require Fortran and C compilers, plus a bunch of linear algebra and other math libraries that are a fucking nightmare to install on Windows (and which are overwhelmingly written specifically for *nix environemnts).
  10. Your options are extremely limited given those specs. Tiny Core has already been mentioned, but I'll second it--though be warned it comes with basically nothing installed--not even a lot of the command line tools that you get in other distros. Debian with a minimal install (look for the "network install" .iso image) and no GUI might work, as might a minimalist Arch install (as in, one where you don't put much stuff on it). Non-linux OSs that should work would include FreeDOS, a free and open source implementation of MSDOS, and Kolibri OS, an operating sytem that's written entirely in Assembly and has hilariously low system requirements (but Kolibri is a very young/early development stage OS, so it may not be of much use).
  11. As long as you don't have multiple matches per line to worry about that should work--it'll print each match to a new line in stdout, based on my quick tests. Admittedly I don't use awk/gawk much so I wasn't aware it had issues with multiple matches per line until I looked it up just now. Frankly I'm starting to lean towards just writing a script in Python or whatever language you're comfortable with to do the matching for you: #!/usr/bin/python3 import re f = open("/path/to/file", "r").read() for i in re.findall('https://www\.twitch\.tv/videos/(.*?)",', f): print(i) # or for multiple files in a single directory import os import re files = [i.path for i in os.scandir("/path/to/files")] for F in files: f = open(F, "r").read() for i in re.findall('https://www\.twitch\.tv/videos/(.*?)",', f): print(i) That has no issue matching across multiple lines or multiple matches within a line, again based on some of my quick and hacky testing.
  12. Ah, i see what's happening. I was testing this on a single random twitch.tv video URL--in your text it's replacing the url with just what comes after the /video/ part. Sed is really meant for manipulating text--for just matching substrings, you'll want (g)awk. gawk 'match($0, /https:\/\/www\.twitch\.tv\/videos\/([^\"]*)\",/, arr) {print arr[1]}' output Regular expressions in (g)awk are surrounded by / characters, s there's a lot of ugly escaping. Quotation marks are also escaped because they normally represent literal string delimiters.
  13. Use the -E flag to use extended regexp syntax, and get rid of the backslashes around your parentheses. Then use special \1 escape character in place of a replacement string to print out the found match for the capturing parentheses. (\1-\9 refer to matched sub-strings; with capturing parentheses they can refer to what the parentheses found) sed -E 's_https://www.twitch.tv/videos/(.*)"_\1_'
  14. After some quick testing with awk: awk 'match($0, /*([0-9]+)~/, res) { print res[1] }' file1 file2 ... Where: match($0, /*([0-9]+)~/, res) matches the regular expression *([0-9]+)~ (regular expressions are enclosed by // in awk) to $0 (i.e., stdin) and saves the result to res. { print res[1] } prints the first item of the (0-indexed) result array. Since there were capturing parentheses in the regular expression, this is what was inside the parentheses. (The 0th item in the array for this bit of code is the full match, as if there weren't capturing parentheses there). file1 file 2... are your files you need to find this pattern in. For the line G50*N*20160207*18885~ , this should return just 18885.
  15. Okay. That still doesn't quite answer the other question, which is what your needs are in terms of precision and recall. E.g., if you need to get every contract related to IT and no false positives, then topic modeling may not be the best tool (for that, you'd need metadata on the documents--probably hand-annotated). But if you just need to get a pretty good number of topics related to IT and some false positives are okay, then it might be a good option. Topic modeling does not require word lists. TM algorithms are designed to learn what the topics are, specifically to avoid the need for things like a wordlist. You generally feed them your documents (though you need to do some preprocessing on them--a standard pipeline in NLP is tokenization, stemming/lemmatization, generating a document-term matrix, applying a TF-IDF transform/weighting, then normalizing your feature values) and some of the model parameters, then they does some number-crunching. What you get out is 1) a list of what the "topics" are, what words are associated with those topics, and how much so (you have to look over these lists manually and figure out what the topics are about in order to make them meaningful to you); 2) a list of how much each topic is showing up in your documents. There's a non-trivial learning curve if you're not already familiar with NLP tools and ideas, so if you're on a deadline then this might not be worth it. Plus, using LDA, the most commonly used algorithm for topic modeling at the moment, you have to manually tell it how many topics there are, and it'll take rather a long time to run on 500,000+ documents (it's taken several hours on my pretty well-powered workstation to run a solid LDA analysis on ~75,000 news articles), so doing the data mining can be very time-consuming if you need to tweak the model's parameters after a run. Though, a quick thought on where some word lists might be found: Wikipedia and Wiktionary might have some Category: pages for computer-related terminology.
  16. How many documents are you dealing with? What kind of documents? And what are your needs in terms of precision and recall of your queries? Depending on what you need, a word list might be a very poor solution. E.g., if you're looking at customer feedback or reviews or really free-form text. Depending what you need I might recommend you look into topic modeling, which is a set of algorithms (e.g.: Latent Dirichlet Allocation and Author-Topic Modeling) , which is a toolset, whic specifically designed to discover the "topics" that a set of documents are about, and to classify documents based on what "topics" appear in them. It gets around the need for a word list, but it does bring some other challenges that may or may not be something you want to/can afford to deal with.
  17. There is if you only allow certain domains, and leave others blocked, though. It lets you block on a domain basis, so I you can have some things on a page blocked while other things are allowed. Example: I have linustechtips.com added to my whitelist, but not google-analytics.com or googletagservices.com, all of which try to run scripts here on the LTT forums. It obviously won't do any good if you permanently whitelist everything on every site you visit, but the same goes for literally any blocking/security service. And I always have cross-site scripting and possible clickjacking blocked regardless of domain. Unless I'm thinking of something else, entrance and exit nodes can be set up by malicious agents (e.g. NSA/FBI/etc), who can then monitor traffic through them. But that's why you still use HTTPS when connecting through Tor--to make sure your data is encrypted en route to and out of the network, making it harder for a compromised entry/exit node to become a problem for you. Unless there was some development I missed, I don't believe the core security of Tor is known to be compromised. (though if it is I would love to see more details, since that seems like a pretty major thing)
  18. NoScript does block all flash/etc by default, but it lets let you whitelist domains with just one or two clicks to un-break sites (Privacy Badger, incidentally, does too, because it can sometimes break sites). I personally don't find NoScript to be at all intrusive, but I admittedly have been using it for a very long time so I've just gotten used to it.
  19. Derp, yes, I just forgot a word. I'd actually argue that Ghostery isn't as good as choice as Privacy Badger + NoScript/uBlock Origin. For one, Ghostery is closed-source (which may not be a big deal for everyone, but definitely makes me raise an eyebrow), while Privacy Badger/etc are all open-source and released under various versions of the GNU GPL license. Ghostery also collects data about what domains are blocked by users and makes that data available to various groups--some are benign-seeming, like researchers and the Better Business Bureau, but by some some accounts this also includes advertising partners and other business entities (citation admittedly needed on this part, since I can't recall where I heard it). I don't know if the data they collect is purely aggregate or is anonymous or whatever, but it raises a few red flags for me either way, and to me at least seems to defeat the point of using a privacy/anonymization browser extension. This may not be a problem for everyone, I grant, but I think people should be aware of it so they can judge that for themselves; personally, it's a deal-breaker for me.
  20. Incognito mode is not private. It doesn't save browsing history to your local machine, but nothing else changes--nothing that wouldn't normally be encrypted is encrypted, for instance. Tor will provide a far greater level of anonymity than a VPN, but at the cost of some functionality you're used to. E.g., you can't and shouldn't do really high-bandwidth stuff through Tor--doing a lot of streaming and really any torrenting can put a lot of stress on the network--and gaming will suffer from massive latencies. The Tor Browser also doesn't let you save any data or settings between sessions, to minimize browser fingerprinting. Oh, and some sites will just straight-up break if you're using Tor, because some sites just categorically reject Tor traffic. (This is a fairly small number of sites, though). VPNs are much more flexible in terms of bandwidth, but less gung-ho about anonymization. They focus more on privacy, i.e., making sure your data is encrypted in transit. Tor tries to hide who you are at all points along the way; VPNs essentially encrypt your data and relay it through another (often just one) server somewhere. VPNs also usually cost money and are closed-source, so do your research before deciding on one. The closed source thing means that you have to trust the company to not be misrepresenting their product (and when has a company ever done that?), and you have to rely on third-party whitepapers and recent security audits, which may not be available for all providers. Also, don't bother with free VPNs. You should always assume that if you're not paying for a service online, you're the product, not the customer, and your data is the real product (being sold to someone else). There are some browser extensions you can use that don't require either Tor or a VPN, though--HTTPS Everywhere, Privacy Badger (both by the Electronic Frontier Foundation), and uBlock Origin or Noscript (they do similar but slightly different things) are a pretty strong set. But anyways, Tor is completely free, so there's no reason not to give it a shot and see if it's for you. If not, you can look into a VPN service.
  21. On a single-core CPU with those specs, basically anything will chug along. And as has been mentioned, YouTube videos will always struggle on that processor, just due to its low specs. And Chrome/Firefox/Chromium are probably not going to perform all that well--look into some of the lightweight browsers like Midori. All that said, you've got some options that might be worth a look: SliTaz: a very lightweight, fully-featured OS that takes up very little disk space. I've used this before on ancient single-core computers and it tends to work pretty well while also being generally usable for basic tasks. Tiny Core: Damn Small Linux developers who left because they felt DSL was too bloated. (Unlike DSL, this one is still actively maintained). Extremely minimal--you can get an image for ~12MB--but as a result, it's lacking a lot of functionality. But if you just need to do document editing and some very light web browsing, it should do you. Just be aware that you may have to go install all the programs you want by hand, especially if you get one of the smaller images. Puppy Linux: less lightweight than SliTaz or Tiny Core, but more user-friendly. You could also try Arch, which does an extremely minimalistic installation (but takes a good bit of work to install compared to other distros). Debian has a "network install" image that's pretty similar in this regard--it installs just the core Debian components, and you pick what to install on top of it. But the configuration is rather more hands-on and time-consuming this way. Really, though, if you can, your best option is to find a computer with better specs. Dell and HP should have some laptops for at or around $200 new. If you live near somewhere like a Free Geek you might be able to get a more powerful computer, albeit a refurbished one, for a comparable price. Or really anywhere that sells used and/or refurbished electronics.
  22. Reading PDFs is very possible, but it can be a pain in the ass due to there being a range of different PDF encodings, and some actually requiring Acrobat to decode (otherwise they are unreadable). Pulling information out of the the PDF falls into the realm of Information Retrieval and might be, far and away, the most difficult step depending on the PDFs you're dealing with. If you're dealing with a finite, known set of possible inputs, you can just write rules that explicitly deal with each one. E.g., if you're dealing only with invoices, and you know the invoices can only be formatted one of several ways, you can just check what the layout is and go to the appropriate processing steps. If you're dealing with more general and unstructured data--e.g., raw text--then you'll need to delve deep into information retrieval and possible natural language processing/understanding. And things can get messy and complicated there if you're building your own IR engine, and you'll need to do a lot of quality assurance/testing even if you're borrowing someone else's to make sure it works in your specific circumstances. Drafting and sending e-mails is pretty straightforwards. That's just text formatting. You could have some form letters and plop the relevant information from the PDFs into the letters. As for sending the e-mails, I don't personally know how you'd do that, but that's absolutely a thing you can do, and it's probably not all that difficult either.
  23. See @fizzlesticks reply above; try using a counter and the .most_common method (which returns items sorted by frequency).
  24. I always forget about everything other than defaultdicts in the collections module. A counter is certainly better if the goal is frequency analysis, yes--more readable than list comprehension (but so much less fun...), and it has that nice .most_common() method.
  25. As in, you're just trying to count the number of times each character appears (i.e., for frequency analysis)? If so, a dictionary is your best bet: # Using list comprehension to make the dictionary raw = raw_input("Please enter the ciphered string to crack: ") freqs = {i:raw.count(i) for i in set(raw)} # Or, iterating through the string raw = raw_input("Please enter the ciphered string to crack: ") freqs = {} for i in raw: try freqs[i] += 1 except KeyError: # character not in dictionary's keys, initialize count to 1 freqs[i] = 1
×