Jump to content

Azgoth 2

Member
  • Posts

    317
  • Joined

  • Last visited

Awards

This user doesn't have any awards

Recent Profile Visitors

1,452 profile views
  1. Python has a portable version (look for the "embedded ZIP" download), with versions for every OS. The 64-bit Windows version comes in at 12.7MB after unzipping. Pros: Code is very easy to write and maintain. Very robust built-in tools for file and string manipulation; more powerful tools in the standard library's os, sys, and re modules (for general OS interfaces, miscellaneous system/interpreter functionality, and regular expressions, respectively). Excellent documentation. The portable version doesn't seem to be generating any compiled bytecode files (the normal install does)--but if it is doing this, there's a command line flag (python -B your_file.py) to make it not. The compiled bytecode just reduces startup time--there's no runtime benefit. Python doesn't quite have Java's level of "write once, run anywhere," but it's close. In 2+ years of writing code across Windows and Linux machines I've only ever run into a small handful of things that needed to be changed (and they were very small, quick fixes). Cons: Runtime is a bit on the slow end, but recent versions of the language keep adding speedups. May or may not be an issue depending on your use cases--if you have very high performance needs, it might be worth benchmarking this with some test scripts to check. I've never noticed the file/directory manipulation to be a slowdown in my own work, and the string manipulation is plenty fast (I do a bunch of natural language processing work, so strings are part of my bread and butter). The portable version does not seem to come with pip, the CLI package manager for Python libraries, and I'm not sure if you can copy-paste source code directories from PyPI to add them in. But thankfully the standard library should be more than enough for file/directory/string manipulations, as long as that's what you stick to.
  2. I'm a big fan of Geany as an editor. It's a pretty light IDE, but it lacks some of the features that bigger programs like Visual Studio or JetBrains' stuff does (granted, they're all features I instantly turn off, so I love it). I've run it on my original Raspberry Pi B+ without issue. It's basically a fancy text editor--syntax highlighting, automatic indentation, code folding, project directory navigation--but it's designed to work with any language you can dream of and lets you specify shell commands to compile and run your program. That last point is a bit of a quirk at first, but it's nice to be able to do pretty much all my programming in a single program, no matter the language, which is a bit more friendly than Vim/Emacs. It should detect Java by default and have the right build/run commands configured out of the box, I'd imagine.
  3. Short answer, no, you don't need one. Longer answer, you are unlikely to see much benefit from one unless 1) you've always had a numpad, and thus just never learned to type on the number row, 2) you are constantly entering numeric values. E.g., hard-coding some arrays, manually specifying numeric coefficients in mathematical equations, or doing data-entry type activities. For that, a numpad tends to speed me up a lot, but only when I'm entering many numbers in succession without needing to jump back to the main keyboard area. So if you want a TKL, get a TKL. You can always get a dedicated, perfectly usable USB numpad for pretty cheap if you find yourself really needing one later. Or, I think there are some keyboard that have a detachable numpad section.
  4. How does OpenMV deal with images? Does it use something like a 2d array of pixel values? If so you could use some of the tools from probably the Numpy or Scipy libraries to clip your pixel values at a certain level (you'd probably need to play with the level a bit to find a good one). Or, if you could use the PIL library: from PIL import Image, ImageEnhance im = Image.Image("/path/to/saved/image") contrast = ImageEnhance.Contrast(im) contrast = contrast.enhance(FACTOR) # set FACTOR > 1 to enhance contrast, < 1 to decrease # either save the image... contrast.save("/path/to/new/location") # or cast to a numpy array import numpy as np im_array = np.array(contrast.getdata())
  5. Your first command might be looking for the wrong package. On Debian-based distros, at least, the package name for pip is python3-pip--not sure if it's the same in the CentOS repos, but look for something like that. Also make sure you don't have a bash alias that points "pip3" at the Python 2.6 pip--it seems like that might be happening since when you type "pip3 install requests" it says it's looking in the Python 2.6 directories. Also, after installing, run "pip install -U pip" to upgrade pip to the newest version (per the error message)--just to make sure you're using the newest version. You can double-check what version of Python your pip install is configured for with "pip -V" (my output is pip 9.0.1 from /home/localuser/.local/lib/python3.5/site-packages (python 3.5), which tells me it's Python 3.5). You can also try running pip explicitly through Python with "python3 -m pip [pip commands/options]". That'll use your PATH variable to find the Python 3 executable, which should know where its corresponding version of pip is stored if you've installed it.
  6. Libraries for AI: For general data and numeric work: the Scipy stack (numpy, scipy, matplotlib, pandas in particular)--necessary for doing really any work with data (and AI is all about data). Numpy gives you native multidimensional arrays and lots of very fast, efficient operations on those arrays (e.g., dot products, matrix norms, convolutions). Scipy has a lot of general scientific functions (e.g. Fourier analysis, Voronoi tesselations, and function optimization routines), and also sparse matrix formats (for storing large data sets that have a lot of zeros in a memory-efficient way). Matplotlib is a very large, but extremely powerful library for doing visualizations of data in pretty much whatever way you can imagine--scatter plots, line plots, 3d surfaces, shared axes, etc. It's a big, big library that can be a bit confusing at times, but it's very good to know. (Alternately, look into related libraries like Seaborn or Pygal--there are others, but Matplotlib is the one I use most so I don't know the others that well). Pandas gives you access to dataframe objects that make dealing with tabular data, particularly non-numeric data, pretty easy. (though Numpy is better for strictly numeric data). For non-neural machine learning: scikit-learn. An excellent library that's got some of the best documentation I've ever found for anything, and a lot of pretty good implementations of classic ML algorithms (e.g., SVM, decision tree, Naive Bayes, TSNE, K-Means, and way more). For neural networks: Tensorflow or Theano if you want to build networks completely from scratch. Both are general-purpose GPU accelerated math libraries, but with a focus on deep learning/neural networks. Theano is older and a bit more general-purpose; Tensorflow is newer, made by Google, and more geared at deep learning/neural networks. There are pros and cons to both, but they'll both be pretty comparable at the end of the day. Tensorflow is becoming far more common in industry (it's the de facto platform for a huge amount of neural network work these days), so that might make it a more attractive options. Keras--a library that acts as a very nice frontend to either Theano or Tensorflow (you can pick--I think it uses Tensorflow by default, but it's very easy to change). The Keras people have already done a lot of the legwork with building different neural network architectures for you, so you don't have to go build an LSTM layer or GRU layer by hand--just use the one that comes in Keras. For general speedups of non-Keras/Theano/Tensorflow code: Cython. It adds static typing to Python and gives you a Python-to-C transpiler that does what I've been told are some very aggressive optimizations. You'll spend more time actually writing you code than if you used pure Python, but if you need something to run super duper fast, this is an incredible tool. And there are a bunch of other libraries for dealing with specific data sources, e.g. NLTK/spaCy/gensim for doing natural language processing. Just google around a bit if you need something more specific. For general Python resources, the official Python documentation is actually not a bad place to start--they have an introduction/tutorial in the language tha'ts pretty basic, but enough to get you up and running reasonably fast. O'Reilly has a lot of very good Python books, but I don't think any of them are free. No Starch Press also has a lot of Python books, some (but not all) of which might be free. There are also lots of websites around that have problem sets designed to be solved using programming (they're usually not written with any specific language in mind): Project Euler for math-heavy stuff, Rosalind for biology/genetics, and Kaggle for machine learning are all ones that come to mind. Also look around on EdX, Coursera, and MIT's open courseware--there should be some free classes there you can use. For AI resources, all the above still applies (minus the main Python documentation), in addition to the PyData conferences. The PyData YouTube channel posts a lot of videos, a few of which are more tutorial-oriented. (PyData is a group of conferences for data scientists, with a focus on Python as the primary tool). The mayb PyCon (the general Python language conference) might also have some occasional AI-ish tutorials or talks.
  7. The single best way, in my experience, to learn how to use any programming language is a project-based approach. Pick or find a thing you want to do--maybe program a game, or a Twitter bot, a generative art maker, or some quality-of-life programs you expect to use often--and then google the hell out of how to do the different parts of it. The fact that you're enrolled in a course right now might give you some natural projects/things to do, but there are a bunch of websites out there that collect programming-oriented problems (Project Euler for computational mathematics; Rosalind for computational bioinformatics/genetics; Kaggle for machine learning; etc etc) if you want something different (and something that a lot of people have already done, so there should be plenty of resources available). The downside to a do-it-yourself project-based approach, though, is that you'll tend to learn a few parts of the language very deeply, but you may never pick up on the breadth of what can be done with the language. A more formal class will tend to, in my experience, give you the inverse, where you get a good sampling of everything but don't necessarily get to dive super deep into any one thing during the course itself. Usually, if you can find some sort of official documentation for your language (e.g., Python's documentation, which is excellent), that's a good reference to always keep on hand. From time to time you should skim through parts of the documentation you're not familiar with--you may not use all the features of a language, but you might learn about something you wouldn't come across otherwise.
  8. You can use most distros as a server, but you'll commonly see Debian and Slackware used, though they don't have distinct server variants--but they lend themselves very nicely to being configured as servers. As for distros that are specifically designed for server use, Ubuntu Server, Red Hat Enterprise Linux, and SUSE Enterprise Linux all have a corprorate support infrastructure behind them, which makes them appealing to a lot of businesses. CentOS is a community driven fork of Red Hat that's very popular, too.
  9. Since you're interested in GPU-accelerated math and neural nets, as mentioned, you won't be able to get anything serious done on a Pi due to its very low specs (at least in terms of building large or complex models), but you can get started with the basics. That said: for straight GPU-accelerated math, look into Tensorflow and Theano. They're both great libraries for GPU math, each with pros and cons that you'll want to read up on a bit. In short, though: Theano is older and more mature, but Tensorflow is developed by Google and is rapidly taking over other GPU-accelerated math libraries/frameworks. Tensorflow is also incompatible with Python 3.6 (unless you install it from source and manually compile it), which is frustrating. There's also PyTorch, which is still in the early release stages I believe. Despite the name, I don't think it has any relation to Lua's venerable Torch framework for deep learning--it seems to be developed from the ground up for Python. As a word of warning, I don't think any of these libraries have very good support for OpenCL--they're all heavily CUDA_oriented, so you'll need an Nvidia GPU to get any real use out of the GPU acceleration. (But, I believe they can all run in CPU mode, though that'll be horrendously slow and basically make non-trivial recurrent neural nets impossible to use). For a neural-net specific library, look into Keras. It's basically a nice frontend for Theano/Tensorflow that obviates the need for you to actually write the various layers of your network by hand. You'll also want to look into the Scipy stack of libraries: numpy, scipy, matplotlib, and pandas, in particular. They're absolutely indispensible for doiing any data work in Python. Just make sure you're either installing them in a *nix environment or you grab the numpy and scipy .whl files from Chris Gohlke's site--they require Fortran and C compilers, plus a bunch of linear algebra and other math libraries that are a fucking nightmare to install on Windows (and which are overwhelmingly written specifically for *nix environemnts).
  10. Your options are extremely limited given those specs. Tiny Core has already been mentioned, but I'll second it--though be warned it comes with basically nothing installed--not even a lot of the command line tools that you get in other distros. Debian with a minimal install (look for the "network install" .iso image) and no GUI might work, as might a minimalist Arch install (as in, one where you don't put much stuff on it). Non-linux OSs that should work would include FreeDOS, a free and open source implementation of MSDOS, and Kolibri OS, an operating sytem that's written entirely in Assembly and has hilariously low system requirements (but Kolibri is a very young/early development stage OS, so it may not be of much use).
  11. As long as you don't have multiple matches per line to worry about that should work--it'll print each match to a new line in stdout, based on my quick tests. Admittedly I don't use awk/gawk much so I wasn't aware it had issues with multiple matches per line until I looked it up just now. Frankly I'm starting to lean towards just writing a script in Python or whatever language you're comfortable with to do the matching for you: #!/usr/bin/python3 import re f = open("/path/to/file", "r").read() for i in re.findall('https://www\.twitch\.tv/videos/(.*?)",', f): print(i) # or for multiple files in a single directory import os import re files = [i.path for i in os.scandir("/path/to/files")] for F in files: f = open(F, "r").read() for i in re.findall('https://www\.twitch\.tv/videos/(.*?)",', f): print(i) That has no issue matching across multiple lines or multiple matches within a line, again based on some of my quick and hacky testing.
  12. Ah, i see what's happening. I was testing this on a single random twitch.tv video URL--in your text it's replacing the url with just what comes after the /video/ part. Sed is really meant for manipulating text--for just matching substrings, you'll want (g)awk. gawk 'match($0, /https:\/\/www\.twitch\.tv\/videos\/([^\"]*)\",/, arr) {print arr[1]}' output Regular expressions in (g)awk are surrounded by / characters, s there's a lot of ugly escaping. Quotation marks are also escaped because they normally represent literal string delimiters.
  13. Use the -E flag to use extended regexp syntax, and get rid of the backslashes around your parentheses. Then use special \1 escape character in place of a replacement string to print out the found match for the capturing parentheses. (\1-\9 refer to matched sub-strings; with capturing parentheses they can refer to what the parentheses found) sed -E 's_https://www.twitch.tv/videos/(.*)"_\1_'
  14. After some quick testing with awk: awk 'match($0, /*([0-9]+)~/, res) { print res[1] }' file1 file2 ... Where: match($0, /*([0-9]+)~/, res) matches the regular expression *([0-9]+)~ (regular expressions are enclosed by // in awk) to $0 (i.e., stdin) and saves the result to res. { print res[1] } prints the first item of the (0-indexed) result array. Since there were capturing parentheses in the regular expression, this is what was inside the parentheses. (The 0th item in the array for this bit of code is the full match, as if there weren't capturing parentheses there). file1 file 2... are your files you need to find this pattern in. For the line G50*N*20160207*18885~ , this should return just 18885.
  15. Okay. That still doesn't quite answer the other question, which is what your needs are in terms of precision and recall. E.g., if you need to get every contract related to IT and no false positives, then topic modeling may not be the best tool (for that, you'd need metadata on the documents--probably hand-annotated). But if you just need to get a pretty good number of topics related to IT and some false positives are okay, then it might be a good option. Topic modeling does not require word lists. TM algorithms are designed to learn what the topics are, specifically to avoid the need for things like a wordlist. You generally feed them your documents (though you need to do some preprocessing on them--a standard pipeline in NLP is tokenization, stemming/lemmatization, generating a document-term matrix, applying a TF-IDF transform/weighting, then normalizing your feature values) and some of the model parameters, then they does some number-crunching. What you get out is 1) a list of what the "topics" are, what words are associated with those topics, and how much so (you have to look over these lists manually and figure out what the topics are about in order to make them meaningful to you); 2) a list of how much each topic is showing up in your documents. There's a non-trivial learning curve if you're not already familiar with NLP tools and ideas, so if you're on a deadline then this might not be worth it. Plus, using LDA, the most commonly used algorithm for topic modeling at the moment, you have to manually tell it how many topics there are, and it'll take rather a long time to run on 500,000+ documents (it's taken several hours on my pretty well-powered workstation to run a solid LDA analysis on ~75,000 news articles), so doing the data mining can be very time-consuming if you need to tweak the model's parameters after a run. Though, a quick thought on where some word lists might be found: Wikipedia and Wiktionary might have some Category: pages for computer-related terminology.
×