Jump to content

Probably way more effort than it's worth. 

 

If you have access to a web server with MySQL and Phpmyadmin, you may be able to import the XML file as a MySQL DB (I vaguely remember this being possible, haven't used Phpmyadmin in ages), then use some code to display the database as paged results, then go through the results. You may also be able to split the database via Phpmyadmin into multiple separate databases (using some basic queries), then export each one individually as smaller XML files. 

 

But, effort. 

Well, my dad might have access to a few of those types of servers :D How does a 500Mb/s internet connection sound just for hosting a .XML file for myself sound? :D if only I could actually do that...

Link to post
Share on other sites

So... I downloaded Wikipedia in the form of a .XML file. It's around 40GB... I can't view the file because Notepad says it's too big. and MSFT word doesn't allow files larger than 512MB

Where did you download it from?  I'd like to have a xml copy of Wikipedia, even though it will be immediately out of date.

Desktop: Intel Core i7-6700K, ASUS Z170-A, ASUS STRIX GTX 1080 Ti, 16GB DDR4 RAM, 512 GB Samsund 840 Pro, Seasonic X series 650W PSU, Fractal Design Define R4, 2x5TB HDD

Hypervisor 1: Intel Xeon E5-2630L, ASRock EPC612D8, 16GB DDR4 ECC RAM, Intel RT3WB080 8-port RAID controller plus expansion card, Norco RPC-4020 case, 20x2TB WD Red HDD

Other spare hypervisors: Dell Poweredge 2950, HP Proliant DL380 G5

Laptops: ThinkPads, lots of ThinkPads

 

Link to post
Share on other sites

Where did you download it from?  I'd like to have a xml copy of Wikipedia, even though it will be immediately out of date.

This link will auto download it. Be careful though, you'll want 50gb of free space before you download it. When it uncompresses it makes a "temp" file that doesn't get deleted. http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Link to post
Share on other sites

http://stackoverflow.com/questions/12612229/parsing-a-large-40gb-xml-text-file-in-python

 

The only other option is to use a programming language to parse the file line-by-line (without having to load the entire file). 

Interested in Linux, SteamOS and Open-source applications? Go here

Gaming Rig - CPU: i5 3570k @ Stock | GPU: EVGA Geforce 560Ti 448 Core Classified Ultra | RAM: Mushkin Enhanced Blackline 8GB DDR3 1600 | SSD: Crucial M4 128GB | HDD: 3TB Seagate Barracuda, 1TB WD Caviar Black, 1TB Seagate Barracuda | Case: Antec Lanboy Air | KB: Corsair Vengeance K70 Cherry MX Blue | Mouse: Corsair Vengeance M95 | Headset: Steelseries Siberia V2

 

 

Link to post
Share on other sites

So... I downloaded Wikipedia in the form of a .XML file. It's around 40GB... I can't view the file because Notepad says it's too big. and MSFT word doesn't allow files larger than 512MB

Taking a LONG time to extract that file.  Didn't take long to download at 100Mbps

Desktop: Intel Core i7-6700K, ASUS Z170-A, ASUS STRIX GTX 1080 Ti, 16GB DDR4 RAM, 512 GB Samsund 840 Pro, Seasonic X series 650W PSU, Fractal Design Define R4, 2x5TB HDD

Hypervisor 1: Intel Xeon E5-2630L, ASRock EPC612D8, 16GB DDR4 ECC RAM, Intel RT3WB080 8-port RAID controller plus expansion card, Norco RPC-4020 case, 20x2TB WD Red HDD

Other spare hypervisors: Dell Poweredge 2950, HP Proliant DL380 G5

Laptops: ThinkPads, lots of ThinkPads

 

Link to post
Share on other sites

Probably way more effort than it's worth. 

 

If you have access to a web server with MySQL and Phpmyadmin, you may be able to import the XML file as a MySQL DB (I vaguely remember this being possible, haven't used Phpmyadmin in ages), then use some code to display the database as paged results, then go through the results. You may also be able to split the database via Phpmyadmin into multiple separate databases (using some basic queries), then export each one individually as smaller XML files. 

 

But, effort. 

 

e: Querying a 40GB database might be problematic; most large databases (including large forums) are nowhere close to the 40GB mark. 

I just happen to have exactly this.  I'm going to have to try it.

Desktop: Intel Core i7-6700K, ASUS Z170-A, ASUS STRIX GTX 1080 Ti, 16GB DDR4 RAM, 512 GB Samsund 840 Pro, Seasonic X series 650W PSU, Fractal Design Define R4, 2x5TB HDD

Hypervisor 1: Intel Xeon E5-2630L, ASRock EPC612D8, 16GB DDR4 ECC RAM, Intel RT3WB080 8-port RAID controller plus expansion card, Norco RPC-4020 case, 20x2TB WD Red HDD

Other spare hypervisors: Dell Poweredge 2950, HP Proliant DL380 G5

Laptops: ThinkPads, lots of ThinkPads

 

Link to post
Share on other sites

So... I downloaded Wikipedia in the form of a .XML file. It's around 40GB... I can't view the file because Notepad says it's too big. and MSFT word doesn't allow files larger than 512MB

Found a program that might work.  XMLmax  I'm going to try it and see how it does.

 

http://www.xponentsoftware.com/TrialDownload.aspx

Desktop: Intel Core i7-6700K, ASUS Z170-A, ASUS STRIX GTX 1080 Ti, 16GB DDR4 RAM, 512 GB Samsund 840 Pro, Seasonic X series 650W PSU, Fractal Design Define R4, 2x5TB HDD

Hypervisor 1: Intel Xeon E5-2630L, ASRock EPC612D8, 16GB DDR4 ECC RAM, Intel RT3WB080 8-port RAID controller plus expansion card, Norco RPC-4020 case, 20x2TB WD Red HDD

Other spare hypervisors: Dell Poweredge 2950, HP Proliant DL380 G5

Laptops: ThinkPads, lots of ThinkPads

 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×