Jump to content
Search In
  • More options...
Find results that contain...
Find results in...

How to read only the last x number of lines

 Share

I have text and csv files which are multiple GiB in size (one CSV is 2.5GiB and another is 3.2GiB presently and growing) and I am only interested in certain lines in these text files. If I load these into an array my programme crashes as I don't have enough RAM, so how do I load only certain lines, lets say the last 50 or the last 100 lines,how do I do this in a way that doesn't require at minimum 5GiB of RAM (C requires 2 byte per character vs CSVs 1 so approx double the RAM). Which for some of the systems I am planning to run this code on won't be possible. Especially when I think the 3.2GiB File will be 50GiB soonish, so at minimum 100GiB of RAM would be needed.

 

Long and short, what is the least resource intensive way to load only certain lines, I don't mind if the code is quite long as long as it uses minimal RAM. (it's a long story what these files are but yea.)

The owner of "too many" computers, called

The Lord of all Toasters (1920X 1080ti 32GB)

The Toasted Controller (i5 4670, R9 380, 24GB)

The Semi Portable Toastie machine (i7 3612QM (was an i3) intel HD 4000 16GB)'

Bread and Butter Pudding (i7 7700HQ, 1050ti, 16GB)

Pinoutbutter Sandwhich (raspberry pi 3 B)

The Portable Slice of Bread (N270, HAHAHA, 2GB)

Muffinator (C2D E6600, Geforce 8400, 6GB, 8X2TB HDD)

Toastbuster (WIP, should be cool)

loaf and let dough (A printer that doesn't print black ink)

The Cheese Toastie (C2D (of some sort), GTX 760, 3GB, win XP gaming machine)

The Toaster (C2D, intel HD, 4GB, 2X1TB NAS)

Matter of Loaf and death (some old shitty AMD laptop)

windybread (4X E5470, intel HD, 32GB ECC) (use coming soon, maybe)

And more, several more

Link to comment
Share on other sites

Link to post
Share on other sites

Well you can either just read the file line by line and only store once a certain condition is meant. Or you could run a system command 'tail -n 50 filename.txt'. You could also use fseek() to move the file pointer to the end of the file and then increment backwards.

PSU Tier List | CoC

Gaming Build | FreeNAS Server

Spoiler

i5-4690k || Seidon 240m || GTX780 ACX || MSI Z97s SLI Plus || 8GB 2400mhz || 250GB 840 Evo || 1TB WD Blue || H440 (Black/Blue) || Windows 10 Pro || Dell P2414H & BenQ XL2411Z || Ducky Shine Mini || Logitech G502 Proteus Core

Spoiler

FreeNAS 9.3 - Stable || Xeon E3 1230v2 || Supermicro X9SCM-F || 32GB Crucial ECC DDR3 || 3x4TB WD Red (JBOD) || SYBA SI-PEX40064 sata controller || Corsair CX500m || NZXT Source 210.

Link to comment
Share on other sites

Link to post
Share on other sites

count the lines = n, then start reading from n-100 ? maybe theres a better way

 

i thought of the tail too but im not sure if every system has that

MSI GX660 + i7 920XM @ 2.8GHz + GTX 970M + Samsung SSD 830 256GB

Link to comment
Share on other sites

Link to post
Share on other sites

determine file size

use setfilepointer or seek or whatever function your programming language uses to seek to end of file - some amount of data (let's say 64KB of data)

read that amount of data in memory

scan the data and figure out the line separator - if it's windows enter (CR+LF carriage return 0x0D + line feed  0x0A ) or just one of the two characters (linux often uses only line feed 0x0A to tell it's a new line) - see https://en.wikipedia.org/wiki/Newline

knowing the line separator you can now go and separate this chunk of text into lines ... ignoring the first line because you don't know if it's a "complete" line or not, most likely it's incomplete.... when you went back some random amount of bytes in the file (for example exactly 64 KB back) you may have landed in the middle of a line, only in rare circumstances you would land exactly on the first character of a line.

if you didn't read enough lines, then repeat the process ... set file pointer to end of file - 128 KB and read 64 KB and append to the last line, the characters from the first line from previous lines read, that you ignored because you weren't sure if that was complete line.

 

Be careful with the set file pointer or seek or whatever function in some APIs or programming languages it is / used to be 32 bit so it would not work over 231 bytes or around 2 GB. With some functions you have two variables, high and low and those two together give 64 bits.

for example see: https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-setfilepointer

 

ps. You can't jump to a specific line without reading the whole file and determining where those new line characters are.

you don't have to read the whole file in memory but you do have to read it in chunks (for example read in 4 KB or 32 KB or 64 KB or 512 KB chunks because drivers and file systems are optimized to read units that are powers of 2 and storage uses 512 byte or 4kb sectors and ssds are arranged in 4kb or 512 kb blocks and pages etc etc) and memorize at which byte each line begins. So you'd need approximately 4 bytes of memory for each line of text to keep track of where each line starts.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

@79wjd @mariushm

I am not running linux on the machines that will be running this programme it's windows and I can't be arsed installing the tools to get me tail on windows. And I have no idea how to implement anything you are suggesting Mariushm unfortchantly. 

 

I found a somewhat easy way that doesn't require too much RAM, just might take a while to get to the end of the file 30s for a 2.5GiB File which I can live with and it gets faster as it's cached for future runs.

 

Method I have done for anyone who wants to know in the future. It's not fast for massive files but it gets the job done fast enough

 

 /// <summary>
/// Get Last Lines
/// LineNo = Number of Lines to Get
/// </summary>
/// <param name="FilePath"></param>
/// <param name="LineNO"></param>
/// <param name="ARRAY"></param>
static void getlastLines(string FilePath, Int64 LineNO, ref string[] ARRAY)
{
  StreamReader File = new StreamReader(FilePath);
  string Line = File.ReadLine();
  while (Line != null)
  {
    Int32 Length = ARRAY.Length;
    if (Length<LineNO)
    {
      Array.Resize(ref ARRAY, Length + 1);
      ARRAY[Length] = Line;
    }
    else
    {
      RemovexLines(ARRAY, ref ARRAY, 1);
      Array.Resize(ref ARRAY, ARRAY.Length + 1);
      ARRAY[ARRAY.Length - 1] = Line;
    }
    Line = File.ReadLine();
  }
  
  /// <summary>
  /// Function to remove x number of lines from the start or end of the Array
  /// With End send true to start at the first line, send false to start with the last line 
  /// </summary>
  /// <param name="ARRAYin"></param>
  /// <param name="ARRAYout"></param>
  /// <param name="End"></param>
  /// <param name="x"></param>
  private static void RemovexLines(string[] ARRAYin, ref string[] ARRAYout, Int32 x)
  {
    string[] ARRAYtemp = new string[] { };
    Int32 Length = ARRAYin.Length;
    Int32 NewLength = Length - x;
    if (Length<x)
    {
      ARRAYout = ARRAYtemp;
      return;
    }
    if (x == 0)
    {
      ARRAYout = ARRAYin;
      return;
    }
    Int64 C1 = x;
    Int64 C2 = 0;
    Array.Resize(ref ARRAYtemp, NewLength);
    do
    {
      ARRAYtemp[C2] = ARRAYin[C1];
      C1++;
      C2++;
    } while (C1 < Length);
    ARRAYout = ARRAYtemp;
    return;
  }

 

The owner of "too many" computers, called

The Lord of all Toasters (1920X 1080ti 32GB)

The Toasted Controller (i5 4670, R9 380, 24GB)

The Semi Portable Toastie machine (i7 3612QM (was an i3) intel HD 4000 16GB)'

Bread and Butter Pudding (i7 7700HQ, 1050ti, 16GB)

Pinoutbutter Sandwhich (raspberry pi 3 B)

The Portable Slice of Bread (N270, HAHAHA, 2GB)

Muffinator (C2D E6600, Geforce 8400, 6GB, 8X2TB HDD)

Toastbuster (WIP, should be cool)

loaf and let dough (A printer that doesn't print black ink)

The Cheese Toastie (C2D (of some sort), GTX 760, 3GB, win XP gaming machine)

The Toaster (C2D, intel HD, 4GB, 2X1TB NAS)

Matter of Loaf and death (some old shitty AMD laptop)

windybread (4X E5470, intel HD, 32GB ECC) (use coming soon, maybe)

And more, several more

Link to comment
Share on other sites

Link to post
Share on other sites

Use the Filestream instead and that gives you seek, read data into a  byte array etc

 

See  https://docs.microsoft.com/en-us/dotnet/api/system.io.filestream?view=netframework-4.7.2

 

And here's an example of reading a chunk of data from a random position in the file : https://www.codeproject.com/Questions/543821/ReadplusBytesplusfromplusLargeplusBinaryplusfilepl

You can use .length which will give you the file size, so you can simply say  length of file - 32 kb , seek.begin  in the .Seek part and then read just one chunk of 32 KB or whatever size you want.

you can convert your byte array to string and then use .Split or whatever .net has (it's explode in php, that's what I'm familiar with) to separate the chunk of data into lines.

 

Link to comment
Share on other sites

Link to post
Share on other sites

On 10/19/2018 at 1:18 PM, grimreeper132 said:

I have text and csv files which are multiple GiB in size (one CSV is 2.5GiB and another is 3.2GiB presently and growing) and I am only interested in certain lines in these text files. If I load these into an array my programme crashes as I don't have enough RAM, so how do I load only certain lines, lets say the last 50 or the last 100 lines,how do I do this in a way that doesn't require at minimum 5GiB of RAM (C requires 2 byte per character vs CSVs 1 so approx double the RAM). Which for some of the systems I am planning to run this code on won't be possible. Especially when I think the 3.2GiB File will be 50GiB soonish, so at minimum 100GiB of RAM would be needed.

 

Long and short, what is the least resource intensive way to load only certain lines, I don't mind if the code is quite long as long as it uses minimal RAM. (it's a long story what these files are but yea.)

The simplest way is to read a block from the end of the file and then scan it for lines, as already mentioned.

Whipped together a little C++ example to get you going - Ask if you have trouble understanding something:

#include <stdexcept>
#include <fstream>
#include <string>
#include <vector>
#include <iostream>
#include <sstream>
#include <algorithm>

class FileHelper
{
public:

    struct OpenFailed : std::runtime_error { using std::runtime_error::runtime_error; };
    struct TellFailed : std::runtime_error { using std::runtime_error::runtime_error; };
    struct SeekFailed : std::runtime_error { using std::runtime_error::runtime_error; };
    struct ReadFailed : std::runtime_error { using std::runtime_error::runtime_error; };

    FileHelper(const std::string& fileName) :
        mStream(fileName, std::ifstream::ate),
        mFileName(fileName)
    {
        if (!mStream.is_open())
        {
            throw OpenFailed(mFileName);
        }

        mSize = Tell();
    }

    std::ifstream::off_type
    GetSize() const
    {
        return mSize;
    }

    std::string
    Read(int nChars, std::ifstream::off_type position)
    {
        Seek(position);
        std::string buffer(nChars, ' ');
        if (!mStream.read(&buffer[0], nChars))
        {
            throw ReadFailed(mFileName);
        }

        return buffer;
    }

    FileHelper(const FileHelper&) = delete;
    FileHelper(FileHelper&&) = default;
    FileHelper& operator = (const FileHelper&) = delete;
    FileHelper& operator = (FileHelper&&) = default;

private:

    std::ifstream::off_type
    Tell()
    {
        const auto position = mStream.tellg();
        if (position == -1)
        {
            throw TellFailed(mFileName);
        }

        return position;
    }

    void
    Seek(std::ifstream::off_type position)
    {
        if (!mStream.seekg(position))
        {
            throw SeekFailed(mFileName);
        }
    }

    std::ifstream mStream;
    std::string mFileName;
    std::ifstream::off_type mSize;
};

std::vector<std::string>
ReadLastLines(const std::string& fileName, int nLines, int blockSize)
{
    FileHelper fh(fileName);
    const auto bytesToRead = (fh.GetSize() < blockSize) ? fh.GetSize() : blockSize;
    const auto buffer = fh.Read(bytesToRead, fh.GetSize() - bytesToRead);

    std::stringstream ss(buffer);
    std::vector<std::string> extractedLines;
    for (std::string currentLine; std::getline(ss, currentLine, '\x0D'); extractedLines.emplace_back(currentLine))
    {}

    const auto nLinesToCopy = (extractedLines.size() < nLines) ? extractedLines.size() : nLines;
    std::vector<std::string> lastLines(nLinesToCopy);
    std::copy(extractedLines.end() - nLinesToCopy, extractedLines.end(), lastLines.begin());
    return lastLines;
}

int
main()
{
    const auto nLines = 50; //Amount of lines to extract
    const auto blockSize = 1024 * 1024; //1M tail block

    try
    {
        const auto lastLines = ReadLastLines("input.csv", nLines, blockSize);

        std::cout << "\nGot " << lastLines.size() << " Lines\n";
        auto i = 0;
        for (const auto& line : lastLines)
        {
            std::cout << ++i << ": " << line << "\n";
        }

        return 0;
    }
    catch(const FileHelper::OpenFailed&)
    {
        std::cout << "Opening file failed!\n";
        return 1;
    }
    catch(const FileHelper::TellFailed&)
    {
        std::cout << "Getting file position failed!\n";
        return 2;
    }
    catch(const FileHelper::SeekFailed&)
    {
        std::cout << "Seeking file position failed!\n";
        return 3;
    }
    catch(const FileHelper::ReadFailed&)
    {
        std::cout << "reading from file failed!\n";
        return 4;
    }
}

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share


×