Archivo de texto grande de lectura de Java con 70 millones de líneas de texto

Tengo un gran archivo de prueba con 70 millones de líneas de texto. Tengo que leer el archivo línea por línea.

Usé dos enfoques diferentes:

InputStreamReader isr = new InputStreamReader(new FileInputStream(FilePath),"unicode"); BufferedReader br = new BufferedReader(isr); while((cur=br.readLine()) != null); 

y

 LineIterator it = FileUtils.lineIterator(new File(FilePath), "unicode"); while(it.hasNext()) cur=it.nextLine(); 

¿Hay algún otro enfoque que pueda hacer que esta tarea sea más rápida?

Atentamente,

1) Estoy seguro de que no hay diferencia en el sentido de la velocidad, ambos usan FileInputStream internamente y el almacenamiento en búfer

2) Puedes tomar medidas y ver por ti mismo

3) Aunque no hay beneficios de rendimiento, me gusta el enfoque 1.7

 try (BufferedReader br = Files.newBufferedReader(Paths.get("test.txt"), StandardCharsets.UTF_8)) { for (String line = null; (line = br.readLine()) != null;) { // } } 

4) Versión basada en escáner

  try (Scanner sc = new Scanner(new File("test.txt"), "UTF-8")) { while (sc.hasNextLine()) { String line = sc.nextLine(); } // note that Scanner suppresses exceptions if (sc.ioException() != null) { throw sc.ioException(); } } 

5) Esto puede ser más rápido que el rest

 try (SeekableByteChannel ch = Files.newByteChannel(Paths.get("test.txt"))) { ByteBuffer bb = ByteBuffer.allocateDirect(1000); for(;;) { StringBuilder line = new StringBuilder(); int n = ch.read(bb); // add chars to line // ... } } 

requiere un poco de encoding, pero puede ser realmente más rápido gracias a ByteBuffer.allocateDirect. Permite al sistema operativo leer bytes del archivo en ByteBuffer directamente, sin copiar

6) El parallel processing definitivamente boostía la velocidad. Cree un búfer de bytes grandes, ejecute varias tareas que lean bytes del archivo en ese búfer en paralelo, cuando esté listo encuentre el primer final de línea, haga una cadena, busque siguiente …

Si está mirando el rendimiento, puede echar un vistazo a los paquetes de java.nio.* , Supuestamente más rápidos que java.io.*

Hay un artículo que compara diferentes formas de leer archivos. Te ayudará a encontrar la mejor solución.

Documento: Consejo de Java: Cómo leer archivos rápidamente

Tuve un problema similar, pero solo necesitaba los bytes del archivo. Leí los enlaces proporcionados en las diversas respuestas y finalmente intenté escribir uno similar al # 5 en la respuesta de Evgeniy. No estaban bromeando, tomó mucho código.

La premisa básica es que cada línea de texto es de longitud desconocida. Comenzaré con SeekableByteChannel, leeré los datos en un ByteBuffer y luego recorreré buscando EOL. Cuando algo es un “arrastre” entre bucles, incrementa un contador y luego mueve la posición SeekableByteChannel y lee todo el búfer.

Es detallado … pero funciona. Fue bastante rápido para lo que necesitaba, pero estoy seguro de que se pueden hacer más mejoras.

El método de proceso se reduce a lo básico para iniciar la lectura del archivo.

 private long startOffset; private long endOffset; private SeekableByteChannel sbc; private final ByteBuffer buffer = ByteBuffer.allocateDirect(1024); public void process() throws IOException { startOffset = 0; sbc = Files.newByteChannel(FILE, EnumSet.of(READ)); byte[] message = null; while((message = readRecord()) != null) { // do something } } public byte[] readRecord() throws IOException { endOffset = startOffset; boolean eol = false; boolean carryOver = false; byte[] record = null; while(!eol) { byte data; buffer.clear(); final int bytesRead = sbc.read(buffer); if(bytesRead == -1) { return null; } buffer.flip(); for(int i = 0; i < bytesRead && !eol; i++) { data = buffer.get(); if(data == '\r' || data == '\n') { eol = true; endOffset += i; if(carryOver) { final int messageSize = (int)(endOffset - startOffset); sbc.position(startOffset); final ByteBuffer tempBuffer = ByteBuffer.allocateDirect(messageSize); sbc.read(tempBuffer); tempBuffer.flip(); record = new byte[messageSize]; tempBuffer.get(record); } else { record = new byte[i]; // Need to move the buffer position back since the get moved it forward buffer.position(0); buffer.get(record, 0, i); } // Skip past the newline characters if(isWindowsOS()) { startOffset = (endOffset + 2); } else { startOffset = (endOffset + 1); } // Move the file position back sbc.position(startOffset); } } if(!eol && sbc.position() == sbc.size()) { // We have hit the end of the file, just take all the bytes record = new byte[bytesRead]; eol = true; buffer.position(0); buffer.get(record, 0, bytesRead); } else if(!eol) { // The EOL marker wasn't found, continue the loop carryOver = true; endOffset += bytesRead; } } // System.out.println(new String(record)); return record; } 

Este artículo es una excelente manera de comenzar.

Además, debe crear casos de prueba en los que lea las primeras líneas de 10k (u otra cosa, pero no debería ser demasiado pequeña) y calcule los tiempos de lectura en consecuencia.

Enrutar puede ser una buena forma de hacerlo, pero es importante que sepamos qué hará con los datos.

Otra cosa a tener en cuenta es cómo almacenará ese tamaño de datos.

De hecho, investigué este tema durante meses en mi tiempo libre y obtuve un punto de referencia y aquí hay un código para comparar todas las diferentes formas de leer un archivo línea por línea. El rendimiento individual puede variar según el sistema subyacente. Corrí en una computadora portátil con Windows 10 Java 8 Intel i5 HP: aquí está el código.

 import java.io.*; import java.nio.channels.Channels; import java.nio.channels.FileChannel; import java.nio.file.Files; import java.util.ArrayList; import java.util.List; import java.util.Scanner; import java.util.regex.Pattern; import java.util.stream.Stream; public class ReadComplexDelimitedFile { private static long total = 0; private static final Pattern FIELD_DELIMITER_PATTERN = Pattern.compile("\\^\\|\\^"); @SuppressWarnings("unused") private void readFileUsingScanner() { String s; try (Scanner stdin = new Scanner(new File(this.getClass().getResource("input.txt").getPath()))) { while (stdin.hasNextLine()) { s = stdin.nextLine(); String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0); total = total + fields.length; } } catch (Exception e) { System.err.println("Error"); } } //Winner private void readFileUsingCustomBufferedReader() { try (CustomBufferedReader stdin = new CustomBufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) { String s; while ((s = stdin.readLine()) != null) { String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0); total += fields.length; } } catch (Exception e) { System.err.println("Error"); } } private void readFileUsingBufferedReader() { try (BufferedReader stdin = new BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) { String s; while ((s = stdin.readLine()) != null) { String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0); total += fields.length; } } catch (Exception e) { System.err.println("Error"); } } private void readFileUsingLineReader() { try (LineNumberReader stdin = new LineNumberReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) { String s; while ((s = stdin.readLine()) != null) { String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0); total += fields.length; } } catch (Exception e) { System.err.println("Error"); } } private void readFileUsingStreams() { try (Stream stream = Files.lines((new File(this.getClass().getResource("input.txt").getPath())).toPath())) { total += stream.mapToInt(s -> FIELD_DELIMITER_PATTERN.split(s, 0).length).sum(); } catch (IOException e1) { e1.printStackTrace(); } } private void readFileUsingBufferedReaderFileChannel() { try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) { try (FileChannel inputChannel = fis.getChannel()) { try (CustomBufferedReader stdin = new CustomBufferedReader(Channels.newReader(inputChannel, "UTF-8"))) { String s; while ((s = stdin.readLine()) != null) { String[] fields = FIELD_DELIMITER_PATTERN.split(s, 0); total = total + fields.length; } } } catch (Exception e) { System.err.println("Error"); } } catch (Exception e) { System.err.println("Error"); } } public static void main(String args[]) { //JVM wamrup for (int i = 0; i < 100000; i++) { total += i; } // We know scanner is slow-Still warming up ReadComplexDelimitedFile readComplexDelimitedFile = new ReadComplexDelimitedFile(); List longList = new ArrayList<>(50); for (int i = 0; i < 50; i++) { total = 0; long startTime = System.nanoTime(); //readComplexDelimitedFile.readFileUsingScanner(); long stopTime = System.nanoTime(); long timeDifference = stopTime - startTime; longList.add(timeDifference); } System.out.println("Time taken for readFileUsingScanner"); longList.forEach(System.out::println); // Actual performance test starts here longList = new ArrayList<>(10); for (int i = 0; i < 10; i++) { total = 0; long startTime = System.nanoTime(); readComplexDelimitedFile.readFileUsingBufferedReaderFileChannel(); long stopTime = System.nanoTime(); long timeDifference = stopTime - startTime; longList.add(timeDifference); } System.out.println("Time taken for readFileUsingBufferedReaderFileChannel"); longList.forEach(System.out::println); longList.clear(); for (int i = 0; i < 10; i++) { total = 0; long startTime = System.nanoTime(); readComplexDelimitedFile.readFileUsingBufferedReader(); long stopTime = System.nanoTime(); long timeDifference = stopTime - startTime; longList.add(timeDifference); } System.out.println("Time taken for readFileUsingBufferedReader"); longList.forEach(System.out::println); longList.clear(); for (int i = 0; i < 10; i++) { total = 0; long startTime = System.nanoTime(); readComplexDelimitedFile.readFileUsingStreams(); long stopTime = System.nanoTime(); long timeDifference = stopTime - startTime; longList.add(timeDifference); } System.out.println("Time taken for readFileUsingStreams"); longList.forEach(System.out::println); longList.clear(); for (int i = 0; i < 10; i++) { total = 0; long startTime = System.nanoTime(); readComplexDelimitedFile.readFileUsingCustomBufferedReader(); long stopTime = System.nanoTime(); long timeDifference = stopTime - startTime; longList.add(timeDifference); } System.out.println("Time taken for readFileUsingCustomBufferedReader"); longList.forEach(System.out::println); longList.clear(); for (int i = 0; i < 10; i++) { total = 0; long startTime = System.nanoTime(); readComplexDelimitedFile.readFileUsingLineReader(); long stopTime = System.nanoTime(); long timeDifference = stopTime - startTime; longList.add(timeDifference); } System.out.println("Time taken for readFileUsingLineReader"); longList.forEach(System.out::println); } } 

Tuve que reescribir BufferedReader para evitar sincronizaciones y un par de condiciones de frontera que no son necesarias. (Al menos eso es lo que sentí. No está probado en una unidad, así que úsala bajo tu propio riesgo).

 import com.sun.istack.internal.NotNull; import java.io.*; import java.util.Iterator; import java.util.NoSuchElementException; import java.util.Spliterator; import java.util.Spliterators; import java.util.concurrent.locks.ReadWriteLock; import java.util.concurrent.locks.ReentrantReadWriteLock; import java.util.stream.Stream; import java.util.stream.StreamSupport; /** * Reads text from a character-input stream, buffering characters so as to * provide for the efficient reading of characters, arrays, and lines. * 

*

The buffer size may be specified, or the default size may be used. The * default is large enough for most purposes. *

*

In general, each read request made of a Reader causes a corresponding * read request to be made of the underlying character or byte stream. It is * therefore advisable to wrap a CustomBufferedReader around any Reader whose read() * operations may be costly, such as FileReaders and InputStreamReaders. For * example, *

*

 * CustomBufferedReader in * = new CustomBufferedReader(new FileReader("foo.in")); * 

*

* will buffer the input from the specified file. Without buffering, each * invocation of read() or readLine() could cause bytes to be read from the * file, converted into characters, and then returned, which can be very * inefficient. *

*

Programs that use DataInputStreams for textual input can be localized by * replacing each DataInputStream with an appropriate CustomBufferedReader. * * @author Mark Reinhold * @see FileReader * @see InputStreamReader * @see java.nio.file.Files#newBufferedReader * @since JDK1.1 */ public class CustomBufferedReader extends Reader { private final Reader in; private char cb[]; private int nChars, nextChar; private static final int INVALIDATED = -2; private static final int UNMARKED = -1; private int markedChar = UNMARKED; private int readAheadLimit = 0; /* Valid only when markedChar > 0 */ /** * If the next character is a line feed, skip it */ private boolean skipLF = false; /** * The skipLF flag when the mark was set */ private boolean markedSkipLF = false; private static int defaultCharBufferSize = 8192; private static int defaultExpectedLineLength = 80; private ReadWriteLock rwlock; /** * Creates a buffering character-input stream that uses an input buffer of * the specified size. * * @param in A Reader * @param sz Input-buffer size * @throws IllegalArgumentException If {@code sz < = 0} */ public CustomBufferedReader(@NotNull final Reader in, int sz) { super(in); if (sz <= 0) throw new IllegalArgumentException("Buffer size <= 0"); this.in = in; cb = new char[sz]; nextChar = nChars = 0; rwlock = new ReentrantReadWriteLock(); } /** * Creates a buffering character-input stream that uses a default-sized * input buffer. * * @param in A Reader */ public CustomBufferedReader(@NotNull final Reader in) { this(in, defaultCharBufferSize); } /** * Fills the input buffer, taking the mark into account if it is valid. */ private void fill() throws IOException { int dst; if (markedChar <= UNMARKED) { /* No mark */ dst = 0; } else { /* Marked */ int delta = nextChar - markedChar; if (delta >= readAheadLimit) { /* Gone past read-ahead limit: Invalidate mark */ markedChar = INVALIDATED; readAheadLimit = 0; dst = 0; } else { if (readAheadLimit < = cb.length) { /* Shuffle in the current buffer */ System.arraycopy(cb, markedChar, cb, 0, delta); markedChar = 0; dst = delta; } else { /* Reallocate buffer to accommodate read-ahead limit */ char ncb[] = new char[readAheadLimit]; System.arraycopy(cb, markedChar, ncb, 0, delta); cb = ncb; markedChar = 0; dst = delta; } nextChar = nChars = delta; } } int n; do { n = in.read(cb, dst, cb.length - dst); } while (n == 0); if (n > 0) { nChars = dst + n; nextChar = dst; } } /** * Reads a single character. * * @return The character read, as an integer in the range * 0 to 65535 (0x00-0xffff), or -1 if the * end of the stream has been reached * @throws IOException If an I/O error occurs */ public char readChar() throws IOException { for (; ; ) { if (nextChar >= nChars) { fill(); if (nextChar >= nChars) return (char) -1; } return cb[nextChar++]; } } /** * Reads characters into a portion of an array, reading from the underlying * stream if necessary. */ private int read1(char[] cbuf, int off, int len) throws IOException { if (nextChar >= nChars) { /* If the requested length is at least as large as the buffer, and if there is no mark/reset activity, and if line feeds are not being skipped, do not bother to copy the characters into the local buffer. In this way buffered streams will cascade harmlessly. */ if (len >= cb.length && markedChar < = UNMARKED && !skipLF) { return in.read(cbuf, off, len); } fill(); } if (nextChar >= nChars) return -1; int n = Math.min(len, nChars - nextChar); System.arraycopy(cb, nextChar, cbuf, off, n); nextChar += n; return n; } /** * Reads characters into a portion of an array. *

*

This method implements the general contract of the corresponding * {@link Reader#read(char[], int, int) read} method of the * {@link Reader} class. As an additional convenience, it * attempts to read as many characters as possible by repeatedly invoking * the read method of the underlying stream. This iterated * read continues until one of the following conditions becomes * true:

    *

    *

  • The specified number of characters have been read, *

    *

  • The read method of the underlying stream returns * -1, indicating end-of-file, or *

    *

  • The ready method of the underlying stream * returns false, indicating that further input requests * would block. *

    *

If the first read on the underlying stream returns * -1 to indicate end-of-file then this method returns * -1. Otherwise this method returns the number of characters * actually read. *

*

Subclasses of this class are encouraged, but not required, to * attempt to read as many characters as possible in the same fashion. *

*

Ordinarily this method takes characters from this stream's character * buffer, filling it from the underlying stream as necessary. If, * however, the buffer is empty, the mark is not valid, and the requested * length is at least as large as the buffer, then this method will read * characters directly from the underlying stream into the given array. * Thus redundant CustomBufferedReaders will not copy data * unnecessarily. * * @param cbuf Destination buffer * @param off Offset at which to start storing characters * @param len Maximum number of characters to read * @return The number of characters read, or -1 if the end of the * stream has been reached * @throws IOException If an I/O error occurs */ public int read(char cbuf[], int off, int len) throws IOException { int n = read1(cbuf, off, len); if (n < = 0) return n; while ((n < len) && in.ready()) { int n1 = read1(cbuf, off + n, len - n); if (n1 <= 0) break; n += n1; } return n; } /** * Reads a line of text. A line is considered to be terminated by any one * of a line feed ('\n'), a carriage return ('\r'), or a carriage return * followed immediately by a linefeed. * * @param ignoreLF If true, the next '\n' will be skipped * @return A String containing the contents of the line, not including * any line-termination characters, or null if the end of the * stream has been reached * @throws IOException If an I/O error occurs * @see java.io.LineNumberReader#readLine() */ String readLine(boolean ignoreLF) throws IOException { StringBuilder s = null; int startChar; bufferLoop: for (; ; ) { if (nextChar >= nChars) fill(); if (nextChar >= nChars) { /* EOF */ if (s != null && s.length() > 0) return s.toString(); else return null; } boolean eol = false; char c = 0; int i; /* Skip a leftover '\n', if necessary */ charLoop: for (i = nextChar; i < nChars; i++) { c = cb[i]; if ((c == '\n')) { eol = true; break charLoop; } } startChar = nextChar; nextChar = i; if (eol) { String str; if (s == null) { str = new String(cb, startChar, i - startChar); } else { s.append(cb, startChar, i - startChar); str = s.toString(); } nextChar++; return str; } if (s == null) s = new StringBuilder(defaultExpectedLineLength); s.append(cb, startChar, i - startChar); } } /** * Reads a line of text. A line is considered to be terminated by any one * of a line feed ('\n'), a carriage return ('\r'), or a carriage return * followed immediately by a linefeed. * * @return A String containing the contents of the line, not including * any line-termination characters, or null if the end of the * stream has been reached * @throws IOException If an I/O error occurs * @see java.nio.file.Files#readAllLines */ public String readLine() throws IOException { return readLine(false); } /** * Skips characters. * * @param n The number of characters to skip * @return The number of characters actually skipped * @throws IllegalArgumentException If n

is negative. * @throws IOException If an I/O error occurs */ public long skip(long n) throws IOException { if (n < 0L) { throw new IllegalArgumentException("skip value is negative"); } rwlock.readLock().lock(); long r = n; try{ while (r > 0) { if (nextChar >= nChars) fill(); if (nextChar >= nChars) /* EOF */ break; if (skipLF) { skipLF = false; if (cb[nextChar] == '\n') { nextChar++; } } long d = nChars - nextChar; if (r < = d) { nextChar += r; r = 0; break; } else { r -= d; nextChar = nChars; } } } finally { rwlock.readLock().unlock(); } return n - r; } /** * Tells whether this stream is ready to be read. A buffered character * stream is ready if the buffer is not empty, or if the underlying * character stream is ready. * * @throws IOException If an I/O error occurs */ public boolean ready() throws IOException { rwlock.readLock().lock(); try { /* * If newline needs to be skipped and the next char to be read * is a newline character, then just skip it right away. */ if (skipLF) { /* Note that in.ready() will return true if and only if the next * read on the stream will not block. */ if (nextChar >= nChars && in.ready()) { fill(); } if (nextChar < nChars) { if (cb[nextChar] == '\n') nextChar++; skipLF = false; } } } finally { rwlock.readLock().unlock(); } return (nextChar < nChars) || in.ready(); } /** * Tells whether this stream supports the mark() operation, which it does. */ public boolean markSupported() { return true; } /** * Marks the present position in the stream. Subsequent calls to reset() * will attempt to reposition the stream to this point. * * @param readAheadLimit Limit on the number of characters that may be * read while still preserving the mark. An attempt * to reset the stream after reading characters * up to this limit or beyond may fail. * A limit value larger than the size of the input * buffer will cause a new buffer to be allocated * whose size is no smaller than limit. * Therefore large values should be used with care. * @throws IllegalArgumentException If {@code readAheadLimit < 0} * @throws IOException If an I/O error occurs */ public void mark(int readAheadLimit) throws IOException { if (readAheadLimit < 0) { throw new IllegalArgumentException("Read-ahead limit < 0"); } rwlock.readLock().lock(); try { this.readAheadLimit = readAheadLimit; markedChar = nextChar; markedSkipLF = skipLF; } finally { rwlock.readLock().unlock(); } } /** * Resets the stream to the most recent mark. * * @throws IOException If the stream has never been marked, * or if the mark has been invalidated */ public void reset() throws IOException { rwlock.readLock().lock(); try { if (markedChar < 0) throw new IOException((markedChar == INVALIDATED) ? "Mark invalid" : "Stream not marked"); nextChar = markedChar; skipLF = markedSkipLF; } finally { rwlock.readLock().unlock(); } } public void close() throws IOException { rwlock.readLock().lock(); try { in.close(); } finally { cb = null; rwlock.readLock().unlock(); } } public Stream lines() { Iterator iter = new Iterator() { String nextLine = null; @Override public boolean hasNext() { if (nextLine != null) { return true; } else { try { nextLine = readLine(); return (nextLine != null); } catch (IOException e) { throw new UncheckedIOException(e); } } } @Override public String next() { if (nextLine != null || hasNext()) { String line = nextLine; nextLine = null; return line; } else { throw new NoSuchElementException(); } } }; return StreamSupport.stream(Spliterators.spliteratorUnknownSize( iter, Spliterator.ORDERED | Spliterator.NONNULL), false); } }

Y ahora los resultados:

Tiempo necesario para readFileUsingBufferedReaderFileChannel 2902690903 1845190694 1894071377 1815161868 1861056735 1867693540 1857521371 1794176251 1768008762 1853089582

Tiempo necesario para readFileUsingBufferedReader 2022837353 1925901163 1802266711 1842689572 1899984555 1843101306 1998642345 1821242301 1820168806 1830375108

Tiempo tomado para readFileUsingStreams 1992855461 1930827034 1850876033 1843402533 1800378283 1863581324 1810857226 1798497108 1809531144 1796345853

Tiempo necesario para readFileUsingCustomBufferedReader 1759732702 1765987214 1776997357 1772999486 1768559162 1755248431 1744434555 1750349867 1740582606 1751390934

Tiempo necesario para readFileUsingLineReader 1845307174 1830950256 1829847321 1828125293 1827936280 1836947487 1832186310 1820276327 1830157935 1829171481

Proceso terminado con el código de salida 0

Inferencia: la prueba se ejecutó en un archivo de 200 MB. La prueba se repitió varias veces. Los datos se veían así

 Start Date^|^Start Time^|^End Date^|^End Time^|^Event Title ^|^All Day Event^|^No End Time^|^Event Description^|^Contact ^|^Contact Email^|^Contact Phone^|^Location^|^Category^|^Mandatory^|^Registration^|^Maximum^|^Last Date To Register 9/5/2011^|^3:00:00 PM^|^9/5/2011^|^^|^Social Studies Dept. Meeting^|^N^|^Y^|^Department meeting^|^Chris Gallagher^|^cgallagher@schoolwires.com^|^814-555-5179^|^High School^|^2^|^N^|^N^|^25^|^9/2/2011 

En resumen, no hay mucha diferencia entre BufferedReader y mi CustomReader, es muy minúsculo y, por lo tanto, puede usarlo para leer su archivo.

Confíe en mí, no tiene que romperse la cabeza. Use BufferedReader con readLine, está debidamente probado. En el peor de los casos, si cree que puede mejorarlo, simplemente anule y cámbielo a StringBuilder en lugar de StringBuffer solo para reducir la mitad de un segundo